Evaluation of Distributed Filesystems for Docker Swarm

Januar 5, 2018

This blog documents my way to find a decent distributed replicated filesystem for hosting docker swarm volumes. It is a work in progress. I started this evaluation, because GlusterFS is incredingly slow.

Current result: XtreemFS is 40× faster with a simple ls than GlusterFS, but it is unstable. LizardFS, which looks quite good, even though there are several problems and it must be well tuned.

Conclusion

Edit: Currently, I’d rather recommend CephFS. LizardFS tends to become slow when data become huge. But my CephFS is still relatively small, so I cannot yet provide a final conclusion.

After all tests and experiences, I recommend LizardFS in the following configuration, recommendations:

master server requires: little CPU, huge RAM, fast IO
use dedicated servers, don’t use the same servers as docker worker
add a lot of memory for the master (>32GB, I use 72GB)
monitor your resources
do daily snapshots, but avoid hourly snapshots
don’t use btrfs

Setup

There are four servers:

universum (40 GB RAM, 24TB HDD, 1,7 GHz Intel Xeon with 8 Cores)
- docker: swarm master node
- storage: master node
urknall (16 GB RAM, 30 TB HDD, 1,3 GHz AMD Athlon with 2 Cores)
- docker: swarm client node
- storage: client node, replicates
pulsar (16 GB RAM, 17 TB HDD, 800 MHz AMD Athlon with 2 Cores)
- docker: swarm client node (not yet)
- storage: client node, replicates
raum (8 GB RAM, 21 TB HDD, 2,3 GHz Intel Celeron with 2 Cores)
- docker: swarm client node
- storage: client node, replicates

The servers universum, urknall and pulsar are on a 1GB ethernet switch with a short 0,5m cat6 cable. Server raum used to be in another room, connected through powerline (Devolo dLAN 1200+), but after some performance problems I moved it to the same location as the others.

The two servers universum and urknall build a docker swarm, that starts containers on any of the servers, so they must have access to a shared file system.

Results

Filesystem	Installation	Performance	Documentation	Result
GlusterFS	simple in theory, unstable in practice	extremely poor, unusable	poor, outdated, wrong, hidden	do not use
Minio	server: very simple client: failed 3rd party, with fuse docker volume plugin	unknown	server: rudimentary but correct client: missing overall view	complete failure
XtreemeFS	simpel, besides some problems	much faster than GlusterFS	basics good, except for replication	do not use – unstable, processes crash, recovery does not work
LizardFS	very easy	faster than XtreemeFS	very good, very clear	needs performance configuration, but problems are resolvablegood community support

Performance

Tested on a file system with the identical amount of 6,4 GB of data (according to du -hs).

Filesystem	ls /root/dir	du -hs /root/dir
Native	0,002s	0,758s
GlusterFS	8,746s	2h 31min 17,543s
XtreemFS	0,202s	32,756s
LizarfdFS	0,004s	2,585s

Details

Minio

Unfortunately, Minio / s3fs is not POSIX compliant, so it cannot be used in docker volumes. Minio can be used only, if the application natively implements AWS S3 API. Docker volume drivers do not.

Server Installation

According to the official documentation:

create a docker secret for secret and access key:

echo "AKIAIOSFODNN7EXAMPLE" | docker secret create access_key -
echo "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" | docker secret create secret_key -

create direcrories for local docker volumes to store minio data on both servers universum and urknall:
```
sudo mkdir -p /var/lib/minio/disk{1,2}
```

deploy this yaml file:

version: '3.3'

services:

  minio1:
    image: minio/minio
    ports:
      - 9001:9000
    volumes:
      - type: bind
        source: /var/lib/minio/disk1
        target: /data
    command: server http://minio1/data http://minio2/data http://minio3/data http://minio4/data
    secrets:
      - secret_key
      - access_key
    deploy:
      placement:
        constraints:
          - node.hostname == universum  

  minio2:
    image: minio/minio
    ports:
      - 9002:9000
    volumes:
      - type: bind
        source: /var/lib/minio/disk2
        target: /data
    command: server http://minio1/data http://minio2/data http://minio3/data http://minio4/data
    secrets:
      - secret_key
      - access_key
    deploy:
      placement:
        constraints:
          - node.hostname == universum  

  minio3:
    image: minio/minio
    ports:
      - 9003:9000
    volumes:
      - type: bind
        source: /var/lib/minio/disk1
        target: /data
    command: server http://minio1/data http://minio2/data http://minio3/data http://minio4/data
    secrets:
      - secret_key
      - access_key
    deploy:
      placement:
        constraints:
          - node.hostname == urknall

  minio4:
    image: minio/minio
    ports:
      - 9004:9000
    volumes:
      - type: bind
        source: /var/lib/minio/disk2
        target: /data
    command: server http://minio1/data http://minio2/data http://minio3/data http://minio4/data
    secrets:
      - secret_key
      - access_key
    deploy:
      placement:
        constraints:
          - node.hostname == urknall

secrets:
  access_key:
    external: true
  secret_key:
    external: true

head the browser to http://universum:9001, log in with the secrets, create buckets and files in the buckets as expected

Client Installation

There are basically two options: Mount a bucket into the operating system using fuse or use a docker volume plugin. for bosth you need to apt install s3fs.

Fuse s3fs

Follow the instructions to install rexray:

install the debian package;

curl -sSL https://dl.bintray.com/emccode/rexray/install | sh

create the configuration in /etc/rexray/config.yml:

sudo tee /etc/rexray/config.yml <<EOF
libstorage:
  service: s3fs
s3fs:
  accessKey: AKIAIOSFODNN7EXAMPLE
  secretKey: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  endpoint: http://universum:9001
  region: us-east-1
  disablePathStyle: false
  options:
  - url=http://universum:9001
  - use_path_request_style
  - allow_other
  - nonempty
EOF

start service:

sudo systemctl enable rexray
sudo systemctl start rexray

create a bucket, e.g. testbucket:
```
sudo rexray volume new testbucket
```
mount the test bucket:
```
sudo rexray volume mount testbucket
```

Now I can create files in /var/lib/rexray/volumes/testbucket, that appear also in the Minio Browser in the web, and I can see a copy of the files on server urknall, so the replication works as expected.

Unfortunately, s3fs does not support full POSIX standard, even worse: it is not possible to create subdirectories, so sudo mkdir /var/lib/rexray/volumes/testbucket/test fails.

Result: Not usable for docker volumes.

Volume plugin

The original project minio/minifs is deprecated. Plugin rexray/s3fs is recommended instead. First problem, installation of docker volume plugin rexray/s3fs fails sometimes, then it passes, so it is very unpredictable.

Installation command: docker plugin install rexray/s3fs S3FS_OPTIONS="url=http://universum:9001" S3FS_ENDPOINT="http://universum:9001" S3FS_ACCESSKEY="AKIAIOSFODNN7EXAMPLE" S3FS_SECRETKEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" --alias s3fs

Next, let’s try to use the plugin in a simple docker volume:

marc@universum:~$ docker volume create --driver s3fs newvol
newvol
marc@universum:~$ docker run -it --rm --volume newvol:/data ubuntu bashroot@05ff1576f1d1:/# ls /data
root@05ff1576f1d1:/# echo "this is a test" > /data/test1
root@05ff1576f1d1:/# cat /data/test1
this is a test
root@05ff1576f1d1:/# exit

So far it works as expected, and I can even create directories, but the file test1 does not appear in bucket newvol on the Minio Browser. Even worse, when I try to run a new container with the same bucket, it constantly fails:

marc@universum:~$ docker run -it --rm --volume newvol:/data ubuntu bash
docker: Error response from daemon: error while mounting volume '/var/lib/docker/plugins/1da4e15929fc167acf5a8cc8c84c0e611b130b503081fdd828e6c8ef3ec0c3bb/rootfs': VolumeDriver.Mount: docker-legacy: Mount: newvol: failed: error mounting s3fs bucket.

I didn’t get help in the chat, nor from google.

Result: Complete failure.

XtreemeFS

When I added MSC and DIR replication, everything became unstable and started blocking. Finally, all data were lost. It does not seem to be stable enough for production environment storing important data. Processes often stop and remain in status active (exited), the service does not restart, but even if it is restarted, it fails again and again with logs such as «Currently there is a failover in progress. Please try again later». But even hours, the services still do not start.

So I cannot recommend it.

Installation

Install the repository and the software (use unstable):

sudo apt-add-repository "deb http://download.opensuse.org/repositories/home:/xtreemfs/xUbuntu_16.04 ./"
wget -q http://download.opensuse.org/repositories/home:/xtreemfs:/unstable/xUbuntu_16.04/Release.key -O - | sudo apt-key add -
sudo apt-get update
sudo apt-get install xtreemfs-client xtreemfs-server xtreemfs-tools

Define your servers in the following variables, the master server is the first one:

REPL=( universum urknall raum )
num=${#REPL[*]}

Basic

Follow the simple instructions, on all servers:

for f in /etc/xos/xtreemfs/*.properties; do
  sudo sed -i 's,.*\(listen\.address *= *\).*,\1'"$(hostname -I | cut -d' ' -f1)"',' $f
  sudo sed -i 's,.*\(hostname *= *\).*,\1'"$(hostname)"',' $f
  sudo sed -i 's,\( *dir_service\.host *= *\).*,\1'"$(hostname)"',' $f
  sudo sed -i 's,.*\(startup\.wait_for_dir *= *\).*,\186400,' $f
done
sudo sed -i 's,.*\(dir_service\.host *= *\).*,\1'"$(hostname)"',' /etc/xos/xtreemfs/default_dir
sudo sed -i "/$(hostname)/d" /etc/hosts

Redundabt MRC

On all hosts (start with the master server):

sudo sed -i 's,.*\(babudb\.sync *= * \).*,\1FDATASYNC,' /etc/xos/xtreemfs/mrcconfig.properties
sudo sed -i 's,.*\(babudb\.repl\.localhost *= * \).*,\1'"$(hostname)"',' /etc/xos/xtreemfs/server-repl-plugin/mrc.properties
sudo sed -i 's,.*\(babudb\.repl\.sync\.n *= *\).*,\1'$(((num+2)/2))',' /etc/xos/xtreemfs/server-repl-plugin/mrc.properties
sudo sed -i '/babudb\.repl\.participant\.[0-9]/d' /etc/xos/xtreemfs/server-repl-plugin/mrc.properties
for ((i=0; i<num; ++i)); do
  sudo tee -a /etc/xos/xtreemfs/server-repl-plugin/mrc.properties <<EOF
babudb.repl.participant.$((i)) = ${REPL[$i]}
babudb.repl.participant.$((i)).port = 35676
EOF
done
sudo sed -i 's,.*\(babudb\.plugin\.0 *= * .*\),\1,' /etc/xos/xtreemfs/mrcconfig.properties

Redundant DIR

On all hosts (start with the master server):

sudo sed -i 's,.*\(babudb\.sync *= * \).*,\1FDATASYNC,' /etc/xos/xtreemfs/dirconfig.properties
sudo sed -i 's,.*\(babudb\.repl\.localhost *= * \).*,\1'"$(hostname)"',' /etc/xos/xtreemfs/server-repl-plugin/dir.properties
sudo sed -i 's,.*\(babudb\.repl\.sync\.n *= *\).*,\1'$(((num+2)/2))',' /etc/xos/xtreemfs/server-repl-plugin/dir.properties
sudo sed -i '/babudb\.repl\.participant\.[0-9]/d' /etc/xos/xtreemfs/server-repl-plugin/dir.properties
for ((i=0; i<num; ++i)); do
  sudo tee -a /etc/xos/xtreemfs/server-repl-plugin/dir.properties <<EOF
babudb.repl.participant.$((i)) = ${REPL[$i]}
babudb.repl.participant.$((i)).port = 35678
EOF
done
for f in /etc/xos/xtreemfs/osdconfig.properties /etc/xos/xtreemfs/mrcconfig.properties; do
  sudo sed -i '/dir_service[0-9]\./d' $f
  for ((i=num-1; i>0; --i)); do
    sudo sed -i '/^dir_service\.port *=/adir_service'$i'.port = 32638' $f
    sudo sed -i '/^dir_service\.port *=/adir_service'$i'.host = '${REPL[$i]} $f
 done
done
sudo sed -i 's,.*\(babudb.plugin.0 *= * .*\),\1,' /etc/xos/xtreemfs/dirconfig.properties

Start

On all servers start up all services all at the same time (otherwise they crash after some minutes if the peers are not ready):

sudo systemctl enable xtreemfs-dir xtreemfs-mrc xtreemfs-osd
sudo systemctl start xtreemfs-dir xtreemfs-mrc xtreemfs-osd

Create Data Volume

On any server, create a new volume (execute it only once):

mkfs.xtreemfs -u root -g docker -m 770 -a POSIX –chown-non-root $(hostname)/volumes

Redundant OSD

For the new volume, enabled replication over all 4 hosts (execute it only once onb any server):

sudo xtfsutil --set-drp --replication-policy WqRq --replication-factor ${num} /var/xtreemfs

Mount Volume

On all servers, mount the volume:

sudo mkdir /var/xtreemfs
sudo sed -i '/'"$(hostname)"'.*volumes.*xtreemfs/d' /etc/fstab
sudo tee -a /etc/fstab <<EOF
$(srv=($(hostname) ${REPL[*]/$(hostname)}); IFS=,; echo "${srv[*]}")/volumes /var/xtreemfs xtreemfs defaults,_netdev,allow_other 0 0
EOF
sudo mount /var/xtreemfs

Tests

Next step: run docker volumes in /var/xtreemfs:

Uninstall

To completely uninstall XtreemFS, including removing all data, and permanentely loose everything, just call:

sudo umount /var/xtreemfs
sudo apt autoremove --purge xtreemfs-*
sudo rm -rf /etc/xos /var/xtreemfs /var/lib/xtreemfs

Problems

Java Version in xtfs_cleanup

The tool xtfs_cleanup does not run on Ubuntu 16.04. I reported a bug.

IP Address Mess

Mount from remote host fails, probably because my local DHCP / DNS does not support IPv6:

marc@pulsar:~$ sudo mount.xtreemfs universum/volumes /var/xtreemfs
[sudo] password for marc: 
[ E |  1/ 4 12:06:58.335 | 7f07fede2700   ] operation failed: call_id=1 errno=5 message=could not connect to host '2a02:168:6a39:0:f603:43ff:fefc:21bc:32636': No route to host 
[ E |  1/ 4 12:06:58.336 | 7f080783b740   ] Got no response from server 2a02:168:6a39:0:f603:43ff:fefc:21bc:32636 (3d1ef865-c7be-41d8-8837-640693ffdd12), retrying (infinite attempts left) (Possible reason: The server is using SSL, and the client is not.), waiting 12.0 more seconds till next attempt.
[ E |  1/ 4 12:07:13.335 | 7f07fede2700   ] operation failed: call_id=2 errno=5 message=could not connect to host '2a02:168:6a39:0:f603:43ff:fefc:21bc:32636': No route to host 
[ E |  1/ 4 12:07:28.339 | 7f07fede2700   ] operation failed: call_id=3 errno=5 message=could not connect to host '2a02:168:6a39:0:f603:43ff:fefc:21bc:32636': No route to host 
[ E |  1/ 4 12:07:43.339 | 7f07fede2700   ] operation failed: call_id=4 errno=5 message=could not connect to host '2a02:168:6a39:0:f603:43ff:fefc:21bc:32636': No route to host

There is a discussion on this, obviousely the problem is still not solved.

Note: The following step is not neccessary using the final solution of specifying all external IP adresses!

On server urknall, I run:

echo 1 | sudo tee /proc/sys/net/ipv6/conf/eno1/disable_ipv6
sudo systemctl restart xtreemfs-dir xtreemfs-mrc xtreemfs-osd

But there is still a problem:

marc@urknall:~$ sudo mount.xtreemfs universum/volumes /var/xtreemfs
[sudo] Passwort für marc: 
[ E |  1/ 4 13:03:32.533 | 7f6352ffd700   ] operation failed: call_id=1 errno=5 message=could not connect to host '172.18.0.1:32636': Connection refused 
[ E |  1/ 4 13:03:32.533 | 7f635b9b4740   ] Got no response from server 172.18.0.1:32636 (3d1ef865-c7be-41d8-8837-640693ffdd12), retrying (infinite attempts left) (Possible reason: The server is using SSL, and the client is not.), waiting 15.0 more seconds till next attempt.
marc@urknall:~$ getent hosts universum | cut -d' ' -f1
192.168.99.137

So the services bind to the wrong interface. Fix it on all servers, also remove the own hostname from /etc/hosts (it often points to 127.0.0.1 instead of to the external IP address):

for f in /etc/xos/xtreemfs/*.properties; do
  sudo sed -i 's,.*\(listen.address *= *\).*,\1'"$(hostname -I | cut -d' ' -f1)"',' $f
  sudo sed -i 's,.*\(hostname *= *\).*,\1'"$(hostname)"',' $f
done
sudo grep listen.address /etc/xos/xtreemfs/*.properties
sudo sed -i "/$(hostname)/d" /etc/hosts

Then restart the services.

LizardFS

Following the instructions, I install:

universum: master, chunkserver
urknall: chunkserver, metalogger
pulsar: chunkserver, metalogger
raum: chunkserver, cgi-server

Master Setup

sudo apt-get install lizardfs-master
sudo cp /usr/share/doc/lizardfs-master/examples/* /etc/lizardfs
echo "192.168.99.0/24 / rw,alldirs,maproot=0" | sudo tee -a /etc/lizardfs/mfsexports.cfg
sudo cp /var/lib/lizardfs/metadata.mfs.empty /var/lib/lizardfs/metadata.mfs
sudo systemctl enable lizardfs-master
sudo systemctl start lizardfs-master

Shadow Master

MASTER=universum
echo "PERSONALITY = shadow" | sudo tee -a mfsmaster.cfg
echo "MASTER_HOST = $MASTER" | sudo tee -a mfsmaster.cfg

Chunkserver Setup

MASTER=universum
sudo apt-get install lizardfs-chunkserver
sudo mkdir /var/lib/lizardfs/chunk
sudo chown lizardfs.lizardfs /var/lib/lizardfs/chunk
echo "/var/lib/lizardfs/chunk" | sudo tee /etc/lizardfs/mfshdd.cfg
echo "MASTER_HOST = $MASTER" | sudo tee /etc/lizardfs/mfschunkserver.cfg
sudo systemctl enable lizardfs-chunkserver
sudo systemctl start lizardfs-chunkserver

Metalogger Setup

MASTER=universum
sudo apt-get install lizardfs-metalogger
echo "MASTER_HOST = $MASTER" | sudo tee /etc/lizardfs/mfsmetalogger.cfg
sudo systemctl enable lizardfs-metalogger
sudo systemctl start lizardfs-metalogger

CGI Server

MASTER=universum
sudo apt-get install lizardfs-cgiserv

Administration

sudo apt-get install lizardfs-adm
man lizardfs-admin

Client

MASTER=universum
sudo apt-get install lizardfs-client
sudo mkdir /var/lizardfs
sudo mfsmount -H $MASTER /var/lizardfs

Add mount command to /etc/fstab:

echo "mfsmount /var/lizardfs fuse rw,mfsmaster=$MASTER,mfsdelayedinit 0 0" | sudo tee -a /etc/fstab

Tests

Finished after about 1 day: Copy data from glusterfs (from local filesystem):

sudo rsync -avP /var/gluster/volumes/ /var/lizardfs/

Problems

I had several problems, but at least got good feedback from the community:

So I ended up with the following configuration:

Master server configuration in file /etc/mfs/mfsmaster.cfg:

LOAD_FACTOR_PENALTY = 0.5
ENDANGERED_CHUNKS_PRIORITY = 0.6
REJECT_OLD_CLIENTS = 1
CHUNKS_WRITE_REP_LIMIT = 20
CHUNKS_READ_REP_LIMIT = 100

Chunk server configuration in file /etc/mfs/mfschunkserver.cfg:

MASTER_HOST = universum
HDD_TEST_FREQ = 3600
ENABLE_LOAD_FACTOR = 1
NR_OF_NETWORK_WORKERS = 10
NR_OF_HDD_WORKERS_PER_NETWORK_WORKER = 4
PERFORM_FSYNC = 0

Cephs

tbd

comments title

Marc Wäckerlin – Docker Swarm and GlusterFS am 5. Januar 2018 um 14:21 Uhr

[…] Do not use GlusterFS! It is extremely slow and very unstable! I am now going to remove my glusterfs and to evaluate new filesystems. […]

Daniel am 20. Februar 2018 um 12:36 Uhr

Hi Marc,

thanks for sharing all that information, it helped me alot.

I got the rexray docker plugin to work with minio using the following command:

docker plugin install rexray/s3fs \
S3FS_OPTIONS=»allow_other,use_path_request_style,nonempty,url=http://localhost:9001″ \
S3FS_ACCESSKEY=minio \
S3FS_SECRETKEY=1235654egegwfd24qf \
S3FS_ENDPOINT=http://localhost:9001 \
–alias s3fs –grant-all-permissions

After creating a volume and starting a container, i could create files and directory from within the container and everything shows up in the Minio Browser.

Hope that helped but i’m still a bit unsure regarding the docker plugin approach, because it is a bit of a black box and you can not describe all of your services as infrastructure as code inside a single compose file, especially in swarm mode.
Maybe it would be better to create a custom rexray image/container and share /var/lib/rexray/volumes between containers, but i dont know if this is working at all with all the mounting stuff going on.

Regards,
Daniel

Marc Wäckerlin – PC-Engines APU.2D4 as Docker Worker am 21. Dezember 2018 um 10:51 Uhr

[…] Currently I have a docker swarm running six APU.2D4 as worker and my old laptop as swarm master. In future, the APDU.2D4 will also become master and the router will be configured to forward requests from the internet directly to the active leader. The mass-storage is on 4 HP ProLiant servers that form a LizardFS. […]