Evaluation of Distributed Filesystems for Docker Swarm
Views: 7692
This blog documents my way to find a decent distributed replicated filesystem for hosting docker swarm volumes. It is a work in progress. I started this evaluation, because GlusterFS is incredingly slow.
Current result: XtreemFS is 40× faster with a simple ls
than GlusterFS, but it is unstable. LizardFS, which looks quite good, even though there are several problems and it must be well tuned.
Conclusion
Edit: Currently, I’d rather recommend CephFS. LizardFS tends to become slow when data become huge. But my CephFS is still relatively small, so I cannot yet provide a final conclusion.
After all tests and experiences, I recommend LizardFS in the following configuration, recommendations:
- master server requires: little CPU, huge RAM, fast IO
- use dedicated servers, don’t use the same servers as docker worker
- add a lot of memory for the master (>32GB, I use 72GB)
- monitor your resources
- do daily snapshots, but avoid hourly snapshots
- don’t use btrfs
Setup
There are four servers:
- universum (40 GB RAM, 24TB HDD, 1,7 GHz Intel Xeon with 8 Cores)
- docker: swarm master node
- storage: master node
- urknall (16 GB RAM, 30 TB HDD, 1,3 GHz AMD Athlon with 2 Cores)
- docker: swarm client node
- storage: client node, replicates
- pulsar (16 GB RAM, 17 TB HDD, 800 MHz AMD Athlon with 2 Cores)
- docker: swarm client node (not yet)
- storage: client node, replicates
- raum (8 GB RAM, 21 TB HDD, 2,3 GHz Intel Celeron with 2 Cores)
- docker: swarm client node
- storage: client node, replicates
The servers universum, urknall and pulsar are on a 1GB ethernet switch with a short 0,5m cat6 cable. Server raum used to be in another room, connected through powerline (Devolo dLAN 1200+), but after some performance problems I moved it to the same location as the others.
The two servers universum and urknall build a docker swarm, that starts containers on any of the servers, so they must have access to a shared file system.
Results
Filesystem | Installation | Performance | Documentation | Result |
---|---|---|---|---|
GlusterFS | simple in theory, unstable in practice | extremely poor, unusable | poor, outdated, wrong, hidden | do not use |
Minio |
|
unknown |
|
complete failure |
XtreemeFS | simpel, besides some problems | much faster than GlusterFS | basics good, except for replication | do not use – unstable, processes crash, recovery does not work |
LizardFS | very easy | faster than XtreemeFS | very good, very clear | needs performance configuration, but problems are resolvablegood community support |
Performance
Tested on a file system with the identical amount of 6,4 GB of data (according to du -hs
).
Filesystem | ls /root/dir | du -hs /root/dir |
---|---|---|
Native | 0,002s | 0,758s |
GlusterFS | 8,746s | 2h 31min 17,543s |
XtreemFS | 0,202s | 32,756s |
LizarfdFS | 0,004s | 2,585s |
Details
Minio
Unfortunately, Minio / s3fs is not POSIX compliant, so it cannot be used in docker volumes. Minio can be used only, if the application natively implements AWS S3 API. Docker volume drivers do not.
Server Installation
According to the official documentation:
-
create a docker secret for secret and access key:
echo "AKIAIOSFODNN7EXAMPLE" | docker secret create access_key - echo "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" | docker secret create secret_key -
- create direcrories for local docker volumes to store minio data on both servers universum and urknall:
sudo mkdir -p /var/lib/minio/disk{1,2}
- deploy this yaml file:
version: '3.3' services: minio1: image: minio/minio ports: - 9001:9000 volumes: - type: bind source: /var/lib/minio/disk1 target: /data command: server http://minio1/data http://minio2/data http://minio3/data http://minio4/data secrets: - secret_key - access_key deploy: placement: constraints: - node.hostname == universum minio2: image: minio/minio ports: - 9002:9000 volumes: - type: bind source: /var/lib/minio/disk2 target: /data command: server http://minio1/data http://minio2/data http://minio3/data http://minio4/data secrets: - secret_key - access_key deploy: placement: constraints: - node.hostname == universum minio3: image: minio/minio ports: - 9003:9000 volumes: - type: bind source: /var/lib/minio/disk1 target: /data command: server http://minio1/data http://minio2/data http://minio3/data http://minio4/data secrets: - secret_key - access_key deploy: placement: constraints: - node.hostname == urknall minio4: image: minio/minio ports: - 9004:9000 volumes: - type: bind source: /var/lib/minio/disk2 target: /data command: server http://minio1/data http://minio2/data http://minio3/data http://minio4/data secrets: - secret_key - access_key deploy: placement: constraints: - node.hostname == urknall secrets: access_key: external: true secret_key: external: true
- head the browser to
http://universum:9001
, log in with the secrets, create buckets and files in the buckets as expected
Client Installation
There are basically two options: Mount a bucket into the operating system using fuse or use a docker volume plugin. for bosth you need to apt install s3fs
.
Fuse s3fs
Follow the instructions to install rexray:
- install the debian package;
curl -sSL https://dl.bintray.com/emccode/rexray/install | sh
- create the configuration in
/etc/rexray/config.yml
:sudo tee /etc/rexray/config.yml <<EOF libstorage: service: s3fs s3fs: accessKey: AKIAIOSFODNN7EXAMPLE secretKey: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY endpoint: http://universum:9001 region: us-east-1 disablePathStyle: false options: - url=http://universum:9001 - use_path_request_style - allow_other - nonempty EOF
- start service:
sudo systemctl enable rexray sudo systemctl start rexray
- create a bucket, e.g.
testbucket
:sudo rexray volume new testbucket
- mount the test bucket:
sudo rexray volume mount testbucket
Now I can create files in /var/lib/rexray/volumes/testbucket
, that appear also in the Minio Browser in the web, and I can see a copy of the files on server urknall, so the replication works as expected.
Unfortunately, s3fs does not support full POSIX standard, even worse: it is not possible to create subdirectories, so sudo mkdir /var/lib/rexray/volumes/testbucket/test
fails.
Result: Not usable for docker volumes.
Volume plugin
The original project minio/minifs is deprecated. Plugin rexray/s3fs is recommended instead. First problem, installation of docker volume plugin rexray/s3fs fails sometimes, then it passes, so it is very unpredictable.
Installation command: docker plugin install rexray/s3fs S3FS_OPTIONS="url=http://universum:9001" S3FS_ENDPOINT="http://universum:9001" S3FS_ACCESSKEY="AKIAIOSFODNN7EXAMPLE" S3FS_SECRETKEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" --alias s3fs
Next, let’s try to use the plugin in a simple docker volume:
marc@universum:~$ docker volume create --driver s3fs newvol newvol marc@universum:~$ docker run -it --rm --volume newvol:/data ubuntu bashroot@05ff1576f1d1:/# ls /data root@05ff1576f1d1:/# echo "this is a test" > /data/test1 root@05ff1576f1d1:/# cat /data/test1 this is a test root@05ff1576f1d1:/# exit
So far it works as expected, and I can even create directories, but the file test1 does not appear in bucket newvol on the Minio Browser. Even worse, when I try to run a new container with the same bucket, it constantly fails:
marc@universum:~$ docker run -it --rm --volume newvol:/data ubuntu bash docker: Error response from daemon: error while mounting volume '/var/lib/docker/plugins/1da4e15929fc167acf5a8cc8c84c0e611b130b503081fdd828e6c8ef3ec0c3bb/rootfs': VolumeDriver.Mount: docker-legacy: Mount: newvol: failed: error mounting s3fs bucket.
I didn’t get help in the chat, nor from google.
Result: Complete failure.
XtreemeFS
When I added MSC and DIR replication, everything became unstable and started blocking. Finally, all data were lost. It does not seem to be stable enough for production environment storing important data. Processes often stop and remain in status active (exited), the service does not restart, but even if it is restarted, it fails again and again with logs such as «Currently there is a failover in progress. Please try again later
». But even hours, the services still do not start.
So I cannot recommend it.
Installation
Install the repository and the software (use unstable):
sudo apt-add-repository "deb http://download.opensuse.org/repositories/home:/xtreemfs/xUbuntu_16.04 ./" wget -q http://download.opensuse.org/repositories/home:/xtreemfs:/unstable/xUbuntu_16.04/Release.key -O - | sudo apt-key add - sudo apt-get update sudo apt-get install xtreemfs-client xtreemfs-server xtreemfs-tools
Define your servers in the following variables, the master server is the first one:
REPL=( universum urknall raum ) num=${#REPL[*]}
Basic
Follow the simple instructions, on all servers:
for f in /etc/xos/xtreemfs/*.properties; do sudo sed -i 's,.*\(listen\.address *= *\).*,\1'"$(hostname -I | cut -d' ' -f1)"',' $f sudo sed -i 's,.*\(hostname *= *\).*,\1'"$(hostname)"',' $f sudo sed -i 's,\( *dir_service\.host *= *\).*,\1'"$(hostname)"',' $f sudo sed -i 's,.*\(startup\.wait_for_dir *= *\).*,\186400,' $f done sudo sed -i 's,.*\(dir_service\.host *= *\).*,\1'"$(hostname)"',' /etc/xos/xtreemfs/default_dir sudo sed -i "/$(hostname)/d" /etc/hosts
Redundabt MRC
On all hosts (start with the master server):
sudo sed -i 's,.*\(babudb\.sync *= * \).*,\1FDATASYNC,' /etc/xos/xtreemfs/mrcconfig.properties sudo sed -i 's,.*\(babudb\.repl\.localhost *= * \).*,\1'"$(hostname)"',' /etc/xos/xtreemfs/server-repl-plugin/mrc.properties sudo sed -i 's,.*\(babudb\.repl\.sync\.n *= *\).*,\1'$(((num+2)/2))',' /etc/xos/xtreemfs/server-repl-plugin/mrc.properties sudo sed -i '/babudb\.repl\.participant\.[0-9]/d' /etc/xos/xtreemfs/server-repl-plugin/mrc.properties for ((i=0; i<num; ++i)); do sudo tee -a /etc/xos/xtreemfs/server-repl-plugin/mrc.properties <<EOF babudb.repl.participant.$((i)) = ${REPL[$i]} babudb.repl.participant.$((i)).port = 35676 EOF done sudo sed -i 's,.*\(babudb\.plugin\.0 *= * .*\),\1,' /etc/xos/xtreemfs/mrcconfig.properties
Redundant DIR
On all hosts (start with the master server):
sudo sed -i 's,.*\(babudb\.sync *= * \).*,\1FDATASYNC,' /etc/xos/xtreemfs/dirconfig.properties sudo sed -i 's,.*\(babudb\.repl\.localhost *= * \).*,\1'"$(hostname)"',' /etc/xos/xtreemfs/server-repl-plugin/dir.properties sudo sed -i 's,.*\(babudb\.repl\.sync\.n *= *\).*,\1'$(((num+2)/2))',' /etc/xos/xtreemfs/server-repl-plugin/dir.properties sudo sed -i '/babudb\.repl\.participant\.[0-9]/d' /etc/xos/xtreemfs/server-repl-plugin/dir.properties for ((i=0; i<num; ++i)); do sudo tee -a /etc/xos/xtreemfs/server-repl-plugin/dir.properties <<EOF babudb.repl.participant.$((i)) = ${REPL[$i]} babudb.repl.participant.$((i)).port = 35678 EOF done for f in /etc/xos/xtreemfs/osdconfig.properties /etc/xos/xtreemfs/mrcconfig.properties; do sudo sed -i '/dir_service[0-9]\./d' $f for ((i=num-1; i>0; --i)); do sudo sed -i '/^dir_service\.port *=/adir_service'$i'.port = 32638' $f sudo sed -i '/^dir_service\.port *=/adir_service'$i'.host = '${REPL[$i]} $f done done sudo sed -i 's,.*\(babudb.plugin.0 *= * .*\),\1,' /etc/xos/xtreemfs/dirconfig.properties
Start
On all servers start up all services all at the same time (otherwise they crash after some minutes if the peers are not ready):
sudo systemctl enable xtreemfs-dir xtreemfs-mrc xtreemfs-osd sudo systemctl start xtreemfs-dir xtreemfs-mrc xtreemfs-osd
Create Data Volume
On any server, create a new volume (execute it only once):
mkfs.xtreemfs -u root -g docker -m 770 -a POSIX –chown-non-root $(hostname)/volumes
Redundant OSD
For the new volume, enabled replication over all 4 hosts (execute it only once onb any server):
sudo xtfsutil --set-drp --replication-policy WqRq --replication-factor ${num} /var/xtreemfs
Mount Volume
On all servers, mount the volume:
sudo mkdir /var/xtreemfs sudo sed -i '/'"$(hostname)"'.*volumes.*xtreemfs/d' /etc/fstab sudo tee -a /etc/fstab <<EOF $(srv=($(hostname) ${REPL[*]/$(hostname)}); IFS=,; echo "${srv[*]}")/volumes /var/xtreemfs xtreemfs defaults,_netdev,allow_other 0 0 EOF sudo mount /var/xtreemfs
Tests
Next step: run docker volumes in /var/xtreemfs
:
Uninstall
To completely uninstall XtreemFS, including removing all data, and permanentely loose everything, just call:
sudo umount /var/xtreemfs sudo apt autoremove --purge xtreemfs-* sudo rm -rf /etc/xos /var/xtreemfs /var/lib/xtreemfs
Problems
Java Version in xtfs_cleanup
The tool xtfs_cleanup does not run on Ubuntu 16.04. I reported a bug.
IP Address Mess
Mount from remote host fails, probably because my local DHCP / DNS does not support IPv6:
marc@pulsar:~$ sudo mount.xtreemfs universum/volumes /var/xtreemfs [sudo] password for marc: [ E | 1/ 4 12:06:58.335 | 7f07fede2700 ] operation failed: call_id=1 errno=5 message=could not connect to host '2a02:168:6a39:0:f603:43ff:fefc:21bc:32636': No route to host [ E | 1/ 4 12:06:58.336 | 7f080783b740 ] Got no response from server 2a02:168:6a39:0:f603:43ff:fefc:21bc:32636 (3d1ef865-c7be-41d8-8837-640693ffdd12), retrying (infinite attempts left) (Possible reason: The server is using SSL, and the client is not.), waiting 12.0 more seconds till next attempt. [ E | 1/ 4 12:07:13.335 | 7f07fede2700 ] operation failed: call_id=2 errno=5 message=could not connect to host '2a02:168:6a39:0:f603:43ff:fefc:21bc:32636': No route to host [ E | 1/ 4 12:07:28.339 | 7f07fede2700 ] operation failed: call_id=3 errno=5 message=could not connect to host '2a02:168:6a39:0:f603:43ff:fefc:21bc:32636': No route to host [ E | 1/ 4 12:07:43.339 | 7f07fede2700 ] operation failed: call_id=4 errno=5 message=could not connect to host '2a02:168:6a39:0:f603:43ff:fefc:21bc:32636': No route to host
There is a discussion on this, obviousely the problem is still not solved.
Note: The following step is not neccessary using the final solution of specifying all external IP adresses!
On server urknall, I run:
echo 1 | sudo tee /proc/sys/net/ipv6/conf/eno1/disable_ipv6 sudo systemctl restart xtreemfs-dir xtreemfs-mrc xtreemfs-osd
But there is still a problem:
marc@urknall:~$ sudo mount.xtreemfs universum/volumes /var/xtreemfs [sudo] Passwort für marc: [ E | 1/ 4 13:03:32.533 | 7f6352ffd700 ] operation failed: call_id=1 errno=5 message=could not connect to host '172.18.0.1:32636': Connection refused [ E | 1/ 4 13:03:32.533 | 7f635b9b4740 ] Got no response from server 172.18.0.1:32636 (3d1ef865-c7be-41d8-8837-640693ffdd12), retrying (infinite attempts left) (Possible reason: The server is using SSL, and the client is not.), waiting 15.0 more seconds till next attempt. marc@urknall:~$ getent hosts universum | cut -d' ' -f1 192.168.99.137
So the services bind to the wrong interface. Fix it on all servers, also remove the own hostname from /etc/hosts
(it often points to 127.0.0.1
instead of to the external IP address):
for f in /etc/xos/xtreemfs/*.properties; do sudo sed -i 's,.*\(listen.address *= *\).*,\1'"$(hostname -I | cut -d' ' -f1)"',' $f sudo sed -i 's,.*\(hostname *= *\).*,\1'"$(hostname)"',' $f done sudo grep listen.address /etc/xos/xtreemfs/*.properties sudo sed -i "/$(hostname)/d" /etc/hosts
Then restart the services.
LizardFS
Following the instructions, I install:
- universum: master, chunkserver
- urknall: chunkserver, metalogger
- pulsar: chunkserver, metalogger
- raum: chunkserver, cgi-server
Master Setup
sudo apt-get install lizardfs-master sudo cp /usr/share/doc/lizardfs-master/examples/* /etc/lizardfs echo "192.168.99.0/24 / rw,alldirs,maproot=0" | sudo tee -a /etc/lizardfs/mfsexports.cfg sudo cp /var/lib/lizardfs/metadata.mfs.empty /var/lib/lizardfs/metadata.mfs sudo systemctl enable lizardfs-master sudo systemctl start lizardfs-master
Shadow Master
MASTER=universum echo "PERSONALITY = shadow" | sudo tee -a mfsmaster.cfg echo "MASTER_HOST = $MASTER" | sudo tee -a mfsmaster.cfg
Chunkserver Setup
MASTER=universum sudo apt-get install lizardfs-chunkserver sudo mkdir /var/lib/lizardfs/chunk sudo chown lizardfs.lizardfs /var/lib/lizardfs/chunk echo "/var/lib/lizardfs/chunk" | sudo tee /etc/lizardfs/mfshdd.cfg echo "MASTER_HOST = $MASTER" | sudo tee /etc/lizardfs/mfschunkserver.cfg sudo systemctl enable lizardfs-chunkserver sudo systemctl start lizardfs-chunkserver
Metalogger Setup
MASTER=universum sudo apt-get install lizardfs-metalogger echo "MASTER_HOST = $MASTER" | sudo tee /etc/lizardfs/mfsmetalogger.cfg sudo systemctl enable lizardfs-metalogger sudo systemctl start lizardfs-metalogger
CGI Server
MASTER=universum sudo apt-get install lizardfs-cgiserv
Administration
sudo apt-get install lizardfs-adm man lizardfs-admin
Client
MASTER=universum sudo apt-get install lizardfs-client sudo mkdir /var/lizardfs sudo mfsmount -H $MASTER /var/lizardfs
Add mount command to /etc/fstab:
echo "mfsmount /var/lizardfs fuse rw,mfsmaster=$MASTER,mfsdelayedinit 0 0" | sudo tee -a /etc/fstab
Tests
Finished after about 1 day: Copy data from glusterfs (from local filesystem):
sudo rsync -avP /var/gluster/volumes/ /var/lizardfs/
Problems
I had several problems, but at least got good feedback from the community:
- lizardfs becomes slow, unstable when growing and with replication
- Snapshot Usage Consideration
- temporary metadata file exists
- Trash Does Not Cleanup
So I ended up with the following configuration:
Master server configuration in file /etc/mfs/mfsmaster.cfg
:
LOAD_FACTOR_PENALTY = 0.5 ENDANGERED_CHUNKS_PRIORITY = 0.6 REJECT_OLD_CLIENTS = 1 CHUNKS_WRITE_REP_LIMIT = 20 CHUNKS_READ_REP_LIMIT = 100
Chunk server configuration in file /etc/mfs/mfschunkserver.cfg
:
MASTER_HOST = universum HDD_TEST_FREQ = 3600 ENABLE_LOAD_FACTOR = 1 NR_OF_NETWORK_WORKERS = 10 NR_OF_HDD_WORKERS_PER_NETWORK_WORKER = 4 PERFORM_FSYNC = 0
Cephs
tbd
Marc Wäckerlin – Docker Swarm and GlusterFS am 5. Januar 2018 um 14:21 Uhr
[…] Do not use GlusterFS! It is extremely slow and very unstable! I am now going to remove my glusterfs and to evaluate new filesystems. […]
Daniel am 20. Februar 2018 um 12:36 Uhr
Hi Marc,
thanks for sharing all that information, it helped me alot.
I got the rexray docker plugin to work with minio using the following command:
docker plugin install rexray/s3fs \
S3FS_OPTIONS=»allow_other,use_path_request_style,nonempty,url=http://localhost:9001″ \
S3FS_ACCESSKEY=minio \
S3FS_SECRETKEY=1235654egegwfd24qf \
S3FS_ENDPOINT=http://localhost:9001 \
–alias s3fs –grant-all-permissions
After creating a volume and starting a container, i could create files and directory from within the container and everything shows up in the Minio Browser.
Hope that helped but i’m still a bit unsure regarding the docker plugin approach, because it is a bit of a black box and you can not describe all of your services as infrastructure as code inside a single compose file, especially in swarm mode.
Maybe it would be better to create a custom rexray image/container and share /var/lib/rexray/volumes between containers, but i dont know if this is working at all with all the mounting stuff going on.
Regards,
Daniel
Marc Wäckerlin – PC-Engines APU.2D4 as Docker Worker am 21. Dezember 2018 um 10:51 Uhr
[…] Currently I have a docker swarm running six APU.2D4 as worker and my old laptop as swarm master. In future, the APDU.2D4 will also become master and the router will be configured to forward requests from the internet directly to the active leader. The mass-storage is on 4 HP ProLiant servers that form a LizardFS. […]