Skip to content

Instantly share code, notes, and snippets.

@hopeseekr
Created February 3, 2018 04:01
Show Gist options
  • Star 72 You must be signed in to star a gist
  • Fork 8 You must be signed in to fork a gist
  • Save hopeseekr/cd2058e71d01deca5bae9f4e5a555440 to your computer and use it in GitHub Desktop.
Save hopeseekr/cd2058e71d01deca5bae9f4e5a555440 to your computer and use it in GitHub Desktop.
Putting Docker on its own pseudo filesystem

Docker on BTRFS is very buggy and can result in a fully-unusable system, in that it will completely butcher the underlying BTRFS filesystem in such a way that it uses far more disk space than it needs and can get into a state where it cannot even delete any image, requiring one to take drastic actions up to and including reformatting the entire affected BTRFS root file system.

According to the official Docker documentation:

btrfs requires a dedicated block storage device such as a physical disk. This block device must be formatted for Btrfs and mounted into /var/lib/docker/.

In my experience, you will still run into issues even if you use a dedicated partition. No, it seems it requires a standalone hard drive, which is a luxury many computers just simply cannot afford.

See Docker gradually exhausts disk space on BTRFS #27653 for details of exactly what I have run into. Also, docker does not remove btrfs subvolumes when destroying container

A pseudo filesystem is a filesystem that is contained inside an otherwise-ordinary file, that is mounted by the OS. This guide will show you how to set one up and use it exclusively for Docker images and containers in a way that will NOT cripple your BTRFS file system, but also allows you to store it in normal BTRFS subvolume snapshots.

Steps to migrate /var/lib/docker from a subdirectory to a dedicated pseudo filesystem.

System Preparation:

  1. BACKUP ANY IMPORTANT self-made Docker images! This guide will destroy all of your existing images and containers. docker save image/name -o image_name.docker; bzip2 image_name.docker
  2. Open up a terminal and run the command sudo watch -n10 df /var/lib/docker. Pay attention to the total space availabler. Because BTRFS deletes files from the system only when the disk is inactive, it is important to know when certain processes have really finished, or if they are even happening. In a BTRFS file system that is corrupted by Docker, many times no file will actually be removed from the underlying file system. If this happens, refer to the Drastic Actions section.
  3. Make a BTRFS volume snapshot!! We are messing with your core file system. It is important to make a snapshot. If all goes to Hell, refer to the Drastic Actions section for how to restore the snapshot and get quickly back to work. sudo mkdir /snaps sudo btrfs subvolume snapshot / /snaps/root-$(date '+%Y-%m-%d')-pre

Clean Up Docker /var/lib/docker files.

  1. Delete all of the docker containers: docker rm $(docker ps -aq)

    Afterwards, docker ps -aq should return nothing.

  2. Delete all of the docker images. docker rmi -f $(docker images -q) NOTE: If you do not see any activity for several minutes, it is indicative of a BTRFS meltdown. To verify for sure, run sudo du -hs /var/lib/docker. If it is still running after 3-5 minutes, refer to the Drastic Actions section.

    Afterwards, docker images -q should return nothing.

  3. Stop docker. sudo systemctl stop docker NOTE: When docker has butchered the BTRFS file system, stopping docker will many times NOT be stoppable via this step. Fortunately, a simple system reboot resolves this issue. Do that now if you encounter this problem.

3b. Ensure that docker is completely stopped. ps aux | grep docker 4. Explore the /var/lib/docker director:

sudo -s
cd /var/lib/docker
du -h --max-depth=1 | sort -h

Because you have deleted literally 100% of the files which docker stores, your /var/lib/docker should be virtually empty. Maybe a few MB max. However, if Docker has been abusing the underlying root BTRFS system, many times many GBs will still be stored. 5. Attempt to remove all of the files manually: DO NOT USE THE rm COMMAND! This will not work, and if it does, you will have irreversibly corrupted your BTRFS system. Go immediately to the Drastic Actions section if you have accidentally done so.

As discussed in nuking old and broken /var/lib/docker directories is non-trivial, the only safe way to remove broken /var/lib/docker files on BTRFS is to do the following:

for subvolume in /var/lib/docker/btrfs/subvolumes/*; do
    btrfs subvolume delete $subvolume
done
  1. Ensure that all docker BTRFS subvolumes have been destroyed: btrfs subvolume list / You should not see any entries with the path /var/lib/docker.
  2. Manually remove all the other files in /var/lib/docker: rm -r /var/lib/docker/* Ensure that it is empty by running both ls and du -h ., both of which should report 0 disk space used.

If all has gone well, you now have a BTRFS file system that is devoid of all docker-related images, containers and various metadata and caches. Congratulations!

Create the pseudo file system

  1. Ensure that you are the root user. sudo -s
  2. Create the pseudo filesystem: The best place to store file-based pseudo filesystems is in /media.

Estimate how much space you will need, or want to reserve, for Docker images. I find that 10-20 GB is far more than enough for properly functioning systems.

cd /media
fallocate -l 10G docker-volume.img
mkfs.ext4 docker-volume.img
mount -o loop -t ext4 /media/docker-volume.img /var/lib/docker
df -h
# You should see: /dev/loop0      9.8G   37M  9.3G   1% /var/lib/docker
umount /var/lib/docker
  1. Add the pesudo filesystem to the "mount on boot" config. echo "/media/docker-volume.img /var/lib/docker ext4 defaults 0 0" >> /etc/fstab
  2. Test mount it: mount /var/lib/docker
  3. Restart docker and confirm that it is using the pseudo filesystem:
systemctl start docker
systemctl stop docker
cd /media
ls /var/lib/docker    # You should see many subdirectories. 
du -h /var/lib/docker # It should report approximately 35 directories, and about 256 KB of space used.
                      # You should NOT see any mention of BTRFS subvolumes.
umount /var/lib/docker
du -h /var/lib/docker # You should see: 0	/var/lib/docker/
  1. Now reboot the system and confirm that the volume has auto-mounted and that docker is using it.

Congratulations! You have now moved Docker volumes from BTRFS to a pseudo ext4 file system, which docker supports much better!

  1. IMPORTANT: Take a new snapshot of the fixed system and remove the one we made at the beginning of this guide.
sudo btrfs subvolume snapshot / /snaps/root-$(date '+%Y-%m-%d')
sudo btrfs subvolume del /snaps/root-$(date '+%Y-%m-%d')-pre

If you ever run into a corrupted /var/lib/docker in the future, simply sudo rm /media/docker-volume.img and repeat this guide. It is much better than risking your entire BTRFS file system to docker's buggy implementation!

Drastic Actions

Attempt a BTRFS restore

Things didn't go so well? Unfortunately, this happens.

First things first, attempt to restore an older snapshot that may not be corrupted.

Follow the guide here: Using Btrfs for Easy Backup and Rollback

If that fails, restore the snapshot taken in the prep stage of this guide. That will at least get you back to the same state your system was in before you started all of this.

Attempt via a rescue disk

Mount the partition while inside a recovery system like System Rescue CD and reattempt this guide from the very beginning.

When I was in a total desperate situation where Docker had consumed so much of the file system that basic commands would not run, this method saved me.

Back up and Reformat the entire system.

In early 2017, no matter what I tried, nothing worked. If you find yourself in this unfortunate state, back up all of your important files, maybe via a resovery system, and reformat the machine. I still recommend BTRFS as it is vastly superior to all other mainstream file systems. Just don't use it with docker!

Be sure to leave your horror story on the official Docker bug reports for this issue:

@wmutschl
Copy link

I compared docker images + containers on btrfs vs overlay2 and it was like 50 GB vs 5 GB - is there some way to measure btrfs usage in such a way that takes CoW into consideration, so that it produces similar numbers as running "du" on /var/lib/docker with overlay2?

You might want to check out https://ownyourbits.com/2017/12/06/check-disk-space-of-your-btrfs-snapshots-with-btrfs-du/

@plantroon
Copy link

plantroon commented Aug 22, 2021

I tried one last thing which freed about 10 GB for me (basically it should find duplicates and reflink them):
chrt -i 0 duperemove -A -h -d -r -v -b4k --dedupe-options=noblock,same --lookup-extents=yes --io-threads=1
(source: https://wiki.tnonline.net/w/Btrfs/Deduplication/Duperemove)

But in the end I switched to overlay2. It will make overall btrfs management easier.

@eriteric
Copy link

thank you so much

@pavel-perina
Copy link

pavel-perina commented Mar 28, 2023

I would like to ask if it's a real problem or people are just confused (including me). I have OpenSuse Leap 15.5 beta.

What I found is that /var/lib/docker/btrfs occupied 43GB of space and went down to 38GB when I deleted all containers and unused images.

df -h reports 52GB used total. Later after reading this and a few horor stories, I found that in my home directory I have old nextcloud volume and it's tarball backup which are like 2x17GB. 38+2x17 is 72GB which is already 20GB more than what df reports for whole disk and likely 10GB is in other directories.

docker system df says I have 4.1GB in 5 images (yes, one of them has 2GB), 0GB in containers, 600MB in volumes (nextcloud is archived) which is realistic

btrfs fi du -s /var/lib/docker/*
     Total   Exclusive  Set shared  Filename
  35.23GiB    11.76MiB     3.85GiB  /var/lib/docker/btrfs
  72.00KiB    72.00KiB       0.00B  /var/lib/docker/buildkit
 188.00KiB   188.00KiB       0.00B  /var/lib/docker/containerd
     0.00B       0.00B       0.00B  /var/lib/docker/containers
   8.04MiB     8.04MiB       0.00B  /var/lib/docker/image
  80.00KiB    80.00KiB       0.00B  /var/lib/docker/network
     0.00B       0.00B       0.00B  /var/lib/docker/plugins
     0.00B       0.00B       0.00B  /var/lib/docker/runtimes
     0.00B       0.00B       0.00B  /var/lib/docker/swarm
     0.00B       0.00B       0.00B  /var/lib/docker/tmp
     0.00B       0.00B       0.00B  /var/lib/docker/trust
 566.00MiB   566.00MiB       0.00B  /var/lib/docker/volumes

Now this is a bit weird. First, image dir is nearly empty. Second, btrfs directory says that "Set shared" 3.85GB matches almost exactly 4.1GB reported by docker system df command. 35.23GiB is 37.8GB which matches 38G reported by df command. I guess btrfs contains both real data and snapshots.

I'm very new to using docker, but my rough understanding is that that every subvolume/id is a snapshot created during image build process.

@knirch
Copy link

knirch commented Apr 30, 2023

might be well to mention docker builder prune which cleared up a lot of what I thought was "dumb bug leftovers", as the guide goes into nuking the site from orbit approach, I think for idio^wnewbies like me some of the low-hanging fruit could be listed :)

@dim-geo
Copy link

dim-geo commented May 14, 2023

I had also some problems with slow storage and mounted ext4 like this:
mount -o loop,noatime,commit=60,barrier=0 -t ext4 /media/docker-volume.img /var/lib/docker
This can corrupt ext4 filesystem, but it's a risk I am willing to take.

@BradenM
Copy link

BradenM commented Apr 11, 2024

As of linux68, using:
mount -o loop,noatime,commit=60,barrier=0 -t ext4 /media/docker-volume.img /var/lib/docker

began failing to mount with:

mount: /docker: /dev/loop0 already mounted or mount point busy.
       dmesg(1) may have more information after failed mount system call.

The loop device is then automatically detached from what I can gather once this happens.

I suspect this kernel change is related:
https://lore.kernel.org/lkml/20240105-vfs-super-4092d802972c@brauner/

From my scan, implements a new safety mechanism relating to writing to block devices.

Note that this effectively only prevents modification of the particular block
device's page cache by other writers. The actual device content can still be
modified by other means - e.g. by issuing direct scsi commands, by doing
writes through devices lower in the storage stack (e.g. in case loop devices,
DM, or MD are involved) etc. But blocking direct modifications of the block
device page cache is enough to give filesystems a chance to perform data
validation when loading data from the underlying storage and thus prevent
kernel crashes.

Not sure if (assuming this is the cause) preventing loop mount like this was intentional or not, but nevertheless you can still manually setup the loop device + mount post-boot with no issue:

$ sudo losetup -fP --show /mnt/storage/@docker/media/docker-volume.img
/dev/loop1
$ sudo mount -o noatime,commit=60,barrier=0 -t ext4 /dev/loop1 /docker

Simple enough to set this up as pre-exec for the systemd service or something of the like.

Just running:
mount -o loop,noatime,commit=60,barrier=0 -t ext4 /media/docker-volume.img /var/lib/docker post-boot fails with the same error in case you were wondering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment