Skip to content

Instantly share code, notes, and snippets.

@holmanb
Last active April 2, 2024 23:37
Show Gist options
  • Save holmanb/ebf26121194b787afeb713b3d95ce83e to your computer and use it in GitHub Desktop.
Save holmanb/ebf26121194b787afeb713b3d95ce83e to your computer and use it in GitHub Desktop.

bcachefs multi-device with systemd

Multi-device bcachefs mounts in fstab cause system hang on boot. Additionally systemd integration for mounting degraded needs work to "just work" for users, using the same mount interface that they expect from mount.

During boot, the system should attempt mounting the filesystem degraded, and once a mount succeeds it should add subsequent disks that are part of a filesystem to the already mounted filesystem.

Fix broken systemd behavior

Two options, preference to option #1 due to better UI and no requirement for changes to upstream systemd.

Custom fstab generator

Override sysetemd's auto-generated .mount file by creating a masked .mount unit to prevent the auto-generated one from blocking boot.

Pros

  • no upstream systemd code required
  • transparent to users

Cons

  • kinda hacky to leave behind a useless unit file
  • requires a systemd generator

Support multiple-device degraded mount with systemd

A new generator[1] fstab-generator-bcachefs will parse from /etc/fstab via getmntent() data and create a unit[2] per filesystem and a unit per block device. The filesystem unit must not have a ordering dependency on the device units, but should start when any of the device units starts. Both the filesystem and device mounts are triggered by the udev rule via SYSTEMD_WANTS[4].

Udev rule will check if the filesystem is mounted. If not yet mounted, then the rule will attempt to mount the filesystem via the filesystem mount unit via SYSTEMD_WANTS. If the filesystem is already mounted, then the rule will request adding the device via the ioctl.

example udev rule:

SUBSYSTEM!="block", GOTO="bcachefs_end"                                                
ACTION=="remove", GOTO="bcachefs_end"                                                  
ENV{ID_FS_TYPE}!="bcachefs", GOTO="bcachefs_end"                                          
ENV{SYSTEMD_READY}=="0", GOTO="bcachefs_end"

# 1) get the mount point of the filesystem from blkid
#    generator can populate udev db with key/value pairs [5]
#    generator can alternatively populate a file with environment variables
#    generator will populate BLKID -> mount point lookup values
# 2) from a udev rule, is the filesystem that it is a member of currently mounted?
#    check if fs is mounted using /proc/mounts?
IMPORT{file}="/run/bcachefs/mount-map" # sets BCACHEFS_MOUNT_PATH
IMPORT{program}="/bin/sh -c 'echo BCACHEFS_MOUNTED=$(grep $BCACHEFS_MOUNT_PATH)'"


# fstab-generator-bcachefs created the service file
# `mount-bcachefs-$env{BCACHEFS_MOUNT_PATH}.service` during early boot from the
# contents of /etc/fstab.
#
# Mounting directly in systemd services isn't allowed. Start a systemd service
# which will attempt to mount the filesystem. When split-brain detection is
# complete, mounting with -o degraded be added to the unit file by default.
ENV{BCACHEFS_MOUNTED}==0, ENV{SYSTEMD_WANTS}+="mount-bcachefs-$env{BCACHEFS_MOUNT_PATH}.service"

# fstab-generator-bcachefs created the service file
# `bcachefs_device_add_$name.service` during early boot from the contents of
# /etc/fstab.
#
# The mountpoint already exists, so add the newly online device to the
# filesystem via `bcachefs device add`. The following line will only occur when
# mounting degraded previously succeeded.
#
# This shouldn't block (just an ioctl) so can probably don't need this to be a service
ENV{BCACHEFS_MOUNTED}==1, ENV{SYSTEMD_WANTS}+="bcachefs_device_add_$name.service"

LABEL="bcachefs_end"

[1] in rust, use getmntent()/setmntent() modeled after systemd's fstab-generator

[2] a .mount would be preferred for implicit systemd dependencies and expected behavior, but a .mount must have a dependency on a .device of the What= in their definition. This means that one device must be selected which will unconditionally block boot - defeating the purpose of redundancy. An alternative would be a .service that includes the default dependencies described in systemd.mount that blocks other units by the expected dependency rules.

[3] slices might be convenient for grouping disks

[4] man:systemd.device(5)

[5] https://www.freedesktop.org/software/systemd/man/latest/udevadm.html#-p2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment