Skip to content

Instantly share code, notes, and snippets.

@whitslack
Last active October 21, 2023 18:47
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save whitslack/ca13e838fd402e4b9a66 to your computer and use it in GitHub Desktop.
Save whitslack/ca13e838fd402e4b9a66 to your computer and use it in GitHub Desktop.

Early Userspace without Initramfs

If you've built your own kernel with all necessary storage-controller and file-system drivers built in, then you may have no need of an early userspace environment. However, if you want to do anything non-trivial with your root file system (LVM, LUKS, etc.), then you need an early userspace to set up and mount it. The traditional mechanism for this is initramfs, but building and maintaining an initramfs image is awkward and tiresome. Initramfs is a sledgehammer when, nine times out of ten, all you need is a screwdriver. This guide details a method of booting into an early userspace environment located in an ordinary file system on a physical disk partition, where an init script in this environment in turn sets up and mounts the real root file system and pivots into it.

Setting Up the Basic Environment

In order to employ this method of booting your system, you will need a traditional (non-LVM) disk partition containing a file system that your kernel can mount without needing to load any modules. This guide will henceforth refer to this partition as the boot device.

Important: This guide assumes that your boot device is /dev/sda1 and your root device is /dev/sda3. Be sure to make all appropriate substitutions in the steps throughout this guide, lest you obliterate something you shouldn't.

  1. Format and mount the boot device.

    # mkfs.ext4 -L Boot -O ^has_journal /dev/sda1
    
    # mkdir -p /boot
    
    # mount -o noatime /dev/sda1 /boot
    
  2. Create the basic file-system hierarchy and populate /etc/fstab.

    # mkdir -p /boot/{dev,etc,mnt,proc,run,sys,tmp,var}
    
    # ln -s /run /tmp /boot/var/
    
    # cat > /boot/etc/fstab <<EOF
    /dev/pts	/dev/pts	devpts	noexec,nosuid	0 0
    /proc	/proc	proc	nodev,noexec,nosuid	0 0
    /run	/run	tmpfs	nodev,nosuid	0 0
    /sys	/sys	sysfs	nodev,noexec,nosuid	0 0
    /tmp	/tmp	tmpfs	nodev,nosuid	0 0
    EOF
    
  3. Emerge a very minimal system.

    Important: Change amd64 below to your actual CPU type, if necessary.

    # mkdir -p /boot/etc/portage/profile
    
    # ln -s /usr/portage/profiles/prefix/linux-standalone/amd64 /boot/etc/portage/make.profile
    
    # emerge --info | grep '^ACCEPT_KEYWORDS=' >> /boot/etc/portage/profile/make.defaults
    
    # echo 'FEATURES="nodoc noinfo noman"' >> /boot/etc/portage/profile/make.defaults
    
    # cat > /boot/etc/portage/profile/packages <<EOF
    -*app-arch/bzip2
    -*app-arch/gzip
    -*app-arch/tar
    -*app-arch/xz-utils
    -*app-shells/bash:0
    -*net-misc/rsync
    -*net-misc/wget
    -*sys-apps/coreutils
    -*sys-apps/diffutils
    -*sys-apps/file
    -*>=sys-apps/findutils-4.4
    -*sys-apps/gawk
    -*sys-apps/grep
    -*sys-apps/less
    -*sys-apps/man-pages
    -*sys-apps/net-tools
    -*sys-apps/sed
    -*sys-apps/which
    -*sys-devel/binutils
    -*sys-devel/gcc
    -*sys-devel/gnuconfig
    -*sys-devel/make
    -*>=sys-devel/patch-2.6.1
    -*sys-process/procps
    -*sys-process/psmisc
    -*virtual/editor
    -*virtual/man
    -*virtual/os-headers
    -*virtual/package-manager
    -*virtual/pager
    -*virtual/service-manager
    -*virtual/ssh
    
    *sys-libs/glibc
    EOF
    
    # cat >> /boot/etc/portage/profile/package.use << EOF
    sys-apps/busybox -static
    sys-apps/util-linux -cramfs
    EOF
    
    # emerge --root=/boot --config-root=/boot @system
    
  4. Create the init scripts that will boot your system. We begin with a basic setup here and will add goodies in later sections of this guide.

    Important: Change /dev/sda3 below to your actual root device.

    # cat > /boot/init.sh <<EOF
    #!/bin/busybox sh
    set -e
    
    for each in /init.d/* ; do
    	. "${each}"
    done
    EOF
    
    # chmod 0700 /boot/init.sh
    
    # mkdir /boot/init.d
    
    # cat > /boot/init.d/00-mounts <<EOF
    mkdir /dev/pts /dev/shm
    mount /dev/pts
    mount /proc
    mount /run
    mount /sys
    mount /tmp
    EOF
    
    # cat > /boot/init.d/40-printk << EOF
    echo 1 > /proc/sys/kernel/printk
    EOF
    
    # cat > /boot/init.d/50-mountroot <<EOF
    mount --ro /dev/sda3 /mnt
    EOF
    
    # cat > /boot/init.d/69-printk << EOF
    echo 7 > /proc/sys/kernel/printk
    EOF
    
    # cat > /boot/init.d/99-pivotroot <<EOF
    umount /tmp /sys /run /proc /dev/pts
    mount --move /dev /mnt/dev
    cd /mnt
    pivot_root . boot
    exec chroot . /sbin/init < dev/console > dev/console 2>&1
    EOF
    
  5. Install your kernel.

    Important: This guide assumes that you have set CONFIG_DEVTMPFS_MOUNT=y in your kernel configuration. If you have not, you must set it and recompile your kernel, or you will have problems.

    # make -C /usr/src/linux install
    
    # ln -sr /boot/vmlinuz{-*,}
    
  6. Install a bootloader. Extlinux is simple and works well.

    # emerge -n sys-boot/syslinux
    
    # mkdir /boot/extlinux
    
    # extlinux --install /boot/extlinux
    
    # cat /usr/share/syslinux/mbr.bin > /dev/sda
    
    # cat > /boot/extlinux/extlinux.conf <<EOF
    DEFAULT linux
    
    LABEL linux
    	KERNEL /vmlinuz
    	APPEND root=/dev/sda1 rootwait init=/init.sh
    EOF
    

At this point, you may wish to reboot your system to your new boot device, to test that your new early userspace environment is working. This may require marking the boot partition as "active" (using fdisk or similar) and/or reconfiguring your BIOS settings to change your default boot device. These steps are outside the scope of this guide.

If all goes well, you should not observe any difference versus your traditional boot. However, you now have an environment capable of running commands before the root file system is mounted, meaning you can do fun things like full-disk encryption.

Interactive Rescue Environment

It may not be immediately obvious, but you now have almost everything you need for an interactive rescue environment, which you can optionally boot into to do emergency maintenance tasks such as running fsck on your root file system. You just need to assemble a few additional pieces.

  1. Symlink /sbin/init to BusyBox so there's a real init for the kernel to start.

    # ln -s ../bin/busybox /boot/sbin/init
    
  2. Create an inittab.

    # cat > /boot/etc/inittab <<EOF
    ::sysinit:/bin/busybox mkdir /dev/pts /dev/shm
    ::sysinit:/bin/busybox mount -a
    
    ::respawn:-/bin/busybox sh
    
    ::shutdown:/bin/busybox killall5
    ::shutdown:/bin/busybox umount -a -r
    EOF
    
  3. Add an option to the bootloader configuration for booting into the rescue environment.

    # cat >> /boot/extlinux/extlinux.conf <<EOF
    LABEL rescue
    	KERNEL /boot/vmlinuz
    	APPEND root=/dev/sda1 rootwait
    EOF
    

    Notice that the only difference between this new rescue label and the default linux label is the lack of init=/init.sh in the kernel command line. The kernel executes /sbin/init by default.

  4. You may wish to install additional utilities for diagnosing problems with your root file system.

    Note: The packages shown here are just examples; you could install packages specific to the file systems you use.

    # emerge --root=/boot --config-root=/boot sys-fs/e2fsprogs sys-fs/xfsprogs
    

To enter into your new rescue environment when booting, hold down the Shift or Alt key (or engage Caps Lock or Scroll Lock) before the kernel loads, and a boot: prompt will appear. Type rescue and press Enter.

Networking Support with DHCP

You can add networking support to your early userspace environment fairly easily. This is useful if you need to mount network shares or you wish to allow remote control of the environment over SSH.

Important: Change eth0 in the scripts below to your actual network device name. Note that there is no udev in the early userspace environment, so the network device name will be whatever the kernel assigns, not the persistent name that udev assigns later in the boot process.

  1. Symlink /etc/resolv.conf to /run/resolv.conf, as /etc may be read-only during boot.

    # ln -s /run/resolv.conf /boot/etc/
    
  2. Add an init script to bring up your network device and run BusyBox's DHCP client.

    # cat > /boot/init.d/10-network <<EOF
    ip link set up dev eth0
    
    udhcpc -f -i eth0 &
    pid_udhcpc=$!
    EOF
    

    Note: If you need to send a host name and/or client ID, perhaps to cause your DHCP server to return a fixed IP address mapping, you can add to the udhcpc command line (before the ampersand) -x hostname:<your-hostname> and/or -x 0x3d:<your-client-ID> (with no colons in the client ID, just hex digits, and no angle brackets).

  3. Add an init script to stop the DHCP client and deconfigure the network interface, so that your later boot scripts can start with a clean slate.

    # cat > /boot/init.d/89-network <<EOF
    kill "${pid_udhcpc}"
    wait "${pid_udhcpc}" || :
    
    ip -4 addr flush dev eth0
    ip link set down dev eth0
    EOF
    

Remote Control over SSH

It is possible to run an SSH server in the early userspace environment. This is useful if you need to enter a passphrase to unlock an encrypted storage device but may not always have physical access to the console.

  1. Emerge the Dropbear SSH server.

    # echo 'net-misc/dropbear -shadow -zlib' >> /boot/etc/portage/package.use
    
    # emerge --root=/boot --config-root=/boot net-misc/dropbear
    
  2. Install your host keys, converting them to Dropbear's format.

    # mkdir /boot/etc/dropbear
    
    # /boot/usr/bin/dropbearconvert openssh dropbear /etc/ssh/ssh_host_dsa_key /boot/etc/dropbear/dropbear_dss_host_key
    
    # /boot/usr/bin/dropbearconvert openssh dropbear /etc/ssh/ssh_host_rsa_key /boot/etc/dropbear/dropbear_rsa_host_key
    
    # /boot/usr/bin/dropbearconvert openssh dropbear /etc/ssh/ssh_host_ecdsa_key /boot/etc/dropbear/dropbear_ecdsa_host_key
    
  3. Add init scripts to start and stop the Dropbear server.

    # cat > /boot/init.d/11-dropbear <<EOF
    dropbear -F -P '' -I 60 &
    pid_dropbear=$!
    EOF
    
    # cat > /boot/init.d/88-dropbear <<EOF
    kill "${pid_dropbear}"
    wait "${pid_dropbear}" || :
    EOF
    
  4. Copy your authorized_keys file.

    # mkdir -p /boot/root/.ssh
    
    # cp -a ~/.ssh/authorized_keys /boot/root/.ssh/
    
  5. Install the default user and group manifests.

    # cp -a /usr/share/baselayout/{passwd,group} /boot/etc/
    
  6. Change the root user's shell to /bin/sh, since Bash is not installed.

    # ln -s busybox /boot/bin/sh
    
    # chsh --root /boot --shell /bin/sh root
    
  7. Add an init script to pause the boot process at a prompt, to allow for remote access.

    # cat > /boot/init.d/49-pause <<EOF
    read -r -p 'Press Enter to continue boot...'
    EOF
    

Full-Disk Encryption with LUKS

The impetus for all of this, of course, is to allow for complex root file system mounts, which cannot be achieved simply with kernel command-line arguments. The following section of this guide details how to convert an existing root partition in place (i.e., preserving the existing file system and its contents) to an encrypted partition and how to set up the early userspace environment to prompt for the passphrase to mount the root file system contained in this partition.

  1. Before you begin, verify that your disk has free space available to shift the start of your root partition by at least 1032 sectors toward the beginning of the disk.

    # sfdisk -lq /dev/sda
    Device     Boot    Start        End    Sectors  Size Id Type
    /dev/sda1  *        2048    1048575    1046528  511M 83 Linux
    /dev/sda2        1048576   16777215   15728640  7.5G 82 Linux swap / Solaris
    /dev/sda3       16777216 2147483647 2130706432 1016G 83 Linux
    

    Shown above is an example of a typical partition layout, with a small boot partition first, followed by a swap partition, followed by the large root partition. In this case, the swap partition can be deleted and created anew with a slightly smaller size, to make room for expanding the root partition into the vacated space.

    Important: If your partition layout lacks sufficient free space to relocate your root partition by at least 1032 sectors closer to the beginning of your disk, then do not continue with this guide!

  2. Emerge cryptsetup.

    # cat >> /boot/etc/portage/package.use <<EOF
    sys-fs/cryptsetup -gcrypt kernel
    sys-fs/lvm2 -thin device-mapper-only
    EOF
    
    # echo 'sys-apps/baselayout-2.2' >> /boot/etc/portage/profile/package.provided
    
    # emerge --root=/boot --config-root=/boot sys-fs/cryptsetup
    
  3. Determine the number of sectors needed for the LUKS header.

    # dd if=/dev/null of=/tmp/tmp.img bs=1M seek=64
    
    # LOOPDEV=$(losetup -f --show /tmp/tmp.img)
    
    # /boot/sbin/cryptsetup luksFormat -q --align-payload 1 "${LOOPDEV}"
    Enter passphrase: [press Enter here]
    
    # /boot/sbin/cryptsetup luksDump "${LOOPDEV}" | grep '^Payload offset:'
    Payload offset: 2056
    
    # losetup -d "${LOOPDEV}"
    
    # rm /tmp/tmp.img
    

    Note: If you do not have enough space to grow your root partition by the number of sectors reported as the "Payload offset", then repeat this step, but add --cipher aes-cbc-essiv:sha256 --key-size 128 to the luksFormat command. These parameters should result in the smallest possible LUKS header. If you still do not have enough space, then you must not continue with this guide!

  4. Before proceeding, make a full backup of your file system to an external disk. Even if you perform all of the following steps perfectly, a power glitch or a kernel panic during the encryption process will trash your file system irreparably. You have been warned!

  5. Rewrite the 50-mountroot init script.

    # cat > /boot/init.d/50-mountroot <<EOF
    until cryptsetup luksOpen /dev/sda3 root ; do : ; done
    mount --ro /dev/mapper/root /mnt
    EOF
    
  6. If you created 49-pause earlier, you should delete it now, as it is no longer useful.

    # rm -f /boot/init.d/49-pause
    
  7. Reboot into your shiny new interactive rescue environment. You cannot perform the remaining steps while your root file system is mounted.

  8. Use sfdisk to extend your root partition toward the beginning of the disk by exactly the number of sectors reported earlier by luksDump as the "Payload offset". Also, change its type to e8, which is the standard partition type for a LUKS partition.

    Important: The numbers shown below are examples only. You will need to use the actual numbers reported by sfdisk for your disk, decreasing the size of the swap partition, decreasing the start of the root partition, and increasing the size of the root partition, all by the exact number of sectors reported earlier as the "Payload offset".

    If you have ANY DOUBTS about what you are doing, STOP NOW!

    # sfdisk /dev/sda
    
    Welcome to sfdisk (util-linux 2.27.1).
    Changes will remain in memory only, until you decide to write them.
    Be careful before using the write command.
    
    Checking that no-one is using this disk right now ... OK
    
    Disk /dev/sda: 1 TiB, 1099511627776 bytes, 2147483648 sectors
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disklabel type: dos
    Disk identifier: 0xf04ad805
    
    Old situation:
    
    Device     Boot    Start        End    Sectors  Size Id Type
    /dev/sda1  *        2048    1048575    1046528  511M 83 Linux
    /dev/sda2        1048576   16777215   15728640  7.5G 82 Linux swap / Solaris
    /dev/sda3       16777216 2147483647 2130706432 1016G 83 Linux
    

    Write down the "Old situation" in case you need to go back to it.

    Type 'help' to get more information.
    
    >>> 2048,1046528,83,*
    Created a new DOS disklabel with disk identifier 0xa63ad8c1.
    Created a new partition 1 of type 'Linux' and of size 511 MiB.
    /dev/sda1 :         2048      1048575 (511M) Linux
    /dev/sda2: 1048576,15726584,82
    Created a new partition 2 of type 'Linux swap / Solaris' and of size 7.5 GiB.
    /dev/sda2 :      1048576     16775159 (7.5G) Linux swap / Solaris
    /dev/sda3: 16775160,2130708488,e8
    Created a new partition 3 of type 'Unknown' and of size 1016 GiB.
    /dev/sda3 :     16775160   2147483647 (1016G) unknown
    /dev/sda4: 0,0
    Ignoring partition.
    All partitions used.
    
    New situation:
    
    Device     Boot    Start        End    Sectors  Size Id Type
    /dev/sda1  *        2048    1048575    1046528  511M 83 Linux
    /dev/sda2        1048576   16775159   15726584  7.5G 82 Linux swap / Solaris
    /dev/sda3       16775160 2147483647 2130708488 1016G e8 unknown
    

    Verify that the ending sector of your root partition is the same in the "New situation" as in the "Old situation" and that its size has increased by the "Payload offset" amount. Also verify that its type is now e8.

    Do you want to write this to disk? [Y]es/[N]o: y
    
    The partition table has been altered.
    Calling ioctl() to re-read partition table.
    Syncing disks.
    
  9. If you shrank a swap partition, you must run mkswap to reinitialize its header with the new size.

    # mkswap /dev/sda2
    
  10. Set up a loop device pointing at your file system, which is now at a positive offset into the partition.

    # losetup -f --show --offset $((2056*512)) /dev/sda3
    /dev/loop0
    
  11. Verify that the loop device is pointing at your file system.

    # blkid /dev/loop0
    /dev/loop0: UUID="6f5401f8-12df-4e17-9935-5478f161d51a" TYPE="ext4"
    

    If you do not see a TYPE=, then you've made a mistake somewhere.

  12. Format the LUKS partition. Use the same parameters to luksFormat as you used earlier when you determined the "Payload offset".

    Important: If you do not use the same parameters to luksFormat as you used earlier, you may accidentally overwrite the beginning of your file system, which would be Very Bad.

    # cryptsetup luksFormat --align-payload 1 /dev/sda3
    WARNING!
    ========
    This will overwrite data on /dev/sda3 irrevocably.
    
    Are you sure? (Type uppercase yes): YES
    Enter passphrase: [type a strong passphrase here]
    Verify passphrase: [repeat the same passphrase here]
    
  13. Open the LUKS partition.

    # cryptsetup luksOpen /dev/sda3 root
    Enter passphrase for /dev/sda3: [type your passphrase here]
    
  14. Encrypt your file system in place.

    # dd if=/dev/loop0 of=/dev/mapper/root bs=512
    

    Go have a nap. This will take several hours. I hope you have stable power.

  15. Verify that the mapped device contains your file system.

    # blkid /dev/mapper/root
    /dev/mapper/root: UUID="6f5401f8-12df-4e17-9935-5478f161d51a" TYPE="ext4"
    

    The UUID and TYPE should be the same as reported by blkid earlier.

  16. Reboot and cross your fingers.

    # reboot
    

Remote Unlocking of Encrypted Root

So now your system is encrypted and prompts you for the passphrase during boot, but what happens if the power flickers while you're away and without physical access to the console? You'd like to be able to SSH in and enter the passphrase to get your system booted up again. Well, you can.

  1. Emerge screen.

    # emerge --root=/boot --config-root=/boot app-misc/screen
    
  2. Rewrite the 50-mountroot init script.

    # cat > /boot/init.d/50-mountroot <<EOF
    openvt -sw screen busybox sh -c 'until cryptsetup luksOpen /dev/sda3 root ; do : ; done' || :
    chvt 1
    deallocvt
    mount --ro /dev/mapper/root /mnt
    EOF
    
  3. Change the root user's shell to /usr/bin/screen.

    # chsh --root /boot --shell /usr/bin/screen root
    

Now reboot. When you see the passphrase prompt, try SSH'ing in from another computer. You will see the same passphrase prompt. Enter the passphrase on either machine to continue the boot process.

@desultory
Copy link

how is this better than using an initramfs? it seems like pretty much all of the work required to make one, but not one? you can just make a similar minimal environment and use CONFIG_INITRAMFS_SOURCE to bake it into the kernel. If you don't need kernel modules, this is especially simple and straightforward.

@whitslack
Copy link
Author

@desultory: Here are some reasons that I prefer to have my early userspace be in a real file system rather than an initramfs image:

  • The early userspace is in a real file system. This means it can be mounted read/write on a running system, and ordinary tools can be used to update it in place rather than needing to write scripts to compose it a la mkinitramfs. I especially appreciate this convenience, as I can simply set ROOT=/mnt/early PORTAGE_CONFIG_ROOT=/mnt/early and use Gentoo Portage to upgrade all the packages I have in my early userspace environment, with nothing further to do after it completes. Of course, you could build your initramfs from a staging tree rather than using mkinitramfs scripts, but if you're going to have a staging tree anyway, you might as well make it its own filesystem and boot directly into it.
  • initramfs requires an awkward method for switching to the real root file system. Since it doesn't use a real file system but instead extracts the initramfs cpio archive directly into the rootfs, an in-memory file system akin to tmpfs, the only way to recover the RAM used by the extracted files is to delete them, which you can only do before you've mounted the real root file system at /. This typically necessitates a kludge wherein a single, purpose-built executable performs the recursive delete, mount, and exec. By contrast, using a real file system as your early userspace means you can simply call pivot_root to move the real root file system to / and move the early userspace to a mountpoint beneath the real root file system, whereafter you can unmount it if you desire or leave it mounted since its files aren't consuming any RAM like initramfs files do (until they're deleted).
  • In the case of using CONFIG_INITRAMFS_SOURCE, you have to rebuild your kernel image whenever you make any change to your early userspace environment. Not a big deal, but it's another step that can be carelessly forgotten.
  • My early userspace file system is actually on a mdraid mirror set, so I can simply edit my kernel command line to boot a different mirror of my early userspace file system if my first drive goes bad.

All that said, if you still prefer for your boot loader to load a compressed cpio archive into RAM and then for your kernel to decompress and extract that archive into your rootfs, and then for your startup scripts to eventually run a kludge tool to delete all the files that the kernel extracted into rootfs before mounting the real root file system there, then by all means, you should do just that! You have options.

@desultory
Copy link

I found this from https://wiki.gentoo.org/wiki/Talk:Dm-crypt_full_disk_encryption btw

I agree working with a plain filesystem is preferable, that is what I do, but I simply set CONFIG_INITRAMFS_SOURCE in my kernel config and it compiles that directory straight into it. That works with emerge --root and emerge --config-root but I generally only need to use emerge --root because using my system portage config is fine. I never actually pack the CPIO, I simply let the kernel do it.

Yeah, the switch is awkward, having to exec switch_root /root/path /init/path, but is this really that different from pivot_root? Would your method work with an EFI stub kernel? or does it require a bootloader? When using an initramfs, is the whole environment realy left behind? I thought it was cleared, I don't see /dev/ram0 after I switch_root.

The main reason I prefer to embed an initramfs into my kernel is so it can all be signed and secure booted. It makes it easy to ensure that my initramfs, kernel, and kernel command line have not been tampered.

Also feel free to check this: https://github.com/desultory/custom-initramfs/tree/main
It builds an initramfs into a dir using lddtree

@whitslack
Copy link
Author

I simply set CONFIG_INITRAMFS_SOURCE in my kernel config and it compiles that directory straight into it.

So you don't have a separate initramfs file for your boot loader to read into memory? That's an improvement over the old way, for sure.

Yeah, the switch is awkward, having to exec switch_root /root/path /init/path, but is this really that different from pivot_root?

The two methods are very different. pivot_root is one system call that atomically swaps two mountpoints in the VFS. switch_root is an abomination (IMHO).

Would your method work with an EFI stub kernel? or does it require a bootloader?

Any method of loading the kernel into RAM and jumping the CPU into its entry point should work, provided you can specify the kernel command line.

When using an initramfs, is the whole environment realy left behind?

Yes and no. One of the tasks that switch_root does is to recursively delete all the files that the kernel unpacked into the rootfs.

I thought it was cleared, I don't see /dev/ram0 after I switch_root.

/dev/ram0 hasn't been used in ages. That's how the old initrd mechanism worked. initramfs simply extracts files from the cpio archive into the rootfs, which is the tmpfs-like, in-memory file system that the kernel mounts at / at startup.

The main reason I prefer to embed an initramfs into my kernel is so it can all be signed and secure booted. It makes it easy to ensure that my initramfs, kernel, and kernel command line have not been tampered.

That's an excellent reason to do it. If I were getting into signing my kernel, I would arrive at the same conclusion as you have regarding embedding the initramfs image into the kernel.

@desultory
Copy link

So you don't have a separate initramfs file for your boot loader to read into memory? That's an improvement over the old way, for sure.

Yeah, I think it makes more sense to do this way, and it means I don't even need to use a bootloader. I end up booting from a single file.

The two methods are very different. pivot_root is one system call that atomically swaps two mountpoints in the VFS. switch_root is an abomination (IMHO).

What's wrong with it? I don't know much about what happens under the hood.

Yes and no. One of the tasks that switch_root does is to recursively delete all the files that the kernel unpacked into the rootfs.

So what is left behind?

/dev/ram0 hasn't been used in ages. That's how the old initrd mechanism worked. initramfs simply extracts files from the cpio archive into the rootfs, which is the tmpfs-like, in-memory file system that the kernel mounts at / at startup.

Good to know, is it tmpfs like or is it just a tmpfs? how does it differ?

That's an excellent reason to do it. If I were getting into signing my kernel, I would arrive at the same conclusion as you have regarding embedding the initramfs image into the kernel.

Signing your kernel and setting up secure boot isn't too hard: https://wiki.gentoo.org/wiki/Secure_Boot

I think dist-kernel also supports automatic secure boot signing

@whitslack
Copy link
Author

switch_root is an abomination (IMHO).

What's wrong with it? I don't know much about what happens under the hood.

switch_root violates the time-honored Unix tradition of each command-line utility doing one thing and doing it well. switch_root performs a whole sequence of syscalls that are grouped into a single executable only out of a necessity to avoid exec'ing any new processes while the sequence is in progress. It's a hack.

So what is left behind?

After switch_root execs the new init process (thereby closing the last file descriptions on unlinked inodes in the rootfs and allowing them and their data blocks to be purged from memory), nothing.

Good to know, is it tmpfs like or is it just a tmpfs? how does it differ?

It's not a tmpfs. If you examine /proc/mounts, you can observe that the type of the file system initially mounted at / is rootfs, not tmpfs. That said, I strongly suspect the implementation of rootfs inherits from that of tmpfs with few if any differences other than name.

Signing your kernel and setting up secure boot isn't too hard

Sure, but I have no reason to do it. None of my Linux systems have UEFI firmwares.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment