Skip to content

Instantly share code, notes, and snippets.

@whitslack
Last active October 21, 2023 18:47
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save whitslack/ca13e838fd402e4b9a66 to your computer and use it in GitHub Desktop.
Save whitslack/ca13e838fd402e4b9a66 to your computer and use it in GitHub Desktop.

Early Userspace without Initramfs

If you've built your own kernel with all necessary storage-controller and file-system drivers built in, then you may have no need of an early userspace environment. However, if you want to do anything non-trivial with your root file system (LVM, LUKS, etc.), then you need an early userspace to set up and mount it. The traditional mechanism for this is initramfs, but building and maintaining an initramfs image is awkward and tiresome. Initramfs is a sledgehammer when, nine times out of ten, all you need is a screwdriver. This guide details a method of booting into an early userspace environment located in an ordinary file system on a physical disk partition, where an init script in this environment in turn sets up and mounts the real root file system and pivots into it.

Setting Up the Basic Environment

In order to employ this method of booting your system, you will need a traditional (non-LVM) disk partition containing a file system that your kernel can mount without needing to load any modules. This guide will henceforth refer to this partition as the boot device.

Important: This guide assumes that your boot device is /dev/sda1 and your root device is /dev/sda3. Be sure to make all appropriate substitutions in the steps throughout this guide, lest you obliterate something you shouldn't.

  1. Format and mount the boot device.

    # mkfs.ext4 -L Boot -O ^has_journal /dev/sda1
    
    # mkdir -p /boot
    
    # mount -o noatime /dev/sda1 /boot
    
  2. Create the basic file-system hierarchy and populate /etc/fstab.

    # mkdir -p /boot/{dev,etc,mnt,proc,run,sys,tmp,var}
    
    # ln -s /run /tmp /boot/var/
    
    # cat > /boot/etc/fstab <<EOF
    /dev/pts	/dev/pts	devpts	noexec,nosuid	0 0
    /proc	/proc	proc	nodev,noexec,nosuid	0 0
    /run	/run	tmpfs	nodev,nosuid	0 0
    /sys	/sys	sysfs	nodev,noexec,nosuid	0 0
    /tmp	/tmp	tmpfs	nodev,nosuid	0 0
    EOF
    
  3. Emerge a very minimal system.

    Important: Change amd64 below to your actual CPU type, if necessary.

    # mkdir -p /boot/etc/portage/profile
    
    # ln -s /usr/portage/profiles/prefix/linux-standalone/amd64 /boot/etc/portage/make.profile
    
    # emerge --info | grep '^ACCEPT_KEYWORDS=' >> /boot/etc/portage/profile/make.defaults
    
    # echo 'FEATURES="nodoc noinfo noman"' >> /boot/etc/portage/profile/make.defaults
    
    # cat > /boot/etc/portage/profile/packages <<EOF
    -*app-arch/bzip2
    -*app-arch/gzip
    -*app-arch/tar
    -*app-arch/xz-utils
    -*app-shells/bash:0
    -*net-misc/rsync
    -*net-misc/wget
    -*sys-apps/coreutils
    -*sys-apps/diffutils
    -*sys-apps/file
    -*>=sys-apps/findutils-4.4
    -*sys-apps/gawk
    -*sys-apps/grep
    -*sys-apps/less
    -*sys-apps/man-pages
    -*sys-apps/net-tools
    -*sys-apps/sed
    -*sys-apps/which
    -*sys-devel/binutils
    -*sys-devel/gcc
    -*sys-devel/gnuconfig
    -*sys-devel/make
    -*>=sys-devel/patch-2.6.1
    -*sys-process/procps
    -*sys-process/psmisc
    -*virtual/editor
    -*virtual/man
    -*virtual/os-headers
    -*virtual/package-manager
    -*virtual/pager
    -*virtual/service-manager
    -*virtual/ssh
    
    *sys-libs/glibc
    EOF
    
    # cat >> /boot/etc/portage/profile/package.use << EOF
    sys-apps/busybox -static
    sys-apps/util-linux -cramfs
    EOF
    
    # emerge --root=/boot --config-root=/boot @system
    
  4. Create the init scripts that will boot your system. We begin with a basic setup here and will add goodies in later sections of this guide.

    Important: Change /dev/sda3 below to your actual root device.

    # cat > /boot/init.sh <<EOF
    #!/bin/busybox sh
    set -e
    
    for each in /init.d/* ; do
    	. "${each}"
    done
    EOF
    
    # chmod 0700 /boot/init.sh
    
    # mkdir /boot/init.d
    
    # cat > /boot/init.d/00-mounts <<EOF
    mkdir /dev/pts /dev/shm
    mount /dev/pts
    mount /proc
    mount /run
    mount /sys
    mount /tmp
    EOF
    
    # cat > /boot/init.d/40-printk << EOF
    echo 1 > /proc/sys/kernel/printk
    EOF
    
    # cat > /boot/init.d/50-mountroot <<EOF
    mount --ro /dev/sda3 /mnt
    EOF
    
    # cat > /boot/init.d/69-printk << EOF
    echo 7 > /proc/sys/kernel/printk
    EOF
    
    # cat > /boot/init.d/99-pivotroot <<EOF
    umount /tmp /sys /run /proc /dev/pts
    mount --move /dev /mnt/dev
    cd /mnt
    pivot_root . boot
    exec chroot . /sbin/init < dev/console > dev/console 2>&1
    EOF
    
  5. Install your kernel.

    Important: This guide assumes that you have set CONFIG_DEVTMPFS_MOUNT=y in your kernel configuration. If you have not, you must set it and recompile your kernel, or you will have problems.

    # make -C /usr/src/linux install
    
    # ln -sr /boot/vmlinuz{-*,}
    
  6. Install a bootloader. Extlinux is simple and works well.

    # emerge -n sys-boot/syslinux
    
    # mkdir /boot/extlinux
    
    # extlinux --install /boot/extlinux
    
    # cat /usr/share/syslinux/mbr.bin > /dev/sda
    
    # cat > /boot/extlinux/extlinux.conf <<EOF
    DEFAULT linux
    
    LABEL linux
    	KERNEL /vmlinuz
    	APPEND root=/dev/sda1 rootwait init=/init.sh
    EOF
    

At this point, you may wish to reboot your system to your new boot device, to test that your new early userspace environment is working. This may require marking the boot partition as "active" (using fdisk or similar) and/or reconfiguring your BIOS settings to change your default boot device. These steps are outside the scope of this guide.

If all goes well, you should not observe any difference versus your traditional boot. However, you now have an environment capable of running commands before the root file system is mounted, meaning you can do fun things like full-disk encryption.

Interactive Rescue Environment

It may not be immediately obvious, but you now have almost everything you need for an interactive rescue environment, which you can optionally boot into to do emergency maintenance tasks such as running fsck on your root file system. You just need to assemble a few additional pieces.

  1. Symlink /sbin/init to BusyBox so there's a real init for the kernel to start.

    # ln -s ../bin/busybox /boot/sbin/init
    
  2. Create an inittab.

    # cat > /boot/etc/inittab <<EOF
    ::sysinit:/bin/busybox mkdir /dev/pts /dev/shm
    ::sysinit:/bin/busybox mount -a
    
    ::respawn:-/bin/busybox sh
    
    ::shutdown:/bin/busybox killall5
    ::shutdown:/bin/busybox umount -a -r
    EOF
    
  3. Add an option to the bootloader configuration for booting into the rescue environment.

    # cat >> /boot/extlinux/extlinux.conf <<EOF
    LABEL rescue
    	KERNEL /boot/vmlinuz
    	APPEND root=/dev/sda1 rootwait
    EOF
    

    Notice that the only difference between this new rescue label and the default linux label is the lack of init=/init.sh in the kernel command line. The kernel executes /sbin/init by default.

  4. You may wish to install additional utilities for diagnosing problems with your root file system.

    Note: The packages shown here are just examples; you could install packages specific to the file systems you use.

    # emerge --root=/boot --config-root=/boot sys-fs/e2fsprogs sys-fs/xfsprogs
    

To enter into your new rescue environment when booting, hold down the Shift or Alt key (or engage Caps Lock or Scroll Lock) before the kernel loads, and a boot: prompt will appear. Type rescue and press Enter.

Networking Support with DHCP

You can add networking support to your early userspace environment fairly easily. This is useful if you need to mount network shares or you wish to allow remote control of the environment over SSH.

Important: Change eth0 in the scripts below to your actual network device name. Note that there is no udev in the early userspace environment, so the network device name will be whatever the kernel assigns, not the persistent name that udev assigns later in the boot process.

  1. Symlink /etc/resolv.conf to /run/resolv.conf, as /etc may be read-only during boot.

    # ln -s /run/resolv.conf /boot/etc/
    
  2. Add an init script to bring up your network device and run BusyBox's DHCP client.

    # cat > /boot/init.d/10-network <<EOF
    ip link set up dev eth0
    
    udhcpc -f -i eth0 &
    pid_udhcpc=$!
    EOF
    

    Note: If you need to send a host name and/or client ID, perhaps to cause your DHCP server to return a fixed IP address mapping, you can add to the udhcpc command line (before the ampersand) -x hostname:<your-hostname> and/or -x 0x3d:<your-client-ID> (with no colons in the client ID, just hex digits, and no angle brackets).

  3. Add an init script to stop the DHCP client and deconfigure the network interface, so that your later boot scripts can start with a clean slate.

    # cat > /boot/init.d/89-network <<EOF
    kill "${pid_udhcpc}"
    wait "${pid_udhcpc}" || :
    
    ip -4 addr flush dev eth0
    ip link set down dev eth0
    EOF
    

Remote Control over SSH

It is possible to run an SSH server in the early userspace environment. This is useful if you need to enter a passphrase to unlock an encrypted storage device but may not always have physical access to the console.

  1. Emerge the Dropbear SSH server.

    # echo 'net-misc/dropbear -shadow -zlib' >> /boot/etc/portage/package.use
    
    # emerge --root=/boot --config-root=/boot net-misc/dropbear
    
  2. Install your host keys, converting them to Dropbear's format.

    # mkdir /boot/etc/dropbear
    
    # /boot/usr/bin/dropbearconvert openssh dropbear /etc/ssh/ssh_host_dsa_key /boot/etc/dropbear/dropbear_dss_host_key
    
    # /boot/usr/bin/dropbearconvert openssh dropbear /etc/ssh/ssh_host_rsa_key /boot/etc/dropbear/dropbear_rsa_host_key
    
    # /boot/usr/bin/dropbearconvert openssh dropbear /etc/ssh/ssh_host_ecdsa_key /boot/etc/dropbear/dropbear_ecdsa_host_key
    
  3. Add init scripts to start and stop the Dropbear server.

    # cat > /boot/init.d/11-dropbear <<EOF
    dropbear -F -P '' -I 60 &
    pid_dropbear=$!
    EOF
    
    # cat > /boot/init.d/88-dropbear <<EOF
    kill "${pid_dropbear}"
    wait "${pid_dropbear}" || :
    EOF
    
  4. Copy your authorized_keys file.

    # mkdir -p /boot/root/.ssh
    
    # cp -a ~/.ssh/authorized_keys /boot/root/.ssh/
    
  5. Install the default user and group manifests.

    # cp -a /usr/share/baselayout/{passwd,group} /boot/etc/
    
  6. Change the root user's shell to /bin/sh, since Bash is not installed.

    # ln -s busybox /boot/bin/sh
    
    # chsh --root /boot --shell /bin/sh root
    
  7. Add an init script to pause the boot process at a prompt, to allow for remote access.

    # cat > /boot/init.d/49-pause <<EOF
    read -r -p 'Press Enter to continue boot...'
    EOF
    

Full-Disk Encryption with LUKS

The impetus for all of this, of course, is to allow for complex root file system mounts, which cannot be achieved simply with kernel command-line arguments. The following section of this guide details how to convert an existing root partition in place (i.e., preserving the existing file system and its contents) to an encrypted partition and how to set up the early userspace environment to prompt for the passphrase to mount the root file system contained in this partition.

  1. Before you begin, verify that your disk has free space available to shift the start of your root partition by at least 1032 sectors toward the beginning of the disk.

    # sfdisk -lq /dev/sda
    Device     Boot    Start        End    Sectors  Size Id Type
    /dev/sda1  *        2048    1048575    1046528  511M 83 Linux
    /dev/sda2        1048576   16777215   15728640  7.5G 82 Linux swap / Solaris
    /dev/sda3       16777216 2147483647 2130706432 1016G 83 Linux
    

    Shown above is an example of a typical partition layout, with a small boot partition first, followed by a swap partition, followed by the large root partition. In this case, the swap partition can be deleted and created anew with a slightly smaller size, to make room for expanding the root partition into the vacated space.

    Important: If your partition layout lacks sufficient free space to relocate your root partition by at least 1032 sectors closer to the beginning of your disk, then do not continue with this guide!

  2. Emerge cryptsetup.

    # cat >> /boot/etc/portage/package.use <<EOF
    sys-fs/cryptsetup -gcrypt kernel
    sys-fs/lvm2 -thin device-mapper-only
    EOF
    
    # echo 'sys-apps/baselayout-2.2' >> /boot/etc/portage/profile/package.provided
    
    # emerge --root=/boot --config-root=/boot sys-fs/cryptsetup
    
  3. Determine the number of sectors needed for the LUKS header.

    # dd if=/dev/null of=/tmp/tmp.img bs=1M seek=64
    
    # LOOPDEV=$(losetup -f --show /tmp/tmp.img)
    
    # /boot/sbin/cryptsetup luksFormat -q --align-payload 1 "${LOOPDEV}"
    Enter passphrase: [press Enter here]
    
    # /boot/sbin/cryptsetup luksDump "${LOOPDEV}" | grep '^Payload offset:'
    Payload offset: 2056
    
    # losetup -d "${LOOPDEV}"
    
    # rm /tmp/tmp.img
    

    Note: If you do not have enough space to grow your root partition by the number of sectors reported as the "Payload offset", then repeat this step, but add --cipher aes-cbc-essiv:sha256 --key-size 128 to the luksFormat command. These parameters should result in the smallest possible LUKS header. If you still do not have enough space, then you must not continue with this guide!

  4. Before proceeding, make a full backup of your file system to an external disk. Even if you perform all of the following steps perfectly, a power glitch or a kernel panic during the encryption process will trash your file system irreparably. You have been warned!

  5. Rewrite the 50-mountroot init script.

    # cat > /boot/init.d/50-mountroot <<EOF
    until cryptsetup luksOpen /dev/sda3 root ; do : ; done
    mount --ro /dev/mapper/root /mnt
    EOF
    
  6. If you created 49-pause earlier, you should delete it now, as it is no longer useful.

    # rm -f /boot/init.d/49-pause
    
  7. Reboot into your shiny new interactive rescue environment. You cannot perform the remaining steps while your root file system is mounted.

  8. Use sfdisk to extend your root partition toward the beginning of the disk by exactly the number of sectors reported earlier by luksDump as the "Payload offset". Also, change its type to e8, which is the standard partition type for a LUKS partition.

    Important: The numbers shown below are examples only. You will need to use the actual numbers reported by sfdisk for your disk, decreasing the size of the swap partition, decreasing the start of the root partition, and increasing the size of the root partition, all by the exact number of sectors reported earlier as the "Payload offset".

    If you have ANY DOUBTS about what you are doing, STOP NOW!

    # sfdisk /dev/sda
    
    Welcome to sfdisk (util-linux 2.27.1).
    Changes will remain in memory only, until you decide to write them.
    Be careful before using the write command.
    
    Checking that no-one is using this disk right now ... OK
    
    Disk /dev/sda: 1 TiB, 1099511627776 bytes, 2147483648 sectors
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disklabel type: dos
    Disk identifier: 0xf04ad805
    
    Old situation:
    
    Device     Boot    Start        End    Sectors  Size Id Type
    /dev/sda1  *        2048    1048575    1046528  511M 83 Linux
    /dev/sda2        1048576   16777215   15728640  7.5G 82 Linux swap / Solaris
    /dev/sda3       16777216 2147483647 2130706432 1016G 83 Linux
    

    Write down the "Old situation" in case you need to go back to it.

    Type 'help' to get more information.
    
    >>> 2048,1046528,83,*
    Created a new DOS disklabel with disk identifier 0xa63ad8c1.
    Created a new partition 1 of type 'Linux' and of size 511 MiB.
    /dev/sda1 :         2048      1048575 (511M) Linux
    /dev/sda2: 1048576,15726584,82
    Created a new partition 2 of type 'Linux swap / Solaris' and of size 7.5 GiB.
    /dev/sda2 :      1048576     16775159 (7.5G) Linux swap / Solaris
    /dev/sda3: 16775160,2130708488,e8
    Created a new partition 3 of type 'Unknown' and of size 1016 GiB.
    /dev/sda3 :     16775160   2147483647 (1016G) unknown
    /dev/sda4: 0,0
    Ignoring partition.
    All partitions used.
    
    New situation:
    
    Device     Boot    Start        End    Sectors  Size Id Type
    /dev/sda1  *        2048    1048575    1046528  511M 83 Linux
    /dev/sda2        1048576   16775159   15726584  7.5G 82 Linux swap / Solaris
    /dev/sda3       16775160 2147483647 2130708488 1016G e8 unknown
    

    Verify that the ending sector of your root partition is the same in the "New situation" as in the "Old situation" and that its size has increased by the "Payload offset" amount. Also verify that its type is now e8.

    Do you want to write this to disk? [Y]es/[N]o: y
    
    The partition table has been altered.
    Calling ioctl() to re-read partition table.
    Syncing disks.
    
  9. If you shrank a swap partition, you must run mkswap to reinitialize its header with the new size.

    # mkswap /dev/sda2
    
  10. Set up a loop device pointing at your file system, which is now at a positive offset into the partition.

    # losetup -f --show --offset $((2056*512)) /dev/sda3
    /dev/loop0
    
  11. Verify that the loop device is pointing at your file system.

    # blkid /dev/loop0
    /dev/loop0: UUID="6f5401f8-12df-4e17-9935-5478f161d51a" TYPE="ext4"
    

    If you do not see a TYPE=, then you've made a mistake somewhere.

  12. Format the LUKS partition. Use the same parameters to luksFormat as you used earlier when you determined the "Payload offset".

    Important: If you do not use the same parameters to luksFormat as you used earlier, you may accidentally overwrite the beginning of your file system, which would be Very Bad.

    # cryptsetup luksFormat --align-payload 1 /dev/sda3
    WARNING!
    ========
    This will overwrite data on /dev/sda3 irrevocably.
    
    Are you sure? (Type uppercase yes): YES
    Enter passphrase: [type a strong passphrase here]
    Verify passphrase: [repeat the same passphrase here]
    
  13. Open the LUKS partition.

    # cryptsetup luksOpen /dev/sda3 root
    Enter passphrase for /dev/sda3: [type your passphrase here]
    
  14. Encrypt your file system in place.

    # dd if=/dev/loop0 of=/dev/mapper/root bs=512
    

    Go have a nap. This will take several hours. I hope you have stable power.

  15. Verify that the mapped device contains your file system.

    # blkid /dev/mapper/root
    /dev/mapper/root: UUID="6f5401f8-12df-4e17-9935-5478f161d51a" TYPE="ext4"
    

    The UUID and TYPE should be the same as reported by blkid earlier.

  16. Reboot and cross your fingers.

    # reboot
    

Remote Unlocking of Encrypted Root

So now your system is encrypted and prompts you for the passphrase during boot, but what happens if the power flickers while you're away and without physical access to the console? You'd like to be able to SSH in and enter the passphrase to get your system booted up again. Well, you can.

  1. Emerge screen.

    # emerge --root=/boot --config-root=/boot app-misc/screen
    
  2. Rewrite the 50-mountroot init script.

    # cat > /boot/init.d/50-mountroot <<EOF
    openvt -sw screen busybox sh -c 'until cryptsetup luksOpen /dev/sda3 root ; do : ; done' || :
    chvt 1
    deallocvt
    mount --ro /dev/mapper/root /mnt
    EOF
    
  3. Change the root user's shell to /usr/bin/screen.

    # chsh --root /boot --shell /usr/bin/screen root
    

Now reboot. When you see the passphrase prompt, try SSH'ing in from another computer. You will see the same passphrase prompt. Enter the passphrase on either machine to continue the boot process.

@whitslack
Copy link
Author

switch_root is an abomination (IMHO).

What's wrong with it? I don't know much about what happens under the hood.

switch_root violates the time-honored Unix tradition of each command-line utility doing one thing and doing it well. switch_root performs a whole sequence of syscalls that are grouped into a single executable only out of a necessity to avoid exec'ing any new processes while the sequence is in progress. It's a hack.

So what is left behind?

After switch_root execs the new init process (thereby closing the last file descriptions on unlinked inodes in the rootfs and allowing them and their data blocks to be purged from memory), nothing.

Good to know, is it tmpfs like or is it just a tmpfs? how does it differ?

It's not a tmpfs. If you examine /proc/mounts, you can observe that the type of the file system initially mounted at / is rootfs, not tmpfs. That said, I strongly suspect the implementation of rootfs inherits from that of tmpfs with few if any differences other than name.

Signing your kernel and setting up secure boot isn't too hard

Sure, but I have no reason to do it. None of my Linux systems have UEFI firmwares.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment