smira/README.md

## README.md

      
    Raw
  

              README.md
            
          
    kexec

syscalls


Load new kernel + initrd from files:

kexec_file_load
Go syscall


Call reboot with LINUX_REBOOT_CMD_KEXEC:

reboot


(2) means that kexec is only available if reboot() is available as well.
There are two kexec_load-related syscalls:

kexec_load which takes arbitrary memory (enabled via CONFIG_KEXEC)
kexec_file_load which takes file descriptors and might do signature validation (enabled via CONFIG_KEXEC_FILE)

KSPP talks only about CONFIG_KEXEC, not about CONFIG_KEXEC_FILE. At the same time it recommends sysctl to disable kexec_load which disables both flavors.
kexec source code.
Capabilities

kexec_file_load and reboot require CAP_SYS_BOOT capability.
reboot() inside user namespace doesn't reboot the system, it reboots the namespace (killing it) proof. If LINUX_REBOOT_CMD_KEXEC is used, it results in EINVAL.  Which in turn means that any container can't actually use kexec, unless it breaks out of user namespace (if it does, security is compromised anyways).
We can further limit kexec by dropping CAP_SYS_BOOT capability for any process forked from machined (init). Path towards that is not yet totally clear for me, but some pointers:

Go issue about runtime, threads and global stuff like settings capabilities
runc setting capabilities
os/exec can set Ambient capabilities
PR_SET_NO_NEW_PRIVS
article on capabilities

Creating user namespace re-enables all the capabilities back but capabilities inside the user namespace are limited to the resources scoped under the user namespace (more info).
In other words, on protecting kexec from being used by processes other than machined:

For processes directly forked from machined (which include udevd, containerd, etc.): we can try to drop capabilities as we fork into those processes.
For containers created by containerd (both system and k8s), kexec shouldn't be available as they reside in user namespace.