Skip to content

Instantly share code, notes, and snippets.

@pinkeen
Last active April 21, 2024 09:50
Show Gist options
  • Save pinkeen/bba0a6790fec96d6c8de84bd824ad933 to your computer and use it in GitHub Desktop.
Save pinkeen/bba0a6790fec96d6c8de84bd824ad933 to your computer and use it in GitHub Desktop.
Run a systemd container using cgroupv2

Run a systemd container using cgroupv2 [NOTES]

In theory this would allow the nested systemd init to manage its own resources via slices and scopes - kind of like with LXC's nested mode but without the nasty security implication of bind mounting the real cgroupfs into the container.

Running a systemd container is not the only thing that this would enable - together with fuse-overlayfs it might allow one to run containers inside containers more securely.

The problem is that by default the nested group is mounted ro into the container which should not be necessary according to my research. It gets mounted rw as expected when userns-remap is enabled in Docker what is not desirable for me. I am not sure if docker/moby/containerd is at fault here or if it's a limitation of Linux control groups or user namespaces. It would be great if somebody could point me in the right direction. I'd be happy even if you prove me completely wrong and point out a fault in my reasoning :)

My full writeup and explanation is in a serverfault answer.


INB4 this becomes a flame war

IMHO This is not always an anti-pattern! This is a legitimate use-case for all kinds of CI, testing, local-development and other workloads. Among others this approach is used in ansible molecule, I've seen it used and sought after many times and personally have multiple use-cases that would greatly benefit. After all this is what essentially LXC/LXD does and I've seen large-scale, production deployments based on it.

Also even if you don't use it for containers running init it still is a security enhancement.


Upon consideration I wonder if its even possible to do without userns-remapping. It might be that kernel does not support clone/unshare inside a child namespace of UID 0?

Theoretically namespaced clone/unshare calls do not require CAP_SYS_ADMIN (they are unprivileged), but I'm not sure if it's applicable here.

Via namespaces(7):

Creation of new namespaces using clone(2) and unshare(2) in most cases requires the CAP_SYS_ADMIN capability, since, in the new namespace, the creator will have the power to change global resources that are visible to other processes that are subsequently created in, or join the namespace. User namespaces are the exception: since Linux 3.8, no privilege is required to create a user namespace.

The relation and interaction between user namespace and control group linux kernel features is quite confusing, I am still trying to wrap my head around it. When you add systemd into the mix it becomes and real mind-boggler.


Just a small note: It's not enough that your kernel has cgroupv2 enabled. Depending on the linux distribution bundled systemd might prefer to use v1 by default.

You can tell systemd to use cgroupv2 via kernel cmdline parameter:
systemd.unified_cgroup_hierarchy=1

It might also be needed to explictly disable hybrid cgroupv1 support to avoid problems using: systemd.legacy_systemd_cgroup_controller=0

Or completely disable cgroupv1 in the kernel with: cgroup_no_v1=all

tl;dr

It seems to me that this use case is not explicitly supported yet. You can almost get it working but not quite.

The root cause

When systemd sees a unified cgroupfs at /sys/fs/cgroup it assumes it should be able to write to it which normally should be possible but is not the case here.

The basics

First of all, you need to create a systemd slice for docker containers and tell docker to use it - my current docker/daemon.json:

{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "features": { "buildkit": true },
  "experimental": true,
  "cgroup-parent": "docker.slice"
}

Note: Not all of these options are necessary. The most important one is cgroup-parent. The cgroupdriver should already be switched to "systemd' by default.

Each slice gets its own nested cgroup. There is one caveat though: Each group might only be a "leaf" or "intermediary". Once a process takes ownershop of a cgroup no other can manage it. This means that the actual container process needs and will get its own private group attached below the configured one in the form of a systemd scope.

Reference: Please find more about systemd resource control, handling of cgroup namespaces and delegation.

Note: A this point docker daemon should use --cgroupns private by default, but you can force it anyway.

Now a newly started container will get its own group which should be available in a path that (depending on your setup) resembles:

/sys/fs/cgroup/your_docker_parent.slice/your_container.scope

And here is the important part: You must not mount a volume into container's /sys/fs/cgroup. The path to its private group mentioned above should get mounted there automatically.

The goal

Now, in theory, the container should be able to manage this delegated, private group by itself almost fully. This would allow its own init process to create child groups.

The problem

The problem is that the /sys/fs/cgroup path in the container gets mounted read-only. I've checked apparmor rules and switched seccomp to unconfined to no avail.

The hypothesis

I am not completely certain yet - my current hypothesis is that this is a security feature of docker/moby/containerd. Without private groups it makes perfect sense to mount this path ro.

Potential solutions

What I've also discovered is that enabling user namespace remapping causes the private /sys/fs/cgroup to be mounted with rw as expected!

This is far from perfect though - the cgroup (among others) mount has wrong ownership: it's owned by the real system root (UID0) while the container has been remapped to a completely different user. Once I've manually adjusted the owner - the container was able to start a systemd init sucessfully.

I suspect this is a deficiency of docker's userns remapping feature and might be fixed sooner or later. Keep in mind that I might be wrong about this - I did not confirm.

Discussion

Userns remapping has got a lot of drawbacks and the best possible scenario for me would be to get the cgroupfs mounted rw without it. I still don't know if this is done on purpose or if it's some kind of limitation of the cgroup/userns implementation.

Notes

It's not enough that your kernel has cgroupv2 enabled. Depending on the linux distribution bundled systemd might prefer to use v1 by default.

You can tell systemd to use cgroupv2 via kernel cmdline parameter:
systemd.unified_cgroup_hierarchy=1

It might also be needed to explictly disable hybrid cgroupv1 support to avoid problems using: systemd.legacy_systemd_cgroup_controller=0

Or completely disable cgroupv1 in the kernel with: cgroup_no_v1=all

Background

Now that docker supports cgroups v2 I would like to take full advantage of it.

When I run a container with a private group using --cgroupns=private the nested cgroup2 filesystem created by systemd scope gets mounted into the containers /sys/fs/cgroup path properly, however, docker mounts it read-only by default:

cgroup2 on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec)

Rationale

Technical considerations

I think that this is legacy behaviour which was correct for cgroupv1 where the system-global cgroupfs was mounted into the container as rw rights would be a gaping security hole.

According to my knowledge a nested cgroup with delegated controllers should be able to write into /sys/fs/cgroup by design without negative security implications.

Target use-cases

Right now running containers with (nested) systemd init or other container runtime requires multiple hacks which seriously expose security and have portability problems.

Solving this problem would enable an easier, more secure and possibly even transparent mechanism for:

  • allowing containers with nested systemd init to manage its own resources via slices and scopes - kind of like with LXC's nested mode but without the nasty security implication of bind mounting the real cgroupfs into the container
  • allowing nested containerized workloads with the help of fuse-overlayfs

The goal

My goal is to adjust the code so the cgroup2 filesystem is mounted read-write when container is run with a private cgroupns with delegated controllers.

The problem

The problem is that I don't really know where to look. Which part of the stack is actually responsible for this? Is it docker, moby, containerd, runc or maybe systemd?

So far I've found the default settings in the moby project, but they are for cgroupv1.

Where do I find the code that I need to modify and submit a PR to?

PS For a more detailed writeup see my answer on serverfault and my post on r/docker.

@c3-mgruchacz
Copy link

@pinkeen I'm running into a similar issue, were you ever able to resolve, or at least find the right repo to submit a PR?
For my use case I don't want to run systemd in the container, but I do want my application that is running in a container with --cgroupns=private and Delegate=True to be able to manage its own cgroup sub-hierarchy. The motivation for this is that when processes are forked from the main process running in a container, I want separate resource limits to collectively apply to the forked processes so that if they oom they don't cause the main process to be oom killed along with them.

@techninja1008
Copy link

For anyone who finds this wanting to achieve a similar thing, I've managed to successfully get it working with a combination of the following:

  1. Launch the container with CAP_SYS_ADMIN (also make sure that the container has its own cgroup namespace - this should be the default though)
  2. Use a shim in the container wrapping your existing entrypoint that looks vaguely like the following:
    #!/bin/bash
    umount /sys/fs/cgroup
    mount -t cgroup2 -o rw,relatime,nsdelegate,memory_recursiveprot cgroup2 /sys/fs/cgroup
    exec capsh --drop=cap_sys_admin -- -c 'exec your_existing_entrypoint_here'
    

You can do that either by baking it into your container image (and refining to it appropriately - eg you may need to include capsh or mount/umount in your image) or by bind mounting a specially prepared volume from the host containing your script+interpreter, mount, umount and capsh.

The script works by unmounting the /sys/fs/cgroup that was placed there by the container runtime (eg docker) and mounting one in its place as read-write. Because you're already running inside the cgroup namespace at this point, the mount will automatically be scoped to the namespace's cgroup root. capsh is then used to drop CAP_SYS_ADMIN (required to do the mounting) before replacing itself with your existing entrypoint.

If you don't care about dropping CAP_SYS_ADMIN after doing the remounting, you can omit capsh and just exec directly.

More about the behavior of this can be found at https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#namespace

@liyimeng
Copy link

@techninja1008 nice trick! what I don't get how do you start systemd service. Don't you need to run systemd from entry point?
If I do that, it drop me to a login prompt. hence blocking me from running exec capsh --drop=cap_sys_admin -- -c 'exec your_existing_entrypoint_here'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment