Table of Contents
- https://www.kernel.org/doc/Documentation/cgroup-v2.txt
- https://www.kernel.org/doc/html/v5.2/admin-guide/cgroup-v2.html
- https://systemd.io/CGROUP_DELEGATION.html
-
Grab Fedora Rawhide (soon-to-be-branched-as-31) image
-
prepare cloud init data
$ cat <<'EOF' > uci-data-guest
#cloud-config
password: guest
users:
- name: guest
passwd: $1$xyz$NupBwZXNoMXD8NQwzjRW/0
groups: wheel
sudo: ALL=(ALL) NOPASSWD:ALL
shell: /bin/bash
ssh-authorized-keys:
- <your-ssh-public-key>
ssh_pwauth: True
datasource_list: [ NoCloud, None ]
$ cloud-localds uci-data-guest.img uci-data-guest
-
grab a helper for running qemu https://gist.githubusercontent.com/bboozzoo/47e3de78551850bae96d5274403cc120/raw/1f8b3de57563f1b7ebad43268deafc8686d2a2fb/run-qemu
-
./run-qemu Fedora-Cloud-Base-Rawhide-20190711.n.1.x86_64.qcow2 -drive file=uci-data-guest.img,if=virtio,index=1,snapshot=on -smp
-
connect via to provided port
- edit
/etc/default/grub
and rengerate config:
GRUB_TIMEOUT=1
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
#### add systemd.unified_cgroup_hierarchy
GRUB_CMDLINE_LINUX="no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8 systemd.unified_cgroup_hierarchy"
GRUB_DISABLE_RECOVERY="true"
GRUB_ENABLE_BLSCFG=true
then:
$ grub2-mkconfig -o /boot/grub/grub.cfg
- OR use
grubby
:
$ sudo grubby --update-kernel=ALL --args='systemd.unified_cgroup_hierarchy=1'
-
reboot
-
verify:
[guest@localhost ~]$ mount|grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate)
- workaround buggy rich dependncy handling:
$ sudo dnf install kernel-modules -y
NOTE: may need to reboot if kernel-core
got updated.
- snapd
$ sudo dnf install snapd -y
$ sudo dnf builddep snapd -y
-
install
hello-world
-
running hello-world fails once s-c tries to access freezer:
$ SNAPD_DEBUG=1 SNAP_CONFINE_DEBUG=1 snap run hello-world
2019/07/12 13:58:46.707278 cmd_linux.go:70: DEBUG: re-exec not supported on distro "fedora" yet
DEBUG: umask reset, old umask was 02
DEBUG: security tag: snap.hello-world.hello-world
DEBUG: executable: /usr/lib/snapd/snap-exec
DEBUG: confinement: non-classic
DEBUG: base snap: core
DEBUG: ruid: 1000, euid: 0, suid: 0
DEBUG: rgid: 1000, egid: 0, sgid: 0
DEBUG: creating lock directory /run/snapd/lock (if missing)
DEBUG: opening lock directory /run/snapd/lock
DEBUG: opening lock file: /run/snapd/lock/.lock
DEBUG: sanity timeout initialized and set for 30 seconds
DEBUG: acquiring exclusive lock (scope (global), uid 0)
DEBUG: sanity timeout reset and disabled
DEBUG: ensuring that snap mount directory is shared
DEBUG: unsharing snap namespace directory
DEBUG: releasing lock 5
DEBUG: opened snap-update-ns executable as file descriptor 5
DEBUG: opened snap-discard-ns executable as file descriptor 6
DEBUG: creating lock directory /run/snapd/lock (if missing)
DEBUG: opening lock directory /run/snapd/lock
DEBUG: opening lock file: /run/snapd/lock/hello-world.lock
DEBUG: sanity timeout initialized and set for 30 seconds
DEBUG: acquiring exclusive lock (scope hello-world, uid 0)
DEBUG: sanity timeout reset and disabled
DEBUG: initializing mount namespace: hello-world
DEBUG: forked support process 1158
DEBUG: helper process waiting for command
DEBUG: sanity timeout initialized and set for 30 secondsDEBUG:
block device of snap core, revision 7270 is 7:0
DEBUG: sanity timeout initialized and set for 30 seconds
DEBUG: joining preserved mount namespace for inspection
DEBUG: block device of the root filesystem is 7:0
DEBUG: sanity timeout reset and disabled
DEBUG: preserved mount namespace can be reused
DEBUG: joined preserved mount namespace hello-world
DEBUG: joining preserved per-user mount namespace
DEBUG: unsharing the mount namespace (per-user)
DEBUG: sc_setup_user_mounts: hello-world
DEBUG: NOT preserving per-user mount namespace
cannot open cgroup hierarchy /sys/fs/cgroup/freezer: No such file or directory
cgroup superblock magic:
#define CGROUP_SUPER_MAGIC 0x27e0eb
#define CGROUP2_SUPER_MAGIC 0x63677270
[root@localhost guest]# stat -f -c %t /sys/fs/cgroup/
63677270
group2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate)
TODO Do we need nsdelegate
?
TODO check for nsdelegate
mount option
nsdelegate
Consider cgroup namespaces as delegation boundaries. This option is system wide and can only be set on mount or modified through remount from the init namespace.
Available:
$ cat /sys/fs/cgroup/testing/cgroup.controllers
memory pids
Enable via cgroup.subtree_control
.
$ echo -n "+memory +pids" > /sys/fs/cgroup/testing/cgroup.subtree_control
The cgroup cannot be occupied by any processes:
$ echo 812 > /sys/fs/cgroup/testing/cgroup.procs
bash: echo: write error: Device or resource busy
$ mkdir /sys/fs/cgroup/testing/group1
$ echo 812 > /sys/fs/cgroup/testing/group1/cgroup.procs
$ cat /sys/fs/cgroup/testing/group1/cgroup.procs
812
TODO: find out why only memory and pids controller are avaible in the cgroup?
$ cat /sys/fs/cgroup/testing/cgroup.events
populated 1
frozen 0
$ cat /sys/fs/cgroup/testing/group1/cgroup.events
populated 1
frozen 0
TODO: does this support epoll like handling?
No separate controller.
[root@localhost guest]# echo 1 > /sys/fs/cgroup/testing/group1/cgroup.freeze
[root@localhost guest]# echo 0 > /sys/fs/cgroup/testing/group1/cgroup.freeze
- eBPF,
bpf(BPF_PROG_LOAD, bpf_attr.prog_type = BPF_CGROUP_DEVICE ...)
- device firewall as implemented in systemd https://github.com/systemd/systemd/blob/master/src/core/bpf-devices.c
- BPF program receives `struct bpf_cgroup_dev_ctx**, https://elixir.bootlin.com/linux/v5.2.2/source/include/uapi/linux/bpf.h#L3367
In the root ns:
[root@localhost guest]# cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope
After CLONE_NEWCGROUP
:
[root@localhost guest]# unshare -f -C /bin/bash
[root@localhost guest]# cat /proc/self/cgroup
0::/
Move self into cgroup
[root@localhost guest]# export PS1='[\u@\h \W]\$ $(cat /proc/self/cgroup) $ '
[root@localhost guest]# 0::/user.slice/user-1000.slice/session-1.scope $
[root@localhost guest]# 0::/user.slice/user-1000.slice/session-1.scope $ echo 0 > /sys/fs/cgroup/testing/group1/cgroup.procs
[root@localhost guest]# 0::/testing/group1 $
unshare to new cgroup, current becomes root
[root@localhost guest]# 0::/testing/group1 $ unshare -C -f -m
[root@localhost guest]# 0::/ $
Mounting cgroupv2, mounts the root cgroup:
[root@localhost guest]# 0::/ $ ls /sys/fs/cgroup/testing/group1/
cgroup.controllers cgroup.max.descendants cgroup.threads io.pressure memory.low memory.pressure memory.swap.max
cgroup.events cgroup.procs cgroup.type memory.current memory.max memory.stat pids.current
cgroup.freeze cgroup.stat cpu.pressure memory.events memory.min memory.swap.current pids.events
cgroup.max.depth cgroup.subtree_control cpu.stat memory.high memory.oom.group memory.swap.events pids.max
[root@localhost guest]# 0::/ $ mkdir /foo
[root@localhost guest]# 0::/ $ mount -t cgroup2 none /foo
[root@localhost guest]# 0::/ $ ls /foo/
[root@localhost guest]# 0::/ $ ls /foo/
cgroup.controllers cgroup.max.descendants cgroup.threads io.pressure memory.low memory.pressure memory.swap.max
cgroup.events cgroup.procs cgroup.type memory.current memory.max memory.stat pids.current
cgroup.freeze cgroup.stat cpu.pressure memory.events memory.min memory.swap.current pids.events
cgroup.max.depth cgroup.subtree_control cpu.stat memory.high memory.oom.group memory.swap.events pids.max
$ cat /usr/bin/run-test
#!/bin/sh
set -e
set -x
if [ "$1" = "" ]; then
echo "$0 <cgroup-group-path>"
exit 1
fi
delegate=0
if [ "$1" = "--delegate" ]; then
shift
delegate=1
fi
rootgr=$(cut -f3 -d: < /proc/self/cgroup)
grname="$(systemd-escape -u -p $1)"
gr="/sys/fs/cgroup/$grname"
if [ "$delegate" = "1" ]; then
ls -l /sys/fs/cgroup/$rootgr >&2
gr="/sys/fs/cgroup/$rootgr/$grname"
mkdir -p "$gr"
ls -l $gr >&2
fi
echo "using group $gr"
echo 0 > "$gr/cgroup.procs"
echo "cgroup: $(cat /proc/self/cgroup)"
while true; do
echo "running"
sleep 10
done
With --delegate
, the script assumes it's a delegate, and creates a hierarchy
under its cgroup.
$ systemctl cat run-test@.service
# /etc/systemd/system/run-test@.service
[Unit]
Description=test prog
[Service]
Type=simple
ExecStart=/usr/bin/run-test %I
[Install]
WantedBy=default.target
Status is confusing:
[guest@localhost snapd]$ sudo systemctl status run-test@testing-group1
● run-test@testing-group1.service - test prog
Loaded: loaded (/etc/systemd/system/run-test@.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2019-07-23 12:16:03 UTC; 3s ago
Main PID: 19514 (run-test)
Tasks: 0 (limit: 1669)
Memory: 1.1M
CPU: 40ms
CGroup: /system.slice/system-run\x2dtest.slice/run-test@testing-group1.service
‣ 19514 /bin/sh /usr/bin/run-test testing/group1
Service moved itself to testing/group1
group:
[guest@localhost snapd]$ systemd-cgls | head -10
Control group /:
-.slice
├─testing
│ └─group1
│ ├─19514 /bin/sh /usr/bin/run-test testing/group1
│ └─19557 sleep 10
Once stopped, /sys/fs/cgroup/testing/group1
is left behind.
Can freely (?) manage hierarchy under its cgroup
[guest@localhost snapd]$ sudo systemctl cat run-test-delegate@testing-group1
# /etc/systemd/system/run-test-delegate@.service
[Unit]
Description=test prog
[Service]
Type=simple
ExecStart=/usr/bin/run-test --delegate %I
Delegate=true
[Install]
WantedBy=default.target
Status looks sane, reflects nested cgroup:
[guest@localhost snapd]$ sudo systemctl status run-test-delegate@testing-group1
● run-test-delegate@testing-group1.service - test prog
Loaded: loaded (/etc/systemd/system/run-test-delegate@.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2019-07-23 12:15:37 UTC; 6min ago
Main PID: 19491 (run-test)
Tasks: 2 (limit: 1669)
Memory: 2.1M
CPU: 220ms
CGroup: /system.slice/system-run\x2dtest\x2ddelegate.slice/run-test-delegate@testing-group1.service
└─testing
└─group1
├─19491 /bin/sh /usr/bin/run-test --delegate testing/group1
└─19656 sleep 10
Whole hierarchy is cleaned up when stopping.
broken
Problems:
- easily identify whether any apps using given snap are running
- be able to freeze processes for mount ns changes
- try not to escape from systemd's view of processes
Emulate snap-confine
:
#!/bin/sh
set -e
set -x
delegate=0
if [ "$1" = "--delegate" ]; then
shift
delegate=1
fi
if [ "$1" == "--cgroup" ]; then
shift
rootgr=$(cut -f3 -d: < /proc/self/cgroup)
grname="$1"
shift
gr="/sys/fs/cgroup/$grname"
if [ "$delegate" = "1" ]; then
ls -l /sys/fs/cgroup/$rootgr >&2
gr="/sys/fs/cgroup/$rootgr/$grname"
fi
mkdir -p "$gr"
echo "using group $gr"
echo 0 > "$gr/cgroup.procs"
fi
echo "cgroup: $(cat /proc/self/cgroup)"
exec /bin/sh "$@"
Slice can group processes together hierarchically. Implicit slices created by
systemd (-.slice, system.slice, user.slice etc.). System services end up under
system.slice
. Users logged in via systemd-logind are automatically placed
under users.slice/users-<uid>.slice
. The -
encodes a hierarchy,
foor-bar-baz.slice, is /foo.slice/foo-bar.slice/foo-bar-baz.slice
.
Use statically declared snaps.slice
:
# /etc/systemd/system/snaps.slice
[Unit]
Description=slice for snaps
Snapd could create a slice for each snap:
# /etc/systemd/system/snaps-foo.slice
[Unit]
Description=slice for snap foo
NOTE: slices cannot have Delegate=
set.
Jul 25 07:50:02 localhost systemd[1]: /etc/systemd/system/snaps.slice:5: Delegate= setting not supported for this unit type, ignoring.
# /etc/systemd/system/snap.foo.svc.service
[Unit]
Description=foo
[Service]
Type=simple
ExecStart=/usr/bin/s-c -c 'while true; do echo running; sleep 10; done'
Slice=snaps-foo.slice
[Install]
WantedBy=default.target
Started service looks fine:
[root@localhost guest]# systemctl status snap.foo.svc
● snap.foo.svc.service - foo
Loaded: loaded (/etc/systemd/system/snap.foo.svc.service; disabled; vendor preset: disabled)
Active: active (running) since Thu 2019-07-25 12:21:53 UTC; 2s ago
Main PID: 4730 (sh)
Tasks: 2 (limit: 1665)
Memory: 772.0K
CPU: 15ms
CGroup: /snaps.slice/snaps-foo.slice/snap.foo.svc.service <--- NOTE
├─4730 /bin/sh -c while true; do echo running; sleep 10; done
└─4731 sleep 10
Easy check whether any services from the snap are running:
[root@localhost guest]# cat /sys/fs/cgroup/snaps.slice/snaps-foo.slice/pids.current
2
The hierarchy is cleaned up on stop. systemctl status snaps-foo.slice
shows
the processes of the snap.
User processes are running under user slice & session scope.
[guest@localhost ~]$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope
[root@localhost guest]# cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope
Scope manages groups of processes. Scope is started via DBus API only.
All processes are occupying the same cgroup as the parent. Thus processes started from the shell, will be in the same slice & scope as the shell process.
systemd-run
can create scopes and slices under the default manager (i.e.
user's systemd manager, or the system systemd manager).
[guest@localhost ~]$ systemd-run --slice snap-foo --unit foo --user sleep 360
Running as unit: foo.service
[guest@localhost ~]$ systemctl status --user foo
● foo.service - /usr/bin/sleep 360
Loaded: loaded (/run/user/1000/systemd/transient/foo.service; transient)
Transient: yes
Active: active (running) since Thu 2019-07-25 12:50:57 UTC; 4s ago
Main PID: 5268 (sleep)
Tasks: 1 (limit: 1665)
Memory: 192.0K
CPU: 5ms
CGroup: /user.slice/user-1000.slice/user@1000.service/snap.slice/snap-foo.slice/foo.service
└─5268 /usr/bin/sleep 360
[guest@localhost ~]$ cat /proc/5268/cgroup
0::/user.slice/user-1000.slice/user@1000.service/snap.slice/snap-foo.slice/foo.service
Similarly, a new scope is created under the current root:
[guest@localhost ~]$ systemd-run --scope --unit foo --user bash
Running scope as unit: foo.scope
[guest@localhost ~]$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/user@1000.service/foo.scope
[guest@localhost ~]$ sleep 360 &
[1] 5360
[guest@localhost ~]$ systemctl status --user foo.scope
● foo.scope - /usr/bin/bash
Loaded: loaded (/run/user/1000/systemd/transient/foo.scope; transient)
Transient: yes
Active: active (running) since Thu 2019-07-25 12:55:38 UTC; 14s ago
Tasks: 4 (limit: 1665)
Memory: 3.8M
CPU: 167ms
CGroup: /user.slice/user-1000.slice/user@1000.service/foo.scope
├─5330 /usr/bin/bash
├─5360 sleep 360
├─5361 systemctl status --user foo.scope
└─5362 less
NOTE we could start snaps under a scope, those would be tracked in cgroups
under user/user-%U/snap.foo.scope
(or slice). The downside is that freezing
and checking would require snap*
to browse the whole tree.
We can escape the cgroup for simple apps ourselves:
[root@localhost guest]# /usr/bin/s-c --cgroup snaps.slice/snaps-foo.slice/snap.foo.app2 -c 'sleep 9001' &
[root@localhost guest]# systemctl status snaps-foo.slice
● snaps-foo.slice - slice for snap foo
Loaded: loaded (/etc/systemd/system/snaps-foo.slice; static; vendor preset: disabled)
Active: active since Thu 2019-07-25 12:21:53 UTC; 58min ago
Tasks: 3
Memory: 1.8M
CPU: 1.414s
CGroup: /snaps.slice/snaps-foo.slice
├─snap.foo.app2
│ └─5702 sleep 9001
└─snap.foo.svc.service
├─4730 /bin/sh -c while true; do echo running; sleep 10; done
└─5717 sleep 10
The pid count is then easily accessible:
[root@localhost guest]# cat /sys/fs/cgroup/snaps.slice/snaps-foo.slice/pids.current
3
But the process is clearly no longer under user-1000.slice/session-1.scope
.
NOTE s-c would need to set up the tree correctly.
NOTE-2 this would be a regression from cgroupv1 setup, where processes are visible under user's slice.
Try a user service:
# /etc/systemd/user/snap.foo.app.service
[Unit]
Description=foo
[Service]
Type=simple
ExecStart=/usr/bin/s-c -c 'while true; do echo running; sleep 10; done'
Slice=snaps-foo.slice
[Install]
WantedBy=default.target
This is the cgroup it ends up in:
[guest@localhost ~]$ systemctl status --user snap.foo.app
● snap.foo.app.service - foo
Loaded: loaded (/etc/systemd/user/snap.foo.app.service; disabled; vendor preset: enabled)
Active: active (running) since Thu 2019-07-25 09:54:31 UTC; 3h 30min ago
Main PID: 2735 (sh)
Tasks: 2 (limit: 1665)
Memory: 1.2M
CPU: 5.241s
CGroup: /user.slice/user-1000.slice/user@1000.service/snaps.slice/snaps-foo.slice/snap.foo.app.service
├─2735 /bin/sh -c while true; do echo running; sleep 10; done
└─5774 sleep 10
Cgroup is clearly nested under user-1000.slice
.
This is now addressed in https://github.com/snapcore/snapd/compare/master...mvo5:cgroups-v2?expand=1