Skip to content

Instantly share code, notes, and snippets.

@bboozzoo
Last active August 2, 2023 16:44
Show Gist options
  • Save bboozzoo/76b1535c93686a27bb7fdbaad0f560f7 to your computer and use it in GitHub Desktop.
Save bboozzoo/76b1535c93686a27bb7fdbaad0f560f7 to your computer and use it in GitHub Desktop.

Table of Contents

Documentation

Fedora 31

$ cat <<'EOF' > uci-data-guest
#cloud-config
password: guest
users:
  - name: guest
    passwd: $1$xyz$NupBwZXNoMXD8NQwzjRW/0
    groups: wheel
    sudo: ALL=(ALL) NOPASSWD:ALL
    shell: /bin/bash
    ssh-authorized-keys:
      - <your-ssh-public-key>
ssh_pwauth: True
datasource_list: [ NoCloud, None ]
$ cloud-localds uci-data-guest.img uci-data-guest

Enable unified hierarchy

  • edit /etc/default/grub and rengerate config:
GRUB_TIMEOUT=1
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
#### add systemd.unified_cgroup_hierarchy
GRUB_CMDLINE_LINUX="no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8 systemd.unified_cgroup_hierarchy"
GRUB_DISABLE_RECOVERY="true"
GRUB_ENABLE_BLSCFG=true

then:

$ grub2-mkconfig -o /boot/grub/grub.cfg
  • OR use grubby:
$ sudo grubby --update-kernel=ALL --args='systemd.unified_cgroup_hierarchy=1'
  • reboot

  • verify:

[guest@localhost ~]$ mount|grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate)

snapd

  • workaround buggy rich dependncy handling:
$ sudo dnf install kernel-modules -y

NOTE: may need to reboot if kernel-core got updated.

  • snapd
$ sudo dnf install snapd -y
$ sudo dnf builddep snapd -y
  • install hello-world

  • running hello-world fails once s-c tries to access freezer:

$ SNAPD_DEBUG=1 SNAP_CONFINE_DEBUG=1 snap run hello-world
2019/07/12 13:58:46.707278 cmd_linux.go:70: DEBUG: re-exec not supported on distro "fedora" yet
DEBUG: umask reset, old umask was   02
DEBUG: security tag: snap.hello-world.hello-world
DEBUG: executable:   /usr/lib/snapd/snap-exec
DEBUG: confinement:  non-classic
DEBUG: base snap:    core
DEBUG: ruid: 1000, euid: 0, suid: 0
DEBUG: rgid: 1000, egid: 0, sgid: 0
DEBUG: creating lock directory /run/snapd/lock (if missing)
DEBUG: opening lock directory /run/snapd/lock
DEBUG: opening lock file: /run/snapd/lock/.lock
DEBUG: sanity timeout initialized and set for 30 seconds
DEBUG: acquiring exclusive lock (scope (global), uid 0)
DEBUG: sanity timeout reset and disabled
DEBUG: ensuring that snap mount directory is shared
DEBUG: unsharing snap namespace directory
DEBUG: releasing lock 5
DEBUG: opened snap-update-ns executable as file descriptor 5
DEBUG: opened snap-discard-ns executable as file descriptor 6
DEBUG: creating lock directory /run/snapd/lock (if missing)
DEBUG: opening lock directory /run/snapd/lock
DEBUG: opening lock file: /run/snapd/lock/hello-world.lock
DEBUG: sanity timeout initialized and set for 30 seconds
DEBUG: acquiring exclusive lock (scope hello-world, uid 0)
DEBUG: sanity timeout reset and disabled
DEBUG: initializing mount namespace: hello-world
DEBUG: forked support process 1158
DEBUG: helper process waiting for command
DEBUG: sanity timeout initialized and set for 30 secondsDEBUG:
block device of snap core, revision 7270 is 7:0
DEBUG: sanity timeout initialized and set for 30 seconds
DEBUG: joining preserved mount namespace for inspection
DEBUG: block device of the root filesystem is 7:0
DEBUG: sanity timeout reset and disabled
DEBUG: preserved mount namespace can be reused
DEBUG: joined preserved mount namespace hello-world
DEBUG: joining preserved per-user mount namespace
DEBUG: unsharing the mount namespace (per-user)
DEBUG: sc_setup_user_mounts: hello-world
DEBUG: NOT preserving per-user mount namespace
cannot open cgroup hierarchy /sys/fs/cgroup/freezer: No such file or directory

cgroupv2

statfs()

cgroup superblock magic:

#define CGROUP_SUPER_MAGIC	0x27e0eb
#define CGROUP2_SUPER_MAGIC	0x63677270
[root@localhost guest]# stat -f -c %t /sys/fs/cgroup/
63677270

Mount options

group2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate)

TODO Do we need nsdelegate ?

TODO check for nsdelegate mount option

nsdelegate

Consider cgroup namespaces as delegation boundaries. This option is system
wide and can only be set on mount or modified through remount from the
init namespace.

Controllers

Available:

$ cat /sys/fs/cgroup/testing/cgroup.controllers
memory pids

Enable via cgroup.subtree_control.

$ echo  -n "+memory +pids" >  /sys/fs/cgroup/testing/cgroup.subtree_control

The cgroup cannot be occupied by any processes:

$ echo 812 > /sys/fs/cgroup/testing/cgroup.procs
bash: echo: write error: Device or resource busy
$ mkdir /sys/fs/cgroup/testing/group1
$ echo 812 > /sys/fs/cgroup/testing/group1/cgroup.procs
$ cat /sys/fs/cgroup/testing/group1/cgroup.procs
812

TODO: find out why only memory and pids controller are avaible in the cgroup?

Events

$ cat /sys/fs/cgroup/testing/cgroup.events
populated 1
frozen 0
$ cat /sys/fs/cgroup/testing/group1/cgroup.events
populated 1
frozen 0

TODO: does this support epoll like handling?

Freezer

No separate controller.

[root@localhost guest]# echo 1 > /sys/fs/cgroup/testing/group1/cgroup.freeze
[root@localhost guest]# echo 0 > /sys/fs/cgroup/testing/group1/cgroup.freeze

Device controller

Namespace

In the root ns:

[root@localhost guest]# cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope

After CLONE_NEWCGROUP:

[root@localhost guest]# unshare -f -C /bin/bash
[root@localhost guest]# cat /proc/self/cgroup
0::/

Move self into cgroup

[root@localhost guest]# export PS1='[\u@\h \W]\$ $(cat /proc/self/cgroup) $ '
[root@localhost guest]# 0::/user.slice/user-1000.slice/session-1.scope $
[root@localhost guest]# 0::/user.slice/user-1000.slice/session-1.scope $ echo 0 > /sys/fs/cgroup/testing/group1/cgroup.procs
[root@localhost guest]# 0::/testing/group1 $

unshare to new cgroup, current becomes root

[root@localhost guest]# 0::/testing/group1 $ unshare -C -f -m
[root@localhost guest]# 0::/ $

Mounting cgroupv2, mounts the root cgroup:

[root@localhost guest]# 0::/ $ ls /sys/fs/cgroup/testing/group1/
cgroup.controllers  cgroup.max.descendants  cgroup.threads  io.pressure     memory.low        memory.pressure      memory.swap.max
cgroup.events       cgroup.procs            cgroup.type     memory.current  memory.max        memory.stat          pids.current
cgroup.freeze       cgroup.stat             cpu.pressure    memory.events   memory.min        memory.swap.current  pids.events
cgroup.max.depth    cgroup.subtree_control  cpu.stat        memory.high     memory.oom.group  memory.swap.events   pids.max
[root@localhost guest]# 0::/ $ mkdir /foo
[root@localhost guest]# 0::/ $ mount -t cgroup2 none /foo
[root@localhost guest]# 0::/ $ ls /foo/
[root@localhost guest]# 0::/ $ ls /foo/
cgroup.controllers  cgroup.max.descendants  cgroup.threads  io.pressure     memory.low        memory.pressure      memory.swap.max
cgroup.events       cgroup.procs            cgroup.type     memory.current  memory.max        memory.stat          pids.current
cgroup.freeze       cgroup.stat             cpu.pressure    memory.events   memory.min        memory.swap.current  pids.events
cgroup.max.depth    cgroup.subtree_control  cpu.stat        memory.high     memory.oom.group  memory.swap.events   pids.max

Experiments

$ cat /usr/bin/run-test 
#!/bin/sh

set -e
set -x

if [ "$1" = "" ]; then
        echo "$0 <cgroup-group-path>"
        exit 1
fi

delegate=0

if [ "$1" = "--delegate" ]; then
        shift
        delegate=1
fi

rootgr=$(cut -f3 -d: < /proc/self/cgroup)

grname="$(systemd-escape -u -p $1)"
gr="/sys/fs/cgroup/$grname"
if [ "$delegate" = "1" ]; then
        ls -l /sys/fs/cgroup/$rootgr >&2
        gr="/sys/fs/cgroup/$rootgr/$grname"
        mkdir -p "$gr"
        ls -l $gr >&2
fi

echo "using group $gr"
echo 0 > "$gr/cgroup.procs"
echo "cgroup: $(cat /proc/self/cgroup)"

while true; do
        echo "running"
        sleep 10
done

With --delegate, the script assumes it's a delegate, and creates a hierarchy under its cgroup.

Non delegate

$ systemctl cat run-test@.service
# /etc/systemd/system/run-test@.service
[Unit]
Description=test prog

[Service]
Type=simple
ExecStart=/usr/bin/run-test %I

[Install]
WantedBy=default.target

Status is confusing:

[guest@localhost snapd]$ sudo systemctl status run-test@testing-group1
● run-test@testing-group1.service - test prog
   Loaded: loaded (/etc/systemd/system/run-test@.service; disabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-07-23 12:16:03 UTC; 3s ago
 Main PID: 19514 (run-test)
    Tasks: 0 (limit: 1669)
   Memory: 1.1M
      CPU: 40ms
   CGroup: /system.slice/system-run\x2dtest.slice/run-test@testing-group1.service
           ‣ 19514 /bin/sh /usr/bin/run-test testing/group1

Service moved itself to testing/group1 group:

[guest@localhost snapd]$ systemd-cgls | head -10
Control group /:
-.slice
├─testing
│ └─group1
│   ├─19514 /bin/sh /usr/bin/run-test testing/group1
│   └─19557 sleep 10

Once stopped, /sys/fs/cgroup/testing/group1 is left behind.

Delegate

Can freely (?) manage hierarchy under its cgroup

[guest@localhost snapd]$ sudo systemctl cat run-test-delegate@testing-group1
# /etc/systemd/system/run-test-delegate@.service
[Unit]
Description=test prog

[Service]
Type=simple
ExecStart=/usr/bin/run-test --delegate %I
Delegate=true

[Install]
WantedBy=default.target

Status looks sane, reflects nested cgroup:

[guest@localhost snapd]$ sudo systemctl status run-test-delegate@testing-group1
● run-test-delegate@testing-group1.service - test prog
   Loaded: loaded (/etc/systemd/system/run-test-delegate@.service; disabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-07-23 12:15:37 UTC; 6min ago
 Main PID: 19491 (run-test)
    Tasks: 2 (limit: 1669)
   Memory: 2.1M
      CPU: 220ms
   CGroup: /system.slice/system-run\x2dtest\x2ddelegate.slice/run-test-delegate@testing-group1.service
           └─testing
             └─group1
               ├─19491 /bin/sh /usr/bin/run-test --delegate testing/group1
               └─19656 sleep 10

Whole hierarchy is cleaned up when stopping.

Snapd refresh app awareness

broken

Snap like MVP

Problems:

  • easily identify whether any apps using given snap are running
  • be able to freeze processes for mount ns changes
  • try not to escape from systemd's view of processes

Attempt slices

Emulate snap-confine:

#!/bin/sh
set -e
set -x

delegate=0
if [ "$1" = "--delegate" ]; then
        shift
        delegate=1
fi

if [ "$1" == "--cgroup" ]; then
        shift
        rootgr=$(cut -f3 -d: < /proc/self/cgroup)

        grname="$1"
        shift

        gr="/sys/fs/cgroup/$grname"
        if [ "$delegate" = "1" ]; then
                ls -l /sys/fs/cgroup/$rootgr >&2
                gr="/sys/fs/cgroup/$rootgr/$grname"
        fi

        mkdir -p "$gr"
        echo "using group $gr"
        echo 0 > "$gr/cgroup.procs"
fi

echo "cgroup: $(cat /proc/self/cgroup)"
exec /bin/sh "$@"

Slice can group processes together hierarchically. Implicit slices created by systemd (-.slice, system.slice, user.slice etc.). System services end up under system.slice. Users logged in via systemd-logind are automatically placed under users.slice/users-<uid>.slice. The - encodes a hierarchy, foor-bar-baz.slice, is /foo.slice/foo-bar.slice/foo-bar-baz.slice.

Use statically declared snaps.slice:

# /etc/systemd/system/snaps.slice
[Unit]
Description=slice for snaps

Snapd could create a slice for each snap:

# /etc/systemd/system/snaps-foo.slice
[Unit]
Description=slice for snap foo

NOTE: slices cannot have Delegate= set.

Jul 25 07:50:02 localhost systemd[1]: /etc/systemd/system/snaps.slice:5: Delegate= setting not supported for this unit type, ignoring.

Snap service

# /etc/systemd/system/snap.foo.svc.service
[Unit]
Description=foo

[Service]
Type=simple
ExecStart=/usr/bin/s-c -c 'while true; do echo running; sleep 10; done'
Slice=snaps-foo.slice

[Install]
WantedBy=default.target

Started service looks fine:

[root@localhost guest]# systemctl status snap.foo.svc
● snap.foo.svc.service - foo
   Loaded: loaded (/etc/systemd/system/snap.foo.svc.service; disabled; vendor preset: disabled)
   Active: active (running) since Thu 2019-07-25 12:21:53 UTC; 2s ago
 Main PID: 4730 (sh)
    Tasks: 2 (limit: 1665)
   Memory: 772.0K
      CPU: 15ms
   CGroup: /snaps.slice/snaps-foo.slice/snap.foo.svc.service  <--- NOTE
           ├─4730 /bin/sh -c while true; do echo running; sleep 10; done
           └─4731 sleep 10

Easy check whether any services from the snap are running:

[root@localhost guest]# cat /sys/fs/cgroup/snaps.slice/snaps-foo.slice/pids.current 
2

The hierarchy is cleaned up on stop. systemctl status snaps-foo.slice shows the processes of the snap.

User process

User processes are running under user slice & session scope.

[guest@localhost ~]$ cat /proc/self/cgroup 
0::/user.slice/user-1000.slice/session-1.scope
[root@localhost guest]# cat /proc/self/cgroup 
0::/user.slice/user-1000.slice/session-1.scope

Scope manages groups of processes. Scope is started via DBus API only.

All processes are occupying the same cgroup as the parent. Thus processes started from the shell, will be in the same slice & scope as the shell process.

systemd-run can create scopes and slices under the default manager (i.e. user's systemd manager, or the system systemd manager).

[guest@localhost ~]$ systemd-run --slice snap-foo --unit foo --user sleep 360
Running as unit: foo.service
[guest@localhost ~]$ systemctl status --user foo
● foo.service - /usr/bin/sleep 360
   Loaded: loaded (/run/user/1000/systemd/transient/foo.service; transient)
Transient: yes
   Active: active (running) since Thu 2019-07-25 12:50:57 UTC; 4s ago
 Main PID: 5268 (sleep)
    Tasks: 1 (limit: 1665)
   Memory: 192.0K
      CPU: 5ms
   CGroup: /user.slice/user-1000.slice/user@1000.service/snap.slice/snap-foo.slice/foo.service
           └─5268 /usr/bin/sleep 360
[guest@localhost ~]$ cat /proc/5268/cgroup
0::/user.slice/user-1000.slice/user@1000.service/snap.slice/snap-foo.slice/foo.service

Similarly, a new scope is created under the current root:

[guest@localhost ~]$ systemd-run --scope --unit foo --user bash
Running scope as unit: foo.scope
[guest@localhost ~]$ cat /proc/self/cgroup 
0::/user.slice/user-1000.slice/user@1000.service/foo.scope
[guest@localhost ~]$ sleep 360 &
[1] 5360
[guest@localhost ~]$ systemctl status --user foo.scope
● foo.scope - /usr/bin/bash
   Loaded: loaded (/run/user/1000/systemd/transient/foo.scope; transient)
Transient: yes
   Active: active (running) since Thu 2019-07-25 12:55:38 UTC; 14s ago
    Tasks: 4 (limit: 1665)
   Memory: 3.8M
      CPU: 167ms
   CGroup: /user.slice/user-1000.slice/user@1000.service/foo.scope
           ├─5330 /usr/bin/bash
           ├─5360 sleep 360
           ├─5361 systemctl status --user foo.scope
           └─5362 less

NOTE we could start snaps under a scope, those would be tracked in cgroups under user/user-%U/snap.foo.scope (or slice). The downside is that freezing and checking would require snap* to browse the whole tree.

Escaping

We can escape the cgroup for simple apps ourselves:

[root@localhost guest]# /usr/bin/s-c --cgroup snaps.slice/snaps-foo.slice/snap.foo.app2 -c 'sleep 9001' &
[root@localhost guest]# systemctl status snaps-foo.slice
● snaps-foo.slice - slice for snap foo
   Loaded: loaded (/etc/systemd/system/snaps-foo.slice; static; vendor preset: disabled)
   Active: active since Thu 2019-07-25 12:21:53 UTC; 58min ago
    Tasks: 3
   Memory: 1.8M
      CPU: 1.414s
   CGroup: /snaps.slice/snaps-foo.slice
           ├─snap.foo.app2
           │ └─5702 sleep 9001
           └─snap.foo.svc.service
             ├─4730 /bin/sh -c while true; do echo running; sleep 10; done
             └─5717 sleep 10

The pid count is then easily accessible:

[root@localhost guest]# cat /sys/fs/cgroup/snaps.slice/snaps-foo.slice/pids.current 
3

But the process is clearly no longer under user-1000.slice/session-1.scope.

NOTE s-c would need to set up the tree correctly.

NOTE-2 this would be a regression from cgroupv1 setup, where processes are visible under user's slice.

User services

Try a user service:

# /etc/systemd/user/snap.foo.app.service
[Unit]
Description=foo

[Service]
Type=simple
ExecStart=/usr/bin/s-c -c 'while true; do echo running; sleep 10; done'
Slice=snaps-foo.slice

[Install]
WantedBy=default.target

This is the cgroup it ends up in:

[guest@localhost ~]$ systemctl status --user snap.foo.app
● snap.foo.app.service - foo
   Loaded: loaded (/etc/systemd/user/snap.foo.app.service; disabled; vendor preset: enabled)
   Active: active (running) since Thu 2019-07-25 09:54:31 UTC; 3h 30min ago
 Main PID: 2735 (sh)
    Tasks: 2 (limit: 1665)
   Memory: 1.2M
      CPU: 5.241s
   CGroup: /user.slice/user-1000.slice/user@1000.service/snaps.slice/snaps-foo.slice/snap.foo.app.service
           ├─2735 /bin/sh -c while true; do echo running; sleep 10; done
           └─5774 sleep 10

Cgroup is clearly nested under user-1000.slice.

@mvo5
Copy link

mvo5 commented Jul 15, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment