Skip to content

Instantly share code, notes, and snippets.

@ganeshmaharaj
Last active April 4, 2022 16:56
Show Gist options
  • Save ganeshmaharaj/f73fbca88b8a82ec930f61d8b18574bf to your computer and use it in GitHub Desktop.
Save ganeshmaharaj/f73fbca88b8a82ec930f61d8b18574bf to your computer and use it in GitHub Desktop.
automate update all cgroup cpus in a system using udev rules

cgroups, containers and udev rules

This gist is mostly to document my journey of understanding how cpusets are setup for containers when cores go offline and how I used udev rules to automate the cpuset files.

PS: This topic is only valid for cgroups v1. I am yet to play this out with cgroupsv2 and see how it behaves

Problem

From within a container, setting thread affinity to an offlined and onlined core will fail even though the core is back online and you can schedule onto it from the host.

cgroups and cpusets

I will not cover cgroups in-depth here. There are a ton of articles in the internet that talk and explain cgroups much better than I can. I will stick with cpusets and my observations of that with containers.

cgroups control the allocation and plaement of the processes to cpus for all the containers created in a system. In systems with cgroupsv1, this creates /sys/fs/cgroups/cpuset/<path-defined-by-container-runtimes>/<container>/cpuset.cpus which shows the CPUs that the process can be placed on. There is a top level parent cpuset.cpus file at /sys/fs/cgroup/cpuset that is the overarching list of CPUs that can be used. The maximum and the minimum amount of CPU needed for the process are defined at /sys/fs/cgroup/cpu.

Now when we offline a core in the system, all the cpuset.cpus are updated to notify that the process cannot use that particular core and the list of cpus in cpuset.cpus will be updated to relfect that. For eg: if your system has 8 cores and all of them can be used, the content of /sys/fs/cgroup/cpuset/cpuset.cpus would contain 0-7. Let us say we offline core 2, now the file would read 0,1,3-7.

If the core is back online, the parent cpuset.cpus would be updated to reflect this change, but none of the sub cpuset.cpus would get updated. My hypothesis is when a core is offlined, the kernel has to notify all the process that the particular core is not available to avoid possible errors pinning thread to the core, but when turned back on, there is no particular way for the kernel to know if each of the sub cpuset definitions were defined by the user as they are or a result of the core offline.

In our particular use-case, we offline a core, bring it back online from within a container and now can no longer use that particular core as the CPU sets are not updated.

Solution

Our workaround was to literally copy the contents of the parent cpuset.cpus and push it to all the sub cpuset.cpus and this allows our container to proceed normally. Now, can we automate this?

UDEV rules to the rescue.

udev allows automated actions in the userspace based on events from devices in the system. The kernel notifies udev daemon about events related to the devices and udev rules get executed if their conditions match the event. USB media devices are the biggest examples you will find from a quick internet search.

These steps helped me figure out how to setup my rule.

  • udevadm info -a -p <DEVICE> helps you figure out the subsystem and various attributes for your device. In my case, I just ran that against a single cpu device.
$ udevadm info -a  -p /sys/devices/system/cpu/cpu3

Udevadm info starts with the device specified by the devpath and then
walks up the chain of parent devices. It prints for every device
found, all possible attributes in the udev rules key format.
A rule to match, can be composed by the attributes of the device
and the attributes from one single parent device.

  looking at device '/devices/system/cpu/cpu3':
    KERNEL=="cpu3"
    SUBSYSTEM=="cpu"
    DRIVER=="processor"
    ATTR{online}=="1"

  looking at parent device '/devices/system/cpu':
    KERNELS=="cpu"
    SUBSYSTEMS==""
    DRIVERS==""
    ATTRS{isolated}==""
    ATTRS{kernel_max}=="8191"
    ATTRS{nohz_full}=="  (null)"
    ATTRS{offline}==""
    ATTRS{online}=="0-7"
    ATTRS{possible}=="0-7"
    ATTRS{present}=="0-7"
  • The one piece of information that I was not able to find here is what ACTIONS does each subsystem honor. The default rules files that comes with the base distro are a good resource for this. /var/lib/udev/rules.d is the default path in most distributions. With that I found we can run an event when any cpu comes back online.

  • We need the parent cpuset.cpus to be copied into all the sub-cgroups only when a CPU comes back online.

$ cat /etc/udev/rules.d/45-cpu-online.rules 
SUBSYSTEM=="cpu",ACTION=="online",RUN+="/bin/sh -c 'for c in /sys/fs/cgroup/cpuset/**/cpuset.cpus; do cat /sys/fs/cgroup/cpuset/cpuset.cpus > $c; done ; for c in /sys/fs/cgroup/cpuset/**/**/cpuset.cpus; do cat /sys/fs/cgroup/cpuset/cpuset.cpus > $c; done'"

Let me expand a bit more on the rule. SUBSYSTEM attaches a rule to a specific subsystem. In our case, it is only the CPU. ACTION mentions the event to which this rules attaches to. In our case, only when a CPU comes online. RUN appends to the list of RUNS the command we need to run. In our case, we are finding every single cpuset.cpus in the sub-cgroups folder and copy the parent one into it.

  • Test if our rule will run. udevadm lets you check if your rule will take effect when a particular action happens. You can achieve this using the udevadm test command.
$ udevadm test -a online /sys/devices/system/cpu/cpu3                                                             calling: test                                                    
version 239 (239-58.el8)                                  
...
...
...
rules contain 49152 bytes tokens (4096 * 12 bytes), 18470 bytes strings                                                                
2595 strings (34311 bytes), 1760 de-duplicated (16677 bytes), 836 trie nodes used                                                      
RUN '/bin/sh -c 'for c in /sys/fs/cgroup/cpuset/**/cpuset.cpus; do cat /sys/fs/cgroup/cpuset/cpuset.cpus > $c; done ; for c in /sys/fs/cgroup/cpuset/**/**/cpuset.cpus; do cat /sys/fs/cgroup/cpuset/cpuset.cpus > $c; done'' /etc/udev/rules.d/45-cpu-online.rules:1          
IMPORT builtin 'hwdb' /usr/lib/udev/rules.d/50-udev-default.rules:14                                                                   
IMPORT builtin 'hwdb' returned non-zero
ACTION=online
DEVPATH=/devices/system/cpu/cpu3
...
...

The output shows that the rule will get executed when the online action occurs.

  • Now to test our offlining and onlining a CPU and you will notice that the file gets copied over and all the sub-cgroup's cpuset.cpus matches that of the parent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment