Programs that want to bind themselves to free cores present some challenges
when run within Docker.
CPU pinning allows a program to request it be assigned exclusively to a core or set of cores This improves cache locality and other factors that become relevant for extremely CPU bound processes, especially in combination with other configurations that prevent the scheduler from assigning other processes to that core.
A common operation in such programs is to look for "free" CPU cores with
various heuristics. One such heuristic is a check that no other process is
bound to the core. This information is available via cpuset values in procfs.
But because /proc is not bind mounted by default in Docker, if a process pins
itself to CPU 0 in container A, a process looking at /proc in container B will
not see this reflected in its procfs (it will be reflected in A's procfs,
though). This may also be the case when one of the pinned programs is running
on the host and the other is in a container, but I haven't tried this. I hope
the answer is yes, because otherwise that would constitute a sandbox escape /
information leak. Containerized processes should not be able to learn that
other processes are bound to certain cores. In this aspect Docker is behaving
(in my opinion) correctly.
The key takeaway is that the system call to pin a process to a core is
successful but the resultant state is not reflected in the procfs of other
containers.
My first thought to solve this was to simply perform core isolation at the
container level. Docker has a flag that can be used to restrict a container to
a certain set of CPUs (--cpuset-cpus
). If we can restrict container A to
cores 0,1 and container B to cores 2,3, then our heuristics will work again;
each container has 2 dedicated cores, so provided nothing on the host is bound
to those cores and procfs shows only those 2 cores, procfs becomes accurate and
we can determine which cores do not have bound processes. Unfortunately this
does not work. While --cpuset-cpus
does work, in the sense that only 2 cpus
are available to the container, procfs still reflects all host cores. If
the host has 8 cpus and you use --cpuset-cpus 5,6
you will still see all 8
within the container. However, if you now try to pin to core 0, the system
call will fail. See for yourself:
# docker run -it --cpuset-cpus 4,5 467c321fce69 bash
root@b4f35b17820a:/# lscpu | grep "CPU(s)"
CPU(s): 8
On-line CPU(s) list: 0-7
NUMA node0 CPU(s): 0-7
root@b4f35b17820a:/#
root@b4f35b17820a:/# for i in `seq 0 7`; do taskset -c $i echo "hi"; done
taskset: failed to set pid 28's affinity: Invalid argument
taskset: failed to set pid 29's affinity: Invalid argument
taskset: failed to set pid 30's affinity: Invalid argument
taskset: failed to set pid 31's affinity: Invalid argument
hi
hi
taskset: failed to set pid 34's affinity: Invalid argument
taskset: failed to set pid 35's affinity: Invalid argument
Unfortunately it seems the only available solutions at this point are either to
bind mount procfs - which works, but is really undesirable given the contents
of /proc - or to modify your program to recover from a failed pinning attempt,
look for the next "free" cpu, and try again. In combination with
--cpuset-cpus
, it should eventually find out which cores have been assigned
to the container. This can also be done outside the program with the above
taskset
loop until it succeeds. Of course, this only helps if the program has
the ability to turn its own pinning functionality off, or if you can modify it
to do so - in which case, it's cleaner to just implement the above pinning
logic in the program.
Relevant Docker issue: moby/moby#20770