Skip to content

Instantly share code, notes, and snippets.

Last active December 24, 2015 17:49
What would you like to do?
Recent systemd changes for cgroups and namespaces.
systemd 208 released []
* This release removes high-level support for the
MemorySoftLimit= cgroup setting. The underlying kernel
cgroup attribute memory.soft_limit= is currently badly
designed and likely to be removed from the kernel API in its
current form, hence we shouldn't expose it for now.
* The memory.use_hierarchy cgroup attribute is now enabled for
all cgroups systemd creates in the memory cgroup
hierarchy. This option is likely to be come the built-in
default in the kernel anyway, and the non-hierarchial mode
never made much sense in the intrinsically hierarchial
cgroup system.
[systemd-devel] [ANNOUNCE] systemd 207
[systemd-devel] [ANNOUNCE] systemd 206
* Creation of "dead" device nodes has been moved from udev
into kmod and tmpfiles. Previously, udev would read the kmod
databases to pre-generate dead device nodes based on meta
information contained in kernel modules, so that these would
be auto-loaded on access rather then at boot. As this
doesn't really have much to do with the exposing actual
kernel devices to userspace this has always been slightly
alien in the udev codebase. Following the new scheme kmod
will now generate a runtime snippet for tmpfiles from the
module meta information and it now is tmpfiles' job to the
create the nodes. This also allows overriding access and
other parameters for the nodes using the usual tmpfiles
facilities. As side effect this allows us to remove the
CAP_SYS_MKNOD capability bit from udevd entirely.
[systemd-devel] [ANNOUNCE] systemd 205
* Two new unit types have been introduced:
Scope units are very similar to service units, however, are
created out of pre-existing processes -- instead of PID 1
forking off the processes. By using scope units it is
possible for system services and applications to group their
own child processes (worker processes) in a powerful way
which then maybe used to organize them, or kill them
together, or apply resource limits on them.
Slice units may be used to partition system resources in an
hierarchial fashion and then assign other units to them. By
default there are now three slices: system.slice (for all
system services), user.slice (for all user sessions),
machine.slice (for VMs and containers).
Slices and scopes have been introduced primarily in
context of the work to move cgroup handling to a
single-writer scheme, where only PID 1
creates/removes/manages cgroups.
* A new mini-daemon "systemd-machined" has been added which
may be used by virtualization managers to register local
VMs/containers. nspawn has been updated accordingly, and
libvirt will be updated shortly. machined will collect a bit
of meta information about the VMs/containers, and assign
them their own scope unit (see above). The collected
meta-data is then made available via the "machinectl" tool,
and exposed in "ps" and similar tools. machined/machinectl
is compile-time optional.
* As discussed earlier, the low-level cgroup configuration
options ControlGroup=, ControlGroupModify=,
ControlGroupPersistent=, ControlGroupAttribute= have been
removed. Please use high-level attribute settings instead as
well as slice units.
* A new bus call SetUnitProperties() has been added to alter
various runtime parameters of a unit. This is primarily
useful to alter cgroup parameters dynamically in a nice way,
but will be extended later on to make more properties
modifiable at runtime. systemctl gained a new set-properties
command that wraps this call.
* nspawn will now inform the user explicitly that kernels with
audit enabled break containers, and suggest the user to turn
off audit.
[systemd-devel] [ANNOUNCE] systemd 204
[systemd-devel] [ANNOUNCE] systemd 203
* systemd-nspawn will now store meta information about a
container on the container's cgroup as extended attribute
fields, including the root directory.
* The cgroup hierarchy has been reworked in many ways. All
objects any of the components systemd creates in the cgroup
tree are now suffixed. More specifically, user sessions are
now placed in cgroups suffixed with ".session", users in
cgroups suffixed with ".user", and nspawn containers in
cgroups suffixed with ".nspawn". Furthermore, all cgroup
names are now escaped in a simple scheme to avoid collision
of userspace object names with kernel filenames. This work
is preparation for making these objects relocatable in the
cgroup tree, in order to allow easy resource partitioning of
these objects without causing naming conflicts.
* gained a new call
sd_get_machine_names() to enumerate running containers and
VMs (currently only supported by very new libvirt and
nspawn). sd_login_monitor can now be used to watch
VMs/containers coming and going.
* systemd will no longer allow manipulating service paths in
the name=systemd:/system cgroup tree using ControlGroup= in
units. (But is still fine with it in all other dirs.)
* There's a new systemd-nspawn at .service service file that may
be used to easily run nspawn containers as system
services. With the container's root directory in
/var/lib/container/foobar it is now sufficient to run
"systemctl start systemd-nspawn at foobar.service" to boot it.
* systemd-cgls gained a new parameter "--machine" to list only
the processes within a certain container.
[systemd-devel] [ANNOUNCE] systemd 202
* systemd-nspawn now places all containers in the new /machine
top-level cgroup directory in the name=systemd
hierarchy. libvirt will soon do the same, so that we get a
uniform separation of /system, /user and /machine for system
services, user processes and containers/virtual
machines. This new cgroup hierarchy is also useful to stick
stable names to specific container instances, which can be
recognized later this way (this name may be controlled
via systemd-nspawn's new -M switch). libsystemd-login also
gained a new call sd_pid_get_machine_name() to retrieve the
name of the container/VM a specific process belongs to.
[systemd-devel] [ANNOUNCE] systemd 201
* systemd-cgtop now optionally shows summed up CPU times of
cgroups. Press '%' while running cgtop to switch between
percentage and absolute mode. This is useful to determine
which cgroups use up the most CPU time over the entire
runtime of the system. systemd-cgtop has also been updated
to be 'pipeable' for processing with further shell tools.
[systemd-devel] [ANNOUNCE] systemd 200
[systemd-devel] [ANNOUNCE] systemd 199
[systemd-devel] [ANNOUNCE] systemd 198
* Resource limits (as exposed by the various control group
controllers) can now be controlled dynamically at runtime
for all units. More specifically, you can now use a command
like "systemctl set-cgroup-attr foobar.service cpu.shares
2000" to alter the CPU shares a specific service gets. These
settings are stored persistently on disk, and thus allow the
administrator to easily adjust the resource usage of
services with a few simple commands. This dynamic resource
management logic is also available to other programs via the
bus. Almost any kernel cgroup attribute and controller is
* nspawn will now implicitly add the CAP_AUDIT_WRITE and
CAP_AUDIT_CONTROL capabilities to the capabilities set for
the container. This makes it easier to boot unmodified
Fedora systems in a container, which however still requires
audit=0 to be passed on the kernel command line. Auditing in
kernel and userspace is unfortunately still too broken in
context of containers, hence we recommend compiling it out
of the kernel or using audit=0. Hopefully this will be fixed
one day for good in the kernel.
* nspawn gained the new --bind= and --bind-ro= parameters to
bind mount specific directories from the host into the
* nspawn will now mount its own devpts file system instance
into the container, in order not to leak pty devices from
the host into the container.
[systemd-devel] [ANNOUNCE] systemd 197
* nspawn may now be invoked without a controlling TTY. This
makes it suitable for invocation as its own service. This
may be used to set up a simple containerized server system
using only core OS tools.
* systemd and nspawn can now accept socket file descriptors
when they are started for socket activation. This enables
implementation of socket activated nspawn
containers. i.e. think about autospawning an entire OS image
when the first SSH or HTTP connection is received. We expect
that similar functionality will also be added to libvirt-lxc
[systemd-devel] [ANNOUNCE] systemd v196
[systemd-devel] [ANNOUNCE] systemd 195
Oh, and one more thing. In Fedora I added
"cap_dac_override,cap_sys_ptrace+ep" as file capabilities to
/usr/bin/systemd-detect-virt, so that this useful tool works for
unprivileged users too. (Yeah, cap_sys_ptrace sounds crazy, but Linux
sucks, it's required to read a few things off /proc/1/). The systemd
makefile will do the same, but if you package systemd for your distro
with RPM or suchlike you probably need to declare this explicitly in
your spec file. Note that not adding these caps is not a problem, you'll
just get a clean permission error if you run it as non-privileged
user. Also nothing depends on this being run as unprivileged user that I
was aware of, so this is really just about making a useful tool more
widely available, and not really a dependency for anything.
[systemd-devel] [RELEASE] systemd 194
[systemd-devel] [ANNOUNCE] systemd 193
[systemd-devel] [ANNOUNCE] systemd 192
* We don't mount the "cpuset" controller anymore together with
"cpu" and "cpuacct", as "cpuset" groups generally cannot be
started if no parameters are assigned to it. "cpuset" hence
broke code that assumed it it could create "cpu" groups and
just start them.
[systemd-devel] [ANNOUNCE] systemd 191
* nspawn will now create a symlink /etc/localtime in the
container environment, copying the host's timezone
setting. Previously this has been done via a bind mount, but
since symlinks cannot be bind mounted this has now been
changed to create/update the appropriate symlink.
[systemd-devel] [ANNOUNCE] systemd 190
* We will now mount the cgroup controllers cpu, cpuacct,
cpuset and the controllers net_cls, net_prio together by
* nspawn containers will now have a virtualized boot
ID. (i.e. /proc/sys/kernel/random/boot_id is now mounted
over with a randomized ID at container initialization). This
has the effect of making "journalctl -b" do the right thing
in a container.
* We now support virtualized reboot() in containers, as
supported by newer kernels. We will fall back to exit() if
CAP_SYS_REBOOT is not available to the container. Also,
nspawn makes use of this now and will actually reboot the
container if the containerized OS asks for that.
[systemd-devel] [ANNOUNCE] systemd v189
* The logic for file system namespace (ReadOnlyDirectory=,
ReadWriteDirectoy=, PrivateTmp=) has been reworked not to
require pivot_root() anymore. This means fewer temporary
directories are created below /tmp for this feature.
* nspawn containers will now see and receive all submounts
made on the host OS below the root file system of the
* nspawn containers will now be run with /dev/stdin, /dev/fd/
and similar symlinks pre-created. This makes running shells
as container init process a lot more fun.
[systemd-devel] [ANNOUNCE] systemd 188
* cgtop gained a new -n switch (similar to top), to configure
the maximum number of iterations to run for. It also gained
-b, to run in batch mode (accepting no input).
[systemd-devel] [ANNOUNCE] systemd 187
* nspawn gained a new --link-journal= switch (and quicker: -j)
to link the container journal with the host. This makes it
very easy to centralize log viewing on the host for all
guests while still keeping the journal files separated.
[systemd-devel] [ANNOUNCE] systemd 186
* systemd-nspawn gained a new --capability= switch to pass
additional capabilities to the container.
* The notify socket is in the abstract namespace again, in
order to support daemons which chroot() at start-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment