The process identifier (a.k.a. process ID or PID) is a number used by most operating system kernels, such as those of Unix, macOS and Windows, to uniquely identify an active process.
In Unix-like operating systems, PID 1 is usually the init
1 process spawned by the kernel and responsible for starting and shutting down the rest of the system.
A process can create child processes using the fork
system call. At any arbitrary time, this means processes can be represented by a tree structure. For example:
─┬─ PID 1 (init)
├─┬─ PID 2
│ └─┬─ PID 4
│ └─── PID 7
├── PID 9
├── PID 3
└── PID 6
When a process completes execution (via the exit
system call), it is not removed from the process table. Instead, it is marked as being in the terminated state. The entry in the process table is needed for the parent process to be able to read its exit status via the wait
system call.
Once the exit status is read, the process entry is removed from the process table and is said to have been reaped.
Whenever a process completes execution, the system sends a signal to its parent. Because of bugs in their implementation, or because this was not a concern for their developers, many applications either ignore or mishandle this signal and do not properly reap terminated child processes —those become zombie processes.
As the ancestor of all active processes, PID 1 has the special responsibility to watch out for zombie processes and reap them.
Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources.
Unlike virtual machines, containers are managed by the same kernel as the host operating system they are launched from, but do so in their own unique namespace. The processes inside a container are isolated from the processes outside of the container.
The first process launched by a container typically becomes PID 1. For Docker, that's either the entry point or, if no entry point is set, the command. If the process spawns child processes but doesn't properly handle signals, this can lead to zombie processes exhausting the process table of the namespace, and ultimately to a crash of the container.
To avoid this, developers from the Docker community created several replacements for the init process. Those are usually statically linked so they can be quickly installed in any image without having to pull any dependency, and focus on signal handling and zombie process reaping. Two popular choices are dumb-init
and tini
.
With the binary pulled and made available in the image, it's very easy to set either one as PID 1:
# tini
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["/your/program", "-and", "-its", "arguments"]
# dumb-init
ENTRYPOINT ["/usr/bin/dumb-init", "--"]
CMD ["/your/program", "-and", "-its", "arguments"]
Shoud I really care?
Probably. You don't have to worry about signal handling and zombie process reaping only if all these conditions are met:
- You container launches only one process.
- That process never creates any child process.
When using docker run
, you can now use the --init
flag to indicate that an init process should be used as the PID 1 in the container. The default init process used is the first docker-init
executable found in the system path of the Docker daemon process. This docker-init binary, included in the default installation, is backed by tini
.
The capability was also added in version 3.7 of the docker-compose.yml
format, which now supports passing init: true
to any service definition.
This means that if you're primarily using docker run
or docker compose
, you can remove dumb-init
or tini
from your Dockerfile and live the carefree life! Or… can you, really?
Enter Kubernetes. Unfortunately, there is no support yet for automatically inserting an init process in the containers of a pod. If, like me, you think it would be cool to stop having to manually add tini
or dumb-init
in every Dockerfile just because of k8s, please add a 👍🏼 reaction on issue #84210!
Meanwhile, if you don't care about container isolation (you probably should, though), enabling shareProcessNamespace
for your pod has it use the same namespace for all its containers. As a weird benefit, this promotes Kubernetes's famous pause
container to PID 1, enabling proper zombie reaping and signal rewiring for free.
Footnotes
-
The process' actual name may vary from system to system, e.g.
init
,systemd
,launchd
… ↩
This is an eloquent and accurate summary of the issue at hand.
One thing to note is that you don't always know what will spawn or fork a child process:
Why should you care?
because unless you do, you'll be left spending inordinate amounts of time looking for why some containers in the fleet die unpredictably.