jcayzac/On PID 1 and Containers.md

## On PID 1 and Containers.md

      
    Raw
  

              On PID 1 and Containers.md
            
          
    The process identifier (a.k.a. process ID or PID) is a number used by most operating system kernels, such as those of Unix, macOS and Windows, to uniquely identify an active process.
In Unix-like operating systems, PID 1 is usually the init¹ process spawned by the kernel and responsible for starting and shutting down the rest of the system.
A process can create child processes using the fork system call. At any arbitrary time, this means processes can be represented by a tree structure. For example:
─┬─ PID 1 (init)
 ├─┬─ PID 2
 │ └─┬─ PID 4
 │   └─── PID 7
 ├── PID 9
 ├── PID 3
 └── PID 6

When a process completes execution (via the exit system call), it is not removed from the process table. Instead, it is marked as being in the terminated state. The entry in the process table is needed for the parent process to be able to read its exit status via the wait system call.
Once the exit status is read, the process entry is removed from the process table and is said to have been reaped.
Whenever a process completes execution, the system sends a signal to its parent. Because of bugs in their implementation, or because this was not a concern for their developers, many applications either ignore or mishandle this signal and do not properly reap terminated child processes —those become zombie processes.
As the ancestor of all active processes, PID 1 has the special responsibility to watch out for zombie processes and reap them.
Enter containers

Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources.
Unlike virtual machines, containers are managed by the same kernel as the host operating system they are launched from, but do so in their own unique namespace. The processes inside a container are isolated from the processes outside of the container.
The first process launched by a container typically becomes PID 1. For Docker, that's either the entry point or, if no entry point is set, the command. If the process spawns child processes but doesn't properly handle signals, this can lead to zombie processes exhausting the process table of the namespace, and ultimately to a crash of the container.
To avoid this, developers from the Docker community created several replacements for the init process. Those are usually statically linked so they can be quickly installed in any image without having to pull any dependency, and focus on signal handling and zombie process reaping. Two popular choices are dumb-init and tini.
With the binary pulled and made available in the image, it's very easy to set either one as PID 1:
# tini
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["/your/program", "-and", "-its", "arguments"]

# dumb-init
ENTRYPOINT ["/usr/bin/dumb-init", "--"]
CMD ["/your/program", "-and", "-its", "arguments"]


Shoud I really care?
Probably. You don't have to worry about signal handling and zombie process reaping only if all these conditions are met:

You container launches only one process.
That process never creates any child process.


It got better!

When using docker run, you can now use the --init flag to indicate that an init process should be used as the PID 1 in the container. The default init process used is the first docker-init executable found in the system path of the Docker daemon process. This docker-init binary, included in the default installation, is backed by tini.
The capability was also added in version 3.7 of the docker-compose.yml format, which now supports passing init: true to any service definition.
This means that if you're primarily using docker run or docker compose, you can remove dumb-init or tini from your Dockerfile and live the carefree life! Or… can you, really?
…but not that better yet 😥

Enter Kubernetes. Unfortunately, there is no support yet for automatically inserting an init process in the containers of a pod. If, like me, you think it would be cool to stop having to manually add tini or dumb-init in every Dockerfile just because of k8s, please add a 👍🏼 reaction on issue #84210!
Meanwhile, if you don't care about container isolation (you probably should, though), enabling shareProcessNamespace for your pod has it use the same namespace for all its containers. As a weird benefit, this promotes Kubernetes's famous pause container to PID 1, enabling proper zombie reaping and signal rewiring for free.
Footnotes


The process' actual name may vary from system to system, e.g. init, systemd, launchd… ↩