Skip to content

Instantly share code, notes, and snippets.

@jcayzac
Last active September 8, 2021 09:18
Show Gist options
  • Save jcayzac/43a34d81646d92ae565126be867ddc7c to your computer and use it in GitHub Desktop.
Save jcayzac/43a34d81646d92ae565126be867ddc7c to your computer and use it in GitHub Desktop.
Explainer for init in docker / k8s

The process identifier (a.k.a. process ID or PID) is a number used by most operating system kernels, such as those of Unix, macOS and Windows, to uniquely identify an active process.

In Unix-like operating systems, PID 1 is usually the init1 process spawned by the kernel and responsible for starting and shutting down the rest of the system.

A process can create child processes using the fork system call. At any arbitrary time, this means processes can be represented by a tree structure. For example:

─┬─ PID 1 (init)
 ├─┬─ PID 2
 │ └─┬─ PID 4
 │   └─── PID 7
 ├── PID 9
 ├── PID 3
 └── PID 6

When a process completes execution (via the exit system call), it is not removed from the process table. Instead, it is marked as being in the terminated state. The entry in the process table is needed for the parent process to be able to read its exit status via the wait system call.

Once the exit status is read, the process entry is removed from the process table and is said to have been reaped.

Whenever a process completes execution, the system sends a signal to its parent. Because of bugs in their implementation, or because this was not a concern for their developers, many applications either ignore or mishandle this signal and do not properly reap terminated child processes —those become zombie processes.

As the ancestor of all active processes, PID 1 has the special responsibility to watch out for zombie processes and reap them.

Enter containers

Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources.

Unlike virtual machines, containers are managed by the same kernel as the host operating system they are launched from, but do so in their own unique namespace. The processes inside a container are isolated from the processes outside of the container.

The first process launched by a container typically becomes PID 1. For Docker, that's either the entry point or, if no entry point is set, the command. If the process spawns child processes but doesn't properly handle signals, this can lead to zombie processes exhausting the process table of the namespace, and ultimately to a crash of the container.

To avoid this, developers from the Docker community created several replacements for the init process. Those are usually statically linked so they can be quickly installed in any image without having to pull any dependency, and focus on signal handling and zombie process reaping. Two popular choices are dumb-init and tini.

With the binary pulled and made available in the image, it's very easy to set either one as PID 1:

# tini
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["/your/program", "-and", "-its", "arguments"]

# dumb-init
ENTRYPOINT ["/usr/bin/dumb-init", "--"]
CMD ["/your/program", "-and", "-its", "arguments"]

Shoud I really care?

Probably. You don't have to worry about signal handling and zombie process reaping only if all these conditions are met:

  • You container launches only one process.
  • That process never creates any child process.

It got better!

When using docker run, you can now use the --init flag to indicate that an init process should be used as the PID 1 in the container. The default init process used is the first docker-init executable found in the system path of the Docker daemon process. This docker-init binary, included in the default installation, is backed by tini.

The capability was also added in version 3.7 of the docker-compose.yml format, which now supports passing init: true to any service definition.

This means that if you're primarily using docker run or docker compose, you can remove dumb-init or tini from your Dockerfile and live the carefree life! Or… can you, really?

…but not that better yet 😥

Enter Kubernetes. Unfortunately, there is no support yet for automatically inserting an init process in the containers of a pod. If, like me, you think it would be cool to stop having to manually add tini or dumb-init in every Dockerfile just because of k8s, please add a 👍🏼 reaction on issue #84210!

Meanwhile, if you don't care about container isolation (you probably should, though), enabling shareProcessNamespace for your pod has it use the same namespace for all its containers. As a weird benefit, this promotes Kubernetes's famous pause container to PID 1, enabling proper zombie reaping and signal rewiring for free.

Footnotes

  1. The process' actual name may vary from system to system, e.g. init, systemd, launchd

@arnaudlacour
Copy link

arnaudlacour commented Jun 18, 2021

This is an eloquent and accurate summary of the issue at hand.
One thing to note is that you don't always know what will spawn or fork a child process:

  • this is especially true when using frameworks, some may have been written without container frame-of-mind and create sub-processes like for example to read the /proc filesystem or even use system commands (seen in monitoring tools for example)
  • FORK, VFORK or POSIX_SPAWN would not get reaped and the leak could eventually lead to the container terminating in the most obscure of ways
  • even if you control your product end-to-end, you may later realize in production your customers start processes of their own. Granted, they'd be breaking the proper pattern of operation in the first place, but that doesn't mean you shouldn't guard against that improper but common practice.

Why should you care?
because unless you do, you'll be left spending inordinate amounts of time looking for why some containers in the fleet die unpredictably.

@jcayzac
Copy link
Author

jcayzac commented Sep 8, 2021

Also, there is the issue of mismapped signals. For those, you still need dumb-init's signal rewiring capabilities, which docker-init doesn't provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment