raunakkathuria/docker_internals.md

## docker_internals.md

      
    Raw
  

              docker_internals.md
            
          
    Docker internals

Underlying technologies

To understand Docker completely, you need to first understand the underlying technologies that make it possible. To understand the technology completely, you first need to understand the many pieces that make it all possible. This blog will mainly cover about:

Namespace
cgroups
Union File System
libcontainer

Namespace

Definition from Wikipedia

Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources. The feature works by having the same namespace for a set of resources and processes, but those namespaces refer to distinct resources. Resources may exist in multiple spaces. Examples of such resources are process IDs, hostnames, user IDs, file names, and some names associated with network access, and interprocess communication.

Namespaces are a fundamental aspect of containers on Linux. They provide isolation of global resources between processes, so it's basically a way to limit what process can see. This isolation is important for containers to work.
Example

Let's see how it works with an example.
unshare - Run a program with some namespaces unshared from the parent.
$ unshare -h
Usage:
 unshare [options] [<program> [<argument>...]]

Run a program with some namespaces unshared from the parent.

Options:
 -m, --mount[=<file>]      unshare mounts namespace
 -u, --uts[=<file>]        unshare UTS namespace (hostname etc)
 -i, --ipc[=<file>]        unshare System V IPC namespace
 -n, --net[=<file>]        unshare network namespace
 -p, --pid[=<file>]        unshare pid namespace
 ...
Create a new UTS (Unix Time Sharing) namespace shell
root@db9326789cbc:/$ unshare -u /bin/sh # -u stands for UTS namespace; unshare -h
$ hostname child # set hostname on new UTS namespace
$ hostname
child
$ exit
$ hostname # it does not change anything on parent host
parent
Let's check the process tree, the new namespace is assigned a different PID (1385) and it's parent ID is the main shell parent ID (1 in this case)
root@db9326789cbc:/$ unshare -u /bin/sh # create a new shell in new UTS namespace
$ ps -ef --forest # inside new UTS
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 05:09 pts/0    00:00:00 /bin/bash
root      1385     1  0 06:14 pts/0    00:00:00 /bin/sh # n
root      1388  1385  0 06:14 pts/0    00:00:00  \_ ps -ef --forest
$ exit
You can check the namespace entry by check /proc/[pid]/ns
root@db9326789cbc:/home/tutorial ls -l /proc/self/ns/uts
lrwxrwxrwx 1 root root 0 Aug 20 06:37 /proc/self/ns/uts -> 'uts:[4026533163]' # parent UTS

root@db9326789cbc:/home/tutorial$ unshare -u /bin/sh
$ ls -l /proc/self/ns/uts
lrwxrwxrwx 1 root root 0 Aug 20 06:50 /proc/self/ns/uts -> 'uts:[4026533147]' # child UTS, separate from parent
cgroup

Definition from Wikipedia

cgroups (abbreviated from control groups) is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes.

The primary design goals of cgroups is to provide a unified interface to many different use cases, from controlling single processes to full operating system-level virtualization (as provided by OpenVZ, Linux-VServer or LXC, for example). Cgroups provides:

Resource limiting: groups can be set to not exceed a configured memory limit, which also includes the file system cache
Prioritization: some groups may get a larger share of CPU utilization[10] or disk I/O throughput
Accounting: measures a group's resource usage, which may be used, for example, for billing purposes
Control: freezing groups of processes, their check-pointing and restarting

Example

Let's see how it works with an example.
Install the necessary packages
On Ubuntu or Debian, type:
apt-get install libcgroup1 cgroup-tools
Creating cgroups and moving processes

A  cgroup filesystem initially contains a single root cgroup, '/', which all processes belong to.  A new cgroup is created by creating a directory in the cgroup filesystem:
$ mkdir /sys/fs/cgroup/memory/mg1
Limit the memory for anything running under the cgroup mg1 to 20MB:
root@dd3d48548fdb:/home/tutorial$ echo 20000000 | tee /sys/fs/cgroup/memory/mg1/memory.limit_in_bytes
A process may be moved to this cgroup by writing its PID into the cgroup's cgroup.procs file:
echo [PID] > /sys/fs/cgroup/memory/mg1/cgroup.procs
You can verify the cgroup of PID by:
$ ps -o cgroup [PID]
Note: if a task exceeds its defined limits, the kernel will intervene and, in some cases, kill that task.
You can also use utilities provided in libcgroup  package to simplify the above steps.
$ sudo cgcreate -g memory:mg1 # create memory cgroup
$ echo 50000000 | sudo tee
 ↪/sys/fs/cgroup/memory/mg1/memory.limit_in_bytes # assign memory size
$ sudo cgexec -g memory:mg1 ~/test.sh # run the script under mg1 cgroup
$ ps -o cgroup [PID] # verify
$ sudo cgdelete memory:mg1 # clean up and remove the cgroup
Union File System

Union file systems, or UnionFS, are file systems that operate by creating layers, making them very lightweight and fast. Docker Engine uses UnionFS to provide the building blocks for containers.
Docker Images are actually just multiple Union File Systems stacked on top of each other!


Image source: https://docs.docker.com/storage/storagedriver/

References
Official docker docs
https://docs.docker.com/v17.09/engine/userguide/storagedriver/imagesandcontainers/
Others (basic overview though)
https://www.terriblecode.com/blog/how-docker-images-work-union-file-systems-for-dummies/
https://medium.com/@paccattam/drooling-over-docker-2-understanding-union-file-systems-2e9bf204177c
libcontainer and lxc

Docker Engine combines the namespaces, control groups, and UnionFS into a wrapper called a container format. The default container format is libcontainer.
Docker 0.9 introduced the libcontainer and before that lxc was used for containers.


Image source: https://www.docker.com/blog/docker-0-9-introducing-execution-drivers-and-libcontainer/

It's because of libcontainer, Docker out of the box can now manipulate namespaces, control groups, capabilities, apparmor profiles, network interfaces and firewalling rules – all in a consistent and predictable way, and without depending on LXC or any other userland package. This drastically reduces the number of moving parts, and insulates Docker from the side-effects introduced across versions and distributions of LXC.
You can read about this more here - https://www.docker.com/blog/docker-0-9-introducing-execution-drivers-and-libcontainer/
References
LXC - https://www.linuxjournal.com/content/everything-you-need-know-about-linux-containers-part-ii-working-linux-containers-lxc
LXC vs Docker - https://www.upguard.com/articles/docker-vs-lxc