The dotCloud container engine (the ancestor of Docker) started as a Python CLI tool
that acted as a frontend to LXC and AUFS. It was called dc
. This is a fake session
using dc
. Keep in mind that I haven't used dc
in 3 years and I don't have the
code checked out locally right now, so this is only approximate.
# pull an image
# (images were stored in Mercurial repos, using metashelf extension for special files and permissions)
dc image_pull ubuntu@f00db33f
# create a container
# (images also have templates that are evaluated when the container is created)
# (e.g. to put the IP address in /etc/network/interfaces, stuff like that)
dc container_create ubuntu@f00db33f jerome.awesome.ubuntu
# start the container
dc container_start jerome.awesome.ubuntu
# enter the container, like with "docker exec"
dc container_enter jerome.awesome.ubuntu
# there are a bunch of commands to manage port mappings
# the following one will allocate a random port
dc container_connection_add jerome.awesome.ubuntu tcp 80
# check which port was allocated
dc network_ls
So far, so good.
Under the hood, dc
was using lxc-start
and lxc-stop
to manage containers.
Remember that fully working nsenter
is a fairly recent thing. In 2010, if you
wanted to enter a container (the equivalent of docker exec
), you needed to patch
your kernel so that setfd()
would support all kinds of namespaces. So we had
an abstract execution engine that would either use the features from our
patched kernels, or fallback to an SSH connection.
Since containers are managed by LXC, you don't need a long-running daemon
(and at this point there was no container engine per se). In fact, if you
scratch the surface, you realize that each container has its own long-running
daemon: it's lxc-start
(it's similar to rkt
or runc
) and you connect
to it using an abstract socket (from memory, @/var/lib/lxc/<containername>
).
Then, you need to slap an API on top of that, so that you can orchestrate deployments from a central place. Since containers are standalone, the process exposing that API doesn't have to be bullet-proof, and you can update/upgrade/restart it without being worried about your containers being restarted.
In dotCloud's case, we wanted to be able to do something like this:
dotcloud-internal-admin-cli create_container ubuntu jerome.awesome.ubuntu
... and this should create the container on a host with available capacity. Images should be pulled automatically if needed. If the image declares exposed ports, the ports should also be connected to the "routing layer".
So what we did was to expose dc
functionality using ZeroRPC. ZeroRPC
is basically RPC over ZeroMQ, using MessagePack to serialize parameters,
return values, and exceptions. MessagePack is similar to JSON, but way
more efficient. (We didn't care much about efficiency except for the
high-traffic use cases like metrics and logs.)
If you're curious about ZeroRPC, I presented it at PyCon a few years ago. Unfortunately, my french accent was a few orders of magnitude thicker than it is today (which says a lot) so you might struggle to understand me, sorry :(
ZeroRPC allowed us to expose almost any Python module or class like this:
# Expose Python built-in module "time" over port 1234
zerorpc-server --listen tcp://0.0.0.0:1234 time &
# Call time.sleep(4)
zerorpc-client tcp://localhost:1234 sleep 4
ZeroRPC also supports some fan-out topologies, including broadcast (all nodes receiving the function call; return value is discarded) and worker queue (all nodes subscribe to a "hub", you send a function call to the hub, one idle worker will get it, so you get transparent load balancing of requests).
So here we are, with a "containers" service running on each node, letting us do the following operations from a central place:
- create containers
- start/stop/destroy them
Listing containers (and gathering core host metrics) relied on a separate service called "hostinfo".
So thanks to "hostinfo" we can also list all containers from a central place. Cool.
In the very first versions, dotCloud was building your apps "in place", i.e. when you push your code, the code would be copied to a temporary directory in the container (while it's still running the previous version of your app!), the build would happen, then a switcheroo happens (a symlink is updated to point to the new version) and processes are restarted.
To keep things clean and simple, this build system was managed by a separate service,
that accessed the same data structures. So we had the "container manager",
"hostinfo", and the "build manager", all accessing a bunch of containers and
configuration files in the same directory (/var/lib/dotcloud
, by the way).
Then we added support for separate builds (probably similar to Heroku's "slugs). The build would happen in a separate container; then that container image would be transferred to the right host, and a switcheroo would happen (the old container is replaced by the new one).
We had the equivalent of volumes, so by making sure that the old and new containers were on the same host, this process could be used for stateful apps as well. This, by the way, was probably a Very Bad Idea; as ditching away stateful apps would have simplified things immensely for us. Keep in mind, though, that we were running not only web apps but also databases like MySQL, PostgreSQL, MongoDB, Redis, etc. I was one of the strong proponents of keeping stateful containers on board, and on retrospect I was very certainly wrong, since it made our lives way more complicated than they could have been. But I digress!
To keep things simple and reduce impact to existing systems (at this point, we had a bunch of customers that each already generated more than $1K of monthly revenue, and we wanted to play safe), when we rolled out that new system, it was managed by another service. So now on our hosts we had the "container manager", "hostinfo", the "build manager" (for in place builds), and the "deploy manager".
(Small parenthesis: we didn't transfer full container images, of course. We
transferred only the AUFS rw
layer; so that's the equivalent of a two-line
Dockerfile doing FROM python-nginx-uwsgi
and RUN dotcloud-build.sh
then
pushing the resulting image around.)
Then we added a few extra services also accessing container data; in no specific order, there was a remote execution manager (used e.g. by the MySQL replication system), a metrics collector, and a bunch of hacks to work around EC2/EBS issues, kernel issues, out of memory killer, etc.; for instance in some scenarios, the OOM killer would leave the container in a weird state and we would need a few special operations to clean it up. In the early day this was manual ops work, but as soon as we had enough data it was automated.
So at this point we have a bunch of services accessing a bunch of on-disk structures. Locking was key. The problem is, that some operations are slow, so you don't want to lock when unnecessary (e.g. you don't want to lock everything while you're merely pulling an image). Some operations can fail gracefully (e.g. it's OK if metrics collection fails for a few minutes). Some operations are really important and you absolutely want to know if they went wrong (e.g. the stuff that watches over MySQL replica provisioning). Sometimes it's OK to ignore a container for a bit (e.g. for metrics) but sometimes you absolutely want to know if it's here (because if it's not, a failover mechanism will spin it up somewhere else; so having containers disappearing in a transient manner would be bad).
To spice things further up, our ops toolkit was based on the dc
CLI tool,
so that tool had to play nice with everything else.
At this point, the process tree looked like this:
- init -+- container
+- hostinfo
+- runner
+- builder
+- deployer
+- metrics
+- oomwrangler
+- someotherstuff
+- lxc-start for container X -+- process of container X
| \- other process of container X
+- lxc-start for container Y --- process of container Y
\- lxc-start for container Z --- process of container Z
At this point, we really dreamed of a single point of entry to the container engine, to avoid locking issues. At the very least, all container metadata should be mediated by an engine exposing a clean API. We had a pretty good idea of what was needed, and that's what shaped the first versions of the Docker API.
The first versions of Docker were still relying on LXC. The process tree looked very much like the one above, except there is just the Docker Engine instead of our army of random container stuff.
Then, as containers picked up steam, LXC development (which was pretty
much dead, or at least making very slow progress) came to life,
and in a few months, there were more LXC versions than in the few years before.
This broke Docker a few times, and that's what led to the development of
libcontainer, allowing to directly program cgroups and namespaces without
going through LXC. You could put container processes directly under the
container engine, but having an intermediary process helps a lot,
so that's what we did; it was named dockerinit
.
The process tree now looked like this:
- init --- docker -+- dockerinit for container X -+- process of container X
| \- other process of container X
+- dockerinit for container Y --- process of container Y
\- dockerinit for container Z --- process of container Z
But now you have a problem: if the docker process is restarted, you end up orphaning all your "dockerinits". For simplicity, docker and dockerinit share a bunch of file descriptors (giving access to the container's stdout and stderr). The idea was to eventually make dockerinit a full-blown, standalone mini-daemon, allowing to pass FDs around across UNIX sockets, buffering logs, wahtever would be needed.
Having a daemon to manage the containers (we're talking low-level management here, i.e. listing, starting, getting basic metrics) is crucial. I'm sorry if I failed to convince you that it was important; but believe me, you don't want to operate containers at scale without some kind of API. (Executing commands over SSH is fine until you have more than 10 containers per machine, then you really want a true API :-))
But at the same time, the Docker Engine has lots of features and complexity: builds, image management, semantic REST API over HTTP, etc.; those features are essential (they are what helped to drive container adoption, while vserver, openvz, jails, zones, LXC, etc. kept containers contained (sorry!) to the hosting world) but it's totally reasonable that you don't want all that code near your production stuff.
So the current solution is to delegate all the low-level management to containerd, and keep the rest in the Docker Engine.
The process tree looks like this:
- init - docker - containerd -+- shim for container X -+- process of container X
| \- other process of container X
+- shim for container Y --- process of container Y
\- shim for container Z --- process of container Z
The big upside (which doesn't appear on the diagram) is that the link between docker and containerd can be severed and reestablished, e.g. to restart or upgrade the Docker Engine.
Now, when people show the following process tree:
- systemd -+- rkt -+- process of container X
| \- other process of container X
+- rkt --- process of container Y
\- rkt --- process of container Z
Something is missing. How do you start additional containers? I can see a few ways to do that:
- create systemd unit files for your containers and re-exec systemd to load them (that's not a realistic solution but with the process tree above, that's the only obvious one!)
- use systemd API over DBUS to create units and containers (I think this should be possible, but this essentially turns systemd into a container engine; which is not very wise, because if it crashes, your machine will just kernel panic since it's PID1)
- use an intermediary process (kubelet, or some systemd subsystem) but then you're ending with exactly the same process tree as above, except that you've hidden the intermediary process!
Let me know if I'm missing something, I'd be more than happy to update this to better reflect the situation when using a "raw" OCI runtime.
Why is having systemd unit files for your containers not realistic? I'm not a systemd expert so genuine question. We don't use systemd in production but on my personal system I do and I have a bunch of unit files to run my various containers (mostly a reverse proxy and a couple of websites), not using Rkt, just Docker though.