Skip to content

Instantly share code, notes, and snippets.

Created September 27, 2016 20:06
Show Gist options
  • Save jpetazzo/f1beba1dfd4c38e8daf2ebf2dcf3cdeb to your computer and use it in GitHub Desktop.
Save jpetazzo/f1beba1dfd4c38e8daf2ebf2dcf3cdeb to your computer and use it in GitHub Desktop.

The dotCloud container engine (the ancestor of Docker) started as a Python CLI tool that acted as a frontend to LXC and AUFS. It was called dc. This is a fake session using dc. Keep in mind that I haven't used dc in 3 years and I don't have the code checked out locally right now, so this is only approximate.

# pull an image
# (images were stored in Mercurial repos, using metashelf extension for special files and permissions)
dc image_pull ubuntu@f00db33f
# create a container
# (images also have templates that are evaluated when the container is created)
# (e.g. to put the IP address in /etc/network/interfaces, stuff like that)
dc container_create ubuntu@f00db33f jerome.awesome.ubuntu
# start the container
dc container_start jerome.awesome.ubuntu
# enter the container, like with "docker exec"
dc container_enter jerome.awesome.ubuntu
# there are a bunch of commands to manage port mappings
# the following one will allocate a random port
dc container_connection_add jerome.awesome.ubuntu tcp 80
# check which port was allocated
dc network_ls

So far, so good.

Under the hood, dc was using lxc-start and lxc-stop to manage containers. Remember that fully working nsenter is a fairly recent thing. In 2010, if you wanted to enter a container (the equivalent of docker exec), you needed to patch your kernel so that setfd() would support all kinds of namespaces. So we had an abstract execution engine that would either use the features from our patched kernels, or fallback to an SSH connection.

Since containers are managed by LXC, you don't need a long-running daemon (and at this point there was no container engine per se). In fact, if you scratch the surface, you realize that each container has its own long-running daemon: it's lxc-start (it's similar to rkt or runc) and you connect to it using an abstract socket (from memory, @/var/lib/lxc/<containername>).

Then, you need to slap an API on top of that, so that you can orchestrate deployments from a central place. Since containers are standalone, the process exposing that API doesn't have to be bullet-proof, and you can update/upgrade/restart it without being worried about your containers being restarted.

In dotCloud's case, we wanted to be able to do something like this:

dotcloud-internal-admin-cli create_container ubuntu jerome.awesome.ubuntu

... and this should create the container on a host with available capacity. Images should be pulled automatically if needed. If the image declares exposed ports, the ports should also be connected to the "routing layer".

So what we did was to expose dc functionality using ZeroRPC. ZeroRPC is basically RPC over ZeroMQ, using MessagePack to serialize parameters, return values, and exceptions. MessagePack is similar to JSON, but way more efficient. (We didn't care much about efficiency except for the high-traffic use cases like metrics and logs.)

If you're curious about ZeroRPC, I presented it at PyCon a few years ago. Unfortunately, my french accent was a few orders of magnitude thicker than it is today (which says a lot) so you might struggle to understand me, sorry :(

ZeroRPC allowed us to expose almost any Python module or class like this:

# Expose Python built-in module "time" over port 1234
zerorpc-server --listen tcp:// time &
# Call time.sleep(4)
zerorpc-client tcp://localhost:1234 sleep 4

ZeroRPC also supports some fan-out topologies, including broadcast (all nodes receiving the function call; return value is discarded) and worker queue (all nodes subscribe to a "hub", you send a function call to the hub, one idle worker will get it, so you get transparent load balancing of requests).

So here we are, with a "containers" service running on each node, letting us do the following operations from a central place:

  • create containers
  • start/stop/destroy them

Listing containers (and gathering core host metrics) relied on a separate service called "hostinfo".

So thanks to "hostinfo" we can also list all containers from a central place. Cool.

In the very first versions, dotCloud was building your apps "in place", i.e. when you push your code, the code would be copied to a temporary directory in the container (while it's still running the previous version of your app!), the build would happen, then a switcheroo happens (a symlink is updated to point to the new version) and processes are restarted.

To keep things clean and simple, this build system was managed by a separate service, that accessed the same data structures. So we had the "container manager", "hostinfo", and the "build manager", all accessing a bunch of containers and configuration files in the same directory (/var/lib/dotcloud, by the way).

Then we added support for separate builds (probably similar to Heroku's "slugs). The build would happen in a separate container; then that container image would be transferred to the right host, and a switcheroo would happen (the old container is replaced by the new one).

We had the equivalent of volumes, so by making sure that the old and new containers were on the same host, this process could be used for stateful apps as well. This, by the way, was probably a Very Bad Idea; as ditching away stateful apps would have simplified things immensely for us. Keep in mind, though, that we were running not only web apps but also databases like MySQL, PostgreSQL, MongoDB, Redis, etc. I was one of the strong proponents of keeping stateful containers on board, and on retrospect I was very certainly wrong, since it made our lives way more complicated than they could have been. But I digress!

To keep things simple and reduce impact to existing systems (at this point, we had a bunch of customers that each already generated more than $1K of monthly revenue, and we wanted to play safe), when we rolled out that new system, it was managed by another service. So now on our hosts we had the "container manager", "hostinfo", the "build manager" (for in place builds), and the "deploy manager".

(Small parenthesis: we didn't transfer full container images, of course. We transferred only the AUFS rw layer; so that's the equivalent of a two-line Dockerfile doing FROM python-nginx-uwsgi and RUN then pushing the resulting image around.)

Then we added a few extra services also accessing container data; in no specific order, there was a remote execution manager (used e.g. by the MySQL replication system), a metrics collector, and a bunch of hacks to work around EC2/EBS issues, kernel issues, out of memory killer, etc.; for instance in some scenarios, the OOM killer would leave the container in a weird state and we would need a few special operations to clean it up. In the early day this was manual ops work, but as soon as we had enough data it was automated.

So at this point we have a bunch of services accessing a bunch of on-disk structures. Locking was key. The problem is, that some operations are slow, so you don't want to lock when unnecessary (e.g. you don't want to lock everything while you're merely pulling an image). Some operations can fail gracefully (e.g. it's OK if metrics collection fails for a few minutes). Some operations are really important and you absolutely want to know if they went wrong (e.g. the stuff that watches over MySQL replica provisioning). Sometimes it's OK to ignore a container for a bit (e.g. for metrics) but sometimes you absolutely want to know if it's here (because if it's not, a failover mechanism will spin it up somewhere else; so having containers disappearing in a transient manner would be bad).

To spice things further up, our ops toolkit was based on the dc CLI tool, so that tool had to play nice with everything else.

At this point, the process tree looked like this:

- init -+- container
        +- hostinfo
        +- runner
        +- builder
        +- deployer
        +- metrics
        +- oomwrangler
        +- someotherstuff
		+- lxc-start for container X -+- process of container X
		|                             \- other process of container X
		+- lxc-start for container Y --- process of container Y
		\- lxc-start for container Z --- process of container Z

At this point, we really dreamed of a single point of entry to the container engine, to avoid locking issues. At the very least, all container metadata should be mediated by an engine exposing a clean API. We had a pretty good idea of what was needed, and that's what shaped the first versions of the Docker API.

The first versions of Docker were still relying on LXC. The process tree looked very much like the one above, except there is just the Docker Engine instead of our army of random container stuff.

Then, as containers picked up steam, LXC development (which was pretty much dead, or at least making very slow progress) came to life, and in a few months, there were more LXC versions than in the few years before. This broke Docker a few times, and that's what led to the development of libcontainer, allowing to directly program cgroups and namespaces without going through LXC. You could put container processes directly under the container engine, but having an intermediary process helps a lot, so that's what we did; it was named dockerinit.

The process tree now looked like this:

- init --- docker -+- dockerinit for container X -+- process of container X
		           |                              \- other process of container X
			       +- dockerinit for container Y --- process of container Y
				   \- dockerinit for container Z --- process of container Z

But now you have a problem: if the docker process is restarted, you end up orphaning all your "dockerinits". For simplicity, docker and dockerinit share a bunch of file descriptors (giving access to the container's stdout and stderr). The idea was to eventually make dockerinit a full-blown, standalone mini-daemon, allowing to pass FDs around across UNIX sockets, buffering logs, wahtever would be needed.

Having a daemon to manage the containers (we're talking low-level management here, i.e. listing, starting, getting basic metrics) is crucial. I'm sorry if I failed to convince you that it was important; but believe me, you don't want to operate containers at scale without some kind of API. (Executing commands over SSH is fine until you have more than 10 containers per machine, then you really want a true API :-))

But at the same time, the Docker Engine has lots of features and complexity: builds, image management, semantic REST API over HTTP, etc.; those features are essential (they are what helped to drive container adoption, while vserver, openvz, jails, zones, LXC, etc. kept containers contained (sorry!) to the hosting world) but it's totally reasonable that you don't want all that code near your production stuff.

So the current solution is to delegate all the low-level management to containerd, and keep the rest in the Docker Engine.

The process tree looks like this:

- init - docker - containerd -+- shim for container X -+- process of container X
		                      |                        \- other process of container X
			                  +- shim for container Y --- process of container Y
				              \- shim for container Z --- process of container Z

The big upside (which doesn't appear on the diagram) is that the link between docker and containerd can be severed and reestablished, e.g. to restart or upgrade the Docker Engine.

Now, when people show the following process tree:

- systemd -+- rkt -+- process of container X
           |       \- other process of container X
           +- rkt --- process of container Y
           \- rkt --- process of container Z

Something is missing. How do you start additional containers? I can see a few ways to do that:

  • create systemd unit files for your containers and re-exec systemd to load them (that's not a realistic solution but with the process tree above, that's the only obvious one!)
  • use systemd API over DBUS to create units and containers (I think this should be possible, but this essentially turns systemd into a container engine; which is not very wise, because if it crashes, your machine will just kernel panic since it's PID1)
  • use an intermediary process (kubelet, or some systemd subsystem) but then you're ending with exactly the same process tree as above, except that you've hidden the intermediary process!

Let me know if I'm missing something, I'd be more than happy to update this to better reflect the situation when using a "raw" OCI runtime.

Copy link

yifan-gu commented Oct 3, 2016

Re @blixtra Thanks for explaining, btw we are already using systemd-run to monitor the rkt processes in rktelet today.

So for the perspective of kubernetes, the process tree will first look like:

- systemd
    |_______ kubelet.service

Then, kubelet calls systemd-run rkt app sandbox to create an empty pod, note that an empty pod will have a systemd running inside as PID1.

- systemd
    |_______ kubelet.service
    |_______ transient rkt service
                |________ stage1 (nspawn, or kvm)
                             |__________ systemd inside stage1

And, later, more containers can be added as services by rkt app add xxx, it's done by adding more unit files for the systemd inside stage1.

- systemd
    |_______ kubelet.service
    |_______ transient rkt service
                |________ systemd-nspawn
                             |__________ systemd
                                            |____________ process of container X
                                            |____________ process of container Y

And if we add another pod, then:

- systemd
    |_______ kubelet.service
    |_______ transient rkt service
                |________ systemd-nspawn
                             |__________ systemd
                                            |____________ process of container X
                                            |____________ process of container Y

    |_______ transient rkt service
                |________ systemd-nspawn
                             |__________ systemd
                                            |____________ process of container XX
                                            |____________ process of container YY

The systemd-nspawn is what we called stage1, it's actually contains nspawn + systemd + other necessary bundles, and packaged into an image.
The stage1 is swappable, besides nspawn, rkt also support stage1-kvm (runs containers inside a vm), and stage1-fly (runs containers without namespace, just chroot).

Feel free to ping us on slack channels if there are any questions on this:
or google groups!forum/kubernetes-sig-rktnetes
Happy to explain more :)

Copy link

jpetazzo commented Aug 9, 2017

Damn, I had never received notifications for comments here. 😰

I'd like to thank @blixtra for his remarks, and I'm going to try to explain my reasoning.

Regarding the first point [creating systemd unit files and re-exec'ing systemd] [...]
The recommended way to do that is to use systemd-run which is documented in the this part of the rkt docs.

OK, I suppose that this will use the DBUS API to start the unit.

The reason why I wrote that it wasn't reasonable, is when you have lots of short-lived containers (i.e. thousands per minute).
You certainly don't want to re-exec PID1 1000 times per second.

So, assuming that systemd-run uses the DBUS API ... Let's move to the next point.

Point 2 I find bit odd. systemd (PID1) is designed to manage processes. Each process it manages has associated namespaces, cgroups, etc. regardless of whether it's a containerized app or not.

Yes! But:

  • some people want host networking while some people want CNI or CNM
  • most people will want to load LSM profiles for their containers
  • lots of people will want "volumes" (in the Docker meaning of the sense)
  • most people want namespaces and cgroups
  • most people want the ability to attach arbitrary labels to containers

How to decide which features are "process-related" (and should be handled by systemd) and which features are "container-related" (and should be handled by rkt and the surrounding environment) ?

For instance, when I wrote this, I couldn't find an easy way to cleanly implement overlay networking (because the approach in systemd's doc was "you should use host networking"). Maybe it's changed today (maybe I can create a network sandbox with systemd, and then configure it with a plugin, and then attach other processes/containers to it?)

With the last point I'm not 100% sure what you mean. What is the hidden intermediary step?
With the current approach, a unit file is generated (manually or by some tool - kubelet, systemd, etc.)
and the process (container) is run. This is fairly direct.

Good point. What I meant is, that there is a process on the side (e.g. kubelet), and it has
to keep track of the running containers (since for systemd it's just a bunch of processes).
That takes me back to the original issues of the dotCloud platform, where we had a bunch
of extra state (the state that we needed, but that wasn't stored by LXC) and keeping that
in sync was challenging.

After thinking about it some more -- I think it all boils down to deciding where to draw
the lines between the different subsystems.

I'm back in Berlin for a bit by the way; I hope we'll be able to discuss that (and more) some time :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment