jpetazzo/README.md Secret

## README.md

      
    Raw
  

              README.md
            
          
    The dotCloud container engine (the ancestor of Docker) started as a Python CLI tool
that acted as a frontend to LXC and AUFS. It was called dc. This is a fake session
using dc. Keep in mind that I haven't used dc in 3 years and I don't have the
code checked out locally right now, so this is only approximate.
# pull an image
# (images were stored in Mercurial repos, using metashelf extension for special files and permissions)
dc image_pull ubuntu@f00db33f
# create a container
# (images also have templates that are evaluated when the container is created)
# (e.g. to put the IP address in /etc/network/interfaces, stuff like that)
dc container_create ubuntu@f00db33f jerome.awesome.ubuntu
# start the container
dc container_start jerome.awesome.ubuntu
# enter the container, like with "docker exec"
dc container_enter jerome.awesome.ubuntu
# there are a bunch of commands to manage port mappings
# the following one will allocate a random port
dc container_connection_add jerome.awesome.ubuntu tcp 80
# check which port was allocated
dc network_ls
So far, so good.
Under the hood, dc was using lxc-start and lxc-stop to manage containers.
Remember that fully working nsenter is a fairly recent thing. In 2010, if you
wanted to enter a container (the equivalent of docker exec), you needed to patch
your kernel so that setfd() would support all kinds of namespaces. So we had
an abstract execution engine that would either use the features from our
patched kernels, or fallback to an SSH connection.
Since containers are managed by LXC, you don't need a long-running daemon
(and at this point there was no container engine per se). In fact, if you
scratch the surface, you realize that each container has its own long-running
daemon: it's lxc-start (it's similar to rkt or runc) and you connect
to it using an abstract socket (from memory, @/var/lib/lxc/<containername>).
Then, you need to slap an API on top of that, so that you can orchestrate
deployments from a central place. Since containers are standalone, the
process exposing that API doesn't have to be bullet-proof, and you can
update/upgrade/restart it without being worried about your containers
being restarted.
In dotCloud's case, we wanted to be able to do something like this:
dotcloud-internal-admin-cli create_container ubuntu jerome.awesome.ubuntu
... and this should create the container on a host with available capacity.
Images should be pulled automatically if needed. If the image declares exposed
ports, the ports should also be connected to the "routing layer".
So what we did was to expose dc functionality using ZeroRPC. ZeroRPC
is basically RPC over ZeroMQ, using MessagePack to serialize parameters,
return values, and exceptions. MessagePack is similar to JSON, but way
more efficient. (We didn't care much about efficiency except for the
high-traffic use cases like metrics and logs.)
If you're curious about ZeroRPC, I presented it at PyCon a few years ago.
Unfortunately, my french accent was a few orders of magnitude thicker than
it is today (which says a lot) so you might struggle to understand me, sorry :(
ZeroRPC allowed us to expose almost any Python module or class like this:
# Expose Python built-in module "time" over port 1234
zerorpc-server --listen tcp://0.0.0.0:1234 time &
# Call time.sleep(4)
zerorpc-client tcp://localhost:1234 sleep 4
ZeroRPC also supports some fan-out topologies, including broadcast (all nodes
receiving the function call; return value is discarded) and worker queue
(all nodes subscribe to a "hub", you send a function call to the hub,
one idle worker will get it, so you get transparent load balancing of requests).
So here we are, with a "containers" service running on each node,
letting us do the following operations from a central place:

create containers
start/stop/destroy them

Listing containers (and gathering core host metrics) relied on a separate
service called "hostinfo".
So thanks to "hostinfo" we can also list all containers from a central place.
Cool.
In the very first versions, dotCloud was building your apps "in place", i.e. when
you push your code, the code would be copied to a temporary directory in the container
(while it's still running the previous version of your app!), the build would happen,
then a switcheroo happens (a symlink is updated to point to the new version) and
processes are restarted.
To keep things clean and simple, this build system was managed by a separate service,
that accessed the same data structures. So we had the "container manager",
"hostinfo", and the "build manager", all accessing a bunch of containers and
configuration files in the same directory (/var/lib/dotcloud, by the way).
Then we added support for separate builds (probably similar to Heroku's "slugs).
The build would happen in a separate container; then that container image would
be transferred to the right host, and a switcheroo would happen (the old container is
replaced by the new one).
We had the equivalent of volumes, so by making sure that the old and new containers
were on the same host, this process could be used for stateful apps
as well. This, by the way, was probably a Very Bad Idea; as ditching away stateful
apps would have simplified things immensely for us. Keep in mind, though, that
we were running not only web apps but also databases like MySQL, PostgreSQL, MongoDB,
Redis, etc. I was one of the strong proponents of keeping stateful containers on board,
and on retrospect I was very certainly wrong, since it made our lives way more
complicated than they could have been. But I digress!
To keep things simple and reduce impact to existing systems (at this point, we had
a bunch of customers that each already generated more than $1K of monthly revenue,
and we wanted to play safe), when we rolled out that new system, it was managed
by another service. So now on our hosts we had the "container manager", "hostinfo",
the "build manager" (for in place builds), and the "deploy manager".
(Small parenthesis: we didn't transfer full container images, of course. We
transferred only the AUFS rw layer; so that's the equivalent of a two-line
Dockerfile doing FROM python-nginx-uwsgi and RUN dotcloud-build.sh then
pushing the resulting image around.)
Then we added a few extra services also accessing container data; in no specific
order, there was a remote execution manager (used e.g. by the MySQL replication
system), a metrics collector, and a bunch of hacks to work around EC2/EBS issues,
kernel issues, out of memory killer, etc.; for instance in some scenarios,
the OOM killer would leave the container in a weird state and we would need a few
special operations to clean it up. In the early day this was manual ops work,
but as soon as we had enough data it was automated.
So at this point we have a bunch of services accessing a bunch of on-disk
structures. Locking was key. The problem is, that some operations are slow,
so you don't want to lock when unnecessary (e.g. you don't want to lock
everything while you're merely pulling an image). Some operations can
fail gracefully (e.g. it's OK if metrics collection fails for a few minutes).
Some operations are really important and you absolutely want to know if
they went wrong (e.g. the stuff that watches over MySQL replica
provisioning). Sometimes it's OK to ignore a container for a bit (e.g. for
metrics) but sometimes you absolutely want to know if it's here (because
if it's not, a failover mechanism will spin it up somewhere else; so having
containers disappearing in a transient manner would be bad).
To spice things further up, our ops toolkit was based on the dc CLI tool,
so that tool had to play nice with everything else.
At this point, the process tree looked like this:
- init -+- container
        +- hostinfo
        +- runner
        +- builder
        +- deployer
        +- metrics
        +- oomwrangler
        +- someotherstuff
		+- lxc-start for container X -+- process of container X
		|                             \- other process of container X
		+- lxc-start for container Y --- process of container Y
		\- lxc-start for container Z --- process of container Z

At this point, we really dreamed of a single point of entry to the
container engine, to avoid locking issues. At the very least, all
container metadata should be mediated by an engine exposing a clean API.
We had a pretty good idea of what was needed, and that's what shaped
the first versions of the Docker API.
The first versions of Docker were still relying on LXC. The process
tree looked very much like the one above, except there is just the
Docker Engine instead of our army of random container stuff.
Then, as containers picked up steam, LXC development (which was pretty
much dead, or at least making very slow progress) came to life,
and in a few months, there were more LXC versions than in the few years before.
This broke Docker a few times, and that's what led to the development of
libcontainer, allowing to directly program cgroups and namespaces without
going through LXC. You could put container processes directly under the
container engine, but having an intermediary process helps a lot,
so that's what we did; it was named dockerinit.
The process tree now looked like this:
- init --- docker -+- dockerinit for container X -+- process of container X
		           |                              \- other process of container X
			       +- dockerinit for container Y --- process of container Y
				   \- dockerinit for container Z --- process of container Z

But now you have a problem: if the docker process is restarted, you
end up orphaning all your "dockerinits". For simplicity, docker and dockerinit
share a bunch of file descriptors (giving access to the container's stdout
and stderr). The idea was to eventually make dockerinit a full-blown, standalone
mini-daemon, allowing to pass FDs around across UNIX sockets, buffering logs,
wahtever would be needed.
Having a daemon to manage the containers (we're talking low-level management
here, i.e. listing, starting, getting basic metrics) is crucial.
I'm sorry if I failed to convince you that it was important; but believe me,
you don't want to operate containers at scale without some kind of API.
(Executing commands over SSH is fine until you have more than 10 containers
per machine, then you really want a true API :-))
But at the same time, the Docker Engine has lots of features and
complexity: builds, image management, semantic REST API over HTTP, etc.;
those features are essential (they are what helped to drive container
adoption, while vserver, openvz, jails, zones, LXC, etc. kept containers
contained (sorry!) to the hosting world) but it's totally reasonable
that you don't want all that code near your production stuff.
So the current solution is to delegate all the low-level management
to containerd, and keep the rest in the Docker Engine.
The process tree looks like this:
- init - docker - containerd -+- shim for container X -+- process of container X
		                      |                        \- other process of container X
			                  +- shim for container Y --- process of container Y
				              \- shim for container Z --- process of container Z

The big upside (which doesn't appear on the diagram) is that the link
between docker and containerd can be severed and reestablished, e.g. to
restart or upgrade the Docker Engine.
Now, when people show the following process tree:
- systemd -+- rkt -+- process of container X
           |       \- other process of container X
           +- rkt --- process of container Y
           \- rkt --- process of container Z

Something is missing. How do you start additional containers? I can see
a few ways to do that:

create systemd unit files for your containers and re-exec systemd
to load them (that's not a realistic solution but with the process
tree above, that's the only obvious one!)
use systemd API over DBUS to create units and containers (I think
this should be possible, but this essentially turns systemd into
a container engine; which is not very wise, because if it crashes,
your machine will just kernel panic since it's PID1)
use an intermediary process (kubelet, or some systemd subsystem)
but then you're ending with exactly the same process tree as
above, except that you've hidden the intermediary process!

Let me know if I'm missing something, I'd be more than happy to update
this to better reflect the situation when using a "raw" OCI runtime.