Skip to content

Instantly share code, notes, and snippets.

@dmolesUC
Created January 8, 2016 18:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dmolesUC/76a419134457a665cea7 to your computer and use it in GitHub Desktop.
Save dmolesUC/76a419134457a665cea7 to your computer and use it in GitHub Desktop.
Notes from BIDS Docker Workshop day 1, 7 Jan 2016

7 Jan 2016 (Day 1)

Prep

Installing Docker: https://docs.docker.com/engine/installation/mac/

On a typical Linux installation, the Docker client, the Docker daemon, and any containers run directly on your localhost.

...

In an OS X installation, the docker daemon is running inside a Linux VM called default. The default is a lightweight Linux VM made specifically to run the Docker daemon on Mac OS X. The VM runs completely from RAM, is a small ~24MB download, and boots in approximately 5s.

Docker Toolbox: 172 MB download

  • Includes VM

Options:

  • Docker Quickstart Terminal
  • Kitematic (Beta) Visual Management for Docker

(Let's go with the terminal, figuring that's more transferable to Linux.)

Result:

                        ##         .
                  ## ## ##        ==
               ## ## ## ## ##    ===
           /"""""""""""""""""\___/ ===
      ~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ /  ===- ~~~
           \______ o           __/
             \    \         __/
              \____\_______/


docker is configured to use the default machine with IP 192.168.99.100
For help getting started, check out the docs at https://docs.docker.com

Intro

"Solving problems with software often comes down to solving installation problems with software."

Docker: "Past the hipster phase"

"Build a black box -- translucent box -- of software so that anybody can run it that has Docker installed"

"[Docker is about] cutting through the crud surrounding scientific software. I've been doing scientific software since the early 1990s and it is an incredibly infuriating field."

"At least compared to the scientific community, the open source community has its head screwed on straight."

"All my notes are CC0 ... please do not contact me for permission to do anything with them -- you already have it. ... I get enough email as it is."

"I did a lot of software development as a way not to do my graduate work"

Running docker

Note: docker commands only work with the right environment. The terminal that starts up after installation works; otherwise run eval $(docker-machine env default) to set the environment variables (where default is the machine name, from ~/.docker/machine/machines/).

$ docker run ubuntu:14.04 /bin/echo 'Hello world'

"Docker is creating a new container using the Ubuntu image. ... This is a very expensive way to run 'Hello, world', but [compared to a traditional virtual machine] Docker is very lightweight."

$ docker run -it ubuntu:14.04 /bin/bash
root@74405e94cc53:/# _

"This is now its own Linux environment... you can do all the standard things you could do on an [Ubuntu] machine."

"You only need the first three letters of the hash, unless you've got a lot of containers."

-it ("interactive")

I've found the default Docker documentation ... basically you end up with command-line soup ... it's taken me weeks to figure out what all the command-line options are.

If you don't specify the -it, then what happens is, it runs /bin/bash, but it doesn't give you a connection to it ... /bin/bash just exits immediately.

There's basically two ways people run things with docker... in the background... for demons, webservers, Project Jupyter notebooks ... and then there's the interactive mode, which I use mostly for debugging.

-d ("detached")

It runs, it excutes the docker container ... you can see that it's running with docker ps... but it's no longer controllable from the command line. ... In order to stop it you'd have to do something like docker stop.

You can docker attach to this, but note that doesn't get you an interactive shell, it just gets you something you can kill with Ctrl-C (etc.).

--rm ("remove")

Removes the Docker container after you exit

Images and containers

Image

Essentially a filesystem... a starting configuraition for your Docker container.

Images can specify a default executable (e.g., run Jupyter notebooks, run something custom).

Container

A Docker container is a running instance of an image.

Every time you do a Docker run, you're basically starting from an unmodified copy of the image that you specified ... it's an isolated environment.

A lot of what I've been thinking about is ways to work with persistent data... we'll show you a couple of different strategies.

root@74405e94cc53:/# touch /tmp/whatever.txt
root@74405e94cc53:/# exit
exit
$ docker cp 74405e94cc53:/tmp/whatever.txt .

Copies a file from a Docker container to your local fileystem; works both ways (cf. scp).

$ docker start 74405e94cc53
$ docker attach 74405e94cc53

Starts and executes an existing container.

You'll notice we're using very inconvenient strings ... if you do a docker ps there are randomly assigned more friendly [names].

docker ps shows the running containers.

$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
74405e94cc53        ubuntu:14.04        "/bin/bash"         12 minutes ago      Up 5 seconds                            big_albattani

$ docker attach big_albattani
root@74405e94cc53:/#

(Note: This may appear to stall. Just hit enter.)

docker ps -a shows (running and?) stopped containers:

$ docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                     PORTS               NAMES
74405e94cc53        ubuntu:14.04        "/bin/bash"         25 minutes ago      Exited (0) 4 minutes ago                       big_albattani
a02e55c8dc6d        hello-world         "/hello"            38 minutes ago      Exited (0) 2 seconds ago                       admiring_curie

docker images shows all the images (not the containers)

$ docker images
REPOSITORY           TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
jupyter/notebook     latest              383cbc5a8497        43 hours ago        1.024 GB
ubuntu               14.04               c4bea91afef3        2 days ago          187.9 MB
rocker/hadleyverse   latest              fefe3e4e1173        7 days ago          3.001 GB
hello-world          latest              0a6ba66e537a        12 weeks ago        960 B

Images vs. containers

This is how I think about them:

  • The image is basically the filesystem plus some configuration. ... think of this as your initial config.
  • This gets turned into a container, and a running container is this image, copied into a new image, and then executed inside a Linux environment.

The typical thing is that these images are just basically throwaway. ... The real idea underneath is that you're delivering neatly packaged images that have all the stuff you need to run them.

The most important thing is that data is transient. You can basically consider data transient unless you make special provisions for it ... that is one of the things that's kind of a poor fit [for scientific computing].

Cleaning up

Containers:

$ docker rm $(docker ps -a -q)

Images:

$ docker rmi $(docker images | grep "^<node>" | awk "{print $3}")

Ports

You have to explicitly export each port using the -p command.

Otherwise you could also use SSH port-forwarding or something, but why?


Amazon

You can use Docker locally to control a Docker container running in Amazon EC3, via an HTTPS connection and a standard API. Easier than SSH, esp. w/unreliable networks or in a high-latency environment.

Remember to make sure your scripts remove the container at the end of whatever they're doing, so you don't keep incurring Amazon charges.

Travis now supports Docker out of the box. [See e.g. for an example with CircleCI rather than Travis.]

Use cases:

  • Reproducable images w/o lock-in to Amazon
  • Portable to other cloud platforms, University resources etc.
  • High-performance computing environments (e.g. NSFCloud Chameleon) starting to experiment w/allowing Docker (security concerns) ... probably a couple of years from mainstream
  • RedHat and some others offer restricted Docker execution environments for more security

We're not likely to end up with something worse than Docker. ... It's just so obvious ... the basic idea of containerization and scriptable infrastructure.

Docker Hub

"Docker Hub is a hosted registry service with public, private, and Official image repositories."

If you're really concerned about reproducibility, you want to point yourself to one of these binary-stable URLs. ... but these images are not binary-stable ... whenever Debian issues a security update ... it updates automatically.


Docker components

Docker client, ak.a. docker

Controls the Docker host via an (HTTPS-based) API.

Docker host

The Linux machine (or VM) that the containers run on (containers are OS-level, using OS-level virtualization).

  • on Linux, could be the same machine the client is running on
  • could be another (hardware) Linux host
  • could be a cloud VM (e.g. Amazon)
  • on the Mac, the Docker host is a Linux VM running in VirtualBox

The Docker host runs the Docker daemon, which handles commands from the client.

This is where all the compute happens.

docker-machine

Automates provisioning and configuring Docker hosts.

If you have Amazon EC2 credentials, you can use docker-machine to automagically create AWS instances that act as Docker hosts.


Docker data volumes

Shared between Docker instances, but not accessible from the host

$ docker create -v /mydata --name my_data_vol ubuntu:14.04 /bin/true

f4d0de1a7ded1c1559c00d0f1621956ec7144fbc94d1bd3a787404a06bf57644

Creates a new Docker volume my_data_vol

$ docker run --volumes-from my_data_vol -t ubuntu:14.04 /bin/bash
root@9e1f75706444:/# 

Creates a new Docker container with the volume my_data_vol mounted at /mydata (as defined on volume creation).

Useful for, e.g., sharing very large data; see e.g. Transcriptomic analysis with Docker containers and data volumes w/o the overhead of copying back and forth from local to the Docker host.

My view is you want do do this thing called "scaling out" -- you want to send the compute to the data.


Building Docker images

  1. Create a directory firsttry, and in that directory create a Dockerfile with the contents:

    FROM ubuntu:14.04
    RUN echo 'echo hello, world' > /home/hello.sh && chmod +x /home/hello.sh
    CMD /home/hello.sh
    
  2. cd to that directory and run docker build:

    $ cd firsttry
    $ docker build -t myhello .
    
  3. run it:

    $ docker run myhello
    hello, world
    
  4. or, alternatively, connect to it interactively ("this is how you end up debugging your Dockerfiles"):

    $ docker run -it myhello /bin/bash
    root@d4c62c3e49ce:/# ls -l home/hello.sh 
    -rwxr-xr-x 1 root root 18 Jan  7 22:06 home/hello.sh
    root@d4c62c3e49ce:/# 
    

The reason the Dockerfile goes on its own directory, is because you can put other files in that directory. E.g., this Dockerfile:

FROM ubuntu:14.04
COPY hello.sh /home/hello.sh
CMD /home/hello.sh

Assumes the following directory structure:

.
├── Dockerfile
└── hello.sh

The Dockerfiles run on the Docker host, not on the Docker client. The reason that's important, is that this entire directory ... it takes this entire directory and copies it over to the virtual host. That becomes important if you're building stuff on Amazon, and you have a 50 MB file here. Every time you run docker build you're copying that 50 MB file over to Amazon from your local host.

Why this is a good idea:

Suppose you're running your Docker client on a tablet ... many of the commands in a Docker file may be compilation commands, or ... something heavyweight that requires compilation time and memory.

A real example Dockerfile:

# image: diblab/khmer
FROM ubuntu:15.10
MAINTAINER titus@idyll.org

ENV PACKAGES python-dev zlib1g git python-setuptools g++ make ca-certificates
ENV KHMER_VERSION v2.0

# khmer scripts will be installed in /usr/local/bin
# khmer sandbox will be in /home/khmer/sandbox/

### don't modify things below here for version updates etc.

WORKDIR /home

RUN apt-get update && \
    apt-get install -y --no-install-recommends ${PACKAGES} && \
    apt-get clean

RUN cd /home && \
    git config --global http.sslVerify false && \
    git clone https://github.com/dib-lab/khmer.git -b ${KHMER_VERSION} && \
    cd khmer && \
    python setup.py install && \
    rm -fr build

"I'm told this represents reasonably good practice."

The important bit is it's all running as one command. Each RUN command represents one layer in the filesystem.

A "layer" can be imagined as sort of a version, or a checkpoint... Docker uses UnionFS, which, according to Wikipedia, "allows files and directories of separate file systems, known as branches, to be transparently overlaid, forming a single coherent file system."

The greatest thing about Dockerfiles is that they're transparent and they're explicit.

UnionFS and docker build: caveats

UnionFS allows Docker to, among other things, share layers between different containers, when their setup processes share steps.

If you docker build again on the same Dockerfile, Docker will try to detect that your Dockerfile hasn't changed and helpfully not do anything. This is fine unlessd your Dockerfile fetches an external resource that can change, in which case you won't get the changes.

E.g.:

FROM ubuntu:14.04
RUN apt-get install -y curl
RUN curl https://raw.githubusercontent.com/ngs-docs/2016-bids-docker/master/scripts/hello.sh > /home/hello.sh && chmod +x /home/hello.sh
CMD /home/hello.sh

Better yet, even if some other lines in the file have changed, if that line hasn't changed, Docker will use the cached layer. (Once a line changes, Docker will re-run everything below that.)

(Note you will at least see ---> Using cache in the output.)

You can do

docker build --no-cache

but that will re-run time-consuming steps that may not have changed. It's often easier just to hack in extra spaces or extra lines or something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment