dmolesUC/bids-docker-workshop.md

## bids-docker-workshop.md

      
    Raw
  

              bids-docker-workshop.md
            
          
    7 Jan 2016 (Day 1)

Prep

Installing Docker: https://docs.docker.com/engine/installation/mac/

On a typical Linux installation, the Docker client, the Docker daemon,
and any containers run directly on your localhost.
...


In an OS X installation, the docker daemon is running inside a Linux VM
called default. The default is a lightweight Linux VM made specifically
to run the Docker daemon on Mac OS X. The VM runs completely from RAM, is
a small ~24MB download, and boots in approximately 5s.

Docker Toolbox: 172 MB download

Includes VM

Options:

Docker Quickstart Terminal
Kitematic (Beta) Visual Management for Docker

(Let's go with the terminal, figuring that's more transferable to Linux.)
Result:
                        ##         .
                  ## ## ##        ==
               ## ## ## ## ##    ===
           /"""""""""""""""""\___/ ===
      ~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ /  ===- ~~~
           \______ o           __/
             \    \         __/
              \____\_______/


docker is configured to use the default machine with IP 192.168.99.100
For help getting started, check out the docs at https://docs.docker.com

Intro


"Solving problems with software often comes down to solving installation problems with software."


Docker Hands-on materials
Day 1 notes

Docker: "Past the hipster phase"
"Build a black box -- translucent box -- of software so that anybody can
run it that has Docker installed"
"[Docker is about] cutting through the crud surrounding scientific
software. I've been doing scientific software since the early 1990s and it
is an incredibly infuriating field."
"At least compared to the scientific community, the open source community
has its head screwed on straight."
"All my notes are CC0 ... please do not contact me for permission to do anything with them -- you already have it. ... I get enough email as it is."
"I did a lot of software development as a way not to do my graduate work"
Running docker

Note: docker commands only work with the right environment. The terminal
that starts up after installation works; otherwise run eval $(docker-machine env default) to set the environment variables (where
default is the machine name, from ~/.docker/machine/machines/).
$ docker run ubuntu:14.04 /bin/echo 'Hello world'

"Docker is creating a new container using the Ubuntu image. ... This is a
very expensive way to run 'Hello, world', but
[compared to a traditional virtual machine] Docker is very lightweight."
$ docker run -it ubuntu:14.04 /bin/bash
root@74405e94cc53:/# _

"This is now its own Linux environment... you can do all the standard
things you could do on an [Ubuntu] machine."
"You only need the first three letters of the hash, unless you've got a lot
of containers."
-it ("interactive")


I've found the default Docker documentation ... basically you end up with
command-line soup ... it's taken me weeks to figure out what all the
command-line options are.


If you don't specify the -it, then what happens is, it runs /bin/bash, but
it doesn't give you a connection to it ... /bin/bash just exits immediately.


There's basically two ways people run things with docker... in the background...
for demons, webservers, Project Jupyter notebooks ... and then there's the
interactive mode, which I use mostly for debugging.

-d ("detached")


It runs, it excutes the docker container ... you can see that it's running
with docker ps... but it's no longer controllable from the command line.
... In order to stop it you'd have to do something like docker stop.

You can docker attach to this, but note that doesn't get you an interactive shell,
it just gets you something you can kill with Ctrl-C (etc.).
--rm ("remove")

Removes the Docker container after you exit
Images and containers

Image

Essentially a filesystem... a starting configuraition for your Docker container.
Images can specify a default executable (e.g., run Jupyter notebooks, run
something custom).
Container

A Docker container is a running instance of an image.

Every time you do a Docker run, you're basically starting from an unmodified copy
of the image that you specified ... it's an isolated environment.


A lot of what I've been thinking about is ways to work with persistent data... we'll
show you a couple of different strategies.

root@74405e94cc53:/# touch /tmp/whatever.txt
root@74405e94cc53:/# exit
exit
$ docker cp 74405e94cc53:/tmp/whatever.txt .

Copies a file from a Docker container to your local fileystem; works both ways (cf.
scp).
$ docker start 74405e94cc53
$ docker attach 74405e94cc53

Starts and executes an existing container.

You'll notice we're using very inconvenient strings ... if you do a
docker ps there are randomly assigned more friendly [names].

docker ps shows the running containers.
$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
74405e94cc53        ubuntu:14.04        "/bin/bash"         12 minutes ago      Up 5 seconds                            big_albattani

$ docker attach big_albattani
root@74405e94cc53:/#

(Note: This may appear to stall. Just hit enter.)
docker ps -a shows (running and?) stopped containers:
$ docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                     PORTS               NAMES
74405e94cc53        ubuntu:14.04        "/bin/bash"         25 minutes ago      Exited (0) 4 minutes ago                       big_albattani
a02e55c8dc6d        hello-world         "/hello"            38 minutes ago      Exited (0) 2 seconds ago                       admiring_curie

docker images shows all the images (not the containers)
$ docker images
REPOSITORY           TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
jupyter/notebook     latest              383cbc5a8497        43 hours ago        1.024 GB
ubuntu               14.04               c4bea91afef3        2 days ago          187.9 MB
rocker/hadleyverse   latest              fefe3e4e1173        7 days ago          3.001 GB
hello-world          latest              0a6ba66e537a        12 weeks ago        960 B

Images vs. containers


This is how I think about them:

The image is basically the filesystem plus some configuration. ...
think of this as your initial config.
This gets turned into a container, and a running container is this
image, copied into a new image, and then executed inside a Linux
environment.


The typical thing is that these images are just basically throwaway. ...
The real idea underneath is that you're delivering neatly packaged images
that have all the stuff you need to run them.


The most important thing is that data is transient. You can basically
consider data transient unless you make special provisions for it ... that
is one of the things that's kind of a poor fit [for scientific computing].

Cleaning up

Containers:
$ docker rm $(docker ps -a -q)

Images:
$ docker rmi $(docker images | grep "^<node>" | awk "{print $3}")

Ports


You have to explicitly export each port using the -p command.

Otherwise you could also use SSH port-forwarding or something, but why?

Amazon

You can use Docker locally to control a Docker container running in Amazon
EC3, via an HTTPS connection and a standard API. Easier than SSH, esp.
w/unreliable networks or in a high-latency environment.
Remember to make sure your scripts remove the container at the end of
whatever they're doing, so you don't keep incurring Amazon charges.

Travis now supports Docker out of the box.
[See e.g. for
an example with CircleCI rather than Travis.]

Use cases:

Reproducable images w/o lock-in to Amazon
Portable to other cloud platforms, University resources etc.
High-performance computing environments (e.g. NSFCloud
Chameleon) starting to experiment
w/allowing Docker (security concerns) ... probably a couple of years from
mainstream
RedHat and some others offer restricted Docker execution environments for
more security


We're not likely to end up with something worse than Docker. ... It's
just so obvious ... the basic idea of containerization and scriptable
infrastructure.

Docker Hub

"Docker Hub is a hosted registry service with public, private, and Official
image repositories."

If you're really concerned about reproducibility, you want to point
yourself to one of these binary-stable URLs. ... but these images are not
binary-stable ... whenever Debian issues a security update ... it updates
automatically.


Docker components

Docker client, ak.a. docker

Controls the Docker host via an (HTTPS-based) API.
Docker host

The Linux machine (or VM) that the containers run on (containers are
OS-level, using OS-level virtualization).

on Linux, could be the same machine the client is running on
could be another (hardware) Linux host
could be a cloud VM (e.g. Amazon)
on the Mac, the Docker host is a Linux VM running in VirtualBox

The Docker host runs the Docker daemon, which handles commands from the
client.

This is where all the compute happens.


docker-machine

Automates provisioning and configuring Docker hosts.
If you have Amazon EC2 credentials, you can use docker-machine to
automagically create AWS instances that act as Docker hosts.

Docker data volumes

Shared between Docker instances, but not accessible from the host
$ docker create -v /mydata --name my_data_vol ubuntu:14.04 /bin/true

f4d0de1a7ded1c1559c00d0f1621956ec7144fbc94d1bd3a787404a06bf57644

Creates a new Docker volume my_data_vol
$ docker run --volumes-from my_data_vol -t ubuntu:14.04 /bin/bash
root@9e1f75706444:/# 

Creates a new Docker container with the volume my_data_vol mounted at
/mydata (as defined on volume creation).
Useful for, e.g., sharing very large data; see e.g.
Transcriptomic analysis with Docker containers and data volumes
w/o the overhead of copying back and forth from local to the Docker host.

My view is you want do do this thing called "scaling out" -- you want to
send the compute to the data.


Building Docker images


Create a directory firsttry, and in that directory create a
Dockerfile with the contents:
FROM ubuntu:14.04
RUN echo 'echo hello, world' > /home/hello.sh && chmod +x /home/hello.sh
CMD /home/hello.sh


cd to that directory and run docker build:
$ cd firsttry
$ docker build -t myhello .


run it:
$ docker run myhello
hello, world


or, alternatively, connect to it interactively ("this is how you end up
debugging your Dockerfiles"):
$ docker run -it myhello /bin/bash
root@d4c62c3e49ce:/# ls -l home/hello.sh 
-rwxr-xr-x 1 root root 18 Jan  7 22:06 home/hello.sh
root@d4c62c3e49ce:/# 


The reason the Dockerfile goes on its own directory, is because you can put other files in that directory. E.g., this Dockerfile:
FROM ubuntu:14.04
COPY hello.sh /home/hello.sh
CMD /home/hello.sh

Assumes the following directory structure:
.
├── Dockerfile
└── hello.sh


The Dockerfiles run on the Docker host, not on the Docker client. The
reason that's important, is that this entire directory ... it takes this
entire directory and copies it over to the virtual host. That becomes
important if you're building stuff on Amazon, and you have a 50 MB file
here. Every time you run docker build you're copying that 50 MB file
over to Amazon from your local host.

Why this is a good idea:

Suppose you're running your Docker client on a tablet ... many of the
commands in a Docker file may be compilation commands, or ... something
heavyweight that requires compilation time and memory.

A real example Dockerfile:
# image: diblab/khmer
FROM ubuntu:15.10
MAINTAINER titus@idyll.org

ENV PACKAGES python-dev zlib1g git python-setuptools g++ make ca-certificates
ENV KHMER_VERSION v2.0

# khmer scripts will be installed in /usr/local/bin
# khmer sandbox will be in /home/khmer/sandbox/

### don't modify things below here for version updates etc.

WORKDIR /home

RUN apt-get update && \
    apt-get install -y --no-install-recommends ${PACKAGES} && \
    apt-get clean

RUN cd /home && \
    git config --global http.sslVerify false && \
    git clone https://github.com/dib-lab/khmer.git -b ${KHMER_VERSION} && \
    cd khmer && \
    python setup.py install && \
    rm -fr build

"I'm told this represents reasonably good practice."

The important bit is it's all running as one command. Each RUN command
represents one layer in the filesystem.

A "layer" can be imagined as sort of a version, or a checkpoint... Docker
uses UnionFS, which, according to
Wikipedia, "allows files and directories of separate file systems, known as
branches, to be transparently overlaid, forming a single coherent file
system."

The greatest thing about Dockerfiles is that they're transparent and they're explicit.


UnionFS and docker build: caveats

UnionFS allows Docker to, among other things, share layers between
different containers, when their setup processes share steps.
If you docker build again on the same Dockerfile, Docker will try to
detect that your Dockerfile hasn't changed and helpfully not do anything.
This is fine unlessd your Dockerfile fetches an external resource that can
change, in which case you won't get the changes.
E.g.:
FROM ubuntu:14.04
RUN apt-get install -y curl
RUN curl https://raw.githubusercontent.com/ngs-docs/2016-bids-docker/master/scripts/hello.sh > /home/hello.sh && chmod +x /home/hello.sh
CMD /home/hello.sh

Better yet, even if some other lines in the file have changed, if that
line hasn't changed, Docker will use the cached layer. (Once a line
changes, Docker will re-run everything below that.)
(Note you will at least see ---> Using cache in the output.)
You can do
docker build --no-cache

but that will re-run time-consuming steps that may not have changed. It's
often easier just to hack in extra spaces or extra lines or something.