crashfrog/dockerizing.md

## dockerizing.md

      
    Raw
  

              dockerizing.md
            
          
    Dockerizing tools from source

Justin Payne, May 30 2015
Introduction

I get a lot of value out of putting bioinformatics tools in Docker containers, since once they're containerized with an automated build script (called a "Dockerfile") it's really easy to keep them up to date and manage their installations on different machines without their individual dependencies stepping all over each other. It's a really convenient way to try out new tools without taking the risk of borking your carefully-maintained "working" Linux install. Tools like boot2docker make Docker images runnable on non-Linux platforms, as well, resulting in improved portability of tools that might have platform (or even distro) specific dependencies. I've done this a couple of times, now, so I thought I'd share some tips and tricks.

What is Docker?


Docker is a tool for containerizing software installs and process state into portable, connectable, versioned VM-like images. I say 'VM-like' because they're not really VM's; rather, as I understand it they're a layer of abstraction and isolation over the host machine's Linux kernel. (That's why they require a Linux host to run, and why a process inside a Docker container sees all of your processor cores and can use all of the host RAM. It's also why processes can run inside them with almost no performance impact compared to running natively on the host.)
Docker is also a repository of such images under version control, like GitHub or BitBucket. (You can set up your own local Docker registry as well, in fact it's as easy as docker pull registry!) Docker enables an unprecedented flexibility in bioinformatics because you can finally draw a border around your software tools and their various dependencies, shared libraries, specific versions, hard-coded paths, environment variables - anything in global scope that might concievably conflict between two tools. Now they can live on the same machine in peace, because as far as they're concerned, they've been installed into their own fresh Linux build.

Images vs. containers


This one took me a while to understand. An image is a container saved as a kind of checkpoint; images are versioned, pushable to and pullable from the Docker Hub. When you run an image you're creating a new container that starts changing in state from its source image; files get written, processes run, etc. You can commit the state of that container into a new image.
You run images; when you do, you create a running container.

Clone the project


Docker Hub allows you to tie automated Docker builds to GitHub commits, so ultimately, you'll want your Dockerfile build file to become part of the project. Therefore when you're Dockerizing a GitHub project, it's a good practice to do so inside the cloned repository. (Also, since all your Dockerfiles will be named Dockerfile, it's a good way to keep them straight.

Build on phusion/baseimage


Phusion releases and maintains a super-minimal Ubuntu baseimage that includes little more than apt-get and the normal shell tools; it's a good basis on which to build because they've solved at least some of the issues with Dockerizing based on standard distros (for instance, a zombie process bug which can affect long-running Docker containers.) It's relatively minimal (so it won't add a lot of size to your container) and a tried and true base image for many, many Docker builds.

Run apt-get with -y


You don't want unwanted user prompts to interrupt an automated build as your Dockerfile installs various dependencies, so remember to run apt-get with -y, which will answer in the affirmative to any prompts. If you're building from a GitHub project, remember that you've just made git a dependency, so put that in your apt-get section. You'll see what one of those looks like, below.

Understand how layers work


Each command (RUN, GET, etc) in your Dockerfile commits a "layer", a kind of cached state of the image. In fact your image is basically the sum of all its layers. This has implications on the ultimate size of your image - if in one command you download 100MB of data, and then in another command you delete it, your image size won't be reduced because it contains a layer that has that 100MB of data. However, the layer is only committed at the end of the command, so you can use bash command chaining (e.g. wget http://a.big/file.tar.gz && rm file.tar.gz) to download and use temporary or ephemeral files without permanently growing the size of your image.

Pull the codebase last


Docker will cache each layer as you build the image, and will re-use those layers when you rebuild. If a layer can't be reused (for instance, because it would have downloaded a different file the second time) then it will rebuild those cached layers and every subsequent layer. If your ultimate intent is to set up automated Docker builds of a project, the project codebase is the artifact most likely to change, which means that if git clone http://github.com/me/my_project is towards the bottom of the file, you'll be able to reuse as many of your previous layers as possible. Ideally anything about your image that is likely to change should be towards the end of the file.

Beware SourceForge


With any luck, I'm famous on Twitter (@crashfrog) for disparaging SourceForge as a hosting solution for bioinformatics projects - primarily because it's so hard to get a static link to source. Downloading anything means negotiating SF's JavaScript redirects, which neither curl nor wget are able to do.

Set an entrypoint


(And know how to ignore it.) The best Docker images have the executable and any non-optional arguments as ENTRYPOINT, so that running the image at the command line is like running the tool. But, you can always override the entrypoint by specifying it as an option to docker run : docker run -t -i --rm --entrypoint /bin/bash crashfrog/my_image will let you 'step into' an image (and clean up the container when you exit), just as if you had SSH'd into it. That can be a useful tool for debugging a build gone wrong.

Build your image


The best time to name your image is as you build it, and if you want to push it to the cloud you'll want to tag it (-t) with your Docker Hub username: docker build -t crashfrog/my_image . (note the dot at the end; it tells docker to look for the Dockerfile in the current directory.) If you forgot to tag it, no problem, you can find the image with docker images and rename it.

Host or don't host?


If you commit and push your Docker image to the Docker Hub, it'll be available for anyone to download by its qualified name: docker pull crashfrog/my_image. This presents something of a point of etiquette that the community has not, to my knowledge, resolved: some might see it as appropriative, or even a form of plagarism, to release someone else's project "branded" with your own identity. In general I find Dockerization too useful not to do it to projects and tools that I really like, but if I'm not in any way associated with that project, I don't "advertise" my images. (They're available on a search of Docker Hub, however.) That said, I do like to submit a GitHub pull request to add my Dockerfile (once I know it works), and when I do, I try to pass along the instructions for how to set up automated builds. That way the Docker image is always up to date, and it's no longer tied to my own GitHub identity - reducing the possibility that I'll be mistaken for a significant contributor to the tool itself.

Sample Dockerfile: andi


Here's a complicated Dockerfile that builds one dependency from source, then configures and builds the project (Fabian Klotzl's andi: https://github.com/EvolBioInf/andi). I've added some comments, but to actually run the Dockerfile you'll have to take them out (they'll mess with Bash):
FROM phusion/baseimage:0.9.12 #build on specific versions of base images.
MAINTAINER Justin Payne, justin.payne@fda.hhs.gov

WORKDIR /tmp/ #WORKDIR is a combination of mkdir and cd
RUN apt-get update -y \ #break your apt-get packages out into their own line, for readability.
	&& apt-get install -y \
	make \ #it's considered best practice to alphabetize your apt packages, but if there's just a few, meh.
	cmake \
	g++ \
	git \
	&& apt-get clean #clean that package cache for a smaller layer
	
RUN git clone https://github.com/y-256/libdivsufsort.git \
	&& cd libdivsufort \
	&& mkdir build \
	&& cd build \
	&& cmake -DCMAKE_BUILD_TYPE="RELEASE" \
			 -DCMAKE_INSTALL_PREFIX="/usr/local" .. \
	&& make \
	&& make install \
	&& cd /tmp/ \
	&& rm -r libdivsufort #remove the source directory and keep this layer small!
	
RUN git clone https://github.com/crashfrog/andi.git \
	&& cd andi \
	&& ./configure \
	&& make \
	&& make install \
	&& make installcheck \
	&& rm -r andi
	
ENTRYPOINT /usr/local/bin/andi #set the entrypoint, so that running the image is just like running the tool