chrisamiller/lsf_and_docker_tutorial.md

## lsf_and_docker_tutorial.md

      
    Raw
  

              lsf_and_docker_tutorial.md
            
          
    Compute cluster basics

Logging in

Open up a terminal (Terminal/iTerm on Mac, putty or WSL on Windows) and SSH into the cluster, replacing USERNAME with your WUSTL key.
ssh c.a.miller@compute1-client-3.ris.wustl.edu

This logs you into a compute client "head node" These are shared by everyone on campus and shouldn't be used for any heavy lifting.  Editing files, etc is fine, but analyses should be run on a blade.
When you log in, you'll be in your home directory. You can confirm this using the pwd command.
Home directories are limited to 10Gb, so are mostly for config files and small program installations.  "Real" data will be analyzed on your lab's disk - generally tied to the PI's name - something like /storage1/fs1/timley/Active/...
For today's exercises, the home dir will be just fine, though.
To simplify using the cluster, there are a couple of things you can add to your ~/.bashrc file, to be automatically loaded when you log in. Copy the following block into your .bashrc file, replacing usernames/groups as necessary:
export LSF_DOCKER_VOLUMES="$LSF_DOCKER_VOLUMES \
/home/USERNAME:/home/USERNAME"

# add additional lines like this once you know where your lab's storage lives 
# /storage1/fs1/timley/Active:/storage1/fs1/timley/Active \

# use the `groups` command to identify the compute group that you 
# have access to, tied to your PI's name/lab
export LSF_COMPUTE_GROUP=compute-timley
export LSF_COMPUTE_QUEUE=general

# the below is totally optional, but can be nice:
# enable color support of ls and also add handy aliases
if [ "$TERM" != "dumb" ]; then
    #eval "`dircolors -b`"
    alias ls='ls --color=auto'
fi

alias ll='ls -lI ".*" --color=auto --hide \. --time-style=long-iso'


Save it and log out (exit) and back in.
Now, use your unix knowledge to make a new directory called workshop in your home directory, then change directories into it.

  Details/Answer
mkdir workshop

or
mkdir ~/workshop

followed by
cd ~/workshop


Using LSF Jobs to submit jobs to the cluster

There are a few things we need to know before
Interactive Jobs

Let's get a job on an interactive blade
LSF_DOCKER_VOLUMES=/home/USERNAME:/home/USERNAME bsub -Is -M 2G -R 'select[mem>2G] rusage[mem=2G]' -n 1 -q general-interactive -G compute-GROUP -a 'docker(ubuntu:focal)' /bin/bash

After landing that job, let's notice some things:

our home directory is mounted and accessible
tools that weren't previously there, are there now
we used the general-interactive queue
other data volumes are not accessible


bjobs lets you see running jobs
bjobs -l lets you see all the gory details

Type exit to get back to the compute client/head node
Non-interactive jobs

When you run a program, you often get two types of output

to your files (often through STDOUT)
to your screen (generally through STDERR)

With LSF, it's the same concept, but stdout/stderr go to files instead
Running a simple job
bsub -M 2G -R 'select[mem>2G] rusage[mem=2G]' -q general -G compute-timley -oo date.log -a 'docker(ubuntu:xenial)' "date"

Redirecting stdout
bsub -M 2G -R 'select[mem>2G] rusage[mem=2G]' -q general -G compute-timley -oo date.log -a 'docker(ubuntu:xenial)' "bash -c \"date >date.output\""

Using a script to avoid escaping quotes or complicated expressions
echo "date >date.output" > rundate.sh
cat rundate.sh
bsub -M 2G -R 'select[mem>2G] rusage[mem=2G]' -q general -G compute-timley -oo date.log -a 'docker(ubuntu:xenial)' "bash rundate.sh"

Using a job name so that you can track progress
echo "sleep 60" >>rundate.sh
bsub -M 2G -R 'select[mem>2G] rusage[mem=2G]' -q general -G compute-timley -oo date.log -a 'docker(ubuntu:xenial)' -J mydate "bash rundate.sh"

Some useful LSF commands:
bjobs - list jobs
bsub - submit jobs
bkill - kill jobs
If you need to kill a bunch of jobs at once, unix pipes are your friend!  Something like:
bjobs | grep JOBNAME | awk '{print $1}' | xargs -n1 bkill

may be helpful
If you need to launch hundreds or thousands of jobs, using job groups can help you control the rate at which things run:
https://confluence.ris.wustl.edu/display/~cmiller/Using+LSF+job+groups
Shortcuts

You can set up shortcuts for getting jobs that don't require quite so much typing. Setting up this alias in your .bashrc file:
alias bsub4='bsub -oo err.log -G compute-c.a.miller -q general -M 4G -R '\''select[mem>4G] rusage[mem=4G]'\'' -a "docker(chrisamiller/docker-genomic-analysis)"'

Will allow you to launch a non-interactive command with a shortened form:
bsub4 "bedtools intersect -h"

gsub is a special alias for doing things in GMS (MGI pipelines and tools). We'll talk about that more in a future session
isub is a script you can download and set up that has similar syntax.
Some LSF tips:


don't launch 100 jobs until you've verified that one will run successfully
If you need to launch 1000 jobs, you're probably doing it wrong.
If you need to launch 100,000 jobs, you're definitely doing it wrong
If you're launching hundreds of jobs that take only seconds to complete, refactor your code
every job has a /tmp/ directory that gets blown away when things are over. In this era of limited/expensive disk, this is useful!


Using Docker

On your laptop:
docker pull ubuntu

What is that doing? It's going to https://hub.docker.com/_/ubuntu and pulling down the image with the "latest" tag
docker run ubuntu

Uhh, nothing happened.  Not quite - it loaded up the entire OS, but you didn't tell it to do anything!
docker run ubuntu echo "hello world" 

What's cool is that didn't run on MacOS, that ran in Linux, and we can prove it:
docker run ubuntu uname -a

But what if we want to do more than one thing at a time?  Run docker interactively!  (bash is the most common shell that we generally work on)
docker run -it ubuntu /bin/bash

There is one major difference between running docker on the cluster and on your laptop:
whoami

We can't get root access in docker images on the cluster. To oversimplify a complicated topic, that's to prevent people from accessing data they shouldn't be able to.
Here though, we have root, so let's install some software.
python

fails - it's not installed!
apt-get update
apt-get install python

Now it should work:
python
>>> print "Hello World!"
Now, exit out python and your container
<CTRL-D to quit python >
exit 

Let's pop back into our container
python

Wait!  where'd it go?  Let's chat about persistence
My First Dockerfile

Create a folder
mkdir ubuntu-python
    cd ubuntu-python

Create a new text file named "Dockerfile" with the following contents:
# start from base ubuntu
FROM ubuntu:latest

MAINTAINER Chris Miller <c.a.miller@wustl.edu>

RUN apt-get -y update
RUN apt-get -y install python

Save that file in the directory
And build a docker image from that Dockerfile
    cd ..
    docker build -t chrisamiller/ubuntu-python ubuntu-python/

Let's run it:
    docker run -it chrisamiller/ubuntu-python

and verify that python is installed
    python