sjl/snakemake-notes.markdown

## snakemake-notes.markdown

      
    Raw
  

              snakemake-notes.markdown
            
          
    Snakemake

Here are the links to various official Snakemake things.

Main site: https://snakemake.github.io/
Paper: https://f1000research.com/articles/10-33/v2
Tutorial: https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html
Full documentation: https://snakemake.readthedocs.io/en/stable/

The tutorial is pretty good, but I'll try to brain dump some information below
that might help.
In programming and bioinformatics one of the situations we always find ourselves
in is:

I have file X, I want to run tool T on it to produce file Y.

This happens constantly, all the time, especially when you're working in
a Linux/UNIX-style environment.  The tool you use and how you need to call it
can vary each time, and sometimes there's more than one input or output file,
but the general pattern of "input file(s) → tool→ output file(s)" is something
you have to do over and over and over again, e.g.:

You have photo-1234.jpg and want to resize it to create a small preview
image for it: convert photo-1234.jpg -resize 100x100 photo-1234-preview.jpg
You have a FASTQ file and want to get a report on quality metrics:
fastqc -o SRA12345_R1_Report/ SRA12345_R1.fastq.gz
You have a pair of FASTQ files representing paired-end reads for a bacteria
sample and want to de novo assemble a genome for it: spades.py --pe1-1 SRA12345_R1.fastq.gz --pe1-2 SRA12345_R2.fastq.gz -o SRA12345_R2_assembly/.
You have a python (or R or whatever) script that takes an assembly, performs
some analysis on it, and generates a plot: python perform_analysis.py --assembly SRA12345_R2_assembly/contigs.fasta --output plot.pdf

If you just have one or a couple of tasks like this to perform, doing it
manually at the command line is fine and lets you get your result and move on
with your life.
If you have a lot of these tasks to perform, it can be easier to make a script
to do it for you instead of typing out the commands by hand:
#!/usr/bin/env bash
set -euo pipefail

convert photo-001.jpg -resize 100x100 photo-001-preview.jpg
convert photo-002.jpg -resize 100x100 photo-002-preview.jpg
convert photo-003.jpg -resize 100x100 photo-003-preview.jpg
convert photo-004.jpg -resize 100x100 photo-004-preview.jpg
convert photo-005.jpg -resize 100x100 photo-001-preview.jpg
convert photo-006.jpg -resize 100x100 photo-006-preview.jpg
convert photo-007.jpg -resize 100x100 photo-007-preview.jpg

This is probably easier than typing it by hand, but is still kind of error prone
(notice the typo in one of the commands).  So maybe you make a slightly fancier
bash script:
#!/usr/bin/env bash
set -euo pipefail

function make_preview {
    convert "${1}.jpg" -resize 100x100 "${1}-preview.jpg"
}

convert photo-001
convert photo-002
convert photo-003
convert photo-004
convert photo-005
convert photo-006
convert photo-007

That's a little better, but there's still some things we might want to improve.
If we edit one of the photos and rerun the script, it'll recreate all the
preview images even though we've only changed one of them.  We could hack
something into the script to try to handle that:
#!/usr/bin/env bash
set -euo pipefail

function make_preview {
    IN="${1}.jpg" 
    OUT="${1}-preview.jpg"

    if test "$IN" -nt "$OUT"; then
        # Only do the conversion if the input is newwer than the output (-nt
        # means "newer than").
        convert "$IN" -resize 100x100 "$OUT"
    fi
}

convert photo-001
convert photo-002
convert photo-003
convert photo-004
convert photo-005
convert photo-006
convert photo-007

So that works, and is kind of okay, but hacking this together for every single
tool one-by-one is a pain, especially if a particular step has multiple input
files (now you have to check if any input is newer than the output), multiple
output files (is the input newer than any output file), or both.
Another pattern you often have to deal with is when the output of one tool
becomes the input for another, e.g.:
Raw FASTQ Files
      ↓
    [Trimming program (e.g. trimmomatic)]
      ↓
Trimmed FASTQ Files
      ↓
    [Assembler (e.g. Spades)]
      ↓
Assembled FASTA file

Often it gets even more complicated, where it's not a single linear list of
X→Y→Z.  The flow for our project will probably look something like this:

Like before, we only want to rebuild an output if the input changed.  But now
that we have a graph, this isn't necessarily limited to a single file.  If we
change an input file, we only need to rebuild the output files that depend
(transitively) on that file.  In our example, if we update the reference there's
no need to waste time trimming the FASTQs again.
This problem has been around for as long as UNIX has.  The classic example is
compiling a program written in C: you compile each file individually, then link
the resulting object files into a program, then copy the program somewhere to
install it, etc.
The steps of the entire process together form a graph, specifically a Directed
Acyclic Graph (DAG) — it's directed because input vs output is directional, and
it's acyclic because you can't require something to already exist in order to
build itself.  DAGs come up in a lot of places in computer science (e.g. the Git
commit history is a DAG), so they're worth getting comfortable with for their
own sake.
Back to the problem, we have a DAG of tasks we need to do:

Doing all of this by hand would be tedious and error prone, and programmers are
often lazy and want to make the computer do the work for them.  In the mid 1970s
some people at Bell Labs created a tool called make, which lets you define the
DAG of your input and output files and handles creating the outputs whenever
they don't exist and/or whenever their inputs change.  make's syntax is
infamously bad, but let's take a look just for the hell of it:
%-preview.jpg: %.jpg
	convert %< -resize 100x100 %@

This says "in order to make a file named whatever-preview.jpg, you need to
have a file named whatever.jpg, and then you can create it by running the
following convert … command".  So now we can run make foo-preview.jpg and
the computer will look for foo.jpg and use it to build the preview (or spit
out an error if the input file is missing).
make is used a lot in the Linux programming world to compile software, but
it's not super user friendly.  Snakemake is a tool that's trying to solve the
same basic problem as make (take a DAG of input/outputs and (re)build any
outputs that need it), but is a little friendlier (especially for bioinformatics
pipelines).  The syntax is Python with a few extra bits on top (hence the
"snake" in "Snakemake").
In Snakemake, that make rule from before looks like:
rule preview_photo:
    input:
        "{name}.jpg"
    output:
        "{name}-preview.jpg"
    shell:
        "convert {input} -resize 100x100 {output}"

And we'd run it similarly to how we ran make before: snakemake whatever-preview.jpg.
There's something important that's easy to miss when you're skimming over the
examples, but is really important to wrap your head around when working with
Snakemake: rules and invocations of Snakemake are always based on
the output, not the input.
For example: our rule above takes a photo as input and generates a preview.  To
call that rule we tell Snakemake "please build whatever-preview.jpg" and
Snakemake will figure out the input it needs itself.  There's no way to say
"please take the input file whatever.jpg and make stuff from it", you always
start at the outputs and work backwards from there.  It takes some time to get
used to working in that direction, but it's not too bad once you get some
practice.
Defining all the steps in your workflow in Snakemake takes a bit of time and
practice, but once you've done it you get a lot of stuff for free:

You can rebuild everything with a single command, without any risk of making
typos along the way.
Snakemake is smart enough to only rebuild the necessary parts of the DAG when
something changes, instead of building everything every time.
Snakemake can plug into Slurm on Great Lakes, so it can run individual tasks
on separate cluster nodes.  So if you need to run 48 assemblies, it can run
them all in parallel on 48 different nodes to finish faster.
You can get some basic statistics (e.g. run time, maximum memory used, etc)
automatically with a line or two of Snakemake config.
Snakemake can handle running tools from Singularity containers if you want it
to, which can help solve the installation/dependency hell that a lot of
bioinformatics tools (especially Python-based ones) end up in.

It's worth going through the Snakemake tutorial if you've never used it before
to get an idea of how it works: https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html
Note: the cluster admins recently added Snakemake as a module on Great Lakes, so you
don't have to mess around installing it yourself if you want to give it a try.
On Great Lakes you can just module load snakemake and you're all set.