npicciani/snakemake.md

## snakemake.md

      
    Raw
  

              snakemake.md
            
          
    Snakemake cheat sheet (coupled with github)

You can find the documentation here.
Installing snakemake on Farnam

You can easily install snakemake on Farnam with mamba:
module load miniconda
mamba create -c conda-forge -c bioconda -n snakemake snakemake

Basic units of the workflow are rules

Each step in a workflow can be written as a rule. Snakemake will automatically evaluate all inputs and outputs. It will figure out which rules need to be run first if they depend on others and link them all through the workflow. If some rules do not depend on others, they will be run in parallel.
Configuration files give you portability

A portable workflow allows a user to re-run it with new data without changing any of the rules. For passing information that is specific to your dataset, snakemake lets you use a configuration file. Information in the configuration file will be loaded when you start running the workflow. That helps with customizing the workflow for different datasets, so you can just change file paths and specify other things in the configuration file itself, while not changing other files in the worfklow.
For instance, in your config.yaml file you can specify samples:
samples:
    A: data/samples/A.fastq
    B: data/samples/B.fastq

And later in a rule, when you want to call them as inputs, you can retrieve those sample names:
expand("sorted_reads/{sample}.bam", sample=config["samples"])

See this example and explanation here: https://snakemake.readthedocs.io/en/stable/tutorial/advanced.html#step-2-config-files
Input functions and wildcards

Wildcards can be used instead of hardcoding names and files paths. They can be stated in the input and output paragraphs and accessed in the shell directive. See a good explanation on wildcards here: https://endrebak.gitbooks.io/the-snakemake-book/content/chapters/wildcards/wildcards.html
I recently needed to generate results for testing several values of a variable within a function (it was a threshold value of branch length for pruning branches in a set of phylogenetic trees). What I wanted to do was to create separate folders for each result. The problem is that if I just used expand to solve the threshold values in a variable {threshold}, where threshold=[0,1,2], my rule would just call for a long list of all folders with different thresholds instead of calling each one at a time. It happened just as they explain here in the section "expand() in other rule’s input: calls". What worked for me was to create a global wildcard variable THRESHOLD_VALS = "0,1,2".split(","), expanding it in the target rule only (expand("results/reference/treeinform/threshold_{threshold}/{species}.collapsed.fasta.transcripts.fasta", threshold=THRESHOLD_VALS, species=config["species"])) and calling {threshold} as a wildcard in every other rule. This way snakemake didn’t concatenate all the paths into one giant line but called each of them separately.
Best practices for structuring a workflow

Snakemake developers suggest we structure the workflow as follows for improved distribution and readability:
+-- .gitignore
+-- README.md
+-- LICENSE.md
+-- workflow
|   +-- rules
|   |   +-- module1.smk
|   |   +-- module2.smk
|   +-- envs
|   |   +-- tool1.yaml
|   |   +-- tool2.yaml
|   +-- scripts
|   |   +-- script1.py
|   |   +-- script2.R
|   +-- notebooks
|   |   +-- notebook1.py.ipynb
|   |   +-- notebook2.r.ipynb
|   +-- report
|   |   +-- plot1.rst
|   |   +-- plot2.rst
|   +-- Snakefile
+-- config
|   +-- config.yaml
|   +-- some-sheet.tsv
+-- results
+-- resources

I especially agree that splitting the rules into smaller files improves readability. See more here.
Executing a workflow on the cluster

I often work with large files and run jobs on a cluster so I can have larger computational power. The cool thing about Snakemake is that it allows you to execute the workflow using HPC resources and gives you flexibility to configure each rule as a separate job.
The way that works can be a little confusing but you can adjust parameters to match those requested by the job scheduler in place. At Farnam, we use the SLURM job scheduling system.
A few things I learned:

The number of threads can be set to any number in a rule. However, you need to provide the number of available threads in your snakemake command or cluster configuration file. For instance, I have to pass the actual parameter --ntasks-per-node to the scheduler as I would normally do in a job file, otherwise the number I set in the snakefile won't go through:
snakemake --cluster "sbatch -p pi_dunn --time {params.time} --mem {params.mem} --nodes=1 --ntasks-per-node=15" --jobs 2
The params keyword can be included in separate rules, so that you can set whatever you need (time, memory, etc.) separately for each of them.

Setting up a SLURM profile

It is a LOT easier to just set up a cluster profile and have default settings for jobs, so you don't have to re-write them every single time. The SLURM profile is available here.
Install cookiecutter:
mamba install cookiecutter

Create a Slurm profile in a snakemake configuration folder at your home directory:
mkdir -p ~/.config/snakemake
cd ~/.config/snakemake
cookiecutter https://github.com/Snakemake-Profiles/slurm.git

Create and edit a cluster_config.yml file to fit your needs. Mine looks like this:
__default__:
    partition: pi_dunn
    mail-user: natasha.picciani@yale.edu
    nodes: 1
    ntasks: 1
    cpus-per-task: {threads}
    mem: 40G
    output: "logs/{rule}/{rule}.{wildcards}.slurm.out"
    error: "logs/{rule}/{rule}.{wildcards}.slurm.err"

In your file slurm-submit.pyreplace the path in CLUSTER_CONFIG with the path to your cluster_config.yml
CLUSTER_CONFIG = "/home/nnp9/.config/snakemake/slurm/cluster_config.yml"

Tips

It's useful to name log files with sample name (if applicable) and rule name.
Other utilities


To validate the configuration file, you can use a json or yaml schema and make sure all inputs are set up correctly.

Useful resources:

https://bluegenes.github.io/hpc-snakemake-tips/
http://www.hpc-carpentry.org/hpc-python/13-cluster/index.html
https://lachlandeer.github.io/snakemake-econ-r-tutorial/wildcards-in-target-rules.html
https://edwards.sdsu.edu/research/wildcards-in-snakemake/
a little outdated by now but still useful: https://hackmd.io/7k6JKE07Q4aCgyNmKQJ8Iw?view
Make your project a github repository

The usual advice is to use Github since day 1 of your project so you can keep track of all the changes to code you write. Once you create a folder for your project, you can start a repository from existing folder following the instructions here. You basically create the repository on the command line and then link it to you github account:
cd your/project/folder
git init
git add .
git commit -m "First commit"

Once you create your repository, you can connect it to your Github account:

Login to Github and create new repository (give it a name, skip the options)
Follow directions on option "…or push an existing repository from the command line".

Note: As mentioned in the step-by-step guide linked, you can set the repository URL to be different from what the directions say so you won't have to type your username and password every time you want to push commits to the origin. If you forgot to do that like I did, you don't have to delete the repository and start it over. You can simply follow the instructions here to change the repository URL. Once you do that, you will likely have to generate SSH keys to push to github.
Generating SSH keys

On the command line, type:
ssh-keygen -t rsa -b 4096 -C your.email.linked.to.github

You can cat the file to see the contents:
cat /home/user/.ssh/id_rsa.pub

Copy the contents and paste in https://github.com/settings/ssh/new
That will do!
Installing pre-commit hooks

Complete instructions here.

Install pre-commit via conda:

conda install -c conda-forge pre-commit


Add a .pre-commit-config.yaml file to your repo:

fail_fast: true
repos:
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v3.4.0
    hooks:
    - id: check-added-large-files
      args: ['--maxkb=100000']
    - id: check-yaml
    - id: end-of-file-fixer
    - id: trailing-whitespace
    - id: fix-encoding-pragma
-   repo: https://github.com/ambv/black
    rev: 21.7b0
    hooks:
    - id: black


Install pre-commit hooks:

pre-commit install