Skip to content

Instantly share code, notes, and snippets.

@darothen
Created July 13, 2016 20:37
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save darothen/90e2d234c4b20d710bcbc95102af742b to your computer and use it in GitHub Desktop.
Save darothen/90e2d234c4b20d710bcbc95102af742b to your computer and use it in GitHub Desktop.
This is a simple example and writeup of using Snakemake in an actual data workflow.

This is a simple example for using snakemake to automate a basic work pipeline.

Motivation

Makefiles and GNU Make are awesome for many reasons, and it's unforgivable for any scientist working with data processing pipelines to use them throughout their projects. But Makefiles, while feature-rich, are not really an ideal tool for automating complex data processing pipelines. If, by some chance, your analyses simply require you to collect different data, process them with identical procedures, collate them, and produce a plot, then sure - Makefiles will do. But in analyzing climate model output, I've found that I have to do a lot of quirky hacks to fit this sort of workflow model.

A perfect example is the analysis of hierarchical climate model output. It's quite common to run a climate model multiple times in a factorial factor, changing 2-3 parameters (say, an emissions dataset and a parameterization in the model). While you can pigeon-hole linear data processing pipelines to fit these sorts of climate model output archives, you'll still run into problems very quickly. First and foremost, because many or all of your intermediate files will still be NetCDF files, you shouldn't change your file extension. At best, you can continue to chain descriptors, separated by peiods at the end of your filename, but these will get unwieldy. The biggest problem, though, is that defining the dependencies of your final products (where you collate processed datasets together) often requires massive, multiple-nested loops in your Makefile and often abstraction to a macro or defined function, which could require complicated secondary expansion rules to properly work correctly.

If Makefiles had a broader standard library of functions to help generate dependencies, this wouldn't be a problem - and if you're a bash expert, there's no problem here, because you can just drop obfuscated shell commands into your Makefile to deal with this. But liberally adopting this strategy defeats the purpose of having a reproducible workflow in the first place:

Workflows should be self-documenting and simple to understand

Two features would greatly alleviate this problems inherent in Makefiles. First, if Makefiles supported different scripting languages, then you could use something like Python to expressively and succinctly build complex expressions that require Makefile-fu. Second, if Makefiles supported multiple wildcards with regular expression matching/expansion, then it would be much more straightforward to build complex, multi-faceted or multi-dimensional pipelines which avoid lots of pre-defined, deeply-nested loops.

Enter Python Tools

Given the two proposed solutions to Makefiles' limitations, the an obvious pursuit would be build a make-like tool in Python. And, in fact, there are quite a few projects which have done this.

I've played with several of them, and it became clear that third feature would exist for a make-killer: simple, non-verbose scripting. Consider, for example, the great tool doit. Doit accomplished almost everything I'm looking for in a killer workflow manager. Unfortunately, doit-Makefiles are extremely verbose and can be very dififcult to skim and read. The doit CLI also leaves much to be desired for. There's nothing fundamental to doit that would require overly-verbose syntax; it's just a byproduct of how doit chooses to serialize tasks within a workflow (generators yielding dictionaries).

Enter snakemake. Snakemake is exactly what it sounds like: a Makefile in a Pythonic syntax (actually, YAML with Python accessibility). Snakemake satisifies all of my three criteria: it has a simple, succinct syntax, it allows multiple wildcards and uses regular expression matching, and it seamlessly allows you to use Python commands in additional to piping shell scripts. By basing itself in YAML instead of a Python data structure, snakemake permits expressive yet short and to-the-point build rules. Furthermore, it leverages the fantastic string processing built into Python to perform pattern matching, letting you very easily use multiple, named wildcards in your rules. These make life super easy. On top of this, snakemake ships with sensible functions like expand() to compute the cartesian product of multiple parameters in your input/output filenames and touch() to generate stub files.

Snakemake has plenty of other useful features, including the ability to construct the DAG of your workflow, document all the shell commands and steps of your workflow, and much more.

ACTS = ["arg_comp", "arg_min_smax", "pcm_comp", ]
AERS = ["PD", "PI", ]
REGIONS = ["GLOBAL", "LND", "OCEAN"]
FIELDS = ["TS", "PS", "PRECL", ]
rule make_data:
output:
expand("data/{act}/{aer}/{act}_{aer}.{field}.2d_monthly.nc",
act=ACTS, aer=AERS, field=FIELDS)
message:
"Generating dummy data"
run:
from itertools import product
import os
import xarray as xr
import numpy as np
import pandas as pd
for act, aer, field in product(ACTS, AERS, FIELDS):
out_dir = "data/{act}/{aer}/".format(act=act, aer=aer)
out_fn = "{act}_{aer}.{field}.2d_monthly.nc".format(act=act, aer=aer, field=field)
os.makedirs(out_dir, exist_ok=True)
data = np.random.randn(20, 20, 5)
lin = np.linspace(0, 1, 20)
ds = xr.Dataset({field: (['x', 'y', 'time'], data)},
coords={'x': lin,
'y': lin,
'time': pd.date_range('2001-01-01', periods=5, freq='MS')})
ds.to_netcdf(os.path.join(out_dir, out_fn))
rule global_avg:
input:
"data/{act}/{aer}/{act}_{aer}.{field}.2d_monthly.nc"
output:
temp("analysis/global_avg/{act}_{aer}.{field}.global_avg.nc")
run:
import xarray as xr
ds = xr.open_dataset(input[0])
ds.mean(["x", "y"]).to_netcdf(output[0])
rule global_avg_comparison:
input:
expand("analysis/global_avg/{act}_{{aer}}.{{field}}.global_avg.nc", act=ACTS)
output:
"figs/global_avg/{field}.{aer}.global_avg.png"
params:
field="{field}", aer="{aer}"
run:
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt
all_data = []
for fn in input:
print(fn)
all_data.append(xr.open_dataset(fn)[params['field']])
all_data = (
xr
.concat(all_data, dim=pd.Index(ACTS, name="act"))
.to_dataframe()
.reset_index()
.pivot("time", "act", params['field'])
)
all_data.plot()
plt.savefig(output[0])
rule all_global_avgs:
input:
expand("figs/global_avg/{field}.{aer}.global_avg.png",
field=FIELDS, aer=AERS)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment