darothen/README.md

## README.md

      
    Raw
  

              README.md
            
          
    This is a simple example for using snakemake to automate a basic work pipeline.
Motivation

Makefiles and GNU Make are awesome for many reasons, and it's unforgivable for any scientist working with data processing pipelines to use them throughout their projects. But Makefiles, while feature-rich, are not really an ideal tool for automating complex data processing pipelines. If, by some chance, your analyses simply require you to collect different data, process them with identical procedures, collate them, and produce a plot, then sure - Makefiles will do. But in analyzing climate model output, I've found that I have to do a lot of quirky hacks to fit this sort of workflow model.
A perfect example is the analysis of hierarchical climate model output. It's quite common to run a climate model multiple times in a factorial factor, changing 2-3 parameters (say, an emissions dataset and a parameterization in the model). While you can pigeon-hole linear data processing pipelines to fit these sorts of climate model output archives, you'll still run into problems very quickly. First and foremost, because many or all of your intermediate files will still be NetCDF files, you shouldn't change your file extension. At best, you can continue to chain descriptors, separated by peiods at the end of your filename, but these will get unwieldy. The biggest problem, though, is that defining the dependencies of your final products (where you collate processed datasets together) often requires massive, multiple-nested loops in your Makefile and often abstraction to a macro or defined function, which could require complicated secondary expansion rules to properly work correctly.
If Makefiles had a broader standard library of functions to help generate dependencies, this wouldn't be a problem - and if you're a bash expert, there's no problem here, because you can just drop obfuscated shell commands into your Makefile to deal with this. But liberally adopting this strategy defeats the purpose of having a reproducible workflow in the first place:

Workflows should be self-documenting and simple to understand

Two features would greatly alleviate this problems inherent in Makefiles. First, if Makefiles supported different scripting languages, then you could use something like Python to expressively and succinctly build complex expressions that require Makefile-fu. Second, if Makefiles supported multiple wildcards with regular expression matching/expansion, then it would be much more straightforward to build complex, multi-faceted or multi-dimensional pipelines which avoid lots of pre-defined, deeply-nested loops.
Enter Python Tools

Given the two proposed solutions to Makefiles' limitations, the an obvious pursuit would be build a make-like tool in Python. And, in fact, there are quite a few projects which have done this.
I've played with several of them, and it became clear that third feature would exist for a make-killer: simple, non-verbose scripting. Consider, for example, the great tool doit. Doit accomplished almost everything I'm looking for in a killer workflow manager. Unfortunately, doit-Makefiles are extremely verbose and can be very dififcult to skim and read. The doit CLI also leaves much to be desired for. There's nothing fundamental to doit that would require overly-verbose syntax; it's just a byproduct of how doit chooses to serialize tasks within a workflow (generators yielding dictionaries).
Enter snakemake. Snakemake is exactly what it sounds like: a Makefile in a Pythonic syntax (actually, YAML with Python accessibility). Snakemake satisifies all of my three criteria: it has a simple, succinct syntax, it allows multiple wildcards and uses regular expression matching, and it seamlessly allows you to use Python commands in additional to piping shell scripts. By basing itself in YAML instead of a Python data structure, snakemake permits expressive yet short and to-the-point build rules. Furthermore, it leverages the fantastic string processing built into Python to perform pattern matching, letting you very easily use multiple, named wildcards in your rules. These make life super easy. On top of this, snakemake ships with sensible functions like expand() to compute the cartesian product of multiple parameters in your input/output filenames and touch() to generate stub files.
Snakemake has plenty of other useful features, including the ability to construct the DAG of your workflow, document all the shell commands and steps of your workflow, and much more.

  
## Snakefile
ACTS = ["arg_comp", "arg_min_smax", "pcm_comp", ]
AERS = ["PD", "PI", ]
REGIONS = ["GLOBAL", "LND", "OCEAN"]
FIELDS = ["TS", "PS", "PRECL", ]

rule make_data:
    output:
        expand("data/{act}/{aer}/{act}_{aer}.{field}.2d_monthly.nc",
               act=ACTS, aer=AERS, field=FIELDS)
    message:
        "Generating dummy data"
    run:
        from itertools import product
        import os
        import xarray as xr
        import numpy as np
        import pandas as pd

        for act, aer, field in product(ACTS, AERS, FIELDS):
            out_dir = "data/{act}/{aer}/".format(act=act, aer=aer)
            out_fn = "{act}_{aer}.{field}.2d_monthly.nc".format(act=act, aer=aer, field=field)

            os.makedirs(out_dir, exist_ok=True)

            data = np.random.randn(20, 20, 5)
            lin = np.linspace(0, 1, 20)
            ds = xr.Dataset({field: (['x', 'y', 'time'], data)},
                            coords={'x': lin,
                                    'y': lin,
                                    'time': pd.date_range('2001-01-01', periods=5, freq='MS')})
            ds.to_netcdf(os.path.join(out_dir, out_fn))

rule global_avg:
    input:
        "data/{act}/{aer}/{act}_{aer}.{field}.2d_monthly.nc"
    output:
        temp("analysis/global_avg/{act}_{aer}.{field}.global_avg.nc")
    run:
        import xarray as xr
        ds = xr.open_dataset(input[0])
        ds.mean(["x", "y"]).to_netcdf(output[0])

rule global_avg_comparison:
    input:
        expand("analysis/global_avg/{act}_{{aer}}.{{field}}.global_avg.nc", act=ACTS)
    output:
        "figs/global_avg/{field}.{aer}.global_avg.png"
    params:
        field="{field}", aer="{aer}"
    run:
        import pandas as pd
        import xarray as xr
        import matplotlib.pyplot as plt

        all_data = []
        for fn in input:
            print(fn)
            all_data.append(xr.open_dataset(fn)[params['field']])
        all_data = (
            xr
            .concat(all_data, dim=pd.Index(ACTS, name="act"))
            .to_dataframe()
            .reset_index()
            .pivot("time", "act", params['field'])
        )

        all_data.plot()

        plt.savefig(output[0])

rule all_global_avgs:
    input:
        expand("figs/global_avg/{field}.{aer}.global_avg.png",
               field=FIELDS, aer=AERS)
	ACTS = ["arg_comp", "arg_min_smax", "pcm_comp", ]
	AERS = ["PD", "PI", ]
	REGIONS = ["GLOBAL", "LND", "OCEAN"]
	FIELDS = ["TS", "PS", "PRECL", ]

	rule make_data:
	output:
	expand("data/{act}/{aer}/{act}_{aer}.{field}.2d_monthly.nc",
	act=ACTS, aer=AERS, field=FIELDS)
	message:
	"Generating dummy data"
	run:
	from itertools import product
	import os
	import xarray as xr
	import numpy as np
	import pandas as pd

	for act, aer, field in product(ACTS, AERS, FIELDS):
	out_dir = "data/{act}/{aer}/".format(act=act, aer=aer)
	out_fn = "{act}_{aer}.{field}.2d_monthly.nc".format(act=act, aer=aer, field=field)

	os.makedirs(out_dir, exist_ok=True)

	data = np.random.randn(20, 20, 5)
	lin = np.linspace(0, 1, 20)
	ds = xr.Dataset({field: (['x', 'y', 'time'], data)},
	coords={'x': lin,
	'y': lin,
	'time': pd.date_range('2001-01-01', periods=5, freq='MS')})
	ds.to_netcdf(os.path.join(out_dir, out_fn))

	rule global_avg:
	input:
	"data/{act}/{aer}/{act}_{aer}.{field}.2d_monthly.nc"
	output:
	temp("analysis/global_avg/{act}_{aer}.{field}.global_avg.nc")
	run:
	import xarray as xr
	ds = xr.open_dataset(input[0])
	ds.mean(["x", "y"]).to_netcdf(output[0])

	rule global_avg_comparison:
	input:
	expand("analysis/global_avg/{act}_{{aer}}.{{field}}.global_avg.nc", act=ACTS)
	output:
	"figs/global_avg/{field}.{aer}.global_avg.png"
	params:
	field="{field}", aer="{aer}"
	run:
	import pandas as pd
	import xarray as xr
	import matplotlib.pyplot as plt

	all_data = []
	for fn in input:
	print(fn)
	all_data.append(xr.open_dataset(fn)[params['field']])
	all_data = (
	xr
	.concat(all_data, dim=pd.Index(ACTS, name="act"))
	.to_dataframe()
	.reset_index()
	.pivot("time", "act", params['field'])
	)

	all_data.plot()

	plt.savefig(output[0])

	rule all_global_avgs:
	input:
	expand("figs/global_avg/{field}.{aer}.global_avg.png",
	field=FIELDS, aer=AERS)