endrebak/SEP1.md

## SEP1.md

      
    Raw
  

              SEP1.md
            
          
    SEP 1 -- Improved and automated logging

Abstract

This Snakemake Enhancement Proposal (SEP) suggests several enhancements of
Snakemake's logging and report generating capabilities.
Introduction

Bioinformaticians often have an enormous library of homemade scripts
continuously producing results and graphs, often for many different people and
groups. Therefore, at a later date, it might be hard to remember how different
results were produced. And even if a git hash is printed on a graph or attached
as metadata to a file, looking through your git history and tracing the path of
execution through a script to find out how your files were made is very
time-consuming. And even this level of meticulousness is not enough: the files
you use or produce are not version controlled and might be stale, modified or of
a different origin than what you intended.
This plight of bioinformaticians is likely a contributor to the "reproducibility
crisis" in the life sciences.
Therefore, making it easy to retrace your steps and find out exactly what code
and what files were used to produce a result is going to be an important step
towards more reproducible science. It will also save a lot of time for
developers. Lastly, if an official system is put in place for creating good
automatic logs, it can be updated and further refined through pull requests from
the community.
Current situation

Currently, snakemake offers two conveniences for creating logs and reports; the
log: directive and the report: keyword.
The log: directive specifies a log file for a specific rule.
The report:keyword allows the user to send in restructured text to create a
more refined HTML report.
These two directives can be combined, so that a nice HTML log is created.
Limitations of the current situation

The log directive merely creates a log file for the user, and the onus is on
the user to generate, format and write the appropriate data to the log file.
This log-writing must be done in the very same rule that produces the normal
results from the pipeline, adding more code to a perhaps already crowded rule,
making the code harder to understand and maintain.
The report keyword requires the user to fetch the data she wants to include
herself and write a lot of boilerplate to format it nicely. Therefore, it is
better suited to create reports at the end of pipeline, tying together lots of
different data.
Having separate rules for creating reports is undesireable, especially if you
want to create a report for each rule, like this SEP suggestes. This is not just
because having to write, debug and maintain more code is a bad thing, but
because it clutters your --dag and --rulegraph output with unnecessary
information.
Proposal: enhanced and automized logging

One might write a log for different reasons; often it is to report warnings from
the software used, how many reads were aligned more than once, which contigs
were removed from a fasta file and similar mundane things. These are likely
things that cannot be done much more efficiently than they already are in
Snakemake.
There is a different reason to write a log, and that is to record exactly which
steps were performed to produce your results and what results were produced.
This is where there is an orchard full of low-hanging fruit for Snakemake.
As a Snakefile typically describes the whole pipeline used to produce a result,
has access to the environment in which it is run, all input, output and
temporary files, and the code used to generate them, including the docstrings,
it should be possible to both greatly improve and automate Snakemake's logging
and reporting.
For each rule, there are several standard elements that would be helpful to have
in a log/report.

The docstring of a rule. This can be used as a readable explanation of a rule,
and is perhaps the most interesting part for non-technical people trying to
understand what you have done.
The input- and output-files of a rule.
A small portion of the files. Viewing a small subset of a few input and
output files is often more helpful than reading the code or docstrings to
immediately understand what a rule does.
The timestamps of the input and output files.
The code used in a rule. There is nothing that will better help you understand
exactly how your results were produced than reading the code. This can be looked
up later, but is often arduous work.
The git status and commit hash of your project.
The time the report was created.
The versions of the Python libraries, R libraries and command line apps used.

There should be support for easily creating a log/report of the whole pipeline
run by concatenating the reports from each rule. The report could include a
table of contents and the rulegraph or DAG for your pipeline.
Example

Below follows a very rough example of what I am trying to achieve.
As you can see, the file starts with the graph produced by the pipeline and a
report for the whole pipeline. The report for the whole pipeline merely consists
of the reports for each individual rule file.
https://github.com/endrebak/git-lfs/blob/master/annotated_H3K4me3.pdf
(To create it, I had to update the pipeline with a lot of boilerplate. As you can
see, there is nothing there snakemake could not do automatically.)
Notice that by having rules to create reports, the rulegraph becomes much less readable: https://github.com/endrebak/git-lfs/blob/master/tss_rulegraph.pdf
Benefits

Easier debugging

Since the code and example data is shown close together, you can read a report
linearly to find where (and why) your pipeline starts to produce wonky data.
Easier sharing of results and collaboration

If results are delivered with a report, it is much easier for both technical and
non-technical collaborators to understand how your results were produced. (This
was actually the use-case that inspired us to create such reports.)
And if the result of a pipeline is a graph, the whole report can be included in
the produced graph. By adding the report as additional pages in a PDF, the
metadata and graph will never be out of sync. When it is time to include the
graph in a publication, the report part can easily be cropped. If the graphs are
intended to be released as supplemental information, including the reports is
likely helpful.
Easier writing up your results for publication

By including the correct information in your report, writing up your results for
publication later will be much easier.
Make Snakemake more attractive and well known

Automatic generation of logs is a killer feature, and as much as I love
Snakemake I would consider switching to a framework that included such a
feature.
Furthermore, if Snakemake logs were well-designed, informative and easy to
generate they would likely be widely used to share information about how results
were generated, spreading the gospel of Snakemake to a much wider audience.
Implementation

Here I consider one way the user-interfacing parts of Snakemake could be modified
to accommodate automatic reports. Only directly relevant implementation details
are discussed.
Use the existing log directive

As Snakemake already contains a log directive, this could be used. To toggle
automatic log generation a command line argument (possibly pointing to the
template for the log files) could be used.
One problem with this approach is that existing rules using the log directive
write directly to the log files. An implementation solution would be to wait
until the rule was done and the log was written. The contents of this custom log
(the log written to in the rule) could be read into memory. Then Snakemake could
write its automatic log and include the custom log into a "custom" section of
the automatically generated log.
This is the cleanest and least invasive solution I can see, as it does not break
compatibility with existing code and even handles the case where the log message
is written in a shell directive or from R code.
Q&A

If a rule uses (or produces) very many files, how can you possibly write a sample from them all?

You cannot, but we can make it so that from each named group of files, you write
a sample from at most X of the files.
If you have a rule like so:
one_important_file .... ```

Only the first X files from the input-directive are sampled in the log.

You can change it to

```snakemake rule bla: input: named_list=list_of_thousand_homogenous_files,
named_file=one_important_file .... ```

And then a sample of the first X files of `named_list` is written, together with
a sample from `named_file`.

## Conclusion

Having Snakemake create automatic reports will not be hard to do. 

You make reproducibility, collaboration, debugging and paper writing much easier for Snakemake users.