Skip to content

Instantly share code, notes, and snippets.

@endrebak
Last active November 8, 2016 07:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save endrebak/a29cb8163c1eab23d6942bc299b18461 to your computer and use it in GitHub Desktop.
Save endrebak/a29cb8163c1eab23d6942bc299b18461 to your computer and use it in GitHub Desktop.

SEP 1 -- Improved and automated logging

Abstract

This Snakemake Enhancement Proposal (SEP) suggests several enhancements of Snakemake's logging and report generating capabilities.

Introduction

Bioinformaticians often have an enormous library of homemade scripts continuously producing results and graphs, often for many different people and groups. Therefore, at a later date, it might be hard to remember how different results were produced. And even if a git hash is printed on a graph or attached as metadata to a file, looking through your git history and tracing the path of execution through a script to find out how your files were made is very time-consuming. And even this level of meticulousness is not enough: the files you use or produce are not version controlled and might be stale, modified or of a different origin than what you intended.

This plight of bioinformaticians is likely a contributor to the "reproducibility crisis" in the life sciences.

Therefore, making it easy to retrace your steps and find out exactly what code and what files were used to produce a result is going to be an important step towards more reproducible science. It will also save a lot of time for developers. Lastly, if an official system is put in place for creating good automatic logs, it can be updated and further refined through pull requests from the community.

Current situation

Currently, snakemake offers two conveniences for creating logs and reports; the log: directive and the report: keyword.

The log: directive specifies a log file for a specific rule.

The report:keyword allows the user to send in restructured text to create a more refined HTML report.

These two directives can be combined, so that a nice HTML log is created.

Limitations of the current situation

The log directive merely creates a log file for the user, and the onus is on the user to generate, format and write the appropriate data to the log file. This log-writing must be done in the very same rule that produces the normal results from the pipeline, adding more code to a perhaps already crowded rule, making the code harder to understand and maintain.

The report keyword requires the user to fetch the data she wants to include herself and write a lot of boilerplate to format it nicely. Therefore, it is better suited to create reports at the end of pipeline, tying together lots of different data.

Having separate rules for creating reports is undesireable, especially if you want to create a report for each rule, like this SEP suggestes. This is not just because having to write, debug and maintain more code is a bad thing, but because it clutters your --dag and --rulegraph output with unnecessary information.

Proposal: enhanced and automized logging

One might write a log for different reasons; often it is to report warnings from the software used, how many reads were aligned more than once, which contigs were removed from a fasta file and similar mundane things. These are likely things that cannot be done much more efficiently than they already are in Snakemake.

There is a different reason to write a log, and that is to record exactly which steps were performed to produce your results and what results were produced. This is where there is an orchard full of low-hanging fruit for Snakemake.

As a Snakefile typically describes the whole pipeline used to produce a result, has access to the environment in which it is run, all input, output and temporary files, and the code used to generate them, including the docstrings, it should be possible to both greatly improve and automate Snakemake's logging and reporting.

For each rule, there are several standard elements that would be helpful to have in a log/report.

  • The docstring of a rule. This can be used as a readable explanation of a rule, and is perhaps the most interesting part for non-technical people trying to understand what you have done.
  • The input- and output-files of a rule.
  • A small portion of the files. Viewing a small subset of a few input and output files is often more helpful than reading the code or docstrings to immediately understand what a rule does.
  • The timestamps of the input and output files.
  • The code used in a rule. There is nothing that will better help you understand exactly how your results were produced than reading the code. This can be looked up later, but is often arduous work.
  • The git status and commit hash of your project.
  • The time the report was created.
  • The versions of the Python libraries, R libraries and command line apps used.

There should be support for easily creating a log/report of the whole pipeline run by concatenating the reports from each rule. The report could include a table of contents and the rulegraph or DAG for your pipeline.

Example

Below follows a very rough example of what I am trying to achieve.

As you can see, the file starts with the graph produced by the pipeline and a report for the whole pipeline. The report for the whole pipeline merely consists of the reports for each individual rule file.

https://github.com/endrebak/git-lfs/blob/master/annotated_H3K4me3.pdf

(To create it, I had to update the pipeline with a lot of boilerplate. As you can see, there is nothing there snakemake could not do automatically.)

Notice that by having rules to create reports, the rulegraph becomes much less readable: https://github.com/endrebak/git-lfs/blob/master/tss_rulegraph.pdf

Benefits

Easier debugging

Since the code and example data is shown close together, you can read a report linearly to find where (and why) your pipeline starts to produce wonky data.

Easier sharing of results and collaboration

If results are delivered with a report, it is much easier for both technical and non-technical collaborators to understand how your results were produced. (This was actually the use-case that inspired us to create such reports.)

And if the result of a pipeline is a graph, the whole report can be included in the produced graph. By adding the report as additional pages in a PDF, the metadata and graph will never be out of sync. When it is time to include the graph in a publication, the report part can easily be cropped. If the graphs are intended to be released as supplemental information, including the reports is likely helpful.

Easier writing up your results for publication

By including the correct information in your report, writing up your results for publication later will be much easier.

Make Snakemake more attractive and well known

Automatic generation of logs is a killer feature, and as much as I love Snakemake I would consider switching to a framework that included such a feature.

Furthermore, if Snakemake logs were well-designed, informative and easy to generate they would likely be widely used to share information about how results were generated, spreading the gospel of Snakemake to a much wider audience.

Implementation

Here I consider one way the user-interfacing parts of Snakemake could be modified to accommodate automatic reports. Only directly relevant implementation details are discussed.

Use the existing log directive

As Snakemake already contains a log directive, this could be used. To toggle automatic log generation a command line argument (possibly pointing to the template for the log files) could be used.

One problem with this approach is that existing rules using the log directive write directly to the log files. An implementation solution would be to wait until the rule was done and the log was written. The contents of this custom log (the log written to in the rule) could be read into memory. Then Snakemake could write its automatic log and include the custom log into a "custom" section of the automatically generated log.

This is the cleanest and least invasive solution I can see, as it does not break compatibility with existing code and even handles the case where the log message is written in a shell directive or from R code.

Q&A

If a rule uses (or produces) very many files, how can you possibly write a sample from them all?

You cannot, but we can make it so that from each named group of files, you write a sample from at most X of the files.

If you have a rule like so:

one_important_file .... ```

Only the first X files from the input-directive are sampled in the log.

You can change it to

```snakemake rule bla: input: named_list=list_of_thousand_homogenous_files,
named_file=one_important_file .... ```

And then a sample of the first X files of `named_list` is written, together with
a sample from `named_file`.

## Conclusion

Having Snakemake create automatic reports will not be hard to do. 

You make reproducibility, collaboration, debugging and paper writing much easier for Snakemake users.
@johanneskoester
Copy link

Hi Endre,
indeed, improving the logging capabilities is a good idea. I have recently thought about a concept going in the same direction: Let Snakemake write an automatic report containing all the log files, the DAG and maybe the rules code or at least some meta info for each rule.

Your proposal seems to be compatible to this in a large extend. The main point where I tend to disagree is about the log directive. From a dev perspective, it is great to be able to see the log file pattern when looking at the rule itself. This is only natural, since input and output are displayed there as well. Further, many tools handle logs in an unusual way (stdout, stderr, files), so that it is good to see how logs are put into the desired place directly from the shell command. Therefore, I would propose to (at least as a first step), only take logging output that was written to files defined by the log directive.
To remove clutter, a global log directive with some kind of pattern could be introduced independently of this automatic reporting/logging support.

Integrating with current report functionality:

It would be great if such an automatic report would integrate with the ability to include custom textual descriptions and output files (e.g. plots). There could be a report directive, pointing to a custom jinja/RST template with that text. The template could access output files via the rules.output[] notation.

File format

Did you thinks about how to present this? I thought about composing a self-contained HTML, where all the additional information can be collapsed or expanded. The downside is that it potentially can't deal with large workflows. This could be circumvented by not putting every log directly into the HTML (which would in turn make it harder to deliver the report to collaborators). There are also a couple multi-file HTML formats around (MHTML, MAFF). Unfortunatly, no standard has emerged so far. MHTML has the best browser support as far as I know.
When using PDF instead, one could look into the PDF file attachment support, but log file display could become problematic depending on the platform a collaborator has.

@endrebak
Copy link
Author

endrebak commented Nov 8, 2016

Moved this to bitbucket issues, like the new guidelines told me to: https://bitbucket.org/snakemake/snakemake/issues/394/automated-logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment