Skip to content

Instantly share code, notes, and snippets.

@porterjamesj
Last active August 29, 2015 13:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save porterjamesj/9399294 to your computer and use it in GitHub Desktop.
Save porterjamesj/9399294 to your computer and use it in GitHub Desktop.
the interface
from pipeline.main import pipeline_from_config
from pipeline.tools import async
from genomics_tools import parse_bowtie_output
with pipeline_from_config("/path/to/config.yaml") as p:
with async():
p.run("cutadapt --quality-base=33 --quality-cutoff=20 --format=fastq --minimum-length=0 --output={path1}/{name1}.trimmed.fastq {path1}/{name1}.fastq")
p.run("cutadapt --quality-base=33 --quality-cutoff=20 --format=fastq --minimum-length=0 --output={path2}/{name2}.trimmed.fastq {path2}/{name1}.fastq")
p.run("mkdir {workdir}/bowtie")
innder_dist,std_dev = parse_bowtie_output(p.run("bowtie2 -p 8 -s 100000 -u 250000 -q -x /path/to/bowtie/index -1 {path1}/{name1}.trimmed.fastq -2 {path2}/{name2}.trimmed.fastq -S {workdir}/bowtie/{description}.bam"))
p.run("tophat --num-threads=8 --GTF={gtf} --no-coverage-search --output-dir={workdir}/tophat --mate-std-dev={std_dev} --mate-inner-dist={inner_dist} /path/to/bowtie/index {name1}.trimmed.fastq {name2}.trimmed.fastq")
p.run("cufflinks . . . ")
@porterjamesj
Copy link
Author

things to think about:

  • transactions? doing it in an automagical way would require parsing the command line output of every tool. a better approach might be to have the user specify a "cleanup" step somehow when calling p.run. maybe another context manager? scripts could get fairly throny. the core problem is that we don't want this to be too magical (i.e. it should feel like writing a bash script that gets applied in parallel), but having the user manually specify the cleanup procedure for each step would make it very unwieldly to write. (every step would have to have an extra context manager or whatever wrapping it.
  • how to support different backends? this can happen in whatever function gives you a pipeline object; need to think about the best way though.
  • logging. right now i'm planning on having each "sample" for lack of better work have it's own log file into which it just dumps everything, which is pretty straightforward. this is useful when you want to examine the logs for a specific file to figure out what's going wrong there but not so much when you want to get a big overview of if something went wrong with the entire cluster — you don't want to have to sift through a million different logs to figure out where the problem is. for this it's possible we could just read all the logs and try to present them in some sort of intelligent way.
  • dashboard! I would also like to write a little flask app to act as a dashboard that you can reverse tunnel into or we could mount on tukey somehow. it would nice to have the pipeline coordinator run as a daemon and just have the dashboard, command line client, etc., just be frontends for it.
  • NAME?????? the hardest part of any OSDC project -_____-

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment