Skip to content

Instantly share code, notes, and snippets.

@mariusae
Created February 22, 2018 23:50
Show Gist options
  • Save mariusae/24ff59757050076e0263af58edf54e00 to your computer and use it in GitHub Desktop.
Save mariusae/24ff59757050076e0263af58edf54e00 to your computer and use it in GitHub Desktop.
"Many files" example of Reflow (https://github.com/grailbio/reflow/issues/22)
// This is an example modules to demonstrate how to process large
// numbers of files in Reflow, as outlined here:
//
// https://github.com/grailbio/reflow/issues/22
// First, we define a number of parameters that provide the input
// and output data locations as well as the location of the reference
// we'd like to use.
param (
// Ref contains our SNP calling reference.
ref = "s3://myreferencedata/ref.fasta"
// Reads is the location of the paired-end reads
// that we are going to be assembling and calling
// SNPs on.
reads string
// Out is the prefix to which the module's outputs
// are placed. Specifically, assemblies are placed in
// out/assemblies, and calls are placed in out/snps.
out string
)
// Define some values that contain our Docker images.
// We don't have any readily available spades or snippy images,
// so we'll use dummies here.
val spadesImage = "unknown"
val snippyImage = "unknown"
// Now we write some functions for the various computations
// we are going to be performing.
// Assemble takes a paired-end read (forward reverse)
// and uses spades to assemble them into a FASTA file.
func assemble(f, r file) file =
exec(image := spadesImage, cpu := 2, mem := GiB) (out file) {"
spades -pe1 {{f}} -pe2 {{r}} -out {{out}}
"}
// Call calls SNPs from the provided fasta file using
// snippy. and the provided reference fasta.
func call(fasta, ref file) dir =
exec(image := snippyImage, cpu := 2, mem := GiB) (out dir) {"
snippy -in {{fasta}} -ref {{ref}} -outdir {{out}}
"}
// We're now going to use these functions to assemble genomes
// and call SNPs on our input dataset.
// Instantiate some useful utility (system) modules.
val dirs = make("$/dirs")
val files = make("$/files")
val strings = make("$/strings")
// Pairs contains a list of all the paired-end reads
// we found in the reads directory. It contains the
// forward and reverse files together with the name
// (prefix) of the pair.
val pairs [{f, r file, name string}] = {
// Pair extracts a read-pair and a name from a directory.
func pair(d dir) {f, r file, name string} = {
val (f, fname) = dirs.Pick(d, "*F.fastq")
val (r, _) = dirs.Pick(d, "*R.fastq")
val [name, _] = strings.Split(fname, "_")
{f, r, name}
}
// dirs.Groups returns a set of directories grouped by the
// provided regular expression. In this case, we're interested
// in all the read pairs in the reads directory.
dirs := dirs.Groups(dir(reads), "(genome.*)_[RF].fastq")
// We then extract each read (forward, reverse) and the name,
// as encoded in the filename.
[pair(d) | (_, d) <- dirs]
}
// Now we assemble each pair, and then call SNPs on each of the
// assemblies.
val assemblies = [{fasta: assemble(f, r), name} | {f, r, name} <- pairs]
val calls = [{snp: call(fasta, file(ref)), name} | {fasta, name} <- assemblies]
// Finally we copy the results of our computations to our desired
// output locations. Note that since Reflow evaluates lazily, any
// evaluation that isn't reachable from Main is not computed,
// thus we typically put side-effecting code here.
@requires(cpu := 1, mem := GiB)
val Main = (
[files.Copy(fasta, out+"/assemblies/"+name+".fasta") | {fasta, name} <- assemblies],
[dirs.Copy(snp, out+"/snps/"+name+"_snps/") | {snp, name} <- calls],
)
/*
Here's a run where I subsituted the definitions of assemble and call
with dummy versions:
func assemble(f, r file) file = exec(image := "ubuntu", cpu := 1) (out file) {"
cat {{f}} {{r}} >{{out}}
"}
func call(fasta, ref file) dir = exec(image := "ubuntu", cpu := 1) (out dir) {"
cp {{fasta}} {{out}}/somecall.xxx
"}
% reflow doc assembly.rf
Parameters
val ref string = "s3://myreferencedata/ref.fasta"
Ref contains our SNP calling reference.
val reads string (required)
Reads is the location of the paired-end reads that we are going to be assembling
and calling SNPs on.
val out string (required)
Out is the prefix to which the module's outputs are placed. Specifically,
assemblies are placed in out/assemblies, and calls are placed in out/snps.
Declarations
val Main ([unit], [unit])
Finally we copy the results of our computations to our desired output locations.
Note that since Reflow evaluates lazily, any evaluation that isn't reachable
from Main is not computed, thus we typically put side-effecting code here.
% reflow run assembly.rf -reads s3://grail-marius/assembly -out s3://grail-marius/OUT
...
% aws s3 ls --recursive s3://grail-marius/OUT/
2018-02-22 15:41:15 47200 OUT/assemblies/genome01.fasta
2018-02-22 15:41:15 47200 OUT/assemblies/genome02.fasta
2018-02-22 15:41:16 47200 OUT/snps/genome01_snps/somecall.xxx
2018-02-22 15:41:15 47200 OUT/snps/genome02_snps/somecall.xxx
*/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment