Skip to content

Instantly share code, notes, and snippets.

@jbyars
Last active May 24, 2016 21:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jbyars/15a1527a095e392e8087642f079ba4a6 to your computer and use it in GitHub Desktop.
Save jbyars/15a1527a095e392e8087642f079ba4a6 to your computer and use it in GitHub Desktop.
I'm a bit confused how to achieve a rational resume behavior when combining `publishDir` and `when:`. The idea is we have 3 input files and 3 processes. Let's assume there is some output from each stage we want to preserve, that also needs to be used in subsequent stages. No way are we lucky enough to have independent processes :-P
#!/usr/bin/env nextflow
// Baseline example of processes. Everything is supposed to work right
// in this one.
params.srcDir = 'in'
params.dstDir = 'out'
//create the test cases
new File(params.srcDir + "/moe.txt").createNewFile()
new File(params.srcDir + "/curly.txt").createNewFile()
new File(params.srcDir + "/larry.txt").createNewFile()
Channel
.fromPath( "${params.srcDir}/*.txt" )
.ifEmpty { error "Cannot find any reads matching: ${params.srcDir}" }
.set { inputfiles }
process one {
tag { input }
publishDir params.dstDir, mode: 'copy'
input:
file input from inputfiles
output:
file "${input.getBaseName()}_one.txt" into onefiles
when:
!file("${params.dstDir}/${input.getBaseName()}_one.txt").exists()
script:
"""
cp $input ${input.getBaseName()}_one.txt
"""
}
process two {
tag { input }
publishDir params.dstDir, mode: 'copy'
input:
file input from onefiles
output:
file "${twoInput}_two.txt" into twofiles
when:
!file("${params.dstDir}/${twoInput}_two.txt").exists()
script:
twoInput = "$input".split("_")[0]
"""
cp $input ${twoInput}_two.txt
"""
}
process three {
tag { input }
publishDir params.dstDir, mode: 'copy'
input:
file input from twofiles
output:
file "${threeInput}_three.txt" into threefiles
when:
!file("${params.dstDir}/${threeInput}_three.txt").exists()
script:
threeInput = "$input".split("_")[0]
"""
cp $input ${threeInput}_three.txt
"""
}

The first time the code runs, the 3 input files are generated. The 3 processes run and the 9 output files are generated. No problem.

N E X T F L O W  ~  version 0.19.1
Launching nextflow-aws-examples/resume_examples/baseline.nf
[warm up] executor > pbs
[d0/fbf18e] Submitted process > one (larry.txt)
[2c/878112] Submitted process > one (curly.txt)
[f1/8b0d29] Submitted process > one (moe.txt)
[23/26319b] Submitted process > two (larry_one.txt)
[95/c4f723] Submitted process > two (curly_one.txt)
[e1/5e73dd] Submitted process > two (moe_one.txt)
[ab/e0e65b] Submitted process > three (larry_two.txt)
[1f/b4f520] Submitted process > three (curly_two.txt)
[ad/78badf] Submitted process > three (moe_two.txt)

in/ 
curly.txt  larry.txt  moe.txt
out/
curly_one.txt    curly_two.txt  larry_three.txt  moe_one.txt    moe_two.txt
curly_three.txt  larry_one.txt  larry_two.txt    moe_three.txt

Run the nextflow script again and no new jobs are submitted. Great! Now, consider common processing problems. A new file shows up. I'll delete the moe results to simulate.

N E X T F L O W  ~  version 0.19.1
Launching nextflow-aws-examples/resume_examples/baseline.nf
[warm up] executor > pbs
[23/27b31d] Submitted process > one (moe.txt)
[49/38f15f] Submitted process > two (moe_one.txt)
[73/7e20cb] Submitted process > three (moe_two.txt)

out/
curly_one.txt    curly_two.txt  larry_three.txt  moe_one.txt    moe_two.txt
curly_three.txt  larry_one.txt  larry_two.txt    moe_three.txt

Great, only the jobs for moe were submitted. Adding a new files to the dataset is not a problem. What about when things go wrong? Let's say intermediate jobs die in some bizarre and creative way... Let's remove the outputs for moe_two.txt and moe_three.txt. We'll run with the --resume option.

N E X T F L O W  ~  version 0.19.1
Launching nextflow-aws-examples/resume_examples/baseline.nf
[warm up] executor > pbs

out/
curly_one.txt  curly_three.txt  curly_two.txt  larry_one.txt  larry_three.txt  larry_two.txt  moe_one.txt

The jobs for the missing outputs did not queue. Process two and three don't realize they are missing work for moe, because the output channel from process one never emits the moe_one.txt output. When I use publishDir, I can't rely on the output being available at the start of the next process. I can't use scratchDir because it doesn't work with S3 buckets.

How do I deal with this? I am stuck in the thought process when: should determine if the script section should run or not, but channel output needs to get emitted either way in order to catch missing downstream jobs. --resume doesn't seem to get me out of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment