jbyars/baseline.nf

## baseline.nf
#!/usr/bin/env nextflow

// Baseline example of processes. Everything is supposed to work right
// in this one.

params.srcDir = 'in'
params.dstDir = 'out'

//create the test cases
new File(params.srcDir + "/moe.txt").createNewFile()
new File(params.srcDir + "/curly.txt").createNewFile()
new File(params.srcDir + "/larry.txt").createNewFile()

Channel
  .fromPath( "${params.srcDir}/*.txt" )
  .ifEmpty { error "Cannot find any reads matching: ${params.srcDir}" }
  .set { inputfiles }

process one {
  tag { input }
  publishDir params.dstDir, mode: 'copy'

  input:
  file input from inputfiles

  output:
  file "${input.getBaseName()}_one.txt" into onefiles

  when:
  !file("${params.dstDir}/${input.getBaseName()}_one.txt").exists()

  script:
  """
  cp $input ${input.getBaseName()}_one.txt
  """
}

process two {
  tag { input }
  publishDir params.dstDir, mode: 'copy'

  input:
  file input from onefiles

  output:
  file "${twoInput}_two.txt" into twofiles

  when:
  !file("${params.dstDir}/${twoInput}_two.txt").exists()

  script:
  twoInput = "$input".split("_")[0]
  """
  cp $input ${twoInput}_two.txt
  """
}

process three {
  tag { input }
  publishDir params.dstDir, mode: 'copy'

  input:
  file input from twofiles

  output:
  file "${threeInput}_three.txt" into threefiles

  when:
  !file("${params.dstDir}/${threeInput}_three.txt").exists()

  script:
  threeInput = "$input".split("_")[0]
  """
  cp $input ${threeInput}_three.txt
  """
}

## scenarios.md

      
    Raw
  

              scenarios.md
            
          
    The first time the code runs, the 3 input files are generated. The 3 processes run and the 9 output files are generated. No problem.
N E X T F L O W  ~  version 0.19.1
Launching nextflow-aws-examples/resume_examples/baseline.nf
[warm up] executor > pbs
[d0/fbf18e] Submitted process > one (larry.txt)
[2c/878112] Submitted process > one (curly.txt)
[f1/8b0d29] Submitted process > one (moe.txt)
[23/26319b] Submitted process > two (larry_one.txt)
[95/c4f723] Submitted process > two (curly_one.txt)
[e1/5e73dd] Submitted process > two (moe_one.txt)
[ab/e0e65b] Submitted process > three (larry_two.txt)
[1f/b4f520] Submitted process > three (curly_two.txt)
[ad/78badf] Submitted process > three (moe_two.txt)

in/ 
curly.txt  larry.txt  moe.txt
out/
curly_one.txt    curly_two.txt  larry_three.txt  moe_one.txt    moe_two.txt
curly_three.txt  larry_one.txt  larry_two.txt    moe_three.txt

Run the nextflow script again and no new jobs are submitted. Great! Now, consider common processing problems. A new file shows up. I'll delete the moe results to simulate.
N E X T F L O W  ~  version 0.19.1
Launching nextflow-aws-examples/resume_examples/baseline.nf
[warm up] executor > pbs
[23/27b31d] Submitted process > one (moe.txt)
[49/38f15f] Submitted process > two (moe_one.txt)
[73/7e20cb] Submitted process > three (moe_two.txt)

out/
curly_one.txt    curly_two.txt  larry_three.txt  moe_one.txt    moe_two.txt
curly_three.txt  larry_one.txt  larry_two.txt    moe_three.txt

Great, only the jobs for moe were submitted. Adding a new files to the dataset is not a problem. What about when things go wrong? Let's say intermediate jobs die in some bizarre and creative way... Let's remove the outputs for moe_two.txt and moe_three.txt. We'll run with the --resume option.
N E X T F L O W  ~  version 0.19.1
Launching nextflow-aws-examples/resume_examples/baseline.nf
[warm up] executor > pbs

out/
curly_one.txt  curly_three.txt  curly_two.txt  larry_one.txt  larry_three.txt  larry_two.txt  moe_one.txt

The jobs for the missing outputs did not queue. Process two and three don't realize they are missing work for moe, because the output channel from process one never emits the moe_one.txt output. When I use publishDir, I can't rely on the output being available at the start of the next process. I can't use scratchDir because it doesn't work with S3 buckets.
How do I deal with this? I am stuck in the thought process when: should determine if the script section should run or not, but channel output needs to get emitted either way in order to catch missing downstream jobs. --resume doesn't seem to get me out of this.
	#!/usr/bin/env nextflow

	// Baseline example of processes. Everything is supposed to work right
	// in this one.

	params.srcDir = 'in'
	params.dstDir = 'out'

	//create the test cases
	new File(params.srcDir + "/moe.txt").createNewFile()
	new File(params.srcDir + "/curly.txt").createNewFile()
	new File(params.srcDir + "/larry.txt").createNewFile()

	Channel
	.fromPath( "${params.srcDir}/*.txt" )
	.ifEmpty { error "Cannot find any reads matching: ${params.srcDir}" }
	.set { inputfiles }

	process one {
	tag { input }
	publishDir params.dstDir, mode: 'copy'

	input:
	file input from inputfiles

	output:
	file "${input.getBaseName()}_one.txt" into onefiles

	when:
	!file("${params.dstDir}/${input.getBaseName()}_one.txt").exists()

	script:
	"""
	cp $input ${input.getBaseName()}_one.txt
	"""
	}

	process two {
	tag { input }
	publishDir params.dstDir, mode: 'copy'

	input:
	file input from onefiles

	output:
	file "${twoInput}_two.txt" into twofiles

	when:
	!file("${params.dstDir}/${twoInput}_two.txt").exists()

	script:
	twoInput = "$input".split("_")[0]
	"""
	cp $input ${twoInput}_two.txt
	"""
	}

	process three {
	tag { input }
	publishDir params.dstDir, mode: 'copy'

	input:
	file input from twofiles

	output:
	file "${threeInput}_three.txt" into threefiles

	when:
	!file("${params.dstDir}/${threeInput}_three.txt").exists()

	script:
	threeInput = "$input".split("_")[0]
	"""
	cp $input ${threeInput}_three.txt
	"""
	}