Skip to content

Instantly share code, notes, and snippets.

@srynobio
Last active August 25, 2021 22:28
Show Gist options
  • Save srynobio/370f0fca4990f84f357761a9c9b18799 to your computer and use it in GitHub Desktop.
Save srynobio/370f0fca4990f84f357761a9c9b18799 to your computer and use it in GitHub Desktop.

I've tried to add notes along to help get you on the right path:

params.fasta = "/kbod2/ReferenceGenomes/GRCh37/human_g1k_v37_decoy_phix.fasta"
params.fai = "/kbod2/ReferenceGenomes/GRCh37/human_g1k_v37_decoy_phix.fai"
params.dbsnp = "/ybod2/cavery/Resources/dbsnp_146.hg38.vcf"
params.reference_mills_1000G = "/ybod2/cavery/Resources/Mills_and_1000G_gold_standard.indels.hg38.vcf"
params.reference_1000G_HighConfSNP = "/ybod2/cavery/Resources/1000G_phase1.snps.high_confidence.hg38.vcf"
params.results = "/ybod2/cavery/gatk_nextflow_files/"


Channel.fromFilePairs("/ybod2/cavery/Emory_Prahal_JIA_Fam/SL88821/*_R{1,2}_001.fastq.gz", flat: true)
  .set { reads }

fasta_file = file(params.fasta)
fai_file = file(params.fai)
dbsnp_file = file(params.dbsnp)
reference_mills_1000G_file = file(params.reference_mills_1000G)
reference_1000G_HighConfSNP_file = file(params.reference_1000G_HighConfSNP)

Here you take values in as params and then set them as files, unless there is a reason for the second step it's not necessary as they are paths to "background" data files. Also keep in mind that when params are set they can be referenced anywhere in your code using the following syntax:

params.fasta = "/kbod2/ReferenceGenomes/GRCh37/human_g1k_v37_decoy_phix.fasta" ${fasta} or !{fasta} depending on what process type you use. If you're using script the '$' version is what you want.

Here I've added some clean up and notes along the way:

process alignment {

	Input:
	file ref_fasta from fasta_file, set val(id), file(read1), file(read2) from reads

	input:
	file ref_fasta from fasta_file <--- not needed as you created a variable (above)
	set val(id), file(read1), file(read2) from reads
	## set is only used for values all coming from the same input channel.

	output:
	## for the flagstat below example:
	set id, file("${id}_sorted_Dedup.bam") into bams_out

	Script:
	"""
	bwa mem -t 16 -k 24 -R "@RG\tID:{id}\tPL:ILLUMINA\tPU:ILLUMINA-1\tSM:${id}" ${reference_fasta} ${read1}.fastq.gz ${read2}.fastq.gz | /ybod/csteely/bin/samblaster/samblaster --addMateTags | 
	sambamba view -f bam -l 0 -S /dev/stdin | sambamba sort --tmpdir=temp_storage/ -m 50G -o ${id_sorted_Dedup.bam /dev/stdin
	"""
}

process flagstat {

	Input:
	file sorted_dedup_bams, set val(id)
	## Here an issue occurs because you are trying to set a variable 'id' but there is no incoming channel that has it.
	## something more like the following (if values were coming through the channel).
	set val(id), file(sorted_dedup_bams) from bams_out

	Output:
	file ("${id}.sorted_Dedup.flagstat") into flagstats_channel

	Script:
	"""
	samtools flagstat ${sorted_dedup_bams} > ${id}.sorted_Dedup.flagstat 2> flagstat.log-1
	## this is correct.
	"""
}

### so now that you have the content of bam_out going to two different places you need to split the channel using into: https://www.nextflow.io/docs/latest/operator.html#into

process base_recal {
	conda '/ybod2/cavery/Bin/gatk-4.0.11.0/gatkcondaenv.yml' (best practice)

	Input (updated):
	Yours: set file ref_fasta from fasta_file, val(id), file(sorted_dedup_bams) from bams_out
	set val(id), file(sorted_dedup_bams) from new_channel_name_from_the_split 
	
	These last three do not need to come in through channels as they have been set as variables, and you are using them correctly below.
	file ref_dbsnp from dbsnp_file
	file ref_mills from reference_mills_1000G_file, 
	file ref_highconf from reference_1000G_HighConfSNP_file

	Output:
	file ("${id}.recal_data.table") into bqsr_apply_table

	Script:

	"""
	gatk BaseRecalibrator -R ${ref_fasta} -I ${sorted_dedup_bams} --known-sites ${ref_dbsnp} --known sites ${ref_mills} --known-sites ${ref_highconf} -O ${id}.recal_data.table
	"""
}

Here is some of my public codebase using Nextflow and some links to specific ideas from above.

Here I use a variable to populate throughout the main nf script (search to see) splitting channels and here

Using set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment