BAM files generated by Exome sequencing at several sequencing service centers do not comply with the required specs assumed by the GATK tools. Here are steps involved in processing these raw bam files to prepare them for analysis by other tools.
These steps are formatted as a sequence of bocks in a make file. Before getting started, make sure that you have specified the path to the following pre-requisites in the header of the make file.
# general workspace path definitions
SRC=/my/source/directory/with/installed/tools/
temp_folder=/mnt/hn07_disk001/my-project/tmp/
SCRATCH=/mnt/hn07_disk001/my-project/pipeline
### Data
# reference fasta sequence
# substitute x in hgx with correct assembly version.
# For most recent studies, it should be hg19.
# this can be verified by viewing the bam header using
# samtools view my-bam.bam
refSequence=/projects/my-project/resources/fasta/hgx/hgx.fa
# targeted-seq or exome capture target region
Target=/projects/my-project/resources/agilent/Agilent.target.hgx.bed
# gatk analysis data bundle
gatk_bundle=/projects/my-project/resources/gatk_bundle
# substitute the files below with correct organism known snp and indel list in vcf format
commonSNPs=/projects/my-project/resources/knownVariation/mgp.v4.snps.dbSNP.fixed.reordered.vcf
commonIndels=/projects/my-project/resources/knownVariation/mgp.v4.indels.dbSNP.fixed.reordered.vcf
dbsnpAll=/projects/clonal-evolution/my-project/resources/knownVariation/mgp.v4.all.fixed.reordered.vcf
Here is an approximate list of tools we will be using in the pre-processing and mutation calling pipeline.
samtools=/projects/my-project/src/samtools-0.1.19/samtools
picard=/projects/my-project/src/picard/picard-tools-1.110
gatk=/projects/my-project/src/gatk/GenomeAnalysisTK.jar
bamtools=/projects/my-project/src/bamtools/bin/bamtools-2.3.0
MuTect=/projects/my-project/src/MuTect/mutect-src/target/mutect-1.1.7.jar
# downloaded from https://www.broadinstitute.org/cancer/cga/indelocator_download
old_gatk=/projects/my-project/src/old-indelocator/IndelGenotyper.36.3336-GenomeAnalysisTK.jar
VarScan=/projects/clonal-evolution/my-project/src/VarScan.v2.3.7.jar
Scalpel=/projects/clonal-evolution/my-project/src/scalpel-0.3.2/scalpel
snpEff=/projects/clonal-evolution/my-project/src/SNPEff/snpEff/snpEff.jar
# this is a simple wrapper script around bamtools
# you can copy it over from
# /projects/clonal-evolution/Human/src/bamtools/bamtoolsStatRunner.py
bamtools_stats=/projects/my-project/src/bamtools/bamtoolsStatRunner.py
# other parameter:
numThreads=4
heap=56000
When appropriate, please substitute the data bundles specified above with appropriate files for your experiment.
Generate a bam file index. Please look for any 'truncated file' or 'segmentation fault' error, as they are possible indicators of corrupt input files.
index:
${samtools} index ${SCRATCH}/${patientid}/${sample}.bam
One can also count then number of lines in each sample bam file to detect any truncations not reported above by comparison across sample; i.e. under similar circumstances (targeted depth of coverage), one expects to see approximately the same number of lines in each bam file.
countLines:
${samtools} view ${SCRATCH}/${patientid}/${sample}.bam | wc -l
reheader:
${samtools} view -H ${SCRATCH}/${patientid}/${sample}.bam | sed 's/.fa//g' > \
${SCRATCH}/${patientid}/${sample}.header.sam
${samtools} reheader ${SCRATCH}/${patientid}/${sample}.header.sam \
${SCRATCH}/${patientid}/${sample}.bam > \
${SCRATCH}/${patientid}/${sample}.headerFixed.bam
This makes the bam file header consistent with chromosome specifications in the reference file.
reorderBAM:
java -Xmx${heap}m -Djava.io.tmpdir\=${temp_folder}/ \
-jar ${picard}/ReorderSam.jar \
INPUT=${SCRATCH}/${patientid}/${sample}.headerFixed.bam \
OUTPUT=${SCRATCH}/${patientid}/${sample}.reordered.bam \
VALIDATION_STRINGENCY=LENIENT \
REFERENCE=${refSequence}
This command makes chromosomal orders in the bam file consistent with the reference fasta file.
sortBAM:
java -Xmx${heap}m -Djava.io.tmpdir\=${temp_folder}/ \
-jar ${picard}/SortSam.jar \
INPUT= ${SCRATCH}/${patientid}/${sample}.reordered.bam \
OUTPUT=${SCRATCH}/${patientid}/${sample}.sorted.bam \
SORT_ORDER=coordinate \
VALIDATION_STRINGENCY=LENIENT
Sorts reads by position in each chromosome.
If your experiment data is generated using paired-end sequencing technology, you need to fix read mate information at this point.
fixMate:
java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/${patient} \
-jar ${picard}/FixMateInformation.jar \
INPUT=${SCRATCH}/${patientid}/${sample}.sorted.bam \
OUTPUT=${SCRATCH}/${patientid}/${sample}.fxmt.bam \
SO\=coordinate \
CREATE_INDEX\=true \
VALIDATION_STRINGENCY\=SILENT
python ${bamtools_stats} \
-i ${SCRATCH}/${patientid}/${sample}.fxmt.bam\
-o ${SCRATCH}/${patientid}/${sample}.fxmt.stats
checkDups:
java -Xmx${heap}m -Djava.io.tmpdir\=${temp_folder}/${patient} \
-jar ${picard}/MarkDuplicates.jar \
I\=${SCRATCH}/${patientid}/${sample}.fxmt.bam \
O\=${SCRATCH}/${patientid}/${sample}.chkdup.bam \
M\=${SCRATCH}/${patientid}/${sample}.duplicate_report.txt \
VALIDATION_STRINGENCY\=SILENT \
REMOVE_DUPLICATES\=false
python ${bamtools_stats}\
-i ${SCRATCH}/${patientid}/${sample}.chkdup.bam \
-o ${SCRATCH}/${patientid}/${sample}.chkdup.stats
You can inspect *.chkdup.stats files to get an estimate of the fraction of duplicate reads (as marked by Picard), as well as other summary information on each bam file.
removeDups:
java -Xmx${heap}m -Djava.io.tmpdir\=${temp_folder}/${patient} \
-jar ${picard}/MarkDuplicates.jar \
I\=${SCRATCH}/${patientid}/${sample}.fxmt.bam \
O\=${SCRATCH}/${patientid}/${sample}.rmdup.bam \
M\=${SCRATCH}/${patientid}/${sample}.duplicate_report.txt \
VALIDATION_STRINGENCY\=SILENT \
REMOVE_DUPLICATES\=true
CAUTION: In targeted sequencing experiments, it is typical to have a very large fraction of reads marked as duplicates. In this case, please skip the removeDups
step and substitute by the following.
skipRmDup:
ln -s ${SCRATCH}/${patientid}/${sample}.fxmt.bam \
${SCRATCH}/${patientid}/${sample}.rmdup.bam
addGrp:
java -Xmx${heap}m -Djava.io.tmpdir\=${temp_folder}/${patient} \
-jar ${picard}/AddOrReplaceReadGroups.jar \
RGLB\=${sample}.fastq \
RGPL\=Illumina \
RGPU\=${sample} \
RGSM\=${sample} \
I\=${SCRATCH}/${patientid}/${sample}.rmdup.bam \
O\=${SCRATCH}/${patientid}/${sample}.rmdup.grp.bam \
SORT_ORDER\=coordinate \
CREATE_INDEX\=true \
VALIDATION_STRINGENCY\=SILENT
python ${bamtools_stats}\
-i ${SCRATCH}/${patientid}/${sample}.rmdup.grp.bam \
-o ${SCRATCH}/${patientid}/${sample}.rmdup.stats
mv -f ${SCRATCH}/${patientid}/${sample}.rmdup.grp.bam ${SCRATCH}/${patientid}/${sample}.rmdup.bam
mv -f ${SCRATCH}/${patientid}/${sample}.rmdup.grp.bai ${SCRATCH}/${patientid}/${sample}.rmdup.bai
Please substitute Illumina with appropriate sequencing platform.
If read duplicates were removed, comparing the stats files ${sample}.chkdup.stats, and ${sample}.rmdup.stats provides an idea of the impact of duplicate removal.
mkIntervals:
mkdir -p ${temp_folder}/tmp_realign_${sample}
java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/tmp_realign_${sample} \
-jar ${gatk} \
-T RealignerTargetCreator \
-I ${SCRATCH}/${patientid}/${sample}.rmdup.bam \
-fixMisencodedQuals \
-R ${refSequence} \
-known ${commonSNPs} \
-known ${commonIndels} \
-nt ${numThreads} \
-o ${SCRATCH}/${patientid}/${sample}.forRealign.intervals
rm -rf ${temp_folder}/tmp_realign_${sample}
cleanSam:
java -Xmx${heap}m -Djava.io.tmpdir\=${temp_folder}/ \
-jar ${picard}/CleanSam.jar \
INPUT=${SCRATCH}/${patientid}/${sample}.rmdup.bam \
OUTPUT=${SCRATCH}/${patientid}/${sample}.clean.rmdup.bam \
CREATE_INDEX\=true
The command above soft-clips the reads hanging from chromosome ends, and sets MAPQ to 0 for unmapped reads.
realign:
mkdir -p ${temp_folder}/realign_${sample}
java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/realign_${sample}\
-jar ${gatk}\
-T IndelRealigner\
-I ${SCRATCH}/${patientid}/${sample}.clean.rmdup.bam\
-R ${refSequence}\
-targetIntervals ${SCRATCH}/${patientid}/${sample}.forRealign.intervals \
--out ${SCRATCH}/${patientid}/${sample}.realigned.bam \
-known ${commonSNPs}\
-known ${commonIndels} \
--fix_misencoded_quality_scores
python ${bamtools_stats}\
-i ${SCRATCH}/${patientid}/${sample}.realigned.bam\
-o ${SCRATCH}/${patientid}/${sample}.realigned.stats
rm -rf ${temp_folder}/realign_${sample}
BQSR:
mkdir -p ${temp_folder}/tmp_recal_${sample}
java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/tmp_recal_${sample}\
-jar ${gatk}\
-T BaseRecalibrator\
-R ${refSequence}\
-I ${SCRATCH}/${patientid}/${sample}.realigned.bam\
-knownSites ${commonSNPs}\
-knownSites ${commonIndels}\
-o ${SCRATCH}/${patientid}/${sample}.recal_data.table\
-nct ${nct}
rm -rf ${temp_folder}/tmp_recal_${sample}
BQSR2:
mkdir -p ${temp_folder}/tmp_recal_${sample}
java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/tmp_recal_${sample}\
-jar ${gatk}\
-T BaseRecalibrator\
-R ${refSequence}\
-I ${SCRATCH}/${patientid}/${sample}.realigned.bam\
-knownSites ${commonSNPs}\
-knownSites ${commonIndels}\
-BQSR ${SCRATCH}/${patientid}/${sample}.recal_data.table\
-nct 2\
-o ${SCRATCH}/${patientid}/${sample}.post_recal_data.table
rm -rf ${temp_folder}/tmp_recal_${sample}
recalPlot:
java -Xmx${heap}m -jar ${gatk}\
-T AnalyzeCovariates\
-csv ${SCRATCH}/${patientid}/${sample}.report.csv \
-R ${refSequence}\
-before ${SCRATCH}/${patientid}/${sample}.recal_data.table\
-after ${SCRATCH}/${patientid}/${sample}.post_recal_data.table\
-plots ${SCRATCH}/${patientid}/${sample}.recalibration_plots.pdf
This command requires the R
package gsalib
.
applyRecal:
mkdir -p ${temp_folder}/applyRecal_${sample}
java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/applyRecal_${sample}\
-jar ${gatk}\
-T PrintReads\
-R ${refSequence}\
-I ${SCRATCH}/${patientid}/${sample}.realigned.bam\
-BQSR ${SCRATCH}/${patientid}/${sample}.recal_data.table\
-o ${SCRATCH}/${patientid}/${sample}.recalibrated.bam\
-nct 4
rm -rf ${temp_folder}/applyRecal_${sample}
callVariantsUG:
mkdir -p ${temp_folder}/var_UG_${sample}
java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/var_UG_${sample}\
-jar ${gatk}\
-T UnifiedGenotyper\
-R ${refSequence}\
-I ${SCRATCH}/${patientid}/${sample}.recalibrated.bam\
-ploidy 2\
-glm BOTH\
-dcov 250\
-stand_emit_conf 10\
-stand_call_conf 30\
-o ${SCRATCH}/${patientid}/${sample}.raw_variants.vcf \
--dbsnp ${dbsnpAll} \
-nt 8 \
-nct 2
rm -rf ${temp_folder}/var_UG_${sample}
germlineSNPs:
java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/var_filter_UG_${sample} \
-jar ${gatk} \
-T SelectVariants \
-R ${refSequence} \
-V ${SCRATCH}/${patientid}/${sample}.raw_variants.vcf \
-selectType SNP \
-o ${SCRATCH}/${patientid}/${sample}.raw_snps.vcf \
java -jar -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/var_filter_UG_${sample} \
-jar ${gatk} \
-T VariantFiltration \
-R ${refSequence} \
-V ${SCRATCH}/${patientid}/${sample}.raw_snps.vcf \
--filterExpression "QD < 2.0" \
--filterName "QD" \
--filterExpression "FS > 60.0" \
--filterName "FS" \
--filterExpression "MQ < 40.0" \
--filterName "MQ" \
--filterExpression "HaplotypeScore > 13.0" \
--filterName "HapScore" \
--filterExpression "MQRankSum < -12.5" \
--filterName "MQRankSum" \
--filterExpression "ReadPosRankSum < -8.0" \
--filterName "ReadPosRankSum" \
--filterExpression "DP < 20" \
--filterName "low_coverage" \
-o ${SCRATCH}/${patientid}/${sample}.filtered_snps.vcf
java -Xmx${heap}m \
-jar ${gatk} \
-T SelectVariants \
-R ${refSequence} \
--variant ${SCRATCH}/${patientid}/${sample}.filtered_snps.vcf \
-select 'vc.isNotFiltered()' \
-o ${SCRATCH}/${patientid}/${sample}.germline_snps.vcf
The above commands create an intial list of germline variant calls, mark the subset failing any of the pre-defined criteria, and finally drop the marked positions.
The output of the process is *.germline_snps.vcf
file.
Germline indel calls can be made using a similar sequence of commands; However since they will not be referenced in the current document, these commanads are not included.
SomaticCalls:
java -Xmx${heap}m \
-jar ${MuTect} \
--analysis_type MuTect \
--reference_sequence ${refSequence} \
--dbsnp ${dbsnpAll} \
--input_file:normal ${SCRATCH}/${patientid}/${germline}.recalibrated.bam \
--input_file:tumor ${SCRATCH}/${patientid}/${tumor}.recalibrated.bam \
--out ${SCRATCH}/${patientid}/${tumor}.call_stats.txt \
--vcf ${SCRATCH}/${patientid}/${tumor}.MuTect.vcf
filterSomaticCalls:
java -Xmx${heap}m \
-jar ${gatk} \
-T SelectVariants \
-R ${refSequence} \
--variant ${SCRATCH}/${patientid}/${tumor}.MuTect.vcf \
-select 'vc.isNotFiltered()' \
-o ${SCRATCH}/${patientid}/${tumor}.MuTect.pass.vcf
Generate an intial list of somatic single base substitutions using MuTect default settings, and filter out the calls that do not pass MuTect's filter.
annotateSomaticCalls:
java -Xmx4g -jar ${snpEff} $assemblyID \
${SCRATCH}/${patientid}/${tumor}.MuTect.pass.vcf > \
${SCRATCH}/${patientid}/${tumor}.MuTect.pass.annotated.vcf
Please set assemblyID
to the assembly version matching your experiment in SnpEff; e.g. GRCh37 for hg19.
A better alternative is passing the above vcf file to CRAVAT and downloading the results to ${SCRATCH}/${patientid}/${tumor}.MuTect.pass.annotated.vcf
.
GATKSomaticIndels:
java -Xmx${heap}m \
-jar ${old_gatk} \
-T IndelGenotyperV2 \
--somatic \
--maxNumberOfReads 100000 \
--input_file:normal ${SCRATCH}/${patientid}/${germline}.recalibrated.bam \
--input_file:tumor ${SCRATCH}/${patientid}/${tumor}.recalibrated.bam \
--verboseOutput ${SCRATCH}/${patientid}/gatk/${tumor}.indel \
-o ${SCRATCH}/${patientid}/gatk/${tumor}.indels.vcf \
--reference_sequence ${refSequence} \
--intervals ${Target} \
--minCoverage 20 \
--minNormalCoverage 20
filterGATKIndels:
grep "#" ${SCRATCH}/${patientid}/gatk/${tumor}.indels.vcf > \
${SCRATCH}/${patientid}/gatk/${tumor}.indels.somatic.vcf
grep "SOMATIC" ${SCRATCH}/${patientid}/gatk/${tumor}.indels.vcf | \
grep -v '#' >> \
${SCRATCH}/${patientid}/gatk/${tumor}.indels.somatic.vcf
annotateGATKIndels:
java -Xmx4g -jar ${snpEff} $assemblyID \
${SCRATCH}/${patientid}/gatk/${tumor}.indels.somatic.vcf > \
${SCRATCH}/${patientid}/gatk/${tumor}.indels.somatic.annotated.vcf
germlineHetPos:
grep -w 'AC=1' ${SCRATCH}/${patientid}/${sample}.germline_snps.vcf | \
cut -f1,2 | /apps/tawk/tawk '{print $$1, ($$2-1),$$2, NR}' > \
${SCRATCH}/${patientid}/${sample}.germline_het.bed
grep -w 'AC=1' ${SCRATCH}/${patientid}/${sample}.germline_snps.vcf | \
/apps/tawk/tawk '{print $$1, ($$2-1),$$2, "+",$$4"/"$$5 , NR}' > \
${SCRATCH}/${patientid}/${sample}.germline_het.annotated.bed
mpileup:
echo "${SCRATCH}/${patientid}/${sample}.recalibrated.bam" > \
${SCRATCH}/${patientid}/${sample}.bamfiles.path
${samtools} mpileup -f ${refSequence} \
-l ${SCRATCH}/${patientid}/${sample}.germline_het.bed \
-b ${SCRATCH}/${patientid}/${sample}.bamfiles.path >\
${SCRATCH}/${patientid}/${sample}.het.mpileup
# please copy over the mpileup2baf python script from
# /projects/clonal-evolution/Mouse/C2/exprs/baf-by-band/scripts/
python /projects/my-project/src/scripts/mpileup2baf.py \
-m ${SCRATCH}/${patientid}/${sample}.het.mpileup \
-b ${SCRATCH}/${patientid}/${sample}.germline_het.annotated.bed > \
${SCRATCH}/${patientid}/${sample}.baf.tsv
Please monitor the disk usage before and after each step in the pipeline above. In my experience, it is good practice to keep intermediate bam files corresponding to two to three previous steps in the pipeline. This ensures efficient trace back of potential problems, without loss of computation time spent.
In Karchin lab, we use /mnt/hn07_disk001/
to store, and process these raw bam files. Please ask Kyle to create a project specific directory for you under this path, and adjust your pipeline make file accordingly.