Skip to content

Instantly share code, notes, and snippets.

@Niknafs
Last active June 21, 2018 20:32
Show Gist options
  • Save Niknafs/0fed15825cbbe07366e808edd5d21bee to your computer and use it in GitHub Desktop.
Save Niknafs/0fed15825cbbe07366e808edd5d21bee to your computer and use it in GitHub Desktop.

BAM files generated by Exome sequencing at several sequencing service centers do not comply with the required specs assumed by the GATK tools. Here are steps involved in processing these raw bam files to prepare them for analysis by other tools.

These steps are formatted as a sequence of bocks in a make file. Before getting started, make sure that you have specified the path to the following pre-requisites in the header of the make file.

Prerequisites:

# general workspace path definitions
SRC=/my/source/directory/with/installed/tools/
temp_folder=/mnt/hn07_disk001/my-project/tmp/
SCRATCH=/mnt/hn07_disk001/my-project/pipeline 

### Data 
# reference fasta sequence
# substitute x in hgx with correct assembly version.
# For most recent studies, it should be hg19.
# this can be verified by viewing the bam header using 
# samtools view my-bam.bam
refSequence=/projects/my-project/resources/fasta/hgx/hgx.fa

# targeted-seq or exome capture target region
Target=/projects/my-project/resources/agilent/Agilent.target.hgx.bed

# gatk analysis data bundle
gatk_bundle=/projects/my-project/resources/gatk_bundle

# substitute the files below with correct organism known snp and indel list in vcf format
commonSNPs=/projects/my-project/resources/knownVariation/mgp.v4.snps.dbSNP.fixed.reordered.vcf
commonIndels=/projects/my-project/resources/knownVariation/mgp.v4.indels.dbSNP.fixed.reordered.vcf
dbsnpAll=/projects/clonal-evolution/my-project/resources/knownVariation/mgp.v4.all.fixed.reordered.vcf

Tools

Here is an approximate list of tools we will be using in the pre-processing and mutation calling pipeline.

Essential Tools

samtools=/projects/my-project/src/samtools-0.1.19/samtools
picard=/projects/my-project/src/picard/picard-tools-1.110
gatk=/projects/my-project/src/gatk/GenomeAnalysisTK.jar
bamtools=/projects/my-project/src/bamtools/bin/bamtools-2.3.0
MuTect=/projects/my-project/src/MuTect/mutect-src/target/mutect-1.1.7.jar

# downloaded from https://www.broadinstitute.org/cancer/cga/indelocator_download
old_gatk=/projects/my-project/src/old-indelocator/IndelGenotyper.36.3336-GenomeAnalysisTK.jar

Additional Tools

VarScan=/projects/clonal-evolution/my-project/src/VarScan.v2.3.7.jar
Scalpel=/projects/clonal-evolution/my-project/src/scalpel-0.3.2/scalpel
snpEff=/projects/clonal-evolution/my-project/src/SNPEff/snpEff/snpEff.jar

# this is a simple wrapper script around bamtools
# you can copy it over from 
# /projects/clonal-evolution/Human/src/bamtools/bamtoolsStatRunner.py
bamtools_stats=/projects/my-project/src/bamtools/bamtoolsStatRunner.py

# other parameter:
numThreads=4
heap=56000

When appropriate, please substitute the data bundles specified above with appropriate files for your experiment.

Pre-processing pipeline:

Index:

Generate a bam file index. Please look for any 'truncated file' or 'segmentation fault' error, as they are possible indicators of corrupt input files.

index:
        ${samtools} index ${SCRATCH}/${patientid}/${sample}.bam

One can also count then number of lines in each sample bam file to detect any truncations not reported above by comparison across sample; i.e. under similar circumstances (targeted depth of coverage), one expects to see approximately the same number of lines in each bam file.

countLines:
        ${samtools} view ${SCRATCH}/${patientid}/${sample}.bam | wc -l

Correcting the header:

reheader:
        ${samtools} view -H ${SCRATCH}/${patientid}/${sample}.bam | sed 's/.fa//g' > \
        ${SCRATCH}/${patientid}/${sample}.header.sam

        ${samtools} reheader  ${SCRATCH}/${patientid}/${sample}.header.sam \
        ${SCRATCH}/${patientid}/${sample}.bam > \
        ${SCRATCH}/${patientid}/${sample}.headerFixed.bam

This makes the bam file header consistent with chromosome specifications in the reference file.

Reordering chromosomes:

reorderBAM:
        java -Xmx${heap}m -Djava.io.tmpdir\=${temp_folder}/ \
        -jar ${picard}/ReorderSam.jar \
        INPUT=${SCRATCH}/${patientid}/${sample}.headerFixed.bam \
        OUTPUT=${SCRATCH}/${patientid}/${sample}.reordered.bam \
        VALIDATION_STRINGENCY=LENIENT \
        REFERENCE=${refSequence}

This command makes chromosomal orders in the bam file consistent with the reference fasta file.

Sorting positions:

sortBAM:
        java -Xmx${heap}m -Djava.io.tmpdir\=${temp_folder}/ \
         -jar ${picard}/SortSam.jar \
        INPUT= ${SCRATCH}/${patientid}/${sample}.reordered.bam \
        OUTPUT=${SCRATCH}/${patientid}/${sample}.sorted.bam \
        SORT_ORDER=coordinate \
        VALIDATION_STRINGENCY=LENIENT

Sorts reads by position in each chromosome.

Fix read mate information in paired-end sequencing:

If your experiment data is generated using paired-end sequencing technology, you need to fix read mate information at this point.

fixMate:
        java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/${patient} \
        -jar ${picard}/FixMateInformation.jar \
        INPUT=${SCRATCH}/${patientid}/${sample}.sorted.bam \
        OUTPUT=${SCRATCH}/${patientid}/${sample}.fxmt.bam \
        SO\=coordinate \
        CREATE_INDEX\=true  \
        VALIDATION_STRINGENCY\=SILENT

        python ${bamtools_stats} \
         -i ${SCRATCH}/${patientid}/${sample}.fxmt.bam\
         -o ${SCRATCH}/${patientid}/${sample}.fxmt.stats

Count duplicate reads:

checkDups:
        java -Xmx${heap}m -Djava.io.tmpdir\=${temp_folder}/${patient} \
        -jar ${picard}/MarkDuplicates.jar \
         I\=${SCRATCH}/${patientid}/${sample}.fxmt.bam \
         O\=${SCRATCH}/${patientid}/${sample}.chkdup.bam \
         M\=${SCRATCH}/${patientid}/${sample}.duplicate_report.txt \
         VALIDATION_STRINGENCY\=SILENT \
         REMOVE_DUPLICATES\=false

        python ${bamtools_stats}\
         -i ${SCRATCH}/${patientid}/${sample}.chkdup.bam \
         -o ${SCRATCH}/${patientid}/${sample}.chkdup.stats

You can inspect *.chkdup.stats files to get an estimate of the fraction of duplicate reads (as marked by Picard), as well as other summary information on each bam file.

Remove duplicate reads:

removeDups:
        java -Xmx${heap}m -Djava.io.tmpdir\=${temp_folder}/${patient} \
        -jar ${picard}/MarkDuplicates.jar \
         I\=${SCRATCH}/${patientid}/${sample}.fxmt.bam \
         O\=${SCRATCH}/${patientid}/${sample}.rmdup.bam \
         M\=${SCRATCH}/${patientid}/${sample}.duplicate_report.txt \
         VALIDATION_STRINGENCY\=SILENT \
         REMOVE_DUPLICATES\=true

CAUTION: In targeted sequencing experiments, it is typical to have a very large fraction of reads marked as duplicates. In this case, please skip the removeDups step and substitute by the following.

skipRmDup:
        ln -s ${SCRATCH}/${patientid}/${sample}.fxmt.bam \
         ${SCRATCH}/${patientid}/${sample}.rmdup.bam 
        

Add reads group information:

addGrp:
        java -Xmx${heap}m -Djava.io.tmpdir\=${temp_folder}/${patient}  \
         -jar ${picard}/AddOrReplaceReadGroups.jar \
          RGLB\=${sample}.fastq \
          RGPL\=Illumina \
          RGPU\=${sample} \
          RGSM\=${sample} \
          I\=${SCRATCH}/${patientid}/${sample}.rmdup.bam \
          O\=${SCRATCH}/${patientid}/${sample}.rmdup.grp.bam \
          SORT_ORDER\=coordinate \
          CREATE_INDEX\=true \
          VALIDATION_STRINGENCY\=SILENT

        python ${bamtools_stats}\
         -i ${SCRATCH}/${patientid}/${sample}.rmdup.grp.bam \
         -o ${SCRATCH}/${patientid}/${sample}.rmdup.stats

        mv -f ${SCRATCH}/${patientid}/${sample}.rmdup.grp.bam    ${SCRATCH}/${patientid}/${sample}.rmdup.bam
        mv -f ${SCRATCH}/${patientid}/${sample}.rmdup.grp.bai     ${SCRATCH}/${patientid}/${sample}.rmdup.bai

Please substitute Illumina with appropriate sequencing platform.

If read duplicates were removed, comparing the stats files ${sample}.chkdup.stats, and ${sample}.rmdup.stats provides an idea of the impact of duplicate removal.

Generate realignment intervals around indels:

mkIntervals:
        mkdir -p ${temp_folder}/tmp_realign_${sample}

        java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/tmp_realign_${sample} \
         -jar ${gatk} \
         -T RealignerTargetCreator \
         -I ${SCRATCH}/${patientid}/${sample}.rmdup.bam \
         -fixMisencodedQuals \
         -R ${refSequence} \
         -known ${commonSNPs} \
         -known ${commonIndels} \
         -nt ${numThreads} \
         -o ${SCRATCH}/${patientid}/${sample}.forRealign.intervals

        rm -rf ${temp_folder}/tmp_realign_${sample}

Cleaning bam files

cleanSam:
        java -Xmx${heap}m -Djava.io.tmpdir\=${temp_folder}/ \
        -jar ${picard}/CleanSam.jar  \
        INPUT=${SCRATCH}/${patientid}/${sample}.rmdup.bam \
        OUTPUT=${SCRATCH}/${patientid}/${sample}.clean.rmdup.bam \
        CREATE_INDEX\=true

The command above soft-clips the reads hanging from chromosome ends, and sets MAPQ to 0 for unmapped reads.

Realign reads in the neighbourhood of indels:

realign:
        mkdir -p ${temp_folder}/realign_${sample}
        java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/realign_${sample}\
         -jar ${gatk}\
         -T IndelRealigner\
         -I ${SCRATCH}/${patientid}/${sample}.clean.rmdup.bam\
         -R ${refSequence}\
         -targetIntervals ${SCRATCH}/${patientid}/${sample}.forRealign.intervals \
         --out ${SCRATCH}/${patientid}/${sample}.realigned.bam \
         -known ${commonSNPs}\
         -known ${commonIndels} \
        --fix_misencoded_quality_scores

         python ${bamtools_stats}\
          -i ${SCRATCH}/${patientid}/${sample}.realigned.bam\
          -o ${SCRATCH}/${patientid}/${sample}.realigned.stats
         
         rm -rf ${temp_folder}/realign_${sample}

Base Quality Score Recalibration:

BQSR:
        mkdir -p ${temp_folder}/tmp_recal_${sample}
        java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/tmp_recal_${sample}\
         -jar ${gatk}\
         -T BaseRecalibrator\
         -R ${refSequence}\
         -I ${SCRATCH}/${patientid}/${sample}.realigned.bam\
         -knownSites ${commonSNPs}\
         -knownSites ${commonIndels}\
         -o ${SCRATCH}/${patientid}/${sample}.recal_data.table\
         -nct ${nct}
        rm -rf ${temp_folder}/tmp_recal_${sample}

Generate adjustment of Base Quality Scores:

BQSR2:
        mkdir -p ${temp_folder}/tmp_recal_${sample}
        java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/tmp_recal_${sample}\
         -jar ${gatk}\
         -T BaseRecalibrator\
         -R ${refSequence}\
         -I ${SCRATCH}/${patientid}/${sample}.realigned.bam\
         -knownSites ${commonSNPs}\
         -knownSites ${commonIndels}\
         -BQSR ${SCRATCH}/${patientid}/${sample}.recal_data.table\
         -nct 2\
         -o ${SCRATCH}/${patientid}/${sample}.post_recal_data.table
        rm -rf ${temp_folder}/tmp_recal_${sample}

Generate recalibration plots:

recalPlot:
         java -Xmx${heap}m -jar ${gatk}\
         -T AnalyzeCovariates\
         -csv ${SCRATCH}/${patientid}/${sample}.report.csv \
         -R ${refSequence}\
         -before ${SCRATCH}/${patientid}/${sample}.recal_data.table\
         -after ${SCRATCH}/${patientid}/${sample}.post_recal_data.table\
         -plots ${SCRATCH}/${patientid}/${sample}.recalibration_plots.pdf

This command requires the R package gsalib.

Apply base quality recalibration process:

applyRecal:
        mkdir -p ${temp_folder}/applyRecal_${sample}
        java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/applyRecal_${sample}\
         -jar ${gatk}\
         -T PrintReads\
         -R ${refSequence}\
         -I ${SCRATCH}/${patientid}/${sample}.realigned.bam\
         -BQSR ${SCRATCH}/${patientid}/${sample}.recal_data.table\
         -o ${SCRATCH}/${patientid}/${sample}.recalibrated.bam\
         -nct 4
        rm -rf ${temp_folder}/applyRecal_${sample}

Mutation/Variant calling pipeline:

Germline variants

callVariantsUG:
        mkdir -p ${temp_folder}/var_UG_${sample}
        java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/var_UG_${sample}\
         -jar ${gatk}\
         -T UnifiedGenotyper\
         -R ${refSequence}\
         -I ${SCRATCH}/${patientid}/${sample}.recalibrated.bam\
         -ploidy 2\
         -glm BOTH\
         -dcov 250\
         -stand_emit_conf 10\
         -stand_call_conf 30\
         -o ${SCRATCH}/${patientid}/${sample}.raw_variants.vcf \
         --dbsnp ${dbsnpAll} \
         -nt 8 \
         -nct 2

        rm -rf ${temp_folder}/var_UG_${sample}

germlineSNPs:
        java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/var_filter_UG_${sample} \
         -jar ${gatk} \
         -T SelectVariants \
         -R ${refSequence} \
         -V ${SCRATCH}/${patientid}/${sample}.raw_variants.vcf \
         -selectType SNP \
         -o ${SCRATCH}/${patientid}/${sample}.raw_snps.vcf \

        java -jar -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}/var_filter_UG_${sample} \
        -jar ${gatk} \
        -T VariantFiltration \
        -R ${refSequence} \
        -V ${SCRATCH}/${patientid}/${sample}.raw_snps.vcf \
        --filterExpression "QD < 2.0" \
        --filterName "QD" \
        --filterExpression "FS > 60.0" \
        --filterName "FS" \
        --filterExpression "MQ < 40.0" \
        --filterName "MQ" \
        --filterExpression "HaplotypeScore > 13.0" \
        --filterName "HapScore" \
        --filterExpression "MQRankSum < -12.5" \
        --filterName "MQRankSum" \
        --filterExpression "ReadPosRankSum < -8.0" \
        --filterName "ReadPosRankSum" \
        --filterExpression "DP < 20" \
        --filterName "low_coverage" \
        -o ${SCRATCH}/${patientid}/${sample}.filtered_snps.vcf

        java -Xmx${heap}m \
        -jar ${gatk} \
        -T SelectVariants \
        -R ${refSequence} \
        --variant ${SCRATCH}/${patientid}/${sample}.filtered_snps.vcf \
        -select 'vc.isNotFiltered()' \
        -o ${SCRATCH}/${patientid}/${sample}.germline_snps.vcf

The above commands create an intial list of germline variant calls, mark the subset failing any of the pre-defined criteria, and finally drop the marked positions.

The output of the process is *.germline_snps.vcf file.

Germline indel calls can be made using a similar sequence of commands; However since they will not be referenced in the current document, these commanads are not included.

Somatic single base substitutions (SBS)

SomaticCalls:
        java -Xmx${heap}m \
        -jar ${MuTect} \
        --analysis_type MuTect \
        --reference_sequence ${refSequence} \
        --dbsnp ${dbsnpAll} \
        --input_file:normal ${SCRATCH}/${patientid}/${germline}.recalibrated.bam \
        --input_file:tumor ${SCRATCH}/${patientid}/${tumor}.recalibrated.bam \
        --out ${SCRATCH}/${patientid}/${tumor}.call_stats.txt \
        --vcf ${SCRATCH}/${patientid}/${tumor}.MuTect.vcf

filterSomaticCalls:
        java -Xmx${heap}m \
        -jar ${gatk} \
        -T SelectVariants \
        -R ${refSequence} \
        --variant ${SCRATCH}/${patientid}/${tumor}.MuTect.vcf \
        -select 'vc.isNotFiltered()' \
        -o ${SCRATCH}/${patientid}/${tumor}.MuTect.pass.vcf

Generate an intial list of somatic single base substitutions using MuTect default settings, and filter out the calls that do not pass MuTect's filter.

Annotate SBS for functional consequence

annotateSomaticCalls:
        java -Xmx4g -jar ${snpEff} $assemblyID \
        ${SCRATCH}/${patientid}/${tumor}.MuTect.pass.vcf > \
        ${SCRATCH}/${patientid}/${tumor}.MuTect.pass.annotated.vcf

Please set assemblyID to the assembly version matching your experiment in SnpEff; e.g. GRCh37 for hg19.

A better alternative is passing the above vcf file to CRAVAT and downloading the results to ${SCRATCH}/${patientid}/${tumor}.MuTect.pass.annotated.vcf.

Somatic Insertion/Deletions:

GATKSomaticIndels:
        java -Xmx${heap}m \
        -jar ${old_gatk} \
        -T IndelGenotyperV2 \
        --somatic \
        --maxNumberOfReads 100000 \
        --input_file:normal ${SCRATCH}/${patientid}/${germline}.recalibrated.bam \
        --input_file:tumor ${SCRATCH}/${patientid}/${tumor}.recalibrated.bam \
        --verboseOutput ${SCRATCH}/${patientid}/gatk/${tumor}.indel \
        -o ${SCRATCH}/${patientid}/gatk/${tumor}.indels.vcf \
        --reference_sequence ${refSequence} \
        --intervals ${Target} \
         --minCoverage 20 \
        --minNormalCoverage 20
        
filterGATKIndels:
        grep "#" ${SCRATCH}/${patientid}/gatk/${tumor}.indels.vcf > \
        ${SCRATCH}/${patientid}/gatk/${tumor}.indels.somatic.vcf

        grep "SOMATIC" ${SCRATCH}/${patientid}/gatk/${tumor}.indels.vcf | \
        grep -v '#' >> \
        ${SCRATCH}/${patientid}/gatk/${tumor}.indels.somatic.vcf
        
annotateGATKIndels:
        java -Xmx4g -jar ${snpEff} $assemblyID \
        ${SCRATCH}/${patientid}/gatk/${tumor}.indels.somatic.vcf > \
        ${SCRATCH}/${patientid}/gatk/${tumor}.indels.somatic.annotated.vcf

Germline heterozygous positions for allelic imbalance analysis

germlineHetPos:
        grep -w 'AC=1' ${SCRATCH}/${patientid}/${sample}.germline_snps.vcf | \
        cut -f1,2 | /apps/tawk/tawk '{print $$1, ($$2-1),$$2, NR}' > \
        ${SCRATCH}/${patientid}/${sample}.germline_het.bed

        grep -w 'AC=1' ${SCRATCH}/${patientid}/${sample}.germline_snps.vcf | \
        /apps/tawk/tawk '{print $$1, ($$2-1),$$2, "+",$$4"/"$$5 , NR}' > \
        ${SCRATCH}/${patientid}/${sample}.germline_het.annotated.bed

Minor allele frequency tables

mpileup:
        echo "${SCRATCH}/${patientid}/${sample}.recalibrated.bam" > \
        ${SCRATCH}/${patientid}/${sample}.bamfiles.path
        
        ${samtools} mpileup -f ${refSequence} \
        -l ${SCRATCH}/${patientid}/${sample}.germline_het.bed \
        -b ${SCRATCH}/${patientid}/${sample}.bamfiles.path >\
        ${SCRATCH}/${patientid}/${sample}.het.mpileup

        # please copy over the mpileup2baf python script from
        # /projects/clonal-evolution/Mouse/C2/exprs/baf-by-band/scripts/
        
        python /projects/my-project/src/scripts/mpileup2baf.py \
        -m ${SCRATCH}/${patientid}/${sample}.het.mpileup \
        -b ${SCRATCH}/${patientid}/${sample}.germline_het.annotated.bed > \
        ${SCRATCH}/${patientid}/${sample}.baf.tsv
        

Final Comments

Please monitor the disk usage before and after each step in the pipeline above. In my experience, it is good practice to keep intermediate bam files corresponding to two to three previous steps in the pipeline. This ensures efficient trace back of potential problems, without loss of computation time spent.

In Karchin lab, we use /mnt/hn07_disk001/ to store, and process these raw bam files. Please ask Kyle to create a project specific directory for you under this path, and adjust your pipeline make file accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment