Start by gathering some data. Navigate to a somatic
folder and pull down a set of input data from (human build38) from this location:
wget https://storage.googleapis.com/bfx_workshop_tmp/inputs.tar.gz
Unwad the inputs tarball with tar -xzvf
and look at the files:
- There are two bams, containing reads from chromosome 17 of a human cell line tumor/normal pair.
- A small reference fasta for chromosome 17 is also included. This is to save time and memory (as opposed to using the whole human reference)
- You might wonder why we have duplicate indices:
tumor.bai
andtumor.bam.bai
files here. The short answer is that they contain the same information. Some tools strongly prefer one format, and some the other, so to avoid complications in our pipelines, we just keep both around.
We're going to run two different somatic variant calling algorithms today - Mutect and Varscan. Let's try running the core callers by hand first.
You may remember using the GATK toolkit in the germline variant calling module. It also contains a somatic variant caller, called Mutect.
Let's run it, using our normal and tumor bam files as inputs, and send it to the background, where it can chug away while we talk:
gatk Mutect2 --java-options "-Xmx4g" -O mutect.vcf.gz -R inputs/chr17.fa -I inputs/tumor.bam -tumor "Exome_Tumor" -I inputs/normal.bam -normal "Exome_Normal" >mutect.log 2>&1
Note the use of redirects to send STDOUT to a file, and STDERR to STDOUT, meaning that all the output ends up in one file.
This will take a couple of minutes, even on this very small data set. When running on whole genomes, we often parallelize it, by launching many Mutect jobs that each run on a small portion of the genome.
Next, we'll run Varscan, a variant caller with a fundamentally different approach.
Next, let's run Varscan, a different variant caller. Varscan is also installed in our AMI and can be run like this:
java -jar ~/bin/VarScan.v2.4.2.jar somatic <(samtools mpileup --no-baq -f inputs/chr17.fa inputs/normal.bam inputs/tumor.bam) varscan.vcf --strand-filter 0 --min-coverage 8 --min-var-freq 0.1 --p-value 0.99 --mpileup 1 --output-vcf
-
Note the piping on the command line with
<()
that takes the input ofsamtools mpileup
and feeds it to varscan. -
Also notice that there are specific parameters, including one that sets the minimum variant frequency of variants returned to 0.1. Can you think of situations where this wouldn't be what you want?
Both callers typically require post-processing to give only high-quality variant calls, but let's take a quick look at the raw outputs. Mutect outputs a gzipped files, so you'll have to either run gunzip -c mutect.vcf.gz | less
, or for a shortcut, just use zless mutect.vcf.gz
- Which caller produced more variants in the VCF? (don't include the header!)
- How could we use bash set operations to do a quick and dirty look at number of identical variants called?
hint
Example:
#extract just the variant information from the VCFs
zgrep -v "^#" mutect.vcf.gz | cut -f 1,2,4,5 >tmp.mut
zgrep -v "^#" varscan.vcf.snp | cut -f 1,2,4,5 >tmp.var
zgrep -v "^#" varscan.vcf.indel | cut -f 1,2,4,5 >>tmp.var
#use bash utilities to see how many appear twice
cat tmp.mut tmp.var | sort | uniq -d | wc -l
#cleanup
rm tmp.mut tmp.var
Filtering is generally needed to get these "raw" calls down to something with a better specificity. There are multiple layers of filtering in our somatic pipeline, but here we'll just run the first-pass, built-in Mutect filter:
gatk FilterMutectCalls -R inputs/chr17.fa -V mutect.vcf.gz -O mutect.filtered.vcf.gz
If you run FilterMutectCalls without inputs, you can see the many parameters that can be tweaked and fine-tuned as needed.
After filtering, you'll find some detailed stats in mutect.filtered.vcf.gz.filteringStats.tsv
afterwards.
- compare the number of mutations in these VCFs before and after filtering using the same
zgrep
command as above. - Why does that value differ from what you expect?
- Using the command line, construct a command that tells you the number of passing variants in the filtered file.
Varscan is a bit more complicated because we have separate files for SNVs and Indels. Let's merge them to make things easier using these two commands to zip up and index our varscan variant calls:
bgzip varscan.vcf.snp
tabix -p vcf varscan.vcf.snp.gz
bgzip varscan.vcf.indel
tabix -p vcf varscan.vcf.indel.gz
Then use this BCFtools command to merge them:
bcftools concat --allow-overlaps --remove-duplicates --output-type z -o varscan_merged.vcf.gz varscan.vcf.snp.gz varscan.vcf.indel.gz
and finally, let's index our resulting merged varscan VCF:
tabix varscan_merged.vcf.gz
Note that we didn't do any filtering on these Varscan ouputs. That isn't because filtering isn't needed (the number of variants is ludicrously high in the raw outputs!) We're just omitting it in the interest of keeping this tutorial brief!
To complete this exercise, let's merge these VCFs from two different callers into one combined output file. We're going to accomplish this using the CombineVariants tool bundled with older versions of GATK. Drop into a docker container that contains the tool:
docker run -v $(pwd -P):/data -it mgibio/gatk-cwl:3.6.0
Then navigate to where you mounted your data folder and run the combine variants command:
cd /data
/usr/bin/java -Xmx2g -jar /opt/GenomeAnalysisTK.jar -T CombineVariants -genotypeMergeOptions PRIORITIZE --rod_priority_list mutect,varscan -o combined.vcf.gz -R inputs/chr17.fa --variant:mutect mutect.filtered.vcf.gz --variant:varscan varscan_merged.vcf.gz
Notice that we have to tell the tool what to do with overlaps. If both VCFs contain the same variant call, which line do we keep? Different tools output different types of auxiliary information, and so we have to choose one. It also adds the "set" attribute, which tells which item came from which VCF (or contains a value like "Intersect" (called by all algorithms) or "FilteredInAll" which tells us that it didn't pass anywhere).
Exit the docker container by typing exit
- Take a look through this file - how many variants (if any) were called by both callers?
The differences in variants called reflect different underlying statistical models and assumptions. In this case, it also reflects that we didn't run through the whole Mutect filtering process, and didn't filter the VarScan calls at all! Still, it's important to realize that different callers have different strengths and weaknesses!
If you'd like to see the results of your hard work, you can take a look at some of these variants in IGV.
- Open IGV, make sure you have genome build hg38 set, then load your
normal.bam
,tumor.bam
, and thecombined.vcf.gz
using "Open File". - Go to the position of the first variant in your VCF. Do you have high confidence that this variant is real? (look at the mapping quality of the reads)
- Now jump to position
chr17:7674764-7674830
in the TP53 gene. Does this look like a somatic (tumor-specific) variant to you? - Click on the name of your VCF file in the lefthand panel, then hit "CTRL-F". Note how it jumps to the next variant in the file. Does this one look more convincingly somatic?
- Click on the coverage track to view the readcounts and VAF of this variant. What could the VAF indicate about this tumor's purity or ploidy?
- Finally, zoom out until you can see several exons of TP53, but still see the coverage track. Notice the "bumps" in coverage over each exon that indicate this was Exome sequencing, not whole genome.
If all of this calling and merging and filtering seems a little complicated, well, that's because it is! Variant calling, filtering, and prioritization is a difficult problem, and doubly so for somatic variants - the challenges of purity, ploidy, and heterogeneity that we talked about in the lecture are challenging to overcome.
Thankfully, solid pipelines exist for running variant calling. Pipelines help you in two ways: 1) they encapsulate many of these steps, and allow you specify needed inputs and just say "Go". 2) They benefit from years of fine-tuning from folks who have thought deeply about the problem. Many such pipelines exist: for one example, you could look at the somatic CWL pipelines in the analysis-workflows repository or comparable WDL workflows in the analysis-wdls repository.
You still need to be aware of some of the underlying issues and relevant parameters, as we discussed above, but using a pre-built pipeline can save you a lot of time!