chrisamiller/somatic_calling.md

## somatic_calling.md

      
    Raw
  

              somatic_calling.md
            
          
    Somatic Variant calling

Gather your inputs.

Start by gathering some data. Navigate to a somatic folder and pull down a set of input data from (human build38) from this location:
wget https://storage.googleapis.com/bfx_workshop_tmp/inputs.tar.gz

Unwad the inputs tarball with tar -xzvf and look at the files:

There are two bams, containing reads from chromosome 17 of a human cell line tumor/normal pair.
A small reference fasta for chromosome 17 is also included. This is to save time and memory (as opposed to using the whole human reference)
You might wonder why we have duplicate indices: tumor.bai and tumor.bam.bai files here. The short answer is that they contain the same information. Some tools strongly prefer one format, and some the other, so to avoid complications in our pipelines, we just keep both around.

Detecting Somatic Variants

Somatic Mutation Calling

We're going to run two different somatic variant calling algorithms today - Mutect and Varscan.  Let's try running the core callers by hand first.
You may remember using the GATK toolkit in the germline variant calling module. It also contains a somatic variant caller, called Mutect.
Let's run it, using our normal and tumor bam files as inputs:
gatk Mutect2 --java-options "-Xmx4g" -O mutect.vcf.gz -R inputs/chr17.fa -I inputs/tumor.bam -tumor "Exome_Tumor" -I inputs/normal.bam -normal "Exome_Normal"

This will take a couple of minutes, even on this very small data set.  When running on whole genomes, we often parallelize it, by launching many Mutect jobs that each run on a small portion of the genome.
Next, we'll run Varscan, a variant caller with a fundamentally different approach.
Next, let's run Varscan, a different variant caller. For this, we'll use a different docker container: mgibio/varscan_helper-cwl:1.0.0  Varscan can be run like this:
java -jar ~/bin/VarScan.v2.4.2.jar somatic <(samtools mpileup --no-baq -f inputs/chr17.fa inputs/normal.bam inputs/tumor.bam) varscan.vcf --strand-filter 0 --min-coverage 8 --min-var-freq 0.1 --p-value 0.99 --mpileup 1 --output-vcf

Note the piping on the command line with <() that takes the input of samtools mpileup and feeds it to varscan.

Which caller was faster?

Both callers typically require post-processing to give only high-quality variant calls, but let's take a quick look at the raw outputs. Mutect outputs a gzipped files, so you'll have to either run gunzip -c mutect.vcf.gz | less, or for a shortcut, just use zless mutect.vcf.gz

Which caller produced more variants in the VCF? (don't include the header!)
How could we use bash set operations to do a quick and dirty look at number of identical variants called?

hint

Example:
#extract just the variant information from the VCFs
zgrep -v "^#" mutect.vcf.gz | cut -f 1,2,4,5 >tmp.mut
zgrep -v "^#" varscan.vcf.snp | cut -f 1,2,4,5 >tmp.var
zgrep -v "^#" varscan.vcf.indel | cut -f 1,2,4,5 >>tmp.var
#use bash utilities to see how many appear twice
cat tmp.mut tmp.var | sort | uniq -d | wc -l
#cleanup
rm tmp.mut tmp.var


Filtering a VCF file

Filtering is generally needed to get these "raw" calls down to something with a better specificity. There are multiple layers of filtering in our somatic pipeline, but here we'll just run the first-pass, built-in Mutect filter, using the same GATK container as above:
gatk FilterMutectCalls -R inputs/chr17.fa -V mutect.vcf.gz -O mutect.filtered.vcf.gz

If you run FilterMutectCalls without inputs, you can see the many parameters that can be tweaked and fine-tuned as needed.
After filtering, you'll find some detailed stats in mutect.filtered.vcf.gz.filteringStats.tsv afterwards.  Now, compare the number of mutations in these VCFs before and after filtering:
zgrep -v "^#" mutect.vcf.gz | wc -l
768

zgrep -v "^#" mutect.filtered.vcf.gz | wc -l
768

Huh? I thought we were removing false positives here.  To resolve this mystery, look in the VCF at the FILTER column. If you use zless to look at your files, you'll note that in the filtered VCF, there are values filled in - weak_evidence, PASS, etc.  The key to understanding the FILTER field is that any value besides . or PASS means that the call is being removed/filtered.  It can be useful to keep those around, though, in case we want to "recover" low VAF variants, or understand why a specific site was removed.
Nothing is marked as filtered in the original VCF, so there are 768 variant calls.  Now let's try a more complicated command to figure out how many variants are in the post-filtering version:
zgrep -v "^#" mutect.filtered.vcf.gz | cut -f 7 | egrep "PASS|\." | wc -l
194

Much better.
Combining VCF files

Varscan is a bit more complicated because we have separate files for SNVs and Indels.  Let's merge them to make things easier using these two commands to zip up and index our varscan variant calls:
bgzip varscan.vcf.snp
tabix -p vcf varscan.vcf.snp.gz
bgzip varscan.vcf.indel
tabix -p vcf varscan.vcf.indel.gz

Then use this BCFtools command to merge them:
bcftools concat --allow-overlaps --remove-duplicates --output-type z -o varscan_merged.vcf.gz varscan.vcf.snp.gz varscan.vcf.indel.gz

and finally, let's index our resulting merged varscan VCF:
tabix varscan_merged.vcf.gz

Note that we didn't do any filtering on these Varscan ouputs. That isn't because filtering isn't needed (the number of variants is ludicrously high in the raw outputs!)  We're just omitting it in the interest of keeping this tutorial brief!
Merging Variants from different callers

To complete this exercise, let's merge these VCFs from two different callers into one combined output file. We're going to accomplish this using the CombineVariants tool bundled with older versions of GATK.  Drop into a docker container that contains the tool:
docker run -v $(pwd -P):/data -it mgibio/gatk-cwl:3.6.0

Then navigate to where you mounted your data folder and run the combine variants command:
cd /data
/usr/bin/java -Xmx2g -jar /opt/GenomeAnalysisTK.jar -T CombineVariants -genotypeMergeOptions PRIORITIZE --rod_priority_list mutect,varscan -o combined.vcf.gz -R inputs/chr17.fa --variant:mutect mutect.filtered.vcf.gz --variant:varscan varscan_merged.vcf.gz

Notice that we have to tell the tool what to do with overlaps. If both VCFs contain the same variant call, which line do we keep?  Different tools output different types of auxiliary information, and so we have to choose one. It also adds the "set" attribute, which tells which item came from which VCF (or contains a value like "Intersect" (called by all algorithms) or "FilteredInAll" which tells us that it didn't pass anywhere).
Exit the docker container by typing exit

Take a look through this file - how many variants (if any) were called by both callers?

The differences in variants called reflect different underlying statistical models and assumptions. In this case, it also reflects that we didn't run through the whole Mutect filtering process, and didn't filter the VarScan calls at all! Still, it's important to realize that different callers have different strengths and weaknesses!
Visualization

If you'd like to see the results of your hard work, you can take a look at some of these variants in IGV.

Open IGV, make sure you have genome build hg38 set, then load your normal.bam, tumor.bam, and the combined.vcf.gz using "Open File".
Go to the position of the first variant in your VCF.  Do you have high confidence that this variant is real? (look at the mapping quality of the reads)
Now jump to position chr17:7674764-7674830 in the TP53 gene.  Does this look like a somatic (tumor-specific) variant to you?
Click on the name of your VCF file in the lefthand panel, then hit "CTRL-F".  Note how it jumps to the next variant in the file.  Does this one look more convincingly somatic?
Click on the coverage track to view the readcounts and VAF of this variant. What could the VAF indicate about this tumor's purity or ploidy?
Finally, zoom out until you can see several exons of TP53, but still see the coverage track. Notice the "bumps" in coverage over each exon that indicate this was Exome sequencing, not whole genome.

Wrap up

If all of this calling and merging and filtering seems a little complicated, well, that's because it is!  Variant calling, filtering, and prioritization is a difficult problem, and doubly so for somatic variants - the challenges of purity, ploidy, and heterogeneity that we talked about in the lecture are challenging to overcome.
Thankfully, solid pipelines exist for running variant calling. Pipelines help you in two ways: 1) they encapsulate many of these steps, and allow you specify needed inputs and just say "Go".  2) They benefit from years of fine-tuning from folks who have thought deeply about the problem. Many such pipelines exist: for one example, you could look at the somatic CWL pipelines in the analysis-workflows repository or comparable WDL workflows in the analysis-wdls repository. You still need to be aware of some of the underlying issues and relevant parameters, as we discussed above, but using a pre-built pipeline can save you a lot of time!