At this point, we've taken 100Gb of raw sequence data, aligned it to the genome, called variants and filtered them, and now you have a VCF file with around 3 million germline SNVs and indels.1 Now what?
The next step is generally to do variant annotation. This means layering on extra information that tells you something about the context and consequences of these variants. There are thousands of pieces of evidence that you can choose to add, but here are some of them:
- gene/transcript annotation
- which genes and transcripts are affected by your variant?
- which bases in the sequence are altered?
- protein consequences
- how does it alter the protein structure?
- is it predicted to be deleterious/functionally relevant?
- population frequency
- how often is this variant observed in healthy patients?
- how often is this variant observed in cancer patients?
- intersections with other tracks
- does it hit promoters, enhancers, TF binding sites?
- is it evolutionarily conserved?
- Disease relevance
- Is it present in ClinVar?
- is it in a known cancer-related gene or hotspot?
This is all information we could, in theory, add with something like bedtools, but would be extremely tedious.
Programs like Ensembl VEP, Annovar and OpenCRAVAT provide one-stop shops for adding these kinds of useful annotations.
There are pros and cons to each (play around with them!), but we're going to use VEP (Variant Effect Predictor) today, as it is a highly curated and standardized database of information and can handle data from dozens of species.
Start by downloading this VCF of somatic variants to your desktop. It contains variants we identified from the HCC1395 cell line, subset to just those from chromosome 17.
First, look at the contents of the VCF:
- How many variants are there?
- What samples are represented in the VCF?
- What information do we already have about these variants?
There are two ways to annotate variants with VEP:
- The first is to download the tool, a large (~20Gb) annotation database and run the tool locally.
- Pros: Easy to automate and parallelize to lots of samples or large VCFs, easy to guarantee a stable annotation version.
- Cons: big download, installation required, less user-friendly
- The second is to upload your variants to the web tool.
- Pros: easy to use.
- Cons: The ensembl website is often overloaded/slow, and there are limits on how large your file can be.
The web tool is what we're going to try today, but see the notes at the bottom of this exercise for some info on running VEP locally.
Load up the VEP webpage. On that page, do the following:
- Upload the VCF you downloaded to the service, using the "Upload this file" section.
- Make sure
Gene Symbol,Transcript Version, andHGVSare all checked - Under "Variants and frequency data", make sure that
gnomAD (genomes) allele frequencyis checked. - In the "Transcript Annotation" section, check "Identify canonical transcripts"
- Under "Pathogenicity predictions", check
AlphaMissense - Check
Right align variants prior to consequence calculationat the bottom
Hit "Run". It should move fairly quickly
Look through the summary statistics that VEP provides.2
- What explains the difference between the number of variants and the number of genes and transcripts?
Under "Download", pull down both the VCF file and the TXT file.3
Open up the VCF and compare it to the VCF prior to annotation. That VCF was a little ugly before, but it just got even uglier. At moments like this, it's important to remember that VCF is really meant to be a machine-readable format first and foremost.
To see what kinds of new information are in our VCF, we can look at the header, and specifically, the CSQ field:
grep "^#" HCC1395_somatic_annotated.vcf | grep CSQ | less
So VEP is stuffing all of this information into the CSQ field. Let's make it a little easier to read by replacing all of the pipes with newlines, numbering them, and looking at it with less:
grep "^#" HCC1395_somatic_annotated.vcf | grep CSQ | sed 's/|/\n/g' | cat -n | less
We can see a lot of field names related to the information we asked to be added. This is still awfully hard to read, though, so let's move over to look at the tabular output:
less HCC1395_somatic_annotated.txt
That's a lot more readable (and can be generated from the VCF if needed!). The columns are all messy, though, especially as we scroll right. Let's use a utility called column to help line them up
cat HCC1395_somatic_annotated.txt | column -t | less
There's still a lot to digest here, though. Let's just focus on one gene for a moment: What does the annotation for the variant at chr17 80086143 look like?
grep "17:80086143-80086143" HCC1395_somatic_annotated.txt | column -t | less
Why do we have so many different annotation lines for that single variant?
Answer
Look at the column of ENST ids - there are many different transcripts of this gene
How many different possible amino acid changes are represented by these transcripts?
Hint
Look at the "Protein_position" and "Amino_acids" columns
Answer
One possible way to answer this question would be something like: ``` grep CCDC40 HCC1395_somatic_annotated.txt | cut -f 17,18 | sort | uniq | cat -n ```
So that's all fairly confusing. How do we know which annotation to believe?!
There is almost never one answer to what effect a mutation has. We can however, take an informed guess and boil each variant down to one consequence, and VEP provides a method for doing so, using the "--pick" option on the command line, or by using the "Restrict results" section of the online tool. The criteria by which it chooses is summarized here:
It's really important to remember that while useful, this isn't a foolproof method, and you need to be aware of it's limitations! For example, in your tissue of interest, you may know that one transcript is preferred over another, and if so, you should bring that knowledge to bear.
In the interest of time, go ahead and download a version of this txt file that was run with the --pick option.
- How many lines are in this file, compared to the original?
We have all this annotation, now let's imagine that we're on the search for potentially pathogenic variants that might be causing this person's disease. We obviously can't inspect all of these by hand, so let's narrow it down with some filters:
- We'll start by excluding common variants. gnomAD adds a lot of fields to our table, related to different populations, but here, we'll focus on the
gnomADe_AFfield, which is the allele frequency in the overall gnomAD exome cohort. It turns out that is in the 36th column, so we can use trusty awk to filter it:
First, print all lines that have an AF greater than 0.01, and sanity check that it makes sense:
awk '$36>=0.001' HCC1395_somatic_annotated_pick.txt | column -t | less
Since those look good, let's go ahead and make a filtered file. First, grab the header line:
head -n 1 HCC1395_somatic_annotated_pick.txt > HCC1395_somatic_annotated_pick.filt_af.txt
Then append the variants, being sure to reverse the sign - we want to keep the low AF variants not the high ones:
awk '$36<0.001' HCC1395_somatic_annotated_pick.txt >> HCC1395_somatic_annotated_pick.filt_af.txt
- Next, let's consider only variants that are predicted to alter the protein. Look at all the consequence descriptions in our VCF like this:
cut -f 4 HCC1395_somatic_annotated_pick.txt | sort | uniq
Which of these indicate a change that would potentially alter the protein sequence?
You can also look at the VEP-provided table of all possible consequences.
Let's limit our table to just the header line and these protein-altering variants. Note the use of grep -E which allows grep to match multiple things.
grep -E "Consequence|inframe_deletion|missense_variant|stop_gained" HCC1395_somatic_annotated_pick.filt_af.txt | column -t | less
(what happens if you don't quote that search string?)
Then let's add a filter on the end, requiring that alphamissense also predicts the mutation to be damaging:
grep -E "Consequence|inframe_deletion|missense_variant|stop_gained" HCC1395_somatic_annotated_pick.txt | awk '$66=="am_class" || $66=="likely_pathogenic"' | column -t | less
We've arrived at a total of 9 candidate genes, which is short enough that we could inspect them manually.
There are lots of additional ways that we could filter these: Intersections with known cancer genes or hotspots, genes in pathways of interest, presence in ClinVar or other databases of interest, etc
Well, maybe...
There are a few issues to talk about before doing so:
- Excel can easily mangle specific gene names: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7
- There is no record of what changed. Hard to reproduce, easy to fat-finger something and overwrite data in a cell
One possible solution is to treat Excel as "read-only". Open a file, browse around, filter and sort, but then convert those commands to a script for reproducibility.
Different annotation packages give you access to other databases and this can be useful. They also give you different interfaces which can let you filter. Try running the same data through the OpenCRAVAT webapp.
- Click the button to log in as a guest
- upload your original VCF (HCC1395_somatic.vcf.gz)
- choose some annotation fields:
- Variant Effect Prediction
- AlphaMissense
- SIFT
- Cancer
- CIVIC, COSMIC, CIVIC gene, Cancer Hotspots
- Allele Frequency
- gnomAD4
- Variants
- ClinVar
- Drugs
- DGIdb
What are some differences that you notice?
Answer
Interactive design, easy filters, more data sources, especially for Cancer/ClinVar. Also only returns one transcript (which again, has pros and cons!).
Apply some filters to your results to search for our pathogenic variant of interest:
- Coding - check
- AlphaMissense class > likely_pathogenic
- COSMIC - check
Hit "Apply Filter", then switch to the "Variant" tab to peruse the results.
Looking at the additional information, from sources like ClinVar and CIVIC pretty clearly lead us to TP53 as a strong candidate for being relevant in this tumor.
Instructions on installing VEP locally
An example VEP command might look something like (for mouse, ensembl version 115):
/usr/bin/perl -I /opt/lib/perl/VEP/Plugins /usr/bin/variant_effect_predictor.pl -i mydata.vcf -o mydata.annotated.vcf --offline --species mus_musculus --symbol --terms SO --dir /path/to/VEP_cache --vcf --terms SO --pick --assembly GRCm39 --cache_version 115 --everything
Footnotes
-
Or alternately you've run somatic variant calling, called 5,000 somatic variants, filtered it, and now you want to figure out which of these variants is contributing to the cancer ↩
-
If VEP was down, you can look at a screenshot of the results ↩
-
If VEP was down, you can access these files here: VCF | TXT ↩

