Skip to content

Instantly share code, notes, and snippets.

@chrisamiller
Last active November 24, 2025 05:06
Show Gist options
  • Select an option

  • Save chrisamiller/37803b13ae1255fb1dd48410ec5d54a7 to your computer and use it in GitHub Desktop.

Select an option

Save chrisamiller/37803b13ae1255fb1dd48410ec5d54a7 to your computer and use it in GitHub Desktop.
Tutorial on doing variant annotation and filtering with VEP and OpenCRAVAT

Variant annotation

At this point, we've taken 100Gb of raw sequence data, aligned it to the genome, called variants and filtered them, and now you have a VCF file with around 3 million germline SNVs and indels.1 Now what?

The next step is generally to do variant annotation. This means layering on extra information that tells you something about the context and consequences of these variants. There are thousands of pieces of evidence that you can choose to add, but here are some of them:

  • gene/transcript annotation
    • which genes and transcripts are affected by your variant?
    • which bases in the sequence are altered?
  • protein consequences
    • how does it alter the protein structure?
    • is it predicted to be deleterious/functionally relevant?
  • population frequency
    • how often is this variant observed in healthy patients?
    • how often is this variant observed in cancer patients?
  • intersections with other tracks
    • does it hit promoters, enhancers, TF binding sites?
    • is it evolutionarily conserved?
  • Disease relevance
    • Is it present in ClinVar?
    • is it in a known cancer-related gene or hotspot?

This is all information we could, in theory, add with something like bedtools, but would be extremely tedious.

Programs like Ensembl VEP, Annovar and OpenCRAVAT provide one-stop shops for adding these kinds of useful annotations.

There are pros and cons to each (play around with them!), but we're going to use VEP (Variant Effect Predictor) today, as it is a highly curated and standardized database of information and can handle data from dozens of species.

Annotating a VCF with VEP

Start by downloading this VCF of somatic variants to your desktop. It contains variants we identified from the HCC1395 cell line, subset to just those from chromosome 17.

First, look at the contents of the VCF:

  • How many variants are there?
  • What samples are represented in the VCF?
  • What information do we already have about these variants?

There are two ways to annotate variants with VEP:

  1. The first is to download the tool, a large (~20Gb) annotation database and run the tool locally.
  • Pros: Easy to automate and parallelize to lots of samples or large VCFs, easy to guarantee a stable annotation version.
  • Cons: big download, installation required, less user-friendly
  1. The second is to upload your variants to the web tool.
  • Pros: easy to use.
  • Cons: The ensembl website is often overloaded/slow, and there are limits on how large your file can be.

The web tool is what we're going to try today, but see the notes at the bottom of this exercise for some info on running VEP locally.

Load up the VEP webpage. On that page, do the following:

  • Upload the VCF you downloaded to the service, using the "Upload this file" section.
  • Make sure Gene Symbol, Transcript Version, and HGVS are all checked
  • Under "Variants and frequency data", make sure that gnomAD (genomes) allele frequency is checked.
  • In the "Transcript Annotation" section, check "Identify canonical transcripts"
  • Under "Pathogenicity predictions", check AlphaMissense
  • Check Right align variants prior to consequence calculation at the bottom

Hit "Run". It should move fairly quickly

Exploring the annotations in the browser

Look through the summary statistics that VEP provides.2

  • What explains the difference between the number of variants and the number of genes and transcripts?

Exploring the output files

Under "Download", pull down both the VCF file and the TXT file.3

Open up the VCF and compare it to the VCF prior to annotation. That VCF was a little ugly before, but it just got even uglier. At moments like this, it's important to remember that VCF is really meant to be a machine-readable format first and foremost.

To see what kinds of new information are in our VCF, we can look at the header, and specifically, the CSQ field:

grep "^#" HCC1395_somatic_annotated.vcf  | grep CSQ | less

So VEP is stuffing all of this information into the CSQ field. Let's make it a little easier to read by replacing all of the pipes with newlines, numbering them, and looking at it with less:

grep "^#" HCC1395_somatic_annotated.vcf  | grep CSQ | sed 's/|/\n/g' | cat -n | less

We can see a lot of field names related to the information we asked to be added. This is still awfully hard to read, though, so let's move over to look at the tabular output:

 less HCC1395_somatic_annotated.txt

That's a lot more readable (and can be generated from the VCF if needed!). The columns are all messy, though, especially as we scroll right. Let's use a utility called column to help line them up

cat HCC1395_somatic_annotated.txt | column -t | less

There's still a lot to digest here, though. Let's just focus on one gene for a moment: What does the annotation for the variant at chr17 80086143 look like?

grep "17:80086143-80086143" HCC1395_somatic_annotated.txt | column -t | less

Why do we have so many different annotation lines for that single variant?

Answer

Look at the column of ENST ids - there are many different transcripts of this gene

How many different possible amino acid changes are represented by these transcripts?

Hint

Look at the "Protein_position" and "Amino_acids" columns

Answer

One possible way to answer this question would be something like: ``` grep CCDC40 HCC1395_somatic_annotated.txt | cut -f 17,18 | sort | uniq | cat -n ```

So that's all fairly confusing. How do we know which annotation to believe?!

Picking a transcript

There is almost never one answer to what effect a mutation has. We can however, take an informed guess and boil each variant down to one consequence, and VEP provides a method for doing so, using the "--pick" option on the command line, or by using the "Restrict results" section of the online tool. The criteria by which it chooses is summarized here:

image text

It's really important to remember that while useful, this isn't a foolproof method, and you need to be aware of it's limitations! For example, in your tissue of interest, you may know that one transcript is preferred over another, and if so, you should bring that knowledge to bear.

In the interest of time, go ahead and download a version of this txt file that was run with the --pick option.

  • How many lines are in this file, compared to the original?

Prioritizing variants

We have all this annotation, now let's imagine that we're on the search for potentially pathogenic variants that might be causing this person's disease. We obviously can't inspect all of these by hand, so let's narrow it down with some filters:

  • We'll start by excluding common variants. gnomAD adds a lot of fields to our table, related to different populations, but here, we'll focus on the gnomADe_AF field, which is the allele frequency in the overall gnomAD exome cohort. It turns out that is in the 36th column, so we can use trusty awk to filter it:

Filtering on Allele Frequency

First, print all lines that have an AF greater than 0.01, and sanity check that it makes sense:

awk '$36>=0.001' HCC1395_somatic_annotated_pick.txt | column -t | less

Since those look good, let's go ahead and make a filtered file. First, grab the header line:

head -n 1 HCC1395_somatic_annotated_pick.txt > HCC1395_somatic_annotated_pick.filt_af.txt

Then append the variants, being sure to reverse the sign - we want to keep the low AF variants not the high ones:

awk '$36<0.001' HCC1395_somatic_annotated_pick.txt >> HCC1395_somatic_annotated_pick.filt_af.txt

Filtering on consequence

  • Next, let's consider only variants that are predicted to alter the protein. Look at all the consequence descriptions in our VCF like this:
cut -f 4 HCC1395_somatic_annotated_pick.txt | sort | uniq

Which of these indicate a change that would potentially alter the protein sequence?

You can also look at the VEP-provided table of all possible consequences.

Let's limit our table to just the header line and these protein-altering variants. Note the use of grep -E which allows grep to match multiple things.

grep -E "Consequence|inframe_deletion|missense_variant|stop_gained" HCC1395_somatic_annotated_pick.filt_af.txt | column -t | less

(what happens if you don't quote that search string?)

Then let's add a filter on the end, requiring that alphamissense also predicts the mutation to be damaging:

grep -E "Consequence|inframe_deletion|missense_variant|stop_gained" HCC1395_somatic_annotated_pick.txt | awk '$66=="am_class" || $66=="likely_pathogenic"' | column -t | less

We've arrived at a total of 9 candidate genes, which is short enough that we could inspect them manually.

There are lots of additional ways that we could filter these: Intersections with known cancer genes or hotspots, genes in pathways of interest, presence in ClinVar or other databases of interest, etc

Could I just do this filtering in Excel?

Well, maybe...

There are a few issues to talk about before doing so:

  1. Excel can easily mangle specific gene names: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7

excel figure

  1. There is no record of what changed. Hard to reproduce, easy to fat-finger something and overwrite data in a cell

One possible solution is to treat Excel as "read-only". Open a file, browse around, filter and sort, but then convert those commands to a script for reproducibility.

Another option: OpenCRAVAT

Different annotation packages give you access to other databases and this can be useful. They also give you different interfaces which can let you filter. Try running the same data through the OpenCRAVAT webapp.

  1. Click the button to log in as a guest
  2. upload your original VCF (HCC1395_somatic.vcf.gz)
  3. choose some annotation fields:
  • Variant Effect Prediction
    • AlphaMissense
    • SIFT
  • Cancer
    • CIVIC, COSMIC, CIVIC gene, Cancer Hotspots
  • Allele Frequency
    • gnomAD4
  • Variants
    • ClinVar
  • Drugs
    • DGIdb

What are some differences that you notice?

Answer

Interactive design, easy filters, more data sources, especially for Cancer/ClinVar. Also only returns one transcript (which again, has pros and cons!).

Apply some filters to your results to search for our pathogenic variant of interest:

  • Coding - check
  • AlphaMissense class > likely_pathogenic
  • COSMIC - check

Hit "Apply Filter", then switch to the "Variant" tab to peruse the results.

Looking at the additional information, from sources like ClinVar and CIVIC pretty clearly lead us to TP53 as a strong candidate for being relevant in this tumor.

Notes

Instructions on installing VEP locally

Docker images containing VEP

An example VEP command might look something like (for mouse, ensembl version 115):

/usr/bin/perl -I /opt/lib/perl/VEP/Plugins /usr/bin/variant_effect_predictor.pl -i mydata.vcf -o mydata.annotated.vcf --offline --species mus_musculus --symbol --terms SO --dir /path/to/VEP_cache --vcf --terms SO --pick --assembly GRCm39 --cache_version 115 --everything

Footnotes

Footnotes

  1. Or alternately you've run somatic variant calling, called 5,000 somatic variants, filtered it, and now you want to figure out which of these variants is contributing to the cancer

  2. If VEP was down, you can look at a screenshot of the results

  3. If VEP was down, you can access these files here: VCF | TXT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment