Skip to content

Instantly share code, notes, and snippets.

@ckandoth
Last active October 4, 2022 21:49
Show Gist options
  • Star 12 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save ckandoth/5390e3ae4ecf182fa92f6318cfa9fa97 to your computer and use it in GitHub Desktop.
Save ckandoth/5390e3ae4ecf182fa92f6318cfa9fa97 to your computer and use it in GitHub Desktop.
Install Ensembl's VEP v95 with various caches for running offline

Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.

To follow these instructions, we'll assume you have these packaged essentials installed:

## For Debian/Ubuntu system admins ##
sudo apt-get install -y build-essential git libncurses-dev

## For RHEL/CentOS system admins ##
sudo yum groupinstall -y 'Development Tools'
sudo yum install -y git ncurses-devel

VEP has been tested on Perl 5.10 and newer. If you do not have admin rights to upgrade your system Perl, try Perlbrew. I recommend Perl 5.26, if you don't know which one to choose.

Create temporary shell variables pointing to where we'll store VEP and its cache data. The paths below are the default for vcf2maf and maf2maf, but different paths can be used. You'll just need to specify --vep-path and --vep-data when running vcf2maf or maf2maf:

export VEP_PATH=$HOME/vep
export VEP_DATA=$HOME/.vep
export VER=95

Download VEP:

curl -LO https://github.com/Ensembl/ensembl-vep/archive/release/95.3.tar.gz
tar -zxf 95.3.tar.gz; rm -f 95.3.tar.gz; mv ensembl-vep-release-95.3 $VEP_PATH
cd $VEP_PATH

Add that path to PERL5LIB, and the htslib subfolder to PATH where tabix will be installed:

export PERL5LIB=$VEP_PATH:$PERL5LIB
export PATH=$VEP_PATH/htslib:$PATH

Download and unpack VEP's offline cache for GRCh37, GRCh38, and GRCm38:

rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-$VER/variation/VEP/homo_sapiens_vep_$VER\_GRCh37.tar.gz $VEP_DATA
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-$VER/variation/VEP/homo_sapiens_vep_$VER\_GRCh38.tar.gz $VEP_DATA
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-$VER/variation/VEP/mus_musculus_vep_$VER\_GRCm38.tar.gz $VEP_DATA
cat $VEP_DATA/*_vep_$VER\_GRC{h37,h38,m38}.tar.gz | tar -izxf - -C $VEP_DATA

Install the Ensembl API, the reference FASTAs for GRCh37/GRCh38/GRCm38:

perl INSTALL.pl --AUTO a --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA --NO_HTSLIB
perl INSTALL.pl --AUTO f --SPECIES homo_sapiens --ASSEMBLY GRCh37 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO f --SPECIES homo_sapiens --ASSEMBLY GRCh38 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO f --SPECIES mus_musculus --ASSEMBLY GRCm38 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA

Convert the offline cache for use with tabix, that significantly speeds up the lookup of known variants:

perl convert_cache.pl --species homo_sapiens --version $VER\_GRCh37 --dir $VEP_DATA
perl convert_cache.pl --species homo_sapiens --version $VER\_GRCh38 --dir $VEP_DATA
perl convert_cache.pl --species mus_musculus --version $VER\_GRCm38 --dir $VEP_DATA

Download and build samtools and bcftools, which we'll need for steps below, and when running vcf2maf/maf2maf:

mkdir $VEP_PATH/samtools && cd $VEP_PATH/samtools
curl -LOOO https://github.com/samtools/{samtools/releases/download/1.4.1/samtools-1.4.1,bcftools/releases/download/1.4.1/bcftools-1.4.1,htslib/releases/download/1.4.1/htslib-1.4.1}.tar.bz2
cat *tar.bz2 | tar -ijxf -
cd htslib-1.4.1 && make && make prefix=$VEP_PATH/samtools install && cd ..
cd samtools-1.4.1 && make && make prefix=$VEP_PATH/samtools install && cd ..
cd bcftools-1.4.1 && make && make prefix=$VEP_PATH/samtools install && cd ..
cd ..

Download the liftOver binary down the same path, and make it executable:

curl -L http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver > bin/liftOver
chmod a+x bin/liftOver

Set $PATH to find all those tools, and also add this line to your ~/.bashrc to make it persistent. Be sure to edit the path below, if you didn't do this in your $HOME:

export PATH=$HOME/vep/samtools/bin:$PATH

Download the ExAC r0.3.1 VCF with germline variants called across thousands of normal samples excluding TCGA:

curl -L ftp://ftp.broadinstitute.org:/pub/ExAC_release/release0.3.1/subsets/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz > $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz

We'll make some fixes to this VCF, so it's easier to work with:

  • Fix the header with a line for AC_Adj0_Filter, so that bcftools won't complain about it
  • Remove header lines describing FORMAT, because that data is not in the VCF
  • Remove all INFO fields except the allele counts/numbers, to reduce file size
  • Remove calls in known_somatic_sites.bed, likely somatic events related to clonal hematopoiesis
echo "##FILTER=<ID=AC_Adj0_Filter,Description=\"Only low quality genotype calls containing alternate alleles are present\">" > header_line.tmp
curl -LO https://raw.githubusercontent.com/mskcc/vcf2maf/v1.6.16/data/known_somatic_sites.bed
bcftools annotate --header-lines header_line.tmp --remove FMT,^INF/AF,INF/AC,INF/AN,INF/AC_Adj,INF/AN_Adj,INF/AC_AFR,INF/AC_AMR,INF/AC_EAS,INF/AC_FIN,INF/AC_NFE,INF/AC_OTH,INF/AC_SAS,INF/AN_AFR,INF/AN_AMR,INF/AN_EAS,INF/AN_FIN,INF/AN_NFE,INF/AN_OTH,INF/AN_SAS $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz | bcftools filter --targets-file ^known_somatic_sites.bed --output-type z --output $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.fixed.vcf.gz

Replace the original to save space, and tabix index for efficient lookup by VEP:

mv -f $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.fixed.vcf.gz $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz
tabix -p vcf $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz

Test running VEP in offline mode the provided sample VCFs:

./vep --species homo_sapiens --assembly GRCh37 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/$VER\_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --input_file examples/homo_sapiens_GRCh37.vcf --output_file examples/homo_sapiens_GRCh37.vep.vcf --polyphen b --af --af_1kg --af_esp --regulatory
./vep --species homo_sapiens --assembly GRCh38 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/$VER\_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz --input_file examples/homo_sapiens_GRCh38.vcf --output_file examples/homo_sapiens_GRCh38.vep.vcf --polyphen b --af --af_1kg --af_esp --regulatory
./vep --species mus_musculus --assembly GRCm38 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/mus_musculus/$VER\_GRCm38/Mus_musculus.GRCm38.dna.toplevel.fa.gz --input_file examples/mus_musculus_GRCm38.vcf --output_file examples/mus_musculus_GRCm38.vep.vcf --regulatory
@kurt-mueller-osumc
Copy link

Hello,

My apologies if this is not the appropriate place to ask technical questions. Let me also say I know that this is OSS work that you've made publicly available for everybody to use, free of charge, and based off your hard work. I appreciate that and want to acknowledge it.

The bcftools annotate command is throwing an error:

$ bcftools annotate --header-lines header_line.tmp --remove FMT,^INF/AF,INF/AC,INF/AN,INF/AC_Adj,INF/AN_Adj,INF/AC_AFR,INF/AC_AMR,INF/AC_EAS,INF/AC_FIN,INF/AC_NFE,INF/AC_OTH,INF/AC_SAS,INF/AN_AFR,INF/AN_AMR,INF/AN_EAS,INF/AN_FIN,INF/AN_NFE,INF/AN_OTH,INF/AN_SAS ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz | bcftools filter --targets-file ^known_somatic_sites.bed --output-type z --output ExAC_nonTCGA.r0.3.1.sites.fixed.vcf.gz
Failed to open header_line.tmp: unknown file type

As a sanity check, list the contents of the current directory and then the contents of header_line.tmp

$ ls
ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz header_line.tmp                         known_somatic_sites.bed

$ cat header_line.tmp
##FILTER=<ID=AC_Adj0_Filter,Description="Only low quality genotype calls containing alternate alleles are present">

The current bcftools version is 1.9 and was installed through miniconda:

$ conda list | grep bcftools
bcftools                  1.9                  h16e57c4_9    bioconda

Thank you for any and all help/guidance.

@kurt-mueller-osumc
Copy link

I think the problem is with this command:

bcftools annotate --header-lines header_line.tmp --remove FMT,^INF/AF,INF/AC,INF/AN,INF/AC_Adj,INF/AN_Adj,INF/AC_AFR,INF/AC_AMR,INF/AC_EAS,INF/AC_FIN,INF/AC_NFE,INF/AC_OTH,INF/AC_SAS,INF/AN_AFR,INF/AN_AMR,INF/AN_EAS,INF/AN_FIN,INF/AN_NFE,INF/AN_OTH,INF/AN_SAS $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz | bcftools filter --targets-file ^known_somatic_sites.bed --output-type z --output $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.fixed.vcf.gz

Specifically, after the pipe command:

# The targets file is prefixed with a caret 
bcftools filter --targets-file ^known_somatic_sites.bed --output-type z --output $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.fixed.vcf.gz

Instead, use this command:

bcftools annotate --header-lines header_line.tmp --remove FMT,^INF/AF,INF/AC,INF/AN,INF/AC_Adj,INF/AN_Adj,INF/AC_AFR,INF/AC_AMR,INF/AC_EAS,INF/AC_FIN,INF/AC_NFE,INF/AC_OTH,INF/AC_SAS,INF/AN_AFR,INF/AN_AMR,INF/AN_EAS,INF/AN_FIN,INF/AN_NFE,INF/AN_OTH,INF/AN_SAS $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz | bcftools filter --targets-file known_somatic_sites.bed --output-type z --output $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.fixed.vcf.gz

@ckandoth
Copy link
Author

@kurt-mueller-osumc Thanks for reporting. But I was unable to reproduce your error with bcftools 1.9 installed using miniconda. The caret ^ before the targets file is required to instruct bcftools to do a logical complement i.e. to exclude those regions from the ExAC VCF. The Failed to open header_line.tmp: unknown file type error can happen if header_line.tmp was not fully written to disk before your bcftools command. Delete the file, and try running those commands again, one at a time.

Note that I have a newer gist for VEP 100 that uses miniconda to simplify VEP installation. It no longer includes instructions for preparing the ExAC VCF, in favor of gnomAD data that VEP already reports.

@kurt-mueller-osumc
Copy link

Thank you for your help Cyriac. I've added the --ncbi-build GRCh38 flag and argument and it seems to be working well :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment