Skip to content

Instantly share code, notes, and snippets.

@ckandoth
Last active May 15, 2018 08:50
Show Gist options
  • Save ckandoth/176fe3199ed7cea682a68d438b8fdffb to your computer and use it in GitHub Desktop.
Save ckandoth/176fe3199ed7cea682a68d438b8fdffb to your computer and use it in GitHub Desktop.
Install Ensembl's VEP v85 with various caches for running offline

Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.

To follow these instructions, we'll assume you have these packaged essentials installed:

## For Debian/Ubuntu system admins ##
sudo apt-get install -y build-essential git libncurses-dev

## For RHEL/CentOS system admins ##
sudo yum groupinstall -y 'Development Tools'
sudo yum install -y git ncurses-devel

Follow this gist to set up Perl 5.22 in a folder somewhere and install the libraries that VEP needs. Be sure to follow the steps that update $PERL5LIB to find those libraries, and set $PATH to use that new Perl instead of the system Perl.

Create temporary shell variables pointing to where we'll store VEP and its cache data. The paths below are the default for vcf2maf and maf2maf, but different paths can be used. You'll just need to specify --vep-path and --vep-data when running vcf2maf or maf2maf:

export VEP_PATH=$HOME/vep
export VEP_DATA=$HOME/.vep

Download the v85 release of VEP:

mkdir $VEP_PATH $VEP_DATA; cd $VEP_PATH
curl -LO https://github.com/Ensembl/ensembl-tools/archive/release/85.tar.gz
tar -zxf 85.tar.gz --starting-file variant_effect_predictor --transform='s|.*/|./|g'

Add that path to PERL5LIB, and the htslib subfolder to PATH where tabix will be installed:

export PERL5LIB=$VEP_PATH:$PERL5LIB
export PATH=$VEP_PATH/htslib:$PATH

Download and unpack VEP's offline cache for GRCh37, GRCh38, and GRCm38:

rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-85/variation/VEP/homo_sapiens_vep_85_GRCh{37,38}.tar.gz $VEP_DATA
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-85/variation/VEP/mus_musculus_vep_85_GRCm38.tar.gz $VEP_DATA
cat $VEP_DATA/*_vep_85_GRC{h37,h38,m38}.tar.gz | tar -izxf - -C $VEP_DATA

Install the Ensembl API, the reference FASTAs for GRCh37/GRCh38/GRCm38, and some neat VEP plugins:

perl INSTALL.pl --AUTO afp --SPECIES homo_sapiens --ASSEMBLY GRCh37 --PLUGINS ExAC --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO afp --SPECIES homo_sapiens --ASSEMBLY GRCh38 --PLUGINS ExAC --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO af --SPECIES mus_musculus --ASSEMBLY GRCm38 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA

Convert the offline cache for use with tabix, that significantly speeds up the lookup of known variants:

perl convert_cache.pl --species homo_sapiens --version 85_GRCh37 --dir $VEP_DATA
perl convert_cache.pl --species homo_sapiens --version 85_GRCh38 --dir $VEP_DATA
perl convert_cache.pl --species mus_musculus --version 85_GRCm38 --dir $VEP_DATA

Download and build samtools and bcftools, which we'll need for steps below, and when running vcf2maf/maf2maf:

mkdir $VEP_PATH/samtools && cd $VEP_PATH/samtools
curl -LOOO https://github.com/samtools/{samtools/releases/download/1.3.1/samtools-1.3.1,bcftools/releases/download/1.3.1/bcftools-1.3.1,htslib/releases/download/1.3.2/htslib-1.3.2}.tar.bz2
cat *tar.bz2 | tar -ijxf -
cd htslib-1.3.2 && make && make prefix=$VEP_PATH/samtools install && cd ..
cd samtools-1.3.1 && make && make prefix=$VEP_PATH/samtools install && cd ..
cd bcftools-1.3.1 && make && make prefix=$VEP_PATH/samtools install && cd ..
cd ..

Set $PATH to find all those tools, and also add this line to your ~/.bashrc to make it persistent. Be sure to edit the path below, if you didn't do this in your $HOME:

export PATH=$HOME/vep/samtools/bin:$PATH

Download the ExAC r0.3.1 VCF with germline variants called across thousands of normal samples excluding TCGA:

curl -L ftp://ftp.broadinstitute.org:/pub/ExAC_release/release0.3.1/subsets/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz > $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz

Remove ACADS:R330 and DNMT3A:R882 variants, which are likely somatic events related to clonal hematopoietic expansion:

bcftools filter --targets ^2:25457242-25457243,12:121176677-121176678 --output-type z --output $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.minus_somatic.vep.vcf.gz $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz
mv -f $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.minus_somatic.vep.vcf.gz $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz

Tabix index the VCF for efficient lookup by VEP:

tabix -p vcf $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz

Test running VEP in offline mode with the ExAC plugin, on the provided sample GRCh37 VCF:

perl variant_effect_predictor.pl --species homo_sapiens --assembly GRCh37 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/85_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --input_file example_GRCh37.vcf --output_file example_GRCh37.vep.vcf --polyphen b --gmaf --maf_1kg --maf_esp --regulatory --plugin ExAC,$VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment