Skip to content

Instantly share code, notes, and snippets.

@ckandoth
Last active March 29, 2019 18:07
Show Gist options
  • Save ckandoth/9d6ad6a7fd3b058e5bc98a1ce884641a to your computer and use it in GitHub Desktop.
Save ckandoth/9d6ad6a7fd3b058e5bc98a1ce884641a to your computer and use it in GitHub Desktop.
Install Ensembl's VEP v82 with various caches for running offline

Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.

To follow these instructions, we'll assume you have these packaged essentials installed:

sudo yum install -y curl rsync tar make perl perl-core
## OR ##
sudo apt-get install -y curl rsync tar make perl perl-base

You'll also need samtools and tabix in your $PATH, which can be found at htslib.org

Set PERL_PATH to where you want to install additional perl libraries. Change this as needed:

export PERL_PATH=~/perl5

Handle VEP's Perl dependencies using cpanminus to install them under $PERL_PATH:

curl -L http://cpanmin.us | perl - --notest -l $PERL_PATH LWP::Simple LWP::Protocol::https Archive::Extract Archive::Tar Archive::Zip CGI DBI Time::HiRes

Set PERL5LIB to find those libraries. Add this to the end of your ~/.bashrc to make it persistent:

export PERL5LIB=$PERL_PATH/lib/perl5:$PERL_PATH/lib/perl5/x86_64-linux

Create temporary shell variables pointing to where we'll store VEP and its cache data (non default paths can be used, but specify --vep-path and --vep-data when running vcf2maf or maf2maf):

export VEP_PATH=~/vep
export VEP_DATA=~/.vep

Download the v82 release of VEP:

mkdir $VEP_PATH $VEP_DATA; cd $VEP_PATH
curl -LO https://github.com/Ensembl/ensembl-tools/archive/release/82.tar.gz
tar -zxf 82.tar.gz --starting-file variant_effect_predictor --transform='s|.*/|./|g'

Add that path to PERL5LIB, and the htslib subfolder to PATH where tabix will be installed:

export PERL5LIB=$VEP_PATH:$PERL5LIB
export PATH=$VEP_PATH/htslib:$PATH

Download and unpack VEP's offline cache for GRCh37, GRCh38, and GRCm38:

rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-82/variation/VEP/homo_sapiens_vep_82_GRCh{37,38}.tar.gz $VEP_DATA
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-82/variation/VEP/mus_musculus_vep_82_GRCm38.tar.gz $VEP_DATA
cat $VEP_DATA/*_vep_82_GRC{h37,h38,m38}.tar.gz | tar -izxf - -C $VEP_DATA

Install the Ensembl API, the reference FASTAs for GRCh37/GRCh38/GRCm38, and some neat VEP plugins:

perl INSTALL.pl --AUTO afp --SPECIES homo_sapiens --ASSEMBLY GRCh37 --PLUGINS ExAC,UpDownDistance --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO afp --SPECIES homo_sapiens --ASSEMBLY GRCh38 --PLUGINS ExAC,UpDownDistance --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO afp --SPECIES mus_musculus --ASSEMBLY GRCm38 --PLUGINS UpDownDistance --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA

Convert the offline cache for use with tabix, that significantly speeds up the lookup of known variants:

perl convert_cache.pl --species homo_sapiens,mus_musculus --version 82_GRCh37,82_GRCh38,82_GRCm38 --dir $VEP_DATA

Download and index a custom ExAC r0.3 VCF, that skips variants overlapping known somatic hotspots:

curl -L https://googledrive.com/host/0B6o74flPT8FAYnBJTk9aTF9WVnM > $VEP_DATA/ExAC.r0.3.sites.minus_somatic.vcf.gz
tabix -p vcf $VEP_DATA/ExAC.r0.3.sites.minus_somatic.vcf.gz

Test running VEP in offline mode with the ExAC plugin, on the provided sample GRCh37 VCF:

perl variant_effect_predictor.pl --species homo_sapiens --assembly GRCh37 --offline --no_progress --everything --shift_hgvs 1 --check_existing --check_alleles --total_length --allele_number --no_escape --xref_refseq --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/82_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --plugin ExAC,$VEP_DATA/ExAC.r0.3.sites.minus_somatic.vcf.gz --input_file example_GRCh37.vcf --output_file example_GRCh37.vep.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment