Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.
To follow these instructions, we'll assume you have these packaged essentials installed:
sudo yum install -y curl rsync tar make perl perl-core
## OR ##
sudo apt-get install -y curl rsync tar make perl perl-base
You'll also need samtools
and tabix
in your $PATH
, which can be found at htslib.org
Set PERL_PATH to where you want to install additional perl libraries. Change this as needed:
export PERL_PATH=~/perl5
Handle VEP's Perl dependencies using cpanminus to install them under $PERL_PATH:
curl -L http://cpanmin.us | perl - --notest -l $PERL_PATH LWP::Simple LWP::Protocol::https Archive::Extract Archive::Tar Archive::Zip CGI DBI Time::HiRes
Set PERL5LIB to find those libraries. Add this to the end of your ~/.bashrc
to make it persistent:
export PERL5LIB=$PERL_PATH/lib/perl5:$PERL_PATH/lib/perl5/x86_64-linux
Create temporary shell variables pointing to where we'll store VEP and its cache data (non default paths can be used, but specify --vep-path
and --vep-data
when running vcf2maf or maf2maf):
export VEP_PATH=~/vep
export VEP_DATA=~/.vep
Download the v82 release of VEP:
mkdir $VEP_PATH $VEP_DATA; cd $VEP_PATH
curl -LO https://github.com/Ensembl/ensembl-tools/archive/release/82.tar.gz
tar -zxf 82.tar.gz --starting-file variant_effect_predictor --transform='s|.*/|./|g'
Add that path to PERL5LIB
, and the htslib subfolder to PATH
where tabix
will be installed:
export PERL5LIB=$VEP_PATH:$PERL5LIB
export PATH=$VEP_PATH/htslib:$PATH
Download and unpack VEP's offline cache for GRCh37, GRCh38, and GRCm38:
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-82/variation/VEP/homo_sapiens_vep_82_GRCh{37,38}.tar.gz $VEP_DATA
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-82/variation/VEP/mus_musculus_vep_82_GRCm38.tar.gz $VEP_DATA
cat $VEP_DATA/*_vep_82_GRC{h37,h38,m38}.tar.gz | tar -izxf - -C $VEP_DATA
Install the Ensembl API, the reference FASTAs for GRCh37/GRCh38/GRCm38, and some neat VEP plugins:
perl INSTALL.pl --AUTO afp --SPECIES homo_sapiens --ASSEMBLY GRCh37 --PLUGINS ExAC,UpDownDistance --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO afp --SPECIES homo_sapiens --ASSEMBLY GRCh38 --PLUGINS ExAC,UpDownDistance --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO afp --SPECIES mus_musculus --ASSEMBLY GRCm38 --PLUGINS UpDownDistance --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
Convert the offline cache for use with tabix, that significantly speeds up the lookup of known variants:
perl convert_cache.pl --species homo_sapiens,mus_musculus --version 82_GRCh37,82_GRCh38,82_GRCm38 --dir $VEP_DATA
Download and index a custom ExAC r0.3 VCF, that skips variants overlapping known somatic hotspots:
curl -L https://googledrive.com/host/0B6o74flPT8FAYnBJTk9aTF9WVnM > $VEP_DATA/ExAC.r0.3.sites.minus_somatic.vcf.gz
tabix -p vcf $VEP_DATA/ExAC.r0.3.sites.minus_somatic.vcf.gz
Test running VEP in offline mode with the ExAC plugin, on the provided sample GRCh37 VCF:
perl variant_effect_predictor.pl --species homo_sapiens --assembly GRCh37 --offline --no_progress --everything --shift_hgvs 1 --check_existing --check_alleles --total_length --allele_number --no_escape --xref_refseq --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/82_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --plugin ExAC,$VEP_DATA/ExAC.r0.3.sites.minus_somatic.vcf.gz --input_file example_GRCh37.vcf --output_file example_GRCh37.vep.txt