Skip to content

Instantly share code, notes, and snippets.

@ckandoth
Last active July 15, 2021 16:26
Show Gist options
  • Save ckandoth/f265ea7c59a880e28b1e533a6e935697 to your computer and use it in GitHub Desktop.
Save ckandoth/f265ea7c59a880e28b1e533a6e935697 to your computer and use it in GitHub Desktop.
Install Ensembl's VEP v86 with various caches for running offline

Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.

To follow these instructions, we'll assume you have these packaged essentials installed:

## For Debian/Ubuntu system admins ##
sudo apt-get install -y build-essential git libncurses-dev

## For RHEL/CentOS system admins ##
sudo yum groupinstall -y 'Development Tools'
sudo yum install -y git ncurses-devel

Follow this gist to set up Perl 5.22 in a folder somewhere and install the libraries that VEP needs. Be sure to follow the steps that update $PERL5LIB to find those libraries, and set $PATH to use that new Perl instead of the system Perl.

Create temporary shell variables pointing to where we'll store VEP and its cache data. The paths below are the default for vcf2maf and maf2maf, but different paths can be used. You'll just need to specify --vep-path and --vep-data when running vcf2maf or maf2maf:

export VEP_PATH=$HOME/vep
export VEP_DATA=$HOME/.vep

Download the v86 release of VEP:

mkdir $VEP_PATH $VEP_DATA; cd $VEP_PATH
curl -LO https://github.com/Ensembl/ensembl-tools/archive/release/86.tar.gz
tar -zxf 86.tar.gz --starting-file variant_effect_predictor --transform='s|.*/|./|g'

Add that path to PERL5LIB, and the htslib subfolder to PATH where tabix will be installed:

export PERL5LIB=$VEP_PATH:$PERL5LIB
export PATH=$VEP_PATH/htslib:$PATH

Download and unpack VEP's offline cache for GRCh37, GRCh38, and GRCm38:

rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-86/variation/VEP/homo_sapiens_vep_86_GRCh37.tar.gz $VEP_DATA
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-86/variation/VEP/homo_sapiens_vep_86_GRCh38.tar.gz $VEP_DATA
rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-86/variation/VEP/mus_musculus_vep_86_GRCm38.tar.gz $VEP_DATA
cat $VEP_DATA/*_vep_86_GRC{h37,h38,m38}.tar.gz | tar -izxf - -C $VEP_DATA

Install the Ensembl API, the reference FASTAs for GRCh37/GRCh38/GRCm38:

perl INSTALL.pl --AUTO af --SPECIES homo_sapiens --ASSEMBLY GRCh37 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO af --SPECIES homo_sapiens --ASSEMBLY GRCh38 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
perl INSTALL.pl --AUTO af --SPECIES mus_musculus --ASSEMBLY GRCm38 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA

Convert the offline cache for use with tabix, that significantly speeds up the lookup of known variants:

perl convert_cache.pl --species homo_sapiens --version 86_GRCh37 --dir $VEP_DATA
perl convert_cache.pl --species homo_sapiens --version 86_GRCh38 --dir $VEP_DATA
perl convert_cache.pl --species mus_musculus --version 86_GRCm38 --dir $VEP_DATA

Download and build samtools and bcftools, which we'll need for steps below, and when running vcf2maf/maf2maf:

mkdir $VEP_PATH/samtools && cd $VEP_PATH/samtools
curl -LOOO https://github.com/samtools/{samtools/releases/download/1.3.1/samtools-1.3.1,bcftools/releases/download/1.3.1/bcftools-1.3.1,htslib/releases/download/1.3.2/htslib-1.3.2}.tar.bz2
cat *tar.bz2 | tar -ijxf -
cd htslib-1.3.2 && make && make prefix=$VEP_PATH/samtools install && cd ..
cd samtools-1.3.1 && make && make prefix=$VEP_PATH/samtools install && cd ..
cd bcftools-1.3.1 && make && make prefix=$VEP_PATH/samtools install && cd ..
cd ..

Download the liftOver binary down the same path, and make it executable:

curl -L http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver > bin/liftOver
chmod a+x bin/liftOver

Set $PATH to find all those tools, and also add this line to your ~/.bashrc to make it persistent. Be sure to edit the path below, if you didn't do this in your $HOME:

export PATH=$HOME/vep/samtools/bin:$PATH

Download the ExAC r0.3.1 VCF with germline variants called across thousands of normal samples excluding TCGA:

curl -L ftp://ftp.broadinstitute.org:/pub/ExAC_release/release0.3.1/subsets/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz > $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz

We'll make some fixes to this VCF, so it's easier to work with:

  • Fix the header with a line for AC_Adj0_Filter, so that bcftools won't complain about it
  • Remove header lines describing FORMAT, because that data is not in the VCF
  • Remove all INFO fields except the allele counts/numbers, to reduce file size
  • Remove calls in known_somatic_sites.bed, likely somatic events related to clonal hematopoiesis
echo "##FILTER=<ID=AC_Adj0_Filter,Description=\"Only low quality genotype calls containing alternate alleles are present\">" > header_line.tmp
curl -LO https://raw.githubusercontent.com/mskcc/vcf2maf/v1.6.14/data/known_somatic_sites.bed
bcftools annotate --header-lines header_line.tmp --remove FMT,^INF/AF,INF/AC,INF/AN,INF/AC_Adj,INF/AN_Adj,INF/AC_AFR,INF/AC_AMR,INF/AC_EAS,INF/AC_FIN,INF/AC_NFE,INF/AC_OTH,INF/AC_SAS,INF/AN_AFR,INF/AN_AMR,INF/AN_EAS,INF/AN_FIN,INF/AN_NFE,INF/AN_OTH,INF/AN_SAS $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz | bcftools filter --targets-file ^known_somatic_sites.bed --output-type z --output $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.fixed.vcf.gz

Replace the original to save space, and tabix index for efficient lookup by VEP:

mv -f $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.fixed.vcf.gz $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz
tabix -p vcf $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz

Test running VEP in offline mode with ExAC custom annotation, on the provided sample GRCh37 VCF:

perl variant_effect_predictor.pl --species homo_sapiens --assembly GRCh37 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/86_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --input_file example_GRCh37.vcf --output_file example_GRCh37.vep.vcf --polyphen b --gmaf --maf_1kg --maf_esp --regulatory --custom $VEP_DATA/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz,ExAC,vcf,exact,1,AC,AN
@covingto
Copy link

Hey @ckandoth,
I think you should point out that they ExAC VCF you are downloading is for GRCh37 but the instructions install both GRCh37 and GRCh38. The file name makes it not immediately obvious, you have to check the header.

@leiendeckerlu
Copy link

@covingto do you have insights whether there is a GRCh38 version available?

@jp3117
Copy link

jp3117 commented Aug 22, 2017

Hi @ckandoth,

We have the private network so that the connection to ftp://ftp.ensembl.org/pub/release-$API_VERSION/variation/VEP and etc won't wok properly. Is there any alternative way to install "the Ensembl API, the reference FASTAs for GRCh37"? We are able to download the packages.

Thanks.

@AteeqMKhaliq
Copy link

Hi,
while Downloading and unpacking VEP's offline cache for GRCh37 i used the following command, but the file never gets downloaded i have kept it for download for almost 22 hrs but nothing happened. please have a look.

[root@localhost vep]# rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-86/variation/VEP/homo_sapiens_vep_86_GRCh37.tar.gz $VEP_DATA
homo_sapiens_vep_86_GRCh37.tar.gz
Thanks,
Ateeq

@pieterlukasse
Copy link

pieterlukasse commented Feb 14, 2018

VEP's cache FTP seems to have moved

from 
ftp://ftp.ensembl.org/ensembl/pub/release-86/variation/VEP/homo_sapiens_vep_86_GRCh37.tar.gz
to
ftp://ftp.ensembl.org/pub/release-86/variation/VEP/homo_sapiens_vep_86_GRCh37.tar.gz

@yiwenhe
Copy link

yiwenhe commented Mar 23, 2018

Hi, do you have a page for installation of the latest VEP version v91? Thank you.

@JingCows
Copy link

Help with VEP command:
$ vep -database -o sequence_analysis_project_effect.txt -species bos_taurus -i sequence_analysis_project.vcf.txt

I'm running the above command. It gives no error message. But its not producing anything in the output file. I've broken out of the command a few times after about 10 minutes. My data is about 330MB. Kindly give your suggestion for a fix, if any. Thank you

Copy link

ghost commented Aug 8, 2018

Hi, when I run the test it shows me this error. For the installation I followed all the steps that are in this tutorial.

-------------------- EXCEPTION --------------------
MSG: ERROR: Specified FASTA file/directory /home/usuariohi/.vep/homo_sapiens/86_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz not found

@dullahan8
Copy link

When I try to use rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-86/variation/VEP/homo_sapiens_vep_86_GRCh37.tar.gz $VEP_DATA, I get the following error:

opening tcp connection to ftp.ensembl.org port 873
sending daemon args: --server --sender -vvlogDtpre.iLsfxC --new-compress . ensembl/pub/release-86/variation/VEP/homo_sapiens_vep_86_GRCh37.tar.gz  (6 args)
receiving incremental file list
rsync: on remote machine: --new-compress: unknown option
rsync error: requested action not supported (code 4) at clientserver.c(849) [sender=3.0.9]
rsync: read error: Connection reset by peer (104)
rsync error: error in socket IO (code 10) at io.c(785) [Receiver=3.1.2]

Looks like --new-compress was a problem so I removed -z from rsync and replaced it with --old-compress and the download worked.

@HaseebYounis1
Copy link

i am trying to run this command
perl INSTALL.pl --AUTO af --SPECIES homo_sapiens --ASSEMBLY GRCh37 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA

But getting this error:
Can't locate Archive/Extract.pm in @inc (you may need to install the Archive::Extract module) (@inc contains: /home/brl/vep /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.1 /usr/local/share/perl/5.26.1 /usr/lib/x86_64-linux-gnu/perl5/5.26 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.26 /usr/share/perl/5.26 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at INSTALL.pl line 46.
BEGIN failed--compilation aborted at INSTALL.pl line 46.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment