Skip to content

Instantly share code, notes, and snippets.

@ckandoth
Created May 30, 2024 19:27
Show Gist options
  • Save ckandoth/4bccadcacd58aad055ed369a78bf2e7c to your computer and use it in GitHub Desktop.
Save ckandoth/4bccadcacd58aad055ed369a78bf2e7c to your computer and use it in GitHub Desktop.
Install Ensembl's VEP v112 with local cache for running offline

Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.

The official instructions to install VEP have never worked well from the United States because of the flaky network connection to their FTP servers in the UK. So, we will instead use conda to install VEP and its dependencies and then manually download VEP caches and reference genomes using rsync.

If you don't already have conda, download and install it into $HOME/miniconda3:

curl -sL https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o miniconda.sh
bash miniconda.sh -bup $HOME/miniconda3 && rm -f miniconda.sh

Add the following lines to your ~/.bashrc, and then logout and login to add conda to your $PATH:

# Add conda to PATH if found
if [ -f "$HOME/miniconda3/etc/profile.d/conda.sh" ]; then
    . $HOME/miniconda3/etc/profile.d/conda.sh
fi

Update conda to the latest version and configure it to use libmamba, a faster dependency solver:

conda update -y -n base -c defaults conda
conda config --set solver libmamba

Create and activate a conda environment with VEP, its dependencies, and other related tools:

conda create -y -n vep && conda activate vep
conda install -y -c conda-forge -c bioconda -c defaults ensembl-vep==112.0 htslib==1.20 bcftools==1.20 samtools==1.20 ucsc-liftover==447

Download VEP's offline cache for GRCh38, and the reference FASTA:

mkdir -p $HOME/.vep/homo_sapiens/112_GRCh38/
rsync -avr --progress rsync://ftp.ensembl.org/ensembl/pub/release-112/variation/indexed_vep_cache/homo_sapiens_vep_112_GRCh38.tar.gz $HOME/.vep/
tar -zxf $HOME/.vep/homo_sapiens_vep_112_GRCh38.tar.gz -C $HOME/.vep/
rsync -avr --progress rsync://ftp.ensembl.org/ensembl/pub/release-112/fasta/homo_sapiens/dna_index/ $HOME/.vep/homo_sapiens/112_GRCh38/

(Optional) Download VEP's offline cache for GRCh37, and the reference FASTA which we must bgzip instead of gzip:

mkdir -p $HOME/.vep/homo_sapiens/112_GRCh37/
rsync -avr --progress rsync://ftp.ensembl.org/ensembl/pub/release-112/variation/indexed_vep_cache/homo_sapiens_vep_112_GRCh37.tar.gz $HOME/.vep/
tar -zxf $HOME/.vep/homo_sapiens_vep_112_GRCh37.tar.gz -C $HOME/.vep/
rsync -avr --progress rsync://ftp.ensembl.org/ensembl/pub/grch37/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz $HOME/.vep/homo_sapiens/112_GRCh37/
gzip -d $HOME/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz
bgzip -i $HOME/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa
samtools faidx $HOME/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz

Test running VEP in offline mode on a GRCh38 VCF:

curl -sLO https://raw.githubusercontent.com/Ensembl/ensembl-vep/release/112/examples/homo_sapiens_GRCh38.vcf
vep --species homo_sapiens --assembly GRCh38 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $HOME/.vep --fasta $HOME/.vep/homo_sapiens/112_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz --input_file homo_sapiens_GRCh38.vcf --output_file homo_sapiens_GRCh38.vep.vcf --polyphen b --af --af_1kg --regulatory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment