Skip to content

Instantly share code, notes, and snippets.

@darencard
Created October 10, 2018 20:47
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save darencard/d3ff1399f1fa05b35429eb148383aea4 to your computer and use it in GitHub Desktop.
Save darencard/d3ff1399f1fa05b35429eb148383aea4 to your computer and use it in GitHub Desktop.
Notes on installing and running Provean

Notes from work installing and running Provean to predict protein impact of variants. Provean input files were produced based on VEP output using commands below. Some trial runs were completed using a computer to understand how quickly Provean can be run in parallel to work through all annotated genes.

# running PROVEAN

# installation & dependencies
# 1. checked that blast was installed and also reinstalled cd-hit to avoid issue with certain version
# 2. installed the NCBI nr database
sudo mkdir /opt/ncbi_blast_nr_db_2018-01-29
sudo chmod 775 /opt/ncbi_blast_nr_db_2018-01-29
sudo chown administrator:administrator /opt/ncbi_blast_nr_db_2018-01-29
cd /opt/ncbi_blast_nr_db_2018-01-29
update_blastdb --passive --decompress --timeout 300 --force --verbose nr
# 3. install provean
tar xvf provean-1.1.5.tar.gz
cd provean-1.1.5
./configure BLAST_DB=/opt/ncbi_blast_nr_db_2018-01-29/nr
make
sudo make install

# prepare PROVEAN inputs using protein sequence fasta and the results of a VEP analysis
# create bed summary of output that we can easily do random searches from
cat Boco_wgs_BZ_HN_joint_variants_VEP_effects | \
awk '{ if ($11 ~ "/") print $0 }' | cut -f 5,10,11 | \
awk -v OFS="\t" '{ print $1, $2+0, $2+1, $3 }' | \
sort -k1,1 -k2,2n | bgzip -c \
> Boco_wgs_BZ_HN_joint_variants_VEP_effects.coding.bed.gz

# index this bed and the protein fasta
tabix -p bed Boco_wgs_BZ_HN_joint_variants_VEP_effects.coding.bed.gz
samtools faidx Bcon_rnd3.all.maker.proteins.fasta

# pull out protein variants and format properly, and gather protein fasta sequences
# variants with an X (any amino acid) are excluded
cat Boco_wgs_BZ_HN_joint_variants_VEP_effects | \
awk '{ if ($11 ~ "/") print $0 }' | cut -f5 | sort | uniq | \
while read protein; \
do \
echo ${protein}; \
tabix Boco_wgs_BZ_HN_joint_variants_VEP_effects.coding.bed.gz ${protein} | \
awk -v OFS="\t" '{ if ($4 !~ "X") print $0 }' | \
awk -F "\t|/" -v OFS="," '{ print $2, $4, $5 }' | sed 's/-/./g' \
> input_files_per_protein/${protein}.var.txt; \
samtools faidx Bcon_rnd3.all.maker.proteins.fasta ${protein} \
> input_files_per_protein/${protein}.seq.fasta; \
done

# I am able to run PROVEAN just fine, but it takes N minutes on 1 core for a protein 
# of average length, mostly due to the BLAST. Given we need to run almost 15k proteins,
# we need to parallelize this a lot. I plan to run 4 sets of jobs in parallel on 
# moonunits 0-2 and javelina (2000 proteins each, running on 10 cores, total run time
# of 8-10 days). I will also run 1 large set of jobs using jetstream (8000 proteins, 
# running on 43 cores, total run time of 8-9 days). Math assumes 1 core per job and that
# jobs last approximately N minutes on average.
@AlisaGU
Copy link

AlisaGU commented Feb 24, 2021

Hi, I have problems when running the example from PROVEAN.

BLAST 2.10.1, CDHIT 4.8.1, and I got the following segmentation error:
[18:46:17] searching related sequences...
[18:46:41] clustering subject sequences...
path/provean.sh: line 188: 51582 Segmentation fault $COMMAND

Here is the code:

$path_provean/provean -q P04637.fasta -v P04637.var --save_supporting_set P04637.sss --num_threads 5 -d $path_blastdb --psiblast $path_psiblast --cdhit $path_cdhit --blastdbcmd $path_blastdbcmd

Could you give me some tips?
Best regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment