Skip to content

Instantly share code, notes, and snippets.

@darencard
Created October 10, 2018 20:47
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save darencard/d3ff1399f1fa05b35429eb148383aea4 to your computer and use it in GitHub Desktop.
Save darencard/d3ff1399f1fa05b35429eb148383aea4 to your computer and use it in GitHub Desktop.
Notes on installing and running Provean

Notes from work installing and running Provean to predict protein impact of variants. Provean input files were produced based on VEP output using commands below. Some trial runs were completed using a computer to understand how quickly Provean can be run in parallel to work through all annotated genes.

# running PROVEAN

# installation & dependencies
# 1. checked that blast was installed and also reinstalled cd-hit to avoid issue with certain version
# 2. installed the NCBI nr database
sudo mkdir /opt/ncbi_blast_nr_db_2018-01-29
sudo chmod 775 /opt/ncbi_blast_nr_db_2018-01-29
sudo chown administrator:administrator /opt/ncbi_blast_nr_db_2018-01-29
cd /opt/ncbi_blast_nr_db_2018-01-29
update_blastdb --passive --decompress --timeout 300 --force --verbose nr
# 3. install provean
tar xvf provean-1.1.5.tar.gz
cd provean-1.1.5
./configure BLAST_DB=/opt/ncbi_blast_nr_db_2018-01-29/nr
make
sudo make install

# prepare PROVEAN inputs using protein sequence fasta and the results of a VEP analysis
# create bed summary of output that we can easily do random searches from
cat Boco_wgs_BZ_HN_joint_variants_VEP_effects | \
awk '{ if ($11 ~ "/") print $0 }' | cut -f 5,10,11 | \
awk -v OFS="\t" '{ print $1, $2+0, $2+1, $3 }' | \
sort -k1,1 -k2,2n | bgzip -c \
> Boco_wgs_BZ_HN_joint_variants_VEP_effects.coding.bed.gz

# index this bed and the protein fasta
tabix -p bed Boco_wgs_BZ_HN_joint_variants_VEP_effects.coding.bed.gz
samtools faidx Bcon_rnd3.all.maker.proteins.fasta

# pull out protein variants and format properly, and gather protein fasta sequences
# variants with an X (any amino acid) are excluded
cat Boco_wgs_BZ_HN_joint_variants_VEP_effects | \
awk '{ if ($11 ~ "/") print $0 }' | cut -f5 | sort | uniq | \
while read protein; \
do \
echo ${protein}; \
tabix Boco_wgs_BZ_HN_joint_variants_VEP_effects.coding.bed.gz ${protein} | \
awk -v OFS="\t" '{ if ($4 !~ "X") print $0 }' | \
awk -F "\t|/" -v OFS="," '{ print $2, $4, $5 }' | sed 's/-/./g' \
> input_files_per_protein/${protein}.var.txt; \
samtools faidx Bcon_rnd3.all.maker.proteins.fasta ${protein} \
> input_files_per_protein/${protein}.seq.fasta; \
done

# I am able to run PROVEAN just fine, but it takes N minutes on 1 core for a protein 
# of average length, mostly due to the BLAST. Given we need to run almost 15k proteins,
# we need to parallelize this a lot. I plan to run 4 sets of jobs in parallel on 
# moonunits 0-2 and javelina (2000 proteins each, running on 10 cores, total run time
# of 8-10 days). I will also run 1 large set of jobs using jetstream (8000 proteins, 
# running on 43 cores, total run time of 8-9 days). Math assumes 1 core per job and that
# jobs last approximately N minutes on average.
@abofinal
Copy link

abofinal commented Jun 3, 2020

Hello, Dr. Card. I'm Ramón Pouso, a PhD student (working with Drosophila melanogaster at University of Vigo) trying to use PROVEAN to complement the deleteriousness predictions I've obtained using SIFT. I was wondering if, instead of performing an additional VEP analysis, I could use SIFT's output to create PROVEAN's input. For example:

Format: SIFTINFO=Allele|Transcript|GeneId|GeneName|Region|VariantType|Ref_Amino_Acid/Alt_AminoAcid|Amino_position|SIFT_score|SIFT_median|SIFT_num_seqs|Allele_Type|SIFT_prediction">

Example:
SIFTINFO=A|FBtr0332271|FBgn0031306|CG4577|CDS|NONSYNONYMOUS|E/K|339|0.000|3.42|9|novel|DELETERIOUS

In your script, you get three fields from VEP's results: 'Feature ID', 'Position in protein', and 'Amino acid change'.

My question is, is it your understanding that I could create a similar PROVEAN input using the SIFT fields 'GeneName', 'Amino_position', and 'Ref_Amino_Acid/Alt_AminoAcid' from SIFT? I've already got the protein FASTA sequences for the release of D. melanogaster I'm using.

Thank you very much in advance, I know this is kind of a niche question, but any input from your point of view would be greatly appreciated!
~Ramón

@darencard
Copy link
Author

Hi Ramón. It looks like you could make this work given what you are telling me, but I've never worked with SIFT so I cannot comment further. But using Awk like I did, you can probably parse the fields accordingly as PROVEAN input.

@AlisaGU
Copy link

AlisaGU commented Feb 24, 2021

Hi, I have problems when running the example from PROVEAN.

BLAST 2.10.1, CDHIT 4.8.1, and I got the following segmentation error:
[18:46:17] searching related sequences...
[18:46:41] clustering subject sequences...
path/provean.sh: line 188: 51582 Segmentation fault $COMMAND

Here is the code:

$path_provean/provean -q P04637.fasta -v P04637.var --save_supporting_set P04637.sss --num_threads 5 -d $path_blastdb --psiblast $path_psiblast --cdhit $path_cdhit --blastdbcmd $path_blastdbcmd

Could you give me some tips?
Best regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment