IsmailM/genevalidator_blast_db.md

## genevalidator_blast_db.md

      
    Raw
  

              genevalidator_blast_db.md
            
          
    GeneValidator - Setting Up BLAST Databases

GeneValidator requires a protein BLAST database in order to fully analyse all sequences. The GeneValidator package includes the Swissprot BLAST database.
Choosing the right BLAST database is essential in order to obtain the best results with GeneValidator. For example, SwissProt contains high quality, manually curated genes, but is a lot smaller than a larger database such as Non-Redundant database. On the hand, the Non-Redundant database contains lower-quality genes. An alternative would be creating BLAST database from a database of a closely related species.
Download BLAST databases from NCBI

GeneValidator contains a script (ncbi-blast-dbs that downloads pre-formatted BLAST databases from NCBI.
# To see available BLAST databases
genevalidator ncbi-blast-dbs

# To download BLAST database(s)
genevalidator ncbi-blast-dbs nr refseq_protein

Features include:

Database files (volumes) are downloaded in parallel - number of threads to use is determined automatically.
MD5 checksum is verified and the database volume extracted upon download.
Database volumes are not downloaded in a particular order.
The volumes are updated if a newer version is available on the server, or re-downloaded if corrupt.
Aborted downloads are safely resumed.

However, this script requires wget and md5sum to work. For macOS users, md5sum can be installed using Homebrew
brew install md5sha1sum
Manually producing a BLAST db from a FASTA file

It is also possible to create a custom BLAST database from a Fasta file by using the included makeblastdb command from BLAST+. In particular, it is important to note that GeneValidator requires that blast databases are set up with the -parse_seqids argument.
makeblastdb -in input_db -dbtype prot -parse_seqids
For example, if we were to download the swissprot database as a fasta file, we could format it as a BLAST database ourselves as follows.
cd genevalidator/blast_db/


# Delete existing version of SwissProt (which is included in the GeneValidator package)
rm swissprot*

# download the latest version in fasta format - from http://www.uniprot.org/downloads
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

# unzip the fasta file
gunzip uniprot_sprot.fasta.gz

# use the packaged makeblastdb tool to create the BLAST DB
makeblastdb -in uniprot_sprot.fasta -dbtype prot -parse_seqids

Passing BLAST output files to GV

It is possible to pass BLAST output files (XML or tabular files) to GV. This can be particularly useful if BLAST results have already been produced, when BLAST can be run on a cluster, or when BLAST results are also required for other analyses.
GeneValidator supports the XML and tabular BLAST output formats.
# Run BLAST (XML output)
blast(p/x) -db DATABASE_PATH -num_threads NUM_THREADS -outfmt 5 -out BLAST_XML_FILE -query INPUT_FASTA_FILE

# Run GeneValidator
genevalidator -d DATABASE_PATH -n NUM_THREADS -x BLAST_XML_FILE INPUT_FASTA_FILE
This is the same, but using the BLAST tabular output.
# Run BLAST (tabular output)
blast(p/x) -db DATABASE_PATH -num_threads NUM_THREADS -outfmt '7 qseqid sseqid sacc slen qstart qend sstart send length qframe pident nident evalue qseq sseq' -out BLAST_TAB_FILE -query INPUT_FASTA_FILE

# Run GeneValidator
genevalidator -n NUM_THREADS -t BLAST_TAB_FILE -o 'qseqid sseqid sacc slen qstart qend sstart send length qframe pident nident evalue qseq sseq' INPUT_FASTA_FILE
Using DIAMOND with GV

Diamond is a faster alternative to BLAST and as such can be used to speed up a GeneValidator analysis.
Below is an example illustrating how Diamond output files can be passed to GeneValidator
# Install GV 
sh -c "$(curl -fsSL https://install-genevalidator.wurmlab.com)"

cd genevalidator

# Download SwissProt 
curl -L -o blast_db/uniprot_sprot.fasta.gz ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

gunzip blast_db/uniprot_sprot.fasta.gz

# Requires prior installation of Diamond 
# Make Diamond BLAST database 
diamond makedb --in blast_db/uniprot_sprot.fasta -d blast_db/swissprot

# Run BLASTP with Diamond
diamond blastp --db blast_db/swissprot --outfmt 5 --query exemplar_data/protein_data.fa --out exemplar_data/protein_data.xml

# Extract the FASTA Sequence using the Hit IDs
# Requires prior installation of Samtools
./lib/ruby/bin.real/nokogiri -e 'puts @doc.css("Hit_id").map(&:content)' exemplar_data/protein_data.xml | xargs samtools faidx blast_db/uniprot_sprot.fasta > exemplar_data/protein_data.raw_sequences.fa

genevalidator -r exemplar_data/protein_data.raw_sequences.fa -x exemplar_data/protein_data.xml exemplar_data/protein_data.fa