taylorreiter/kaijudb_builds.md

## kaijudb_builds.md

      
    Raw
  

              kaijudb_builds.md
            
          
    Download Kaiju software
cd ~
git clone https://github.com/bioinformatics-centre/kaiju.git
cd kaiju/src
make

From kaiju github:
There are several options for creating the reference database with protein sequences from different source databases:

Complete Reference Genomes from NCBI RefSeq

makeDB.sh -r
Download only completely assembled and annotated reference genomes of Archaea and Bacteria from the NCBI RefSeq database.
Additionally, viral genomes from NCBI RefSeq can be added by using the option -v.
As of October 2016, this database contains ca. 20M protein sequences, which amounts to a requirement of 14GB RAM for running Kaiju.

Representative genomes from proGenomes

makeDB.sh -p
Download the protein sequences belonging to the representative set of genomes from the proGenomes database. This dataset generally covers a broader phylogenenic range compared to the RefSeq dataset, and is therefore recommended, especially for environmental samples.
Additionally, viral genomes from NCBI RefSeq can be added by using the option -v.
As of October 2016, this database contains ca. 19M protein sequences, which amounts to a requirement of 13GB RAM for running Kaiju.

Non-redundant protein database nr

makeDB.sh -n
Download the nr database that is used by NCBI BLAST and extract proteins belonging to Archaea, Bacteria and Viruses.
makeDB.sh -e
Download the nr database as above, but additionally include proteins from fungi and microbial eukaryotes. The complete taxon list for this option is in the file bin/taxonlist.tsv.
Because the nr database contains more proteins, more RAM is needed for index construction and for running Kaiju. As of October 2016, the nr database with option -e contains ca. 80M protein sequences, which amounts to a requirement of 43GB RAM for running Kaiju.
Build kaiju database -e

cd ~
mkdir kaijudb_e
cd kaijudb_e
~/kaiju/bin/makeDB.sh -e

Remove scratch files from build process
cd ~/kaijudb_e/kaijudb_e/
rm -rf genomes/ kaiju_db_nr_euk.bwt kaiju_db_nr_euk.sa kaiju_db_nr_euk.faa taxdump.tar.gz nr.gz prot.accession2taxid prot.accession2taxid.gz

Compress kaijudb_e (compressed file uploaded to google docs and then deleted from instance; original file maintained)
tar -zcvf kaijudb_e.tar.gz kaijudb_e/

Build kaiju db -p

cd ~
mkdir kaijudb_p
cd kaijudb_p
~/kaiju/bin/makeDB.sh -p

Remove scratch files from build process
cd ~/kaijudb_p/
rm -rf genomes/ kaiju_db.bwt kaiju_db.sa kaiju_db.faa taxdump.tar.gz nr.gz prot.accession2taxid prot.accession2taxid.gz

Build kaiju db -n

mkdir kaijudb_n
cd kaijudb_n
~/kaiju/bin/makeDB.sh -n

Build kaiju db -r

mkdir kaijudb_r
cd kaijudb_r
~/kaiju/bin/makeDB.sh -r