Download Kaiju software
cd ~
git clone https://github.com/bioinformatics-centre/kaiju.git
cd kaiju/src
make
From kaiju github:
There are several options for creating the reference database with protein sequences from different source databases:
- Complete Reference Genomes from NCBI RefSeq
makeDB.sh -r Download only completely assembled and annotated reference genomes of Archaea and Bacteria from the NCBI RefSeq database.
Additionally, viral genomes from NCBI RefSeq can be added by using the option -v.
As of October 2016, this database contains ca. 20M protein sequences, which amounts to a requirement of 14GB RAM for running Kaiju.
- Representative genomes from proGenomes
makeDB.sh -p Download the protein sequences belonging to the representative set of genomes from the proGenomes database. This dataset generally covers a broader phylogenenic range compared to the RefSeq dataset, and is therefore recommended, especially for environmental samples.
Additionally, viral genomes from NCBI RefSeq can be added by using the option -v.
As of October 2016, this database contains ca. 19M protein sequences, which amounts to a requirement of 13GB RAM for running Kaiju.
- Non-redundant protein database nr
makeDB.sh -n Download the nr database that is used by NCBI BLAST and extract proteins belonging to Archaea, Bacteria and Viruses.
makeDB.sh -e Download the nr database as above, but additionally include proteins from fungi and microbial eukaryotes. The complete taxon list for this option is in the file bin/taxonlist.tsv.
Because the nr database contains more proteins, more RAM is needed for index construction and for running Kaiju. As of October 2016, the nr database with option -e contains ca. 80M protein sequences, which amounts to a requirement of 43GB RAM for running Kaiju.
cd ~
mkdir kaijudb_e
cd kaijudb_e
~/kaiju/bin/makeDB.sh -e
Remove scratch files from build process
cd ~/kaijudb_e/kaijudb_e/
rm -rf genomes/ kaiju_db_nr_euk.bwt kaiju_db_nr_euk.sa kaiju_db_nr_euk.faa taxdump.tar.gz nr.gz prot.accession2taxid prot.accession2taxid.gz
Compress kaijudb_e (compressed file uploaded to google docs and then deleted from instance; original file maintained)
tar -zcvf kaijudb_e.tar.gz kaijudb_e/
cd ~
mkdir kaijudb_p
cd kaijudb_p
~/kaiju/bin/makeDB.sh -p
Remove scratch files from build process
cd ~/kaijudb_p/
rm -rf genomes/ kaiju_db.bwt kaiju_db.sa kaiju_db.faa taxdump.tar.gz nr.gz prot.accession2taxid prot.accession2taxid.gz
mkdir kaijudb_n
cd kaijudb_n
~/kaiju/bin/makeDB.sh -n
mkdir kaijudb_r
cd kaijudb_r
~/kaiju/bin/makeDB.sh -r