Skip to content

Instantly share code, notes, and snippets.

@taylorreiter
Last active May 21, 2017 22:03
Show Gist options
  • Save taylorreiter/2511b0c6e904b455e7002742c5da1492 to your computer and use it in GitHub Desktop.
Save taylorreiter/2511b0c6e904b455e7002742c5da1492 to your computer and use it in GitHub Desktop.

Download Kaiju software

cd ~
git clone https://github.com/bioinformatics-centre/kaiju.git
cd kaiju/src
make

From kaiju github:

There are several options for creating the reference database with protein sequences from different source databases:

  1. Complete Reference Genomes from NCBI RefSeq

makeDB.sh -r Download only completely assembled and annotated reference genomes of Archaea and Bacteria from the NCBI RefSeq database.

Additionally, viral genomes from NCBI RefSeq can be added by using the option -v.

As of October 2016, this database contains ca. 20M protein sequences, which amounts to a requirement of 14GB RAM for running Kaiju.

  1. Representative genomes from proGenomes

makeDB.sh -p Download the protein sequences belonging to the representative set of genomes from the proGenomes database. This dataset generally covers a broader phylogenenic range compared to the RefSeq dataset, and is therefore recommended, especially for environmental samples.

Additionally, viral genomes from NCBI RefSeq can be added by using the option -v.

As of October 2016, this database contains ca. 19M protein sequences, which amounts to a requirement of 13GB RAM for running Kaiju.

  1. Non-redundant protein database nr

makeDB.sh -n Download the nr database that is used by NCBI BLAST and extract proteins belonging to Archaea, Bacteria and Viruses.

makeDB.sh -e Download the nr database as above, but additionally include proteins from fungi and microbial eukaryotes. The complete taxon list for this option is in the file bin/taxonlist.tsv.

Because the nr database contains more proteins, more RAM is needed for index construction and for running Kaiju. As of October 2016, the nr database with option -e contains ca. 80M protein sequences, which amounts to a requirement of 43GB RAM for running Kaiju.

Build kaiju database -e

cd ~
mkdir kaijudb_e
cd kaijudb_e
~/kaiju/bin/makeDB.sh -e

Remove scratch files from build process

cd ~/kaijudb_e/kaijudb_e/
rm -rf genomes/ kaiju_db_nr_euk.bwt kaiju_db_nr_euk.sa kaiju_db_nr_euk.faa taxdump.tar.gz nr.gz prot.accession2taxid prot.accession2taxid.gz

Compress kaijudb_e (compressed file uploaded to google docs and then deleted from instance; original file maintained)

tar -zcvf kaijudb_e.tar.gz kaijudb_e/

Build kaiju db -p

cd ~
mkdir kaijudb_p
cd kaijudb_p
~/kaiju/bin/makeDB.sh -p

Remove scratch files from build process

cd ~/kaijudb_p/
rm -rf genomes/ kaiju_db.bwt kaiju_db.sa kaiju_db.faa taxdump.tar.gz nr.gz prot.accession2taxid prot.accession2taxid.gz

Build kaiju db -n

mkdir kaijudb_n
cd kaijudb_n
~/kaiju/bin/makeDB.sh -n

Build kaiju db -r

mkdir kaijudb_r
cd kaijudb_r
~/kaiju/bin/makeDB.sh -r
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment