kallisto is a new method for processing RNA-seq data. By pseudoaligning reads to a transcriptome instead of aligning reads to a genome, the quantification step is much faster. While the computational speedup will be huge for projects with many samples and/or with organisms with large genomes, I was curious how much time would be saved using kallisto on a small RNA-seq project for an organism with a smaller genome. To perform this comparison, I downloaded 6 fastq files from a recent yeast RNA-seq study on GEO. I chose Subread as the comparison method because it performs read alignment but is optimized for quickly obtaining gene counts (it soft clips reads instead of trying to map exact exon-exon boundaries).
kallisto took less than 5 minutes to index the transcriptome and pseudoalign 6 RNA-seq samples.
$ time bash run-kallisto.sh 1
# Program output omitted
real 4m51.378s
user 4m30.056s
sys 0m6.044s
Utilizing all four cores of my laptop, kallisto took less than 3 minutes.
$ time bash run-kallisto.sh 4
# Program output omitted
real 2m34.972s
user 5m3.268s
sys 0m7.787s
Subread took about 45 minutes to index the genome, align 6 RNA-seq samples, and then count the number of reads per gene.
$ time bash run-subread.sh 1
real 46m16.760s
user 42m46.264s
sys 0m17.379s
Utilizing all 4 cores of my laptop reduced the time to under a half hour.
$ time bash run-subread.sh 4
real 27m19.005s
user 57m0.761s
sys 2m47.445s
Even for a small scale RNA-seq study with only 6 yeast samples, kallisto is ~9x faster than the alignment-based Subread method. The time to build the index for either method is negligible, so the real time disadvantage for Subread occurs during the alignment step. Thus even for projects with a small sample size and an organism with a small genome, the time saved by using kallisto is substantial.
download-data.R
downloads all the data needed for the comparison.
It takes a long time because it downloads 6 separate fastq files.
run-subread.sh
runs a typical Subread analysis in which reads from the 6 fastq files are aligned to the genome with subread-align
and reads per gene are counted with featureCounts
.
run-kallisto.sh
pseudoaligns the 6 fastq files.
Both scripts take exactly one argument, which is the number of cores to use for multithreading.
The time estimates were calculated using the UNIX function time
.
- RAM: 3.7 GB
- processor: Intel® Core™ i3 CPU M 380 @ 2.53GHz × 4
- OS: Ubuntu 14.04
- R: 3.2.3
- biomaRt: 2.24.1
- Subread: 1.5.0
- kallisto: 0.42.4
The complexity of the transcriptome is what affects the choice of k. The k-mers in the transcriptome are more random, i.e. less short repeats, than the genome, however when you include transcripts from the same gene, i.e. same genomic location you are enriching for ambiguity.
On the other hand because kallisto uses exact k-mer matches any sequencing error means that the information in the k-mer is lost, cannot be pseudoaligned (also they could map to other locations, but that's rare). So depending on your read length you might not want to go as high as 31. In our analysis we didn't see a big difference between k=23 and k=31, but the accuracy dropped with k=19, so anything from 21 to 31 should work well.
If your reads are 75bp or more we recommend k=31, if they are 35 or less use k=21, anything in between you can use either one (depends on error rate).