kallisto is a new method for processing RNA-seq data. By pseudoaligning reads to a transcriptome instead of aligning reads to a genome, the quantification step is much faster. While the computational speedup will be huge for projects with many samples and/or with organisms with large genomes, I was curious how much time would be saved using kallisto on a small RNA-seq project for an organism with a smaller genome. To perform this comparison, I downloaded 6 fastq files from a recent yeast RNA-seq study on GEO. I chose Subread as the comparison method because it performs read alignment but is optimized for quickly obtaining gene counts (it soft clips reads instead of trying to map exact exon-exon boundaries).
kallisto took less than 5 minutes to index the transcriptome and pseudoalign 6 RNA-seq samples.
$ time bash run-kallisto.sh 1
# Program output omitted
real 4m51.378s
user 4m30.056s
sys 0m6.044s
Utilizing all four cores of my laptop, kallisto took less than 3 minutes.
$ time bash run-kallisto.sh 4
# Program output omitted
real 2m34.972s
user 5m3.268s
sys 0m7.787s
Subread took about 45 minutes to index the genome, align 6 RNA-seq samples, and then count the number of reads per gene.
$ time bash run-subread.sh 1
real 46m16.760s
user 42m46.264s
sys 0m17.379s
Utilizing all 4 cores of my laptop reduced the time to under a half hour.
$ time bash run-subread.sh 4
real 27m19.005s
user 57m0.761s
sys 2m47.445s
Even for a small scale RNA-seq study with only 6 yeast samples, kallisto is ~9x faster than the alignment-based Subread method. The time to build the index for either method is negligible, so the real time disadvantage for Subread occurs during the alignment step. Thus even for projects with a small sample size and an organism with a small genome, the time saved by using kallisto is substantial.
download-data.R
downloads all the data needed for the comparison.
It takes a long time because it downloads 6 separate fastq files.
run-subread.sh
runs a typical Subread analysis in which reads from the 6 fastq files are aligned to the genome with subread-align
and reads per gene are counted with featureCounts
.
run-kallisto.sh
pseudoaligns the 6 fastq files.
Both scripts take exactly one argument, which is the number of cores to use for multithreading.
The time estimates were calculated using the UNIX function time
.
- RAM: 3.7 GB
- processor: Intel® Core™ i3 CPU M 380 @ 2.53GHz × 4
- OS: Ubuntu 14.04
- R: 3.2.3
- biomaRt: 2.24.1
- Subread: 1.5.0
- kallisto: 0.42.4
Very nice comparison. I would recommend using a higher value of
k
. Either the default 31 or at least 21. These are 50bp reads so either should be fine.Also kallisto quant has a thread parameter, so to get a fair comparison you should set it to
-t 4