Skip to content

Instantly share code, notes, and snippets.

@jdblischak
Last active March 29, 2019 02:40
Show Gist options
  • Save jdblischak/30456f14918efc2618d2 to your computer and use it in GitHub Desktop.
Save jdblischak/30456f14918efc2618d2 to your computer and use it in GitHub Desktop.
kallisto vs. Subread for yeast RNA-seq analysis

Comparing speed for yeast RNA-seq analysis - kallisto vs. Subread

Introduction

kallisto is a new method for processing RNA-seq data. By pseudoaligning reads to a transcriptome instead of aligning reads to a genome, the quantification step is much faster. While the computational speedup will be huge for projects with many samples and/or with organisms with large genomes, I was curious how much time would be saved using kallisto on a small RNA-seq project for an organism with a smaller genome. To perform this comparison, I downloaded 6 fastq files from a recent yeast RNA-seq study on GEO. I chose Subread as the comparison method because it performs read alignment but is optimized for quickly obtaining gene counts (it soft clips reads instead of trying to map exact exon-exon boundaries).

Results

kallisto took less than 5 minutes to index the transcriptome and pseudoalign 6 RNA-seq samples.

$ time bash run-kallisto.sh 1
# Program output omitted
real	4m51.378s
user	4m30.056s
sys	0m6.044s

Utilizing all four cores of my laptop, kallisto took less than 3 minutes.

$ time bash run-kallisto.sh 4
# Program output omitted
real	2m34.972s
user	5m3.268s
sys	0m7.787s

Subread took about 45 minutes to index the genome, align 6 RNA-seq samples, and then count the number of reads per gene.

$ time bash run-subread.sh 1
real	46m16.760s
user	42m46.264s
sys	0m17.379s

Utilizing all 4 cores of my laptop reduced the time to under a half hour.

$ time bash run-subread.sh 4
real	27m19.005s
user	57m0.761s
sys	2m47.445s

Conclusions

Even for a small scale RNA-seq study with only 6 yeast samples, kallisto is ~9x faster than the alignment-based Subread method. The time to build the index for either method is negligible, so the real time disadvantage for Subread occurs during the alignment step. Thus even for projects with a small sample size and an organism with a small genome, the time saved by using kallisto is substantial.

Methods

download-data.R downloads all the data needed for the comparison. It takes a long time because it downloads 6 separate fastq files. run-subread.sh runs a typical Subread analysis in which reads from the 6 fastq files are aligned to the genome with subread-align and reads per gene are counted with featureCounts. run-kallisto.sh pseudoaligns the 6 fastq files. Both scripts take exactly one argument, which is the number of cores to use for multithreading. The time estimates were calculated using the UNIX function time.

  • RAM: 3.7 GB
  • processor: Intel® Core™ i3 CPU M 380 @ 2.53GHz × 4
  • OS: Ubuntu 14.04
  • R: 3.2.3
  • biomaRt: 2.24.1
  • Subread: 1.5.0
  • kallisto: 0.42.4
#!/usr/bin/env Rscript
# Download the necessary data files.
#
# yeast transcriptome: for pseudoaligning with kallisto
# yeast genome: for aligning with subread
# yeast exons: for counting reads per gene with featureCounts
# yeast RNA-seq: for testing quantification speed
# Download yeast transcriptome
transcriptome_fname <- "transcriptome.fa.gz"
transcriptome_url <- "http://bio.math.berkeley.edu/kallisto/transcriptomes/Saccharomyces_cerevisiae.R64-1-1.rel81.cdna.all.fa.gz"
if (!file.exists(transcriptome_fname)) {
download.file(url = transcriptome_url,
destfile = transcriptome_fname)
}
# Download yeast genome
genome_fname <- "genome.fa.gz"
genome_url <- "ftp://ftp.ensemblgenomes.org/pub/release-28/fungi/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.28.dna.genome.fa.gz"
if (!file.exists(genome_fname)) {
download.file(url = genome_url,
destfile = genome_fname)
}
# Download yeast exons
suppressPackageStartupMessages(library("biomaRt"))
ensembl <- useMart(host = "sep2015.archive.ensembl.org",
biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "scerevisiae_gene_ensembl")
exons_fname <- "exons.saf"
if (!file.exists(exons_fname)) {
exons_all <- getBM(attributes = c("ensembl_gene_id", "ensembl_exon_id",
"chromosome_name", "exon_chrom_start",
"exon_chrom_end", "strand"),
mart = ensembl)
exons_final <- exons_all[, c("ensembl_gene_id", "chromosome_name", "exon_chrom_start",
"exon_chrom_end", "strand")]
colnames(exons_final) <- c("GeneID", "Chr", "Start", "End", "Strand")
# Sort by chromosome and position
exons_final <- exons_final[order(exons_final$Chr,
exons_final$Start,
exons_final$End), ]
# Fix strand
exons_final$Strand <- ifelse(exons_final$Strand == 1, "+", "-")
write.table(exons_final, exons_fname, quote = FALSE, sep = "\t",
row.names = FALSE)
}
# Download 6 yeast RNA-seq files from SRA
# Used the following search and took the first result:
# search: https://www.ncbi.nlm.nih.gov/gds?term=%28saccharomyces%20cerevisiae[Organism]%29%20AND%20%22expression%20profiling%20by%20high%20throughput%20sequencing%22[DataSet%20Type]
# GEO entry: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77596
fastq_fname <- paste0("sample", 1:6, ".fastq.gz")
sra_fname <- sub("fastq.gz", "sra", fastq_fname)
base_url <- "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX%2FSRX156%2FSRX15634"
fastq_url <- paste0(base_url, c("56/SRR3148171/SRR3148171",
"57/SRR3148172/SRR3148172",
"58/SRR3148173/SRR3148173",
"59/SRR3148174/SRR3148174",
"60/SRR3148175/SRR3148175",
"61/SRR3148176/SRR3148176"), ".sra")
for (i in seq_along(fastq_fname)) {
if (!file.exists(sra_fname[i])) {
download.file(url = fastq_url[i], destfile = sra_fname[i], method = "wget")
}
if (!file.exists(fastq_fname[i])) {
# http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=fastq-dump
cmd <- paste("fastq-dump", sra_fname[i], "--gzip --stdout >", fastq_fname[i])
print(cmd)
system(cmd)
}
}
#!/bin/bash
threads=$1
# Index transcriptome
kallisto index -i transcriptome.idx transcriptome.fa.gz
# Pseudoalign reads to transcripts
for i in {1..6}
do
kallisto quant -i transcriptome.idx --single -l 180 -s 10 -t $threads -o sample$i sample$i.fastq.gz
done
#!/bin/bash
threads=$1
# Index genome
zcat genome.fa.gz > tmp.fa
subread-buildindex -M 3000 -o genome-index tmp.fa
rm tmp.fa
# Align reads to genome
for i in {1..6}
do
subread-align -i genome-index -r sample$i.fastq.gz -t 0 -u -T $threads > sample$i.bam
done
# Count reads per gene
featureCounts -a exons.saf -F SAF -o genecounts.txt sample[1-6].bam
@pmelsted
Copy link

Very nice comparison. I would recommend using a higher value of k. Either the default 31 or at least 21. These are 50bp reads so either should be fine.

Also kallisto quant has a thread parameter, so to get a fair comparison you should set it to -t 4

@jdblischak
Copy link
Author

Thanks for the feedback, @pmelsted! I'm sorry I didn't see your comment earlier. I didn't realize that GitHub does not send email notifications for comments on Gists. Apparently this has been a long requested feature. The Issue has been open since 2013, and recently a petition was started. I like having the discussion next to the code, so I'll send you an email to let you know I replied, and please email me as well if you leave another comment.

I updated the code to use the default value of 31 for k. I was motivated to use a shorter k based on the following excerpt from your paper:

The k-mer length must be large enough that random sequences of length k do not match to the transcriptome and short enough to ensure robustness to errors.

My reasoning was that since the yeast genome is much smaller than the human genome, there would be a lower chance that a random sequence of length k would match the transcriptome. Thus I could decrease the k-mer size and increase the robustness to errors. In retrospect the genome size is probably not the best metric for choosing k. I presume the size and complexity of the transcriptome would be more important. What are the most important factors that affect the choice of k-mer size, i.e. why 31?

I ran kallisto utilizing all 4 cores, and this did improve the speed. Overall for this small analysis, kallisto was ~9x faster!

@pmelsted
Copy link

The complexity of the transcriptome is what affects the choice of k. The k-mers in the transcriptome are more random, i.e. less short repeats, than the genome, however when you include transcripts from the same gene, i.e. same genomic location you are enriching for ambiguity.

On the other hand because kallisto uses exact k-mer matches any sequencing error means that the information in the k-mer is lost, cannot be pseudoaligned (also they could map to other locations, but that's rare). So depending on your read length you might not want to go as high as 31. In our analysis we didn't see a big difference between k=23 and k=31, but the accuracy dropped with k=19, so anything from 21 to 31 should work well.

If your reads are 75bp or more we recommend k=31, if they are 35 or less use k=21, anything in between you can use either one (depends on error rate).

@jdblischak
Copy link
Author

Thanks for the explanation, @pmelsted!

@w-shi
Copy link

w-shi commented Jan 18, 2017

Hi @jdblischak, you can run Subread more efficiently by directly mapping reads to the transcriptome (instead of the genome) and then obtaining gene counts using reads mapping to the transcripts.

When using Subread to map reads to transcriptome, please remove the '-u' option from your command since this will allow Subread to map a read to one of the transcripts belonging to the same gene if the read overlaps more than one transcript. This will also facilitate the assignment of the read to the gene.

When runing featureCounts to count reads for genes, you will need to generate an annotation in which features are transcripts (ie. each row in your annotation is a transcript instead of an exon) and transcripts should be grouped into genes. For those reads that overlap more than one transcript, featureCounts will only count them once since they were reported to map to only one the transcripts in the mapping results. This should give you almost the same counting results as you got from your current Subread analysis.

featureCounts supports multi-threading so you may add '-T' option to your command to enable it to run in a multi-threading mode.

With this approach, I would expect a significant reduction in Subread's running time. This also makes Subread results more comparable with your kallisto results since your kallisto analysis also used the transcriptome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment