jdblischak/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Comparing speed for yeast RNA-seq analysis - kallisto vs. Subread

Introduction

kallisto is a new method for processing RNA-seq data.
By pseudoaligning reads to a transcriptome instead of aligning reads to a genome, the quantification step is much faster.
While the computational speedup will be huge for projects with many samples and/or with organisms with large genomes, I was curious how much time would be saved using kallisto on a small RNA-seq project for an organism with a smaller genome.
To perform this comparison, I downloaded 6 fastq files from a recent yeast RNA-seq study on GEO.
I chose Subread as the comparison method because it performs read alignment but is optimized for quickly obtaining gene counts (it soft clips reads instead of trying to map exact exon-exon boundaries).
Results

kallisto took less than 5 minutes to index the transcriptome and pseudoalign 6 RNA-seq samples.
$ time bash run-kallisto.sh 1
# Program output omitted
real	4m51.378s
user	4m30.056s
sys	0m6.044s

Utilizing all four cores of my laptop, kallisto took less than 3 minutes.
$ time bash run-kallisto.sh 4
# Program output omitted
real	2m34.972s
user	5m3.268s
sys	0m7.787s

Subread took about 45 minutes to index the genome, align 6 RNA-seq samples, and then count the number of reads per gene.
$ time bash run-subread.sh 1
real	46m16.760s
user	42m46.264s
sys	0m17.379s

Utilizing all 4 cores of my laptop reduced the time to under a half hour.
$ time bash run-subread.sh 4
real	27m19.005s
user	57m0.761s
sys	2m47.445s

Conclusions

Even for a small scale RNA-seq study with only 6 yeast samples, kallisto is ~9x faster than the alignment-based Subread method.
The time to build the index for either method is negligible, so the real time disadvantage for Subread occurs during the alignment step.
Thus even for projects with a small sample size and an organism with a small genome, the time saved by using kallisto is substantial.
Methods

download-data.R downloads all the data needed for the comparison.
It takes a long time because it downloads 6 separate fastq files.
run-subread.sh runs a typical Subread analysis in which reads from the 6 fastq files are aligned to the genome with subread-align and reads per gene are counted with featureCounts.
run-kallisto.sh pseudoaligns the 6 fastq files.
Both scripts take exactly one argument, which is the number of cores to use for multithreading.
The time estimates were calculated using the UNIX function time.

RAM: 3.7 GB
processor: Intel® Core™ i3 CPU M 380 @ 2.53GHz × 4
OS: Ubuntu 14.04
R: 3.2.3
biomaRt: 2.24.1
Subread: 1.5.0
kallisto: 0.42.4


## download-data.R
#!/usr/bin/env Rscript

# Download the necessary data files.
#
# yeast transcriptome: for pseudoaligning with kallisto
# yeast genome: for aligning with subread
# yeast exons: for counting reads per gene with featureCounts
# yeast RNA-seq: for testing quantification speed

# Download yeast transcriptome
transcriptome_fname <- "transcriptome.fa.gz"
transcriptome_url <- "http://bio.math.berkeley.edu/kallisto/transcriptomes/Saccharomyces_cerevisiae.R64-1-1.rel81.cdna.all.fa.gz"
if (!file.exists(transcriptome_fname)) {
  download.file(url = transcriptome_url,
                destfile = transcriptome_fname)
}

# Download yeast genome
genome_fname <- "genome.fa.gz"
genome_url <- "ftp://ftp.ensemblgenomes.org/pub/release-28/fungi/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.28.dna.genome.fa.gz"
if (!file.exists(genome_fname)) {
  download.file(url = genome_url,
                destfile = genome_fname)
}

# Download yeast exons
suppressPackageStartupMessages(library("biomaRt"))
ensembl <- useMart(host = "sep2015.archive.ensembl.org",
                   biomart = "ENSEMBL_MART_ENSEMBL",
                   dataset = "scerevisiae_gene_ensembl")

exons_fname <- "exons.saf"
if (!file.exists(exons_fname)) {
  exons_all <- getBM(attributes = c("ensembl_gene_id", "ensembl_exon_id",
                                    "chromosome_name", "exon_chrom_start",
                                    "exon_chrom_end", "strand"),
                     mart = ensembl)
  exons_final <- exons_all[, c("ensembl_gene_id", "chromosome_name", "exon_chrom_start",
                             "exon_chrom_end", "strand")]
  colnames(exons_final) <- c("GeneID", "Chr", "Start", "End", "Strand")
  # Sort by chromosome and position
  exons_final <- exons_final[order(exons_final$Chr,
                                   exons_final$Start,
                                   exons_final$End), ]
  # Fix strand
  exons_final$Strand <- ifelse(exons_final$Strand == 1, "+", "-")
  write.table(exons_final, exons_fname, quote = FALSE, sep = "\t",
              row.names = FALSE)
}

# Download 6 yeast RNA-seq files from SRA
# Used the following search and took the first result:
# search: https://www.ncbi.nlm.nih.gov/gds?term=%28saccharomyces%20cerevisiae[Organism]%29%20AND%20%22expression%20profiling%20by%20high%20throughput%20sequencing%22[DataSet%20Type]
# GEO entry: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77596

fastq_fname <- paste0("sample", 1:6, ".fastq.gz")
sra_fname <- sub("fastq.gz", "sra", fastq_fname)
base_url <- "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX%2FSRX156%2FSRX15634"
fastq_url <- paste0(base_url, c("56/SRR3148171/SRR3148171",
                                "57/SRR3148172/SRR3148172",
                                "58/SRR3148173/SRR3148173",
                                "59/SRR3148174/SRR3148174",
                                "60/SRR3148175/SRR3148175",
                                "61/SRR3148176/SRR3148176"), ".sra")

for (i in seq_along(fastq_fname)) {
  if (!file.exists(sra_fname[i])) {
    download.file(url = fastq_url[i], destfile = sra_fname[i], method = "wget")
  }
  if (!file.exists(fastq_fname[i])) {
    # http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=fastq-dump
    cmd <- paste("fastq-dump", sra_fname[i], "--gzip --stdout >", fastq_fname[i])
    print(cmd)
    system(cmd)
  }
}

## run-kallisto.sh
#!/bin/bash

threads=$1

# Index transcriptome
kallisto index -i transcriptome.idx transcriptome.fa.gz

# Pseudoalign reads to transcripts
for i in {1..6}
do
  kallisto quant -i transcriptome.idx --single -l 180 -s 10 -t $threads -o sample$i sample$i.fastq.gz
done

## run-subread.sh
#!/bin/bash

threads=$1

# Index genome
zcat genome.fa.gz > tmp.fa
subread-buildindex -M 3000 -o genome-index tmp.fa
rm tmp.fa

# Align reads to genome
for i in {1..6}
do
  subread-align -i genome-index -r sample$i.fastq.gz -t 0 -u -T $threads > sample$i.bam
done

# Count reads per gene
featureCounts -a exons.saf -F SAF -o genecounts.txt sample[1-6].bam
	#!/usr/bin/env Rscript

	# Download the necessary data files.
	#
	# yeast transcriptome: for pseudoaligning with kallisto
	# yeast genome: for aligning with subread
	# yeast exons: for counting reads per gene with featureCounts
	# yeast RNA-seq: for testing quantification speed

	# Download yeast transcriptome
	transcriptome_fname <- "transcriptome.fa.gz"
	transcriptome_url <- "http://bio.math.berkeley.edu/kallisto/transcriptomes/Saccharomyces_cerevisiae.R64-1-1.rel81.cdna.all.fa.gz"
	if (!file.exists(transcriptome_fname)) {
	download.file(url = transcriptome_url,
	destfile = transcriptome_fname)
	}

	# Download yeast genome
	genome_fname <- "genome.fa.gz"
	genome_url <- "ftp://ftp.ensemblgenomes.org/pub/release-28/fungi/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.28.dna.genome.fa.gz"
	if (!file.exists(genome_fname)) {
	download.file(url = genome_url,
	destfile = genome_fname)
	}

	# Download yeast exons
	suppressPackageStartupMessages(library("biomaRt"))
	ensembl <- useMart(host = "sep2015.archive.ensembl.org",
	biomart = "ENSEMBL_MART_ENSEMBL",
	dataset = "scerevisiae_gene_ensembl")

	exons_fname <- "exons.saf"
	if (!file.exists(exons_fname)) {
	exons_all <- getBM(attributes = c("ensembl_gene_id", "ensembl_exon_id",
	"chromosome_name", "exon_chrom_start",
	"exon_chrom_end", "strand"),
	mart = ensembl)
	exons_final <- exons_all[, c("ensembl_gene_id", "chromosome_name", "exon_chrom_start",
	"exon_chrom_end", "strand")]
	colnames(exons_final) <- c("GeneID", "Chr", "Start", "End", "Strand")
	# Sort by chromosome and position
	exons_final <- exons_final[order(exons_final$Chr,
	exons_final$Start,
	exons_final$End), ]
	# Fix strand
	exons_final$Strand <- ifelse(exons_final$Strand == 1, "+", "-")
	write.table(exons_final, exons_fname, quote = FALSE, sep = "\t",
	row.names = FALSE)
	}

	# Download 6 yeast RNA-seq files from SRA
	# Used the following search and took the first result:
	# search: https://www.ncbi.nlm.nih.gov/gds?term=%28saccharomyces%20cerevisiae[Organism]%29%20AND%20%22expression%20profiling%20by%20high%20throughput%20sequencing%22[DataSet%20Type]
	# GEO entry: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77596

	fastq_fname <- paste0("sample", 1:6, ".fastq.gz")
	sra_fname <- sub("fastq.gz", "sra", fastq_fname)
	base_url <- "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX%2FSRX156%2FSRX15634"
	fastq_url <- paste0(base_url, c("56/SRR3148171/SRR3148171",
	"57/SRR3148172/SRR3148172",
	"58/SRR3148173/SRR3148173",
	"59/SRR3148174/SRR3148174",
	"60/SRR3148175/SRR3148175",
	"61/SRR3148176/SRR3148176"), ".sra")

	for (i in seq_along(fastq_fname)) {
	if (!file.exists(sra_fname[i])) {
	download.file(url = fastq_url[i], destfile = sra_fname[i], method = "wget")
	}
	if (!file.exists(fastq_fname[i])) {
	# http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=fastq-dump
	cmd <- paste("fastq-dump", sra_fname[i], "--gzip --stdout >", fastq_fname[i])
	print(cmd)
	system(cmd)
	}
	}
	#!/bin/bash

	threads=$1

	# Index transcriptome
	kallisto index -i transcriptome.idx transcriptome.fa.gz

	# Pseudoalign reads to transcripts
	for i in {1..6}
	do
	kallisto quant -i transcriptome.idx --single -l 180 -s 10 -t $threads -o sample$i sample$i.fastq.gz
	done
	#!/bin/bash

	threads=$1

	# Index genome
	zcat genome.fa.gz > tmp.fa
	subread-buildindex -M 3000 -o genome-index tmp.fa
	rm tmp.fa

	# Align reads to genome
	for i in {1..6}
	do
	subread-align -i genome-index -r sample$i.fastq.gz -t 0 -u -T $threads > sample$i.bam
	done

	# Count reads per gene
	featureCounts -a exons.saf -F SAF -o genecounts.txt sample[1-6].bam