Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Convert read counts to transcripts per million (TPM).
#' Convert counts to transcripts per million (TPM).
#' Convert a numeric matrix of features (rows) and conditions (columns) with
#' raw feature counts to transcripts per million.
#' Lior Pachter. Models for transcript quantification from RNA-Seq.
#' arXiv:1104.3889v2
#' Wagner, et al. Measurement of mRNA abundance using RNA-seq data:
#' RPKM measure is inconsistent among samples. Theory Biosci. 24 July 2012.
#' doi:10.1007/s12064-012-0162-3
#' @param counts A numeric matrix of raw feature counts i.e.
#' fragments assigned to each gene.
#' @param featureLength A numeric vector with feature lengths.
#' @param meanFragmentLength A numeric vector with mean fragment lengths.
#' @return tpm A numeric matrix normalized by library size and feature length.
counts_to_tpm <- function(counts, featureLength, meanFragmentLength) {
# Ensure valid arguments.
stopifnot(length(featureLength) == nrow(counts))
stopifnot(length(meanFragmentLength) == ncol(counts))
# Compute effective lengths of features in each library.
effLen <-, lapply(1:ncol(counts), function(i) {
featureLength - meanFragmentLength[i] + 1
# Exclude genes with length less than the mean fragment length.
idx <- apply(effLen, 1, function(x) min(x) > 1)
counts <- counts[idx,]
effLen <- effLen[idx,]
featureLength <- featureLength[idx]
# Process one column at a time.
tpm <-, lapply(1:ncol(counts), function(i) {
rate = log(counts[,i]) - log(effLen[,i])
denom = log(sum(exp(rate)))
exp(rate - denom + log(1e6))
# Copy the row and column names from the original matrix.
colnames(tpm) <- colnames(counts)
rownames(tpm) <- rownames(counts)
Copy link

myh9811 commented Feb 1, 2021

hi @slowkow just wonder if I can use expected (estimated) counts (RSEM result) for this script?
(or is its use restricted to raw counts only?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment