Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Convert read counts to transcripts per million (TPM).
#' Convert counts to transcripts per million (TPM).
#'
#' Convert a numeric matrix of features (rows) and conditions (columns) with
#' raw feature counts to transcripts per million.
#'
#' Lior Pachter. Models for transcript quantification from RNA-Seq.
#' arXiv:1104.3889v2
#'
#' Wagner, et al. Measurement of mRNA abundance using RNA-seq data:
#' RPKM measure is inconsistent among samples. Theory Biosci. 24 July 2012.
#' doi:10.1007/s12064-012-0162-3
#'
#' @param counts A numeric matrix of raw feature counts i.e.
#' fragments assigned to each gene.
#' @param featureLength A numeric vector with feature lengths.
#' @param meanFragmentLength A numeric vector with mean fragment lengths.
#' @return tpm A numeric matrix normalized by library size and feature length.
counts_to_tpm <- function(counts, featureLength, meanFragmentLength) {
# Ensure valid arguments.
stopifnot(length(featureLength) == nrow(counts))
stopifnot(length(meanFragmentLength) == ncol(counts))
# Compute effective lengths of features in each library.
effLen <- do.call(cbind, lapply(1:ncol(counts), function(i) {
featureLength - meanFragmentLength[i] + 1
}))
# Exclude genes with length less than the mean fragment length.
idx <- apply(effLen, 1, function(x) min(x) > 1)
counts <- counts[idx,]
effLen <- effLen[idx,]
featureLength <- featureLength[idx]
# Process one column at a time.
tpm <- do.call(cbind, lapply(1:ncol(counts), function(i) {
rate = log(counts[,i]) - log(effLen[,i])
denom = log(sum(exp(rate)))
exp(rate - denom + log(1e6))
}))
# Copy the row and column names from the original matrix.
colnames(tpm) <- colnames(counts)
rownames(tpm) <- rownames(counts)
return(tpm)
}
@flyboyleo
Copy link

flyboyleo commented Apr 19, 2018

Hi, I tried your scripts to convert featureCounts results from counts to TPM. Thanks.

But I wonder why the TPM results are not the same as RSEM generated..as the raw counts are same for both RSEM and featureCounts generated.

Is it because in the script you used the "effLen" instead of gene length?

p.s: Why don't you just wrote the equation without "log" and "exp", it might be easier to read and understand :)

@andysaurin
Copy link

andysaurin commented Apr 20, 2018

flyboyleo - I tried the featureCounts script also and found it didn't produce columns that summed to 1e6 (which is what is required imo to be able to compare transcripts across samples)
fwiw, the rpkms didn't compute correctly either
So I wrote an Rscript based on the calculations shown in http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

https://github.com/andysaurin/tpm_rpkm

@maromato
Copy link

maromato commented May 24, 2019

Hi @slowkow

This code is very useful!!!

By the way, I try to calculate mean fragment length with Picard and it is turn to be around 617.
I think 617 is relatively large length to exclude the genes.
Do you know any good idea to adjust this situation?
(ex. using original gene length for effective length?)

Thanks.

@slowkow
Copy link
Author

slowkow commented May 24, 2019

@maromato Could I ask you to please consider asking for additional help at Biostars? That is a better place to ask for help than GitHub Gist comments.

@OnlyBelter
Copy link

OnlyBelter commented Oct 10, 2019

This is the Python version:

import pandas as pd
import numpy as np

def read_counts2tpm(df, sample_name):
    """
    convert read counts to TPM (transcripts per million)
    :param df: a dataFrame contains the result coming from featureCounts
    :param sample_name: a list, all sample names, same as the result of featureCounts
    :return: TPM
    """
    result = df
    sample_reads = result.loc[:, sample_name].copy()
    gene_len = result.loc[:, ['Length']]
    rate = sample_reads.values / gene_len.values
    tpm = rate / np.sum(rate, axis=0).reshape(1, -1) * 1e6
    return pd.DataFrame(data=tpm, columns=sample_name)

a = pd.DataFrame(data = {
    'Gene': ("A","B","C","D","E"),
    'Length': (100, 50, 25, 5, 1),
     'S1': (80, 10,  6,  3,   1),
     'S2': (20, 20, 10, 50, 400)
})

read_counts2tpm(a, ['S1', 'S2'])

@siya-kk
Copy link

siya-kk commented Jan 26, 2021

The sad thing is i was using your "tpm_rpkm_from_featureCount.R" code years ago, and could not find the place how to cite it. cry......

@myh9811
Copy link

myh9811 commented Feb 1, 2021

hi @slowkow just wonder if I can use expected (estimated) counts (RSEM result) for this script?
(or is its use restricted to raw counts only?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment