Last active
October 8, 2024 02:00
-
Star
(102)
You must be signed in to star a gist -
Fork
(20)
You must be signed in to fork a gist
-
-
Save slowkow/c6ab0348747f86e2748b to your computer and use it in GitHub Desktop.
Convert read counts to transcripts per million (TPM).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#' Convert counts to transcripts per million (TPM). | |
#' | |
#' Convert a numeric matrix of features (rows) and conditions (columns) with | |
#' raw feature counts to transcripts per million. | |
#' | |
#' Lior Pachter. Models for transcript quantification from RNA-Seq. | |
#' arXiv:1104.3889v2 | |
#' | |
#' Wagner, et al. Measurement of mRNA abundance using RNA-seq data: | |
#' RPKM measure is inconsistent among samples. Theory Biosci. 24 July 2012. | |
#' doi:10.1007/s12064-012-0162-3 | |
#' | |
#' @param counts A numeric matrix of raw feature counts i.e. | |
#' fragments assigned to each gene. | |
#' @param featureLength A numeric vector with feature lengths. | |
#' @param meanFragmentLength A numeric vector with mean fragment lengths. | |
#' @return tpm A numeric matrix normalized by library size and feature length. | |
counts_to_tpm <- function(counts, featureLength, meanFragmentLength) { | |
# Ensure valid arguments. | |
stopifnot(length(featureLength) == nrow(counts)) | |
stopifnot(length(meanFragmentLength) == ncol(counts)) | |
# Compute effective lengths of features in each library. | |
effLen <- do.call(cbind, lapply(1:ncol(counts), function(i) { | |
featureLength - meanFragmentLength[i] + 1 | |
})) | |
# Exclude genes with length less than the mean fragment length. | |
idx <- apply(effLen, 1, function(x) min(x) > 1) | |
counts <- counts[idx,] | |
effLen <- effLen[idx,] | |
featureLength <- featureLength[idx] | |
# Process one column at a time. | |
tpm <- do.call(cbind, lapply(1:ncol(counts), function(i) { | |
rate = log(counts[,i]) - log(effLen[,i]) | |
denom = log(sum(exp(rate))) | |
exp(rate - denom + log(1e6)) | |
})) | |
# Copy the row and column names from the original matrix. | |
colnames(tpm) <- colnames(counts) | |
rownames(tpm) <- rownames(counts) | |
return(tpm) | |
} |
This is the Python version:
import pandas as pd
import numpy as np
def read_counts2tpm(df, sample_name):
"""
convert read counts to TPM (transcripts per million)
:param df: a dataFrame contains the result coming from featureCounts
:param sample_name: a list, all sample names, same as the result of featureCounts
:return: TPM
"""
result = df
sample_reads = result.loc[:, sample_name].copy()
gene_len = result.loc[:, ['Length']]
rate = sample_reads.values / gene_len.values
tpm = rate / np.sum(rate, axis=0).reshape(1, -1) * 1e6
return pd.DataFrame(data=tpm, columns=sample_name)
a = pd.DataFrame(data = {
'Gene': ("A","B","C","D","E"),
'Length': (100, 50, 25, 5, 1),
'S1': (80, 10, 6, 3, 1),
'S2': (20, 20, 10, 50, 400)
})
read_counts2tpm(a, ['S1', 'S2'])
The sad thing is i was using your "tpm_rpkm_from_featureCount.R" code years ago, and could not find the place how to cite it. cry......
hi @slowkow just wonder if I can use expected (estimated) counts (RSEM result) for this script?
(or is its use restricted to raw counts only?)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@maromato Could I ask you to please consider asking for additional help at Biostars? That is a better place to ask for help than GitHub Gist comments.