Instantly share code, notes, and snippets.

Embed
What would you like to do?
Convert read counts to transcripts per million (TPM).
#' Convert counts to transcripts per million (TPM).
#'
#' Convert a numeric matrix of features (rows) and conditions (columns) with
#' raw feature counts to transcripts per million.
#'
#' Lior Pachter. Models for transcript quantification from RNA-Seq.
#' arXiv:1104.3889v2
#'
#' Wagner, et al. Measurement of mRNA abundance using RNA-seq data:
#' RPKM measure is inconsistent among samples. Theory Biosci. 24 July 2012.
#' doi:10.1007/s12064-012-0162-3
#'
#' @param counts A numeric matrix of raw feature counts i.e.
#' fragments assigned to each gene.
#' @param featureLength A numeric vector with feature lengths.
#' @param meanFragmentLength A numeric vector with mean fragment lengths.
#' @return tpm A numeric matrix normalized by library size and feature length.
counts_to_tpm <- function(counts, featureLength, meanFragmentLength) {
# Ensure valid arguments.
stopifnot(length(featureLength) == nrow(counts))
stopifnot(length(meanFragmentLength) == ncol(counts))
# Compute effective lengths of features in each library.
effLen <- do.call(cbind, lapply(1:ncol(counts), function(i) {
featureLength - meanFragmentLength[i] + 1
}))
# Exclude genes with length less than the mean fragment length.
idx <- apply(effLen, 1, function(x) min(x) > 1)
counts <- counts[idx,]
effLen <- effLen[idx,]
featureLength <- featureLength[idx]
# Process one column at a time.
tpm <- do.call(cbind, lapply(1:ncol(counts), function(i) {
rate = log(counts[,i]) - log(effLen[,i])
denom = log(sum(exp(rate)))
exp(rate - denom + log(1e6))
}))
# Copy the row and column names from the original matrix.
colnames(tpm) <- colnames(counts)
rownames(tpm) <- rownames(counts)
return(tpm)
}
@mostafaBio

This comment has been minimized.

Show comment
Hide comment
@mostafaBio

mostafaBio Sep 22, 2016

Dear Slowkow,
Greetings! Please, Is it possible to convert from tpm matrix to count matrix?
Kind Regards.

mostafaBio commented Sep 22, 2016

Dear Slowkow,
Greetings! Please, Is it possible to convert from tpm matrix to count matrix?
Kind Regards.

@slowkow

This comment has been minimized.

Show comment
Hide comment
@slowkow

slowkow Nov 6, 2016

@mostafaBio In order to convert TPM to counts, you need the total number of assigned reads in each sample.

Owner

slowkow commented Nov 6, 2016

@mostafaBio In order to convert TPM to counts, you need the total number of assigned reads in each sample.

@sdash-github

This comment has been minimized.

Show comment
Hide comment
@sdash-github

sdash-github Nov 7, 2016

Hi,
I found this a useful script while looking for a tool to calculate TPM from read counts. However, I fail to understand how to get the 'meanFragmentLength A numeric vector with mean fragment lengths'. All I have is gene_id and its read_counts from HTSeq-count. So, how do I go about getting a 'meanFragmentLength'-- my data set is single end reads from illumina.
Thanks.
SD

sdash-github commented Nov 7, 2016

Hi,
I found this a useful script while looking for a tool to calculate TPM from read counts. However, I fail to understand how to get the 'meanFragmentLength A numeric vector with mean fragment lengths'. All I have is gene_id and its read_counts from HTSeq-count. So, how do I go about getting a 'meanFragmentLength'-- my data set is single end reads from illumina.
Thanks.
SD

@slowkow

This comment has been minimized.

Show comment
Hide comment
@slowkow

slowkow Nov 7, 2016

@sdash-github For paired-end sequencing data, the mean fragment length can be obtained from Picard using InsertSizeMetrics.

It is not possible to estimate fragment length from single-end sequencing data.

Here's a fragment (molecule of cDNA):

fragment      -----------------------
read1         ====
read2                            ====
insert            ===============

              |---------------------|

     fragment length = read1 + insert + read2
Owner

slowkow commented Nov 7, 2016

@sdash-github For paired-end sequencing data, the mean fragment length can be obtained from Picard using InsertSizeMetrics.

It is not possible to estimate fragment length from single-end sequencing data.

Here's a fragment (molecule of cDNA):

fragment      -----------------------
read1         ====
read2                            ====
insert            ===============

              |---------------------|

     fragment length = read1 + insert + read2
@slowkow

This comment has been minimized.

Show comment
Hide comment
@slowkow

slowkow Nov 7, 2016

Here are simpler functions for RPKM and TPM:

rpkm <- function(counts, lengths) {
  rate <- counts / lengths 
  rate / sum(counts) * 1e6
}

tpm <- function(counts, lengths) {
  rate <- counts / lengths
  rate / sum(rate) * 1e6
}

See here for a demonstration of how they work: https://gist.github.com/slowkow/6e34ccb4d1311b8fe62e

For more details, see: http://blog.nextgenetics.net/?e=51

Owner

slowkow commented Nov 7, 2016

Here are simpler functions for RPKM and TPM:

rpkm <- function(counts, lengths) {
  rate <- counts / lengths 
  rate / sum(counts) * 1e6
}

tpm <- function(counts, lengths) {
  rate <- counts / lengths
  rate / sum(rate) * 1e6
}

See here for a demonstration of how they work: https://gist.github.com/slowkow/6e34ccb4d1311b8fe62e

For more details, see: http://blog.nextgenetics.net/?e=51

@sdash-github

This comment has been minimized.

Show comment
Hide comment
@sdash-github

sdash-github Nov 10, 2016

Thank you very much. Almost a life saver for me.
Being not an expert in R:
The two functions along with their use in https://gist.github.com/slowkow/6e34ccb4d1311b8fe62e helped me understand how to put things together to make it work.
You already are or would make a great teacher.
SD

sdash-github commented Nov 10, 2016

Thank you very much. Almost a life saver for me.
Being not an expert in R:
The two functions along with their use in https://gist.github.com/slowkow/6e34ccb4d1311b8fe62e helped me understand how to put things together to make it work.
You already are or would make a great teacher.
SD

@saketkc

This comment has been minimized.

Show comment
Hide comment
@saketkc

saketkc Dec 7, 2016

@slowkow Shouldn't Line 44 be:

 rownames(tpm) <- rownames(counts)

and not

  rownames(tpm) <- rownames(counts)[idx]

saketkc commented Dec 7, 2016

@slowkow Shouldn't Line 44 be:

 rownames(tpm) <- rownames(counts)

and not

  rownames(tpm) <- rownames(counts)[idx]
@slowkow

This comment has been minimized.

Show comment
Hide comment
@slowkow

slowkow Dec 7, 2016

Thanks for catching that! My mistake.

Owner

slowkow commented Dec 7, 2016

Thanks for catching that! My mistake.

@ldutoit

This comment has been minimized.

Show comment
Hide comment
@ldutoit

ldutoit Mar 25, 2017

Thank you for this function!

I have a question, I am trying to estimate TPM from HTSEQ counts. I just cannot estimate fragment length. I was wondering if you I am confused as the bam outputted by HTSEQ are unusable for picard ( negative fragment length). Did anyone have this probem?

ldutoit commented Mar 25, 2017

Thank you for this function!

I have a question, I am trying to estimate TPM from HTSEQ counts. I just cannot estimate fragment length. I was wondering if you I am confused as the bam outputted by HTSEQ are unusable for picard ( negative fragment length). Did anyone have this probem?

@paulgradie

This comment has been minimized.

Show comment
Hide comment
@paulgradie

paulgradie Apr 14, 2017

@Idutoit - Your problem is actually pretty unclear to me. You calculated negative fragment lengths from your alignment files? Maybe don't use HTSEQ for alignments. I've used HTSEQ for counts, but it is SO SLOW. Use Subread featureCount instead maybe. There are lots of good aligners available. STAR, HISAT2, etc...

@slowkow - Thanks a lot for posting this code. And... good job on that name. 'slowkow'. haha.

What advantage does this have over ignoring meanfragmentlenth estimates as I've seen in some TPM implementations:
https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/

Cheers,
Paul

paulgradie commented Apr 14, 2017

@Idutoit - Your problem is actually pretty unclear to me. You calculated negative fragment lengths from your alignment files? Maybe don't use HTSEQ for alignments. I've used HTSEQ for counts, but it is SO SLOW. Use Subread featureCount instead maybe. There are lots of good aligners available. STAR, HISAT2, etc...

@slowkow - Thanks a lot for posting this code. And... good job on that name. 'slowkow'. haha.

What advantage does this have over ignoring meanfragmentlenth estimates as I've seen in some TPM implementations:
https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/

Cheers,
Paul

@nlapalu

This comment has been minimized.

Show comment
Hide comment
@nlapalu

nlapalu May 31, 2017

Hi,

Why do you exclude genes with negative effective length ? RSEM implements a model that always find a positive effective length. In my case, I prefer set the effective length to 1. Negative effective length is a quite common for genome of pathogens with small genes as effectors. The bias of negative effective length is largely due to missing UTR in annotation files that reduce transcript to the CDS part. I think that it is important to keep these genes. The way you count the reads and estimate the effective length influences the TPM value. So, if you want to compare libraries with TPM metrics, you must compute your TPM in the same way. Finally, I am not sure that TPM is the most reliable metric to compare libraries, especially if different tools were used for computation.
+
nico

nlapalu commented May 31, 2017

Hi,

Why do you exclude genes with negative effective length ? RSEM implements a model that always find a positive effective length. In my case, I prefer set the effective length to 1. Negative effective length is a quite common for genome of pathogens with small genes as effectors. The bias of negative effective length is largely due to missing UTR in annotation files that reduce transcript to the CDS part. I think that it is important to keep these genes. The way you count the reads and estimate the effective length influences the TPM value. So, if you want to compare libraries with TPM metrics, you must compute your TPM in the same way. Finally, I am not sure that TPM is the most reliable metric to compare libraries, especially if different tools were used for computation.
+
nico

@slowkow

This comment has been minimized.

Show comment
Hide comment
@slowkow

slowkow May 31, 2017

@nlapalu Thanks for the comments! I think you're probably right about the effective length.

Indeed, TPM may not be reliable when comparing multiple different libraries. One should check to see if libraries are comparable and perform normalization that is appropriate for the research question.

Owner

slowkow commented May 31, 2017

@nlapalu Thanks for the comments! I think you're probably right about the effective length.

Indeed, TPM may not be reliable when comparing multiple different libraries. One should check to see if libraries are comparable and perform normalization that is appropriate for the research question.

@flyboyleo

This comment has been minimized.

Show comment
Hide comment
@flyboyleo

flyboyleo Apr 19, 2018

Hi, I tried your scripts to convert featureCounts results from counts to TPM. Thanks.

But I wonder why the TPM results are not the same as RSEM generated..as the raw counts are same for both RSEM and featureCounts generated.

Is it because in the script you used the "effLen" instead of gene length?

p.s: Why don't you just wrote the equation without "log" and "exp", it might be easier to read and understand :)

flyboyleo commented Apr 19, 2018

Hi, I tried your scripts to convert featureCounts results from counts to TPM. Thanks.

But I wonder why the TPM results are not the same as RSEM generated..as the raw counts are same for both RSEM and featureCounts generated.

Is it because in the script you used the "effLen" instead of gene length?

p.s: Why don't you just wrote the equation without "log" and "exp", it might be easier to read and understand :)

@andysaurin

This comment has been minimized.

Show comment
Hide comment
@andysaurin

andysaurin Apr 20, 2018

flyboyleo - I tried the featureCounts script also and found it didn't produce columns that summed to 1e6 (which is what is required imo to be able to compare transcripts across samples)
fwiw, the rpkms didn't compute correctly either
So I wrote an Rscript based on the calculations shown in http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

https://github.com/andysaurin/tpm_rpkm

andysaurin commented Apr 20, 2018

flyboyleo - I tried the featureCounts script also and found it didn't produce columns that summed to 1e6 (which is what is required imo to be able to compare transcripts across samples)
fwiw, the rpkms didn't compute correctly either
So I wrote an Rscript based on the calculations shown in http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

https://github.com/andysaurin/tpm_rpkm

@venkan

This comment has been minimized.

Show comment
Hide comment
@venkan

venkan Oct 18, 2018

@slowkow Could you please tell how to convert featureCounts to FPKM? I know that rpkm is for single-end RNAseq. My data is paired-end RNAseq, so can I use rpkm function for that which give fpkm?

venkan commented Oct 18, 2018

@slowkow Could you please tell how to convert featureCounts to FPKM? I know that rpkm is for single-end RNAseq. My data is paired-end RNAseq, so can I use rpkm function for that which give fpkm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment