Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Comparison of RPKM (reads per kilobase per million) and TPM (transcripts per million).
# RPKM versus TPM
#
# RPKM and TPM are both normalized for library size and gene length.
#
# RPKM is not comparable across different samples.
#
# For more details, see: http://blog.nextgenetics.net/?e=51
rpkm <- function(counts, lengths) {
rate <- counts / lengths
rate / sum(counts) * 1e6
}
tpm <- function(counts, lengths) {
rate <- counts / lengths
rate / sum(rate) * 1e6
}
genes <- data.frame(
Gene = c("A","B","C","D","E"),
Length = c(100, 50, 25, 5, 1)
)
counts <- data.frame(
S1 = c(80, 10, 6, 3, 1),
S2 = c(20, 20, 10, 50, 400)
)
rpkms <- apply(counts, 2, function(x) rpkm(x, genes$Length))
tpms <- apply(counts, 2, function(x) tpm(x, genes$Length))
genes
# Gene Length
# 1 A 100
# 2 B 50
# 3 C 25
# 4 D 5
# 5 E 1
counts
# S1 S2
# 1 80 20
# 2 10 20
# 3 6 10
# 4 3 50
# 5 1 400
rpkms
# S1 S2
# [1,] 8000 4e+02
# [2,] 2000 8e+02
# [3,] 2400 8e+02
# [4,] 6000 2e+04
# [5,] 10000 8e+05
tpms
# S1 S2
# [1,] 281690.14 486.618
# [2,] 70422.54 973.236
# [3,] 84507.04 973.236
# [4,] 211267.61 24330.900
# [5,] 352112.68 973236.010
# Sample means should be equal.
colSums(rpkms)
# S1 S2
# 28400 822000
colSums(tpms)
# S1 S2
# 1e+06 1e+06
colMeans(rpkms)
# S1 S2
# 5680 164400
colMeans(tpms)
# S1 S2
# 2e+05 2e+05
@muhammed-ali

This comment has been minimized.

Copy link

muhammed-ali commented Jul 25, 2017

@slowkow Thanks for this post. So in this case we are ignoring the mean fragment length parameter. Right?
I have obtained raw counts from htseq-count and I used the exon length/gene (obtained by following this biostar post) as gene length. The .gtf file used here was same as used for getting the counts from htseq-count. Then I apply your function tpm to get TPM from raw counts and exon length per gene. Can you please guide me if I am doing it correct? Thank you.

@slowkow

This comment has been minimized.

Copy link
Owner Author

slowkow commented Jul 25, 2017

Consider using the length estimates produced by your read counter (HTSeq, STAR, Subread, Kallisto, Salmon, etc.).

Here are some tutorials I found useful:

@AimSchina

This comment has been minimized.

Copy link

AimSchina commented Jul 21, 2018

@slowkow Thank you very much for this post. I have tried to apply the tpm function on a count matrix of raw htseq counts ( 58307 genes, 28 samples) and I end up with a matrix of NAs. Could you possibly have an idea why this is happening? Thank you!

@slowkow

This comment has been minimized.

Copy link
Owner Author

slowkow commented Jul 22, 2018

@AimSchina Please check for NA in your count matrix.

Here is an example for you.

Let's take the counts from above, and change one value to NA:

> counts <- data.frame(
+   S1 = c(80, 10,  6,  3,   1),
+   S2 = c(20, 20, 10, 50, 400)
+ )
> counts
  S1  S2
1 80  20
2 10  20
3  6  10
4  3  50
5  1 400
> counts[1,1] <- NA

Now let's compute the TPM:

> tpm <- function(counts, lengths) {
+   rate <- counts / lengths
+   rate / sum(rate) * 1e6
+ }
> genes <- data.frame(
+   Gene = c("A","B","C","D","E"),
+   Length = c(100, 50, 25, 5, 1)
+ )
> tpms <- apply(counts, 2, function(x) tpm(x, genes$Length))
> tpms
     S1         S2
[1,] NA    486.618
[2,] NA    973.236
[3,] NA    973.236
[4,] NA  24330.900
[5,] NA 973236.010

Now let's use sum(rate, na.rm = TRUE) instead of sum(rate):

> tpm <- function(counts, lengths) {
+   rate <- counts / lengths
+   rate / sum(rate, na.rm = TRUE) * 1e6
+ }
> tpms <- apply(counts, 2, function(x) tpm(x, genes$Length))
> tpms
            S1         S2
[1,]        NA    486.618
[2,]  98039.22    973.236
[3,] 117647.06    973.236
[4,] 294117.65  24330.900
[5,] 490196.08 973236.010
@AimSchina

This comment has been minimized.

Copy link

AimSchina commented Jul 24, 2018

@slowkow Thank you very much for your reply. There are no NAs in my matrix. However, when I used the tpm function with sum(rate, na.rm = TRUE) I got a matrix as you did. Looking into this matrix with the TPM values, some genes had NAs. Looking in the count matrix for these genes, there were no NAs, they have counts in all samples. There must be something I cannot figure out right now. Again, thank you!

@rtasakis

This comment has been minimized.

Copy link

rtasakis commented Nov 29, 2018

@slowkow thanks a lot for this useful function!
Just a question:
I get this warnings after I apply the tpm function on my readcounts
Warning messages:
1: In counts/lengths :
longer object length is not a multiple of shorter object length
2: In counts/lengths :
longer object length is not a multiple of shorter object length
3: In counts/lengths :
longer object length is not a multiple of shorter object length

But the result seems to be correct. Should I worry?

@slowkow

This comment has been minimized.

Copy link
Owner Author

slowkow commented Nov 29, 2018

@rtasakis

Yes, you should worry. Your results are not correct.

Please double-check all of your variables. Look at the contents inside the variables, count the rows, count the columns, check the length, etc. Please confirm that everything matches exactly as it should.

For your data, this is probably false:

nrow(counts) == length(lengths)

What you are doing is incorrect. Here is an example of how R will "recycle" values when you give objects with mismatched sizes:

> 1:5
[1] 1 2 3 4 5
> 1:5 / 1:3
[1] 1.0 1.0 1.0 4.0 2.5
Warning message:
In 1:5/1:3 :
  longer object length is not a multiple of shorter object length

Notice that we got 1/1 = 1.0, 2/2 = 1.0, 3/3 = 1.0, 4/1 = 4.0, 5/2 = 2.5.

When we ran out of values after using 1, 2, and 3 in the denominator, R will "recycle" the values to get 4/1 and 5/2.

Could I ask you to please consider asking for additional help at Biostars? That is a much better place to ask for help than GitHub Gist comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.