-
-
Save slowkow/6e34ccb4d1311b8fe62e to your computer and use it in GitHub Desktop.
# RPKM versus TPM | |
# | |
# RPKM and TPM are both normalized for library size and gene length. | |
# | |
# RPKM is not comparable across different samples. | |
# | |
# For more details, see: http://blog.nextgenetics.net/?e=51 | |
rpkm <- function(counts, lengths) { | |
rate <- counts / lengths | |
rate / sum(counts) * 1e6 | |
} | |
tpm <- function(counts, lengths) { | |
rate <- counts / lengths | |
rate / sum(rate) * 1e6 | |
} | |
genes <- data.frame( | |
Gene = c("A","B","C","D","E"), | |
Length = c(100, 50, 25, 5, 1) | |
) | |
counts <- data.frame( | |
S1 = c(80, 10, 6, 3, 1), | |
S2 = c(20, 20, 10, 50, 400) | |
) | |
rpkms <- apply(counts, 2, function(x) rpkm(x, genes$Length)) | |
tpms <- apply(counts, 2, function(x) tpm(x, genes$Length)) | |
genes | |
# Gene Length | |
# 1 A 100 | |
# 2 B 50 | |
# 3 C 25 | |
# 4 D 5 | |
# 5 E 1 | |
counts | |
# S1 S2 | |
# 1 80 20 | |
# 2 10 20 | |
# 3 6 10 | |
# 4 3 50 | |
# 5 1 400 | |
rpkms | |
# S1 S2 | |
# [1,] 8000 4e+02 | |
# [2,] 2000 8e+02 | |
# [3,] 2400 8e+02 | |
# [4,] 6000 2e+04 | |
# [5,] 10000 8e+05 | |
tpms | |
# S1 S2 | |
# [1,] 281690.14 486.618 | |
# [2,] 70422.54 973.236 | |
# [3,] 84507.04 973.236 | |
# [4,] 211267.61 24330.900 | |
# [5,] 352112.68 973236.010 | |
# Sample means should be equal. | |
colSums(rpkms) | |
# S1 S2 | |
# 28400 822000 | |
colSums(tpms) | |
# S1 S2 | |
# 1e+06 1e+06 | |
colMeans(rpkms) | |
# S1 S2 | |
# 5680 164400 | |
colMeans(tpms) | |
# S1 S2 | |
# 2e+05 2e+05 |
@slowkow Thank you very much for your reply. There are no NAs in my matrix. However, when I used the tpm function with sum(rate, na.rm = TRUE) I got a matrix as you did. Looking into this matrix with the TPM values, some genes had NAs. Looking in the count matrix for these genes, there were no NAs, they have counts in all samples. There must be something I cannot figure out right now. Again, thank you!
@slowkow thanks a lot for this useful function!
Just a question:
I get this warnings after I apply the tpm function on my readcounts
Warning messages:
1: In counts/lengths :
longer object length is not a multiple of shorter object length
2: In counts/lengths :
longer object length is not a multiple of shorter object length
3: In counts/lengths :
longer object length is not a multiple of shorter object length
But the result seems to be correct. Should I worry?
Yes, you should worry. Your results are not correct.
Please double-check all of your variables. Look at the contents inside the variables, count the rows, count the columns, check the length, etc. Please confirm that everything matches exactly as it should.
For your data, this is probably false:
nrow(counts) == length(lengths)
What you are doing is incorrect. Here is an example of how R will "recycle" values when you give objects with mismatched sizes:
> 1:5
[1] 1 2 3 4 5
> 1:5 / 1:3
[1] 1.0 1.0 1.0 4.0 2.5
Warning message:
In 1:5/1:3 :
longer object length is not a multiple of shorter object length
Notice that we got 1/1 = 1.0
, 2/2 = 1.0
, 3/3 = 1.0
, 4/1 = 4.0
, 5/2 = 2.5
.
When we ran out of values after using 1, 2, and 3 in the denominator, R will "recycle" the values to get 4/1
and 5/2
.
Could I ask you to please consider asking for additional help at Biostars? That is a much better place to ask for help than GitHub Gist comments.
What is the difference between "lengths" and "gene$Lengths". Are those both the feature length that are obtained from the featureCounts?
@kxb38 Please consider copying the code into your R session and running it. I find that running code helps me to become more comfortable with it.
Also consider asking for help at Biostars, where many people are ready to answer your questions immediately. (In contrast, no one will see your questions here on this GitHub Gist.)
Consider using the length estimates produced by your read counter (HTSeq, STAR, Subread, Kallisto, Salmon, etc.).
Here are some tutorials I found useful:
Thank you so much!
hi @slowkow , gene length is in base pairs, correct?
@AimSchina Please check for
NA
in your count matrix.Here is an example for you.
Let's take the counts from above, and change one value to
NA
:Now let's compute the TPM:
Now let's use
sum(rate, na.rm = TRUE)
instead ofsum(rate)
: