Created
October 16, 2023 14:34
-
-
Save mikelove/c9164063b2edd78acb7c8f6d42a3e87a to your computer and use it in GitHub Desktop.
Toy tree example for collapsing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Toy tree example for collapsing" | |
author: "Michael Love" | |
--- | |
Example data with 20 inferential replicates, here we just have 1 | |
sample per condition and we calculate the LFC at each level of the | |
tree. | |
From the below simulation setup (see first chunk), the true DE signal | |
is from `c`, `e`, and `g`. In my opinion the optimal aggregation would | |
be `abc`, `e` and `g`. This is just intuitive, we need to encode this | |
as a metric and a rule for optimization. | |
Note the following: | |
1. there is shared sequence among `a`, `b`, and `c` (e.g. `a` and `b` | |
are not truly expressed, just misquantified). So `abc` minimizes | |
this with little loss to the DE signal. | |
2. `g` and `fg` have similar LFC. So we need to use inferential | |
variance to prioritize `g` over `fg`. | |
```{r} | |
n <- 20 | |
set.seed(1) | |
de1 <- rnorm(2*n, rep(c(150,100),each=n)) # true LFC = -.58 | |
de2 <- rnorm(2*n, rep(c(400,200),each=n)) # -1 | |
de3 <- rnorm(2*n, rep(c(200,400),each=n)) # +1 | |
a <- rnorm(2*n, 30, 10) # misquant | |
b <- rnorm(2*n, 30, 10) # misquant | |
c <- de1 - a - b | |
d <- rnorm(2*n, 800, 200) # high expression, no misquant | |
e <- de2 | |
f <- rnorm(2*n, 10, 3) # low expression, no misquant | |
g <- de3 # de in other direction to 'e' | |
``` | |
The tree on the features is as follows: | |
```{r} | |
bc <- b + c | |
abc <- a + bc | |
de <- d + e | |
fg <- f + g | |
defg <- de + fg | |
abcdefg <- abc + defg | |
``` | |
Heatmaps of the data at every level: | |
```{r} | |
mat <- rbind(a,b,c,d,e,f,g,bc,abc,de,fg,defg,abcdefg) | |
# first, no scaling on rows: | |
pheatmap::pheatmap(mat, cluster_rows=FALSE, cluster_cols=FALSE) | |
# second, scaling within row to help see DE: | |
pheatmap::pheatmap(mat, cluster_rows=FALSE, cluster_cols=FALSE, scale="row") | |
``` | |
Define functions to calculate LFC and the InfSD of LFC: | |
```{r} | |
lfc <- function(x) { | |
log2(mean(x[(n+1):(2*n)])) - | |
log2(mean(x[1:n])) | |
} | |
lfcInfSD <- function(x) { | |
sd(log2(x[(n+1):(2*n)]) - | |
log2(x[1:n])) | |
} | |
``` | |
Do we care only about LFC or also about the LFC scaled by inferential | |
SD: | |
```{r} | |
barplot(rev(apply(mat, 1, function(x) lfc(x))), | |
horiz=TRUE, las=1, xlab="LFC") | |
``` | |
```{r} | |
abs_lfc_over_infsd <- apply(mat, 1, function(x) abs(lfc(x))/lfcInfSD(x)) | |
barplot(rev(abs_lfc_over_infsd), | |
horiz=TRUE, las=1, xlab="abs LFC / InfSD") | |
``` | |
What if you follow from a leaf to root: | |
```{r} | |
plot(abs_lfc_over_infsd[c("c","abc","abcdefg")], | |
xlab="leaf -> root", ylab="abs lfc / InfSD", type="b") | |
``` | |
Questions: | |
1. How to make decisions? When we divide by InfSD we can get an | |
arbitrarily large value (if InfSD is close to 0). So I'm worried | |
about the stability of using the global sum of LFC / InfSD across | |
all potential cuts of the tree. It will prefer solutions that | |
incorporate features where InfSD $\rightarrow$ 0. | |
2. The above has nothing to do with significance, how to incorporate | |
significance and treeclimbR into this? Do we say, among solutions | |
that are similar by treeclimbR FDR, we prefer those which maximize | |
this sum of LFC / infSD? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment