Skip to content

Instantly share code, notes, and snippets.

@mikelove
Created October 16, 2023 14:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mikelove/c9164063b2edd78acb7c8f6d42a3e87a to your computer and use it in GitHub Desktop.
Save mikelove/c9164063b2edd78acb7c8f6d42a3e87a to your computer and use it in GitHub Desktop.
Toy tree example for collapsing
---
title: "Toy tree example for collapsing"
author: "Michael Love"
---
Example data with 20 inferential replicates, here we just have 1
sample per condition and we calculate the LFC at each level of the
tree.
From the below simulation setup (see first chunk), the true DE signal
is from `c`, `e`, and `g`. In my opinion the optimal aggregation would
be `abc`, `e` and `g`. This is just intuitive, we need to encode this
as a metric and a rule for optimization.
Note the following:
1. there is shared sequence among `a`, `b`, and `c` (e.g. `a` and `b`
are not truly expressed, just misquantified). So `abc` minimizes
this with little loss to the DE signal.
2. `g` and `fg` have similar LFC. So we need to use inferential
variance to prioritize `g` over `fg`.
```{r}
n <- 20
set.seed(1)
de1 <- rnorm(2*n, rep(c(150,100),each=n)) # true LFC = -.58
de2 <- rnorm(2*n, rep(c(400,200),each=n)) # -1
de3 <- rnorm(2*n, rep(c(200,400),each=n)) # +1
a <- rnorm(2*n, 30, 10) # misquant
b <- rnorm(2*n, 30, 10) # misquant
c <- de1 - a - b
d <- rnorm(2*n, 800, 200) # high expression, no misquant
e <- de2
f <- rnorm(2*n, 10, 3) # low expression, no misquant
g <- de3 # de in other direction to 'e'
```
The tree on the features is as follows:
```{r}
bc <- b + c
abc <- a + bc
de <- d + e
fg <- f + g
defg <- de + fg
abcdefg <- abc + defg
```
Heatmaps of the data at every level:
```{r}
mat <- rbind(a,b,c,d,e,f,g,bc,abc,de,fg,defg,abcdefg)
# first, no scaling on rows:
pheatmap::pheatmap(mat, cluster_rows=FALSE, cluster_cols=FALSE)
# second, scaling within row to help see DE:
pheatmap::pheatmap(mat, cluster_rows=FALSE, cluster_cols=FALSE, scale="row")
```
Define functions to calculate LFC and the InfSD of LFC:
```{r}
lfc <- function(x) {
log2(mean(x[(n+1):(2*n)])) -
log2(mean(x[1:n]))
}
lfcInfSD <- function(x) {
sd(log2(x[(n+1):(2*n)]) -
log2(x[1:n]))
}
```
Do we care only about LFC or also about the LFC scaled by inferential
SD:
```{r}
barplot(rev(apply(mat, 1, function(x) lfc(x))),
horiz=TRUE, las=1, xlab="LFC")
```
```{r}
abs_lfc_over_infsd <- apply(mat, 1, function(x) abs(lfc(x))/lfcInfSD(x))
barplot(rev(abs_lfc_over_infsd),
horiz=TRUE, las=1, xlab="abs LFC / InfSD")
```
What if you follow from a leaf to root:
```{r}
plot(abs_lfc_over_infsd[c("c","abc","abcdefg")],
xlab="leaf -> root", ylab="abs lfc / InfSD", type="b")
```
Questions:
1. How to make decisions? When we divide by InfSD we can get an
arbitrarily large value (if InfSD is close to 0). So I'm worried
about the stability of using the global sum of LFC / InfSD across
all potential cuts of the tree. It will prefer solutions that
incorporate features where InfSD $\rightarrow$ 0.
2. The above has nothing to do with significance, how to incorporate
significance and treeclimbR into this? Do we say, among solutions
that are similar by treeclimbR FDR, we prefer those which maximize
this sum of LFC / infSD?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment