mikelove/tree_example.Rmd

## tree_example.Rmd
---
title: "Toy tree example for collapsing"
author: "Michael Love"
---

Example data with 20 inferential replicates, here we just have 1
sample per condition and we calculate the LFC at each level of the
tree.

From the below simulation setup (see first chunk), the true DE signal
is from `c`, `e`, and `g`. In my opinion the optimal aggregation would
be `abc`, `e` and `g`. This is just intuitive, we need to encode this
as a metric and a rule for optimization.

Note the following:

1. there is shared sequence among `a`, `b`, and `c` (e.g. `a` and `b`
   are not truly expressed, just misquantified). So `abc` minimizes
   this with little loss to the DE signal.
2. `g` and `fg` have similar LFC. So we need to use inferential
   variance to prioritize `g` over `fg`.

```{r}
n <- 20
set.seed(1)
de1 <- rnorm(2*n, rep(c(150,100),each=n)) # true LFC = -.58
de2 <- rnorm(2*n, rep(c(400,200),each=n)) # -1
de3 <- rnorm(2*n, rep(c(200,400),each=n)) # +1
a <- rnorm(2*n, 30, 10) # misquant
b <- rnorm(2*n, 30, 10) # misquant
c <- de1 - a - b
d <- rnorm(2*n, 800, 200) # high expression, no misquant
e <- de2
f <- rnorm(2*n, 10, 3) # low expression, no misquant
g <- de3 # de in other direction to 'e'
```

The tree on the features is as follows:

```{r}
bc <- b + c
abc <- a + bc
de <- d + e
fg <- f + g
defg <- de + fg
abcdefg <- abc + defg
```

Heatmaps of the data at every level:

```{r}
mat <- rbind(a,b,c,d,e,f,g,bc,abc,de,fg,defg,abcdefg)
# first, no scaling on rows:
pheatmap::pheatmap(mat, cluster_rows=FALSE, cluster_cols=FALSE)
# second, scaling within row to help see DE:
pheatmap::pheatmap(mat, cluster_rows=FALSE, cluster_cols=FALSE, scale="row")
```

Define functions to calculate LFC and the InfSD of LFC:

```{r}
lfc <- function(x) {
  log2(mean(x[(n+1):(2*n)])) -
    log2(mean(x[1:n]))
}
lfcInfSD <- function(x) {
  sd(log2(x[(n+1):(2*n)]) -
    log2(x[1:n]))
}
```

Do we care only about LFC or also about the LFC scaled by inferential
SD:

```{r}
barplot(rev(apply(mat, 1, function(x) lfc(x))),
        horiz=TRUE, las=1, xlab="LFC")
```

```{r}
abs_lfc_over_infsd <- apply(mat, 1, function(x) abs(lfc(x))/lfcInfSD(x))
barplot(rev(abs_lfc_over_infsd),
        horiz=TRUE, las=1, xlab="abs LFC / InfSD")
```

What if you follow from a leaf to root:

```{r}
plot(abs_lfc_over_infsd[c("c","abc","abcdefg")],
     xlab="leaf -> root", ylab="abs lfc / InfSD", type="b")
```

Questions:

1. How to make decisions? When we divide by InfSD we can get an
   arbitrarily large value (if InfSD is close to 0). So I'm worried
   about the stability of using the global sum of LFC / InfSD across
   all potential cuts of the tree. It will prefer solutions that
   incorporate features where InfSD $\rightarrow$ 0.
2. The above has nothing to do with significance, how to incorporate
   significance and treeclimbR into this? Do we say, among solutions
   that are similar by treeclimbR FDR, we prefer those which maximize
   this sum of LFC / infSD?
	---
	title: "Toy tree example for collapsing"
	author: "Michael Love"
	---

	Example data with 20 inferential replicates, here we just have 1
	sample per condition and we calculate the LFC at each level of the
	tree.

	From the below simulation setup (see first chunk), the true DE signal
	is from `c`, `e`, and `g`. In my opinion the optimal aggregation would
	be `abc`, `e` and `g`. This is just intuitive, we need to encode this
	as a metric and a rule for optimization.

	Note the following:

	1. there is shared sequence among `a`, `b`, and `c` (e.g. `a` and `b`
	are not truly expressed, just misquantified). So `abc` minimizes
	this with little loss to the DE signal.
	2. `g` and `fg` have similar LFC. So we need to use inferential
	variance to prioritize `g` over `fg`.

	```{r}
	n <- 20
	set.seed(1)
	de1 <- rnorm(2*n, rep(c(150,100),each=n)) # true LFC = -.58
	de2 <- rnorm(2*n, rep(c(400,200),each=n)) # -1
	de3 <- rnorm(2*n, rep(c(200,400),each=n)) # +1
	a <- rnorm(2*n, 30, 10) # misquant
	b <- rnorm(2*n, 30, 10) # misquant
	c <- de1 - a - b
	d <- rnorm(2*n, 800, 200) # high expression, no misquant
	e <- de2
	f <- rnorm(2*n, 10, 3) # low expression, no misquant
	g <- de3 # de in other direction to 'e'
	```

	The tree on the features is as follows:

	```{r}
	bc <- b + c
	abc <- a + bc
	de <- d + e
	fg <- f + g
	defg <- de + fg
	abcdefg <- abc + defg
	```

	Heatmaps of the data at every level:

	```{r}
	mat <- rbind(a,b,c,d,e,f,g,bc,abc,de,fg,defg,abcdefg)
	# first, no scaling on rows:
	pheatmap::pheatmap(mat, cluster_rows=FALSE, cluster_cols=FALSE)
	# second, scaling within row to help see DE:
	pheatmap::pheatmap(mat, cluster_rows=FALSE, cluster_cols=FALSE, scale="row")
	```

	Define functions to calculate LFC and the InfSD of LFC:

	```{r}
	lfc <- function(x) {
	log2(mean(x[(n+1):(2*n)])) -
	log2(mean(x[1:n]))
	}
	lfcInfSD <- function(x) {
	sd(log2(x[(n+1):(2*n)]) -
	log2(x[1:n]))
	}
	```

	Do we care only about LFC or also about the LFC scaled by inferential
	SD:

	```{r}
	barplot(rev(apply(mat, 1, function(x) lfc(x))),
	horiz=TRUE, las=1, xlab="LFC")
	```

	```{r}
	abs_lfc_over_infsd <- apply(mat, 1, function(x) abs(lfc(x))/lfcInfSD(x))
	barplot(rev(abs_lfc_over_infsd),
	horiz=TRUE, las=1, xlab="abs LFC / InfSD")
	```

	What if you follow from a leaf to root:

	```{r}
	plot(abs_lfc_over_infsd[c("c","abc","abcdefg")],
	xlab="leaf -> root", ylab="abs lfc / InfSD", type="b")
	```

	Questions:

	1. How to make decisions? When we divide by InfSD we can get an
	arbitrarily large value (if InfSD is close to 0). So I'm worried
	about the stability of using the global sum of LFC / InfSD across
	all potential cuts of the tree. It will prefer solutions that
	incorporate features where InfSD $\rightarrow$ 0.
	2. The above has nothing to do with significance, how to incorporate
	significance and treeclimbR into this? Do we say, among solutions
	that are similar by treeclimbR FDR, we prefer those which maximize
	this sum of LFC / infSD?