csiu/day32_feature-relative-importance.Rmd

## day32_feature-relative-importance.Rmd
---
title: Variable importance in ANNs
output:
  html_document:
    keep_md: yes
---

```{r setup, include=FALSE}
source(file.path(PROJHOME, "src/r/setup.R"))
setup()
knitr::opts_chunk$set(echo = TRUE)
```

Examples are from the following:

> Ibrahim, OM. 2013. [A comparison of methods  for  assessing the  relative  importance of input variables in  artificial neural networks](http://www.palisade.com/downloads/pdf/academic/DTSpaper110915.pdf). Journal of Applied Sciences Research, 9(11): 5692-5700.

Table 4: Final connection weights
```{r network-weights, echo=FALSE}
w <- NULL
w$ih <-
  data.frame(x1 = c(-0.8604, 1.0651, -0.1392, 1.0666),
             x2 = c(-6.6167, 1.5142, -5.3051, 1.3358),
             x3 = c(-1.7338, 1.4107, -1.4207, 0.7845)) %>%
  t() %>%
  `colnames<-`(c("h1", "h2", "h3","h4"))

w$oh <-
  data.frame(gy = c(-5.2075, 2.0158, -4.4617, 1.5858)) %>%
  t() %>%
  `colnames<-`(c("h1", "h2", "h3","h4"))

w
```

<br>
<br>

### Connection weights algorithm (CW)

$RI_x = \sum_{y=1}^m{w_{xy}w_{yz}}$

- The mention the above equation in the paper, but I think they mean: $RI_x = \frac{\sum_{y=1}^m{w_{xy}w_{yz}}}{\sum_{x=1}^n{\sum_{y=1}^m{w_{xy}w_{yz}}}}$

----

Here we show explicitly the calculations for input `x1`

```{r}
# We first calculate the connection weight product of eg. input x1
# w_x1,h1 * w_h1,gy
# w_x1,h2 * w_h2,gy
# w_x1,h3 * w_h3,gy
# w_x1,h4 * w_h4,gy
(cw_prd_x1 <- w$ih[1,] * w$oh)

# & then take the sum
sum(cw_prd_x1)
```

----

We do the calculation for all inputs and get the following connection weight product sums (`cw_prd_sums`):

```{r echo=FALSE}
(cw_prd_sums <-
  sapply(1:3, function(i){sum(w$ih[i,] * w$oh)}) %>%
  setNames(., rownames(w$ih))
 )

#You can do the same calculation by matrix multiplication
# (cw_prd_sums <-
#   (w$ih %*% t(w$oh)) %>% # Matrix multiplication shapes: (3,4) * (4,1)
#   {setNames(as.numeric(.), rownames(.))}
# )
```

To calculate relative importance, we normalize and sort to find the most important feature

```{r}
# Helper to normalize scores
normalize_to_100 <- function(x, sort=TRUE){
  (x / sum(x)*100) %>%
    sort(decreasing = sort)
}

normalize_to_100(cw_prd_sums)
```

<br>
<br>

### The first proposed algorithm (MCW) - Modified Connection Weights

$RI_x = \frac{|\sum_{y=1}^m{w_{xy}w_{yz} \times r_{ij.k}}|}{\sum_{x=1}^n{\sum_{y=1}^m{w_{xy}w_{yz}}}}$

*A correction term (partial correlation) is multiplied by this sum and the absolute value is taken, this is called the corrected sum. Then to calculate the relative importance of each input, the corrected sum of each input is divided by the total corrected sum.*

<br>

#### Partial correlation

$r_{ij.k}=\frac{r_{ij} - r_{ki} \times r_{kj}}{\sqrt{(1-r_{ki}^2) \times (1-r_{kj}^2)}}$

*[Partial correlation](https://en.wikipedia.org/wiki/Partial_correlation) measures the degree of association between two random variables, with the effect of a set of controlling random variables removed.* -- wikipedia

----

The following is from [SlideShare: What is a partial correlation?](https://www.slideshare.net/plummer48/what-is-a-partial-correlation) (Ken Plummer, 2014)

```{r echo=FALSE}
(dat <-
  data.frame(
    individual = letters[1:8],
    height = c(73, 70, 69, 68, 70, 68, 67, 62),
    weight = c(240, 210, 180, 160, 150, 140, 135, 120),
    gender = c(rep(1, 4), rep(2, 4)),
    stringsAsFactors = FALSE
    )
 )
```

The correlation between height and weight is 0.825

```{r}
cor(dat$height, dat$weight, method="pearson") #default is pearson
```

However, when controlling for gender (male=1; female=2) the correlation between height and weight is 0.770

```{r results='hold'}
(r_ij <- cor(dat$height, dat$weight))
(r_ki <- cor(dat$gender, dat$height))
(r_kj <- cor(dat$gender, dat$weight))

numer <- r_ij-(r_ki*r_kj)
denom <- sqrt((1-r_ki^2) * (1-r_kj^2))

sprintf("Partial correlation: %0.3f", numer/denom)
```


> A partial correlation makes it possible to eliminate the effect of a third variable

----

```{r}
# Partial correlation from Table 6
(r_ijk <- c(x1=0.5266, x2=0.9584, x3=0.7275))

# Calculate absolute corrected sum
(abs_cor_sum <- cw_prd_sums * r_ijk)

# Normalize & sort for relative importance
normalize_to_100(abs_cor_sum)
```

- One possible issue is that we don't know the identities of the controlling random variables

<br>
<br>

### The second proposed algorithm (MS) - Most Squares

$RI_x = \frac{\sum_{x=1}^m{(w_{xy}^i - w_{xy}^f)^2}}{\sum_{x=1}^m{\sum_{y=1}^n{(w_{xy}^i - w_{xy}^f)^2}}}$

Calculated by comparing the connection weights between the input and hidden layer of the initial (un-trained) and trained weights. Note: The connection weights between the hidden and output layer were not used.

Table 7: Input-hidden initial connection weights.
```{r echo=FALSE}
w$ih_i <-
  data.frame(x1 = c(0.0236, 0.1943, 0.1478, 0.2054),
             x2 = c(0.0562, -0.3437, -0.3215, -0.0894),
             x3 = c(-0.1488, 0.1844, -0.0608, 0.3267)) %>%
  t() %>%
  `colnames<-`(c("h1", "h2", "h3","h4"))

w$ih_i
```

```{r}
# Sum of squared of differences
(sum_sq_diff <-
  (w$ih_i - w$ih)^2 %>%
  rowSums()
)

# Normalize & sort
normalize_to_100(sum_sq_diff)
```

- Possible issues include (1) discrepancies depending on weight initialization & (2) output layer is removed (what of additional hidden layers?)

<br>
<br>

### Multiple Linear Regression (MLR)

(Fancy way of calling linear regression with more than one explanatory variable; different from multivariate linear regression which refers to the case of multiple response variables.)

- Issue: I'm want to infer variable importance from ANNs, not remodel the data with a linear model.

### Dominance Analysis (DA)

*DA determines the dominance of one input over another by comparing their additional contributions across all subset models... approaches the problem of relative importance by examining the change in R^2 resulting from adding an input to all possible subset regression models... averaging all the possible models*

- Issue: I'm want to infer variable importance from ANNs, not remodel the data with many linear models

<br>
<br>

### Garson's algorithm

$RI_x = \sum_{x=1}^n{\frac{|w_{xy}-w_{yz}|}{\sum_{y=1}^m{|w_{xy}-w_{yz}|}}}$

```{r}
# If i = input i; j = hidden j; and o = the single output,
# Compute: w_ij * w_jo
# for each value i and j
#
# Matrix of weights (4,3) multiplied by vector of hidden-output weights
# Save the transposed result (3,4)
(numer <-
  (t(w$ih) * as.numeric(w$oh)) %>%
   t()
)

# Sum of products of all inputs
# Think: contribution from each hidden node
# Eg. contribution from h1 = w_x1,h1*w_h1,o +
#                            w_x2,h1*w_h1,o +
#                            w_x3,h1*w_h1,o
# Will get contribution score for h1, h2, h3, and h4
#
# input-hidden weights (3,4) %*% (4,4) diagonal matrix of hidden-output weights
(denom <-
  w$ih %*% diag(as.numeric(w$oh)) %>%
  colSums()
)

# Take division to produce (3,4)
# Take row sum to score the importance for each input
(ga <-
  (numer %*% diag(denom^-1)) %>%
  rowSums()
)

# Normalize & sort
normalize_to_100(ga)
```

<br>
<br>

### Partial derivative

(Not enough information given for replication)
	---
	title: Variable importance in ANNs
	output:
	html_document:
	keep_md: yes
	---

	```{r setup, include=FALSE}
	source(file.path(PROJHOME, "src/r/setup.R"))
	setup()
	knitr::opts_chunk$set(echo = TRUE)
	```

	Examples are from the following:

	> Ibrahim, OM. 2013. [A comparison of methods for assessing the relative importance of input variables in artificial neural networks](http://www.palisade.com/downloads/pdf/academic/DTSpaper110915.pdf). Journal of Applied Sciences Research, 9(11): 5692-5700.

	Table 4: Final connection weights
	```{r network-weights, echo=FALSE}
	w <- NULL
	w$ih <-
	data.frame(x1 = c(-0.8604, 1.0651, -0.1392, 1.0666),
	x2 = c(-6.6167, 1.5142, -5.3051, 1.3358),
	x3 = c(-1.7338, 1.4107, -1.4207, 0.7845)) %>%
	t() %>%
	`colnames<-`(c("h1", "h2", "h3","h4"))

	w$oh <-
	data.frame(gy = c(-5.2075, 2.0158, -4.4617, 1.5858)) %>%
	t() %>%
	`colnames<-`(c("h1", "h2", "h3","h4"))

	w
	```

	<br>
	<br>

	### Connection weights algorithm (CW)

	$RI_x = \sum_{y=1}^m{w_{xy}w_{yz}}$

	- The mention the above equation in the paper, but I think they mean: $RI_x = \frac{\sum_{y=1}^m{w_{xy}w_{yz}}}{\sum_{x=1}^n{\sum_{y=1}^m{w_{xy}w_{yz}}}}$

	----

	Here we show explicitly the calculations for input `x1`

	```{r}
	# We first calculate the connection weight product of eg. input x1
	# w_x1,h1 * w_h1,gy
	# w_x1,h2 * w_h2,gy
	# w_x1,h3 * w_h3,gy
	# w_x1,h4 * w_h4,gy
	(cw_prd_x1 <- w$ih[1,] * w$oh)

	# & then take the sum
	sum(cw_prd_x1)
	```

	----

	We do the calculation for all inputs and get the following connection weight product sums (`cw_prd_sums`):

	```{r echo=FALSE}
	(cw_prd_sums <-
	sapply(1:3, function(i){sum(w$ih[i,] * w$oh)}) %>%
	setNames(., rownames(w$ih))
	)

	#You can do the same calculation by matrix multiplication
	# (cw_prd_sums <-
	# (w$ih %% t(w$oh)) %>% # Matrix multiplication shapes: (3,4) (4,1)
	# {setNames(as.numeric(.), rownames(.))}
	# )
	```

	To calculate relative importance, we normalize and sort to find the most important feature

	```{r}
	# Helper to normalize scores
	normalize_to_100 <- function(x, sort=TRUE){
	(x / sum(x)*100) %>%
	sort(decreasing = sort)
	}

	normalize_to_100(cw_prd_sums)
	```

	<br>
	<br>

	### The first proposed algorithm (MCW) - Modified Connection Weights

	$RI_x = \frac{\|\sum_{y=1}^m{w_{xy}w_{yz} \times r_{ij.k}}\|}{\sum_{x=1}^n{\sum_{y=1}^m{w_{xy}w_{yz}}}}$

	A correction term (partial correlation) is multiplied by this sum and the absolute value is taken, this is called the corrected sum. Then to calculate the relative importance of each input, the corrected sum of each input is divided by the total corrected sum.

	<br>

	#### Partial correlation

	$r_{ij.k}=\frac{r_{ij} - r_{ki} \times r_{kj}}{\sqrt{(1-r_{ki}^2) \times (1-r_{kj}^2)}}$

	[Partial correlation](https://en.wikipedia.org/wiki/Partial_correlation) measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. -- wikipedia

	----

	The following is from [SlideShare: What is a partial correlation?](https://www.slideshare.net/plummer48/what-is-a-partial-correlation) (Ken Plummer, 2014)

	```{r echo=FALSE}
	(dat <-
	data.frame(
	individual = letters[1:8],
	height = c(73, 70, 69, 68, 70, 68, 67, 62),
	weight = c(240, 210, 180, 160, 150, 140, 135, 120),
	gender = c(rep(1, 4), rep(2, 4)),
	stringsAsFactors = FALSE
	)
	)
	```

	The correlation between height and weight is 0.825

	```{r}
	cor(dat$height, dat$weight, method="pearson") #default is pearson
	```

	However, when controlling for gender (male=1; female=2) the correlation between height and weight is 0.770

	```{r results='hold'}
	(r_ij <- cor(dat$height, dat$weight))
	(r_ki <- cor(dat$gender, dat$height))
	(r_kj <- cor(dat$gender, dat$weight))

	numer <- r_ij-(r_ki*r_kj)
	denom <- sqrt((1-r_ki^2) * (1-r_kj^2))

	sprintf("Partial correlation: %0.3f", numer/denom)
	```


	> A partial correlation makes it possible to eliminate the effect of a third variable

	----

	```{r}
	# Partial correlation from Table 6
	(r_ijk <- c(x1=0.5266, x2=0.9584, x3=0.7275))

	# Calculate absolute corrected sum
	(abs_cor_sum <- cw_prd_sums * r_ijk)

	# Normalize & sort for relative importance
	normalize_to_100(abs_cor_sum)
	```

	- One possible issue is that we don't know the identities of the controlling random variables

	<br>
	<br>

	### The second proposed algorithm (MS) - Most Squares

	$RI_x = \frac{\sum_{x=1}^m{(w_{xy}^i - w_{xy}^f)^2}}{\sum_{x=1}^m{\sum_{y=1}^n{(w_{xy}^i - w_{xy}^f)^2}}}$

	Calculated by comparing the connection weights between the input and hidden layer of the initial (un-trained) and trained weights. Note: The connection weights between the hidden and output layer were not used.

	Table 7: Input-hidden initial connection weights.
	```{r echo=FALSE}
	w$ih_i <-
	data.frame(x1 = c(0.0236, 0.1943, 0.1478, 0.2054),
	x2 = c(0.0562, -0.3437, -0.3215, -0.0894),
	x3 = c(-0.1488, 0.1844, -0.0608, 0.3267)) %>%
	t() %>%
	`colnames<-`(c("h1", "h2", "h3","h4"))

	w$ih_i
	```

	```{r}
	# Sum of squared of differences
	(sum_sq_diff <-
	(w$ih_i - w$ih)^2 %>%
	rowSums()
	)

	# Normalize & sort
	normalize_to_100(sum_sq_diff)
	```

	- Possible issues include (1) discrepancies depending on weight initialization & (2) output layer is removed (what of additional hidden layers?)

	<br>
	<br>

	### Multiple Linear Regression (MLR)

	(Fancy way of calling linear regression with more than one explanatory variable; different from multivariate linear regression which refers to the case of multiple response variables.)

	- Issue: I'm want to infer variable importance from ANNs, not remodel the data with a linear model.

	### Dominance Analysis (DA)

	DA determines the dominance of one input over another by comparing their additional contributions across all subset models... approaches the problem of relative importance by examining the change in R^2 resulting from adding an input to all possible subset regression models... averaging all the possible models

	- Issue: I'm want to infer variable importance from ANNs, not remodel the data with many linear models

	<br>
	<br>

	### Garson's algorithm

	$RI_x = \sum_{x=1}^n{\frac{\|w_{xy}-w_{yz}\|}{\sum_{y=1}^m{\|w_{xy}-w_{yz}\|}}}$

	```{r}
	# If i = input i; j = hidden j; and o = the single output,
	# Compute: w_ij * w_jo
	# for each value i and j
	#
	# Matrix of weights (4,3) multiplied by vector of hidden-output weights
	# Save the transposed result (3,4)
	(numer <-
	(t(w$ih) * as.numeric(w$oh)) %>%
	t()
	)

	# Sum of products of all inputs
	# Think: contribution from each hidden node
	# Eg. contribution from h1 = w_x1,h1*w_h1,o +
	# w_x2,h1*w_h1,o +
	# w_x3,h1*w_h1,o
	# Will get contribution score for h1, h2, h3, and h4
	#
	# input-hidden weights (3,4) %*% (4,4) diagonal matrix of hidden-output weights
	(denom <-
	w$ih %*% diag(as.numeric(w$oh)) %>%
	colSums()
	)

	# Take division to produce (3,4)
	# Take row sum to score the importance for each input
	(ga <-
	(numer %*% diag(denom^-1)) %>%
	rowSums()
	)

	# Normalize & sort
	normalize_to_100(ga)
	```

	<br>
	<br>

	### Partial derivative

	(Not enough information given for replication)