lgondara/mida.Rmd

## mida.Rmd
---
title: "MIDA: Multiple Imputation using Denoising Autoencoders"
author: "Lovedeep Gondara"
date: "February 13, 2018"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```

## Getting started

[h2o](https://cran.r-project.org/web/packages/h2o/h2o.pdf) package offers an easy to use function for implementing autoencoders. More information is available at this [link](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/DeepLearningBooklet.pdf). Before we implement our model, let us first load a dataset and introduce missingness. For this we will use the [mlbench](https://cran.r-project.org/web/packages/mlbench/mlbench.pdf) package. This package provides some datasets that can be used for benchmarking machine learning models.

Let us load the dataset and convert it to numeric. We will be using 'Shuttle' dataset, it has 58000 obervations and 10 variables.

```{r dataset, eval=FALSE}
require(mlbench)
data("Shuttle")
data_use=Shuttle
data_use=as.data.frame(lapply(data_use, as.numeric))
```

## Inducing missingness

Now we have our dataset loaded, we start with inducing missingness. Let us introduce simple random missing patterns (Missing Completely At Random), that is, we sample half of the variables and set observations in those variables to missing if an appended random uniform vector has value less than a certain threshhold. This should introduce about 20\% missingness.

```{r missing, eval=FALSE}
x <- data_use
f <- data_use
prop.m=0.2

x$mcar=runif(nrow(x),0,1)

for (j in sample(1:ncol(x),round(ncol(x)/2)))
{
  x[,j]=ifelse(x$mcar<prop.m, NA, x[,j])
}

assign("MCAR_unif", data.frame(x[,-ncol(x)]))

y <- get("MCAR_unif")
smp_size <- floor(0.70 * nrow(y))
train_ind <- sample(seq_len(nrow(y)), size = smp_size)

train <- y[train_ind, ]
test <- y[-train_ind, ]
test_full <-f[-train_ind,]
```

Now we have 70\% training data and 30\%  test data which includes missingness and a test data without missingness so we can calculate performance. We can now proceed to modelling. We start with initializing 'h2o' package and then reading the training and test datasets as the 'h2o's supported format Then we run our imputation model multiple times as each new start would initialize the weights with different values.

```{r model, eval=FALSE}
require(h2o)
h2o.init()
for(l in 1:5)
{
 train.hex=as.h2o(train)
 test.hex=as.h2o(test)

 n=ncol(test)
 pred=c(1:n)
 ae_model <- h2o.deeplearning(x=pred,
                              training_frame=train.hex,
                              hidden=c(n,n+7,n+14,n+21,n+14,n+7),
                              epoch=500,
                              activation="Tanh",
                              autoencoder=T,
                              input_dropout_ratio = 0.5,
                              ignore_const_cols=F)

 pred=h2o.predict(ae_model,test.hex)

 dae_imp=as.data.frame(pred)
 assign(paste0("dae_rand", l), data.frame(dae_imp))
 a <- get(paste0("dae_rand", l))
 file3=paste("U:/Imp results2/Imp results6/dae_mcar_rmd",l, ".csv", sep = "")
 write.table(a,file3, sep=",",row.names = F,col.names = F)

}
```

Let us look in detail at the function call for imputation model and break it down for easy understanding:

- x= This specifies the vector with predictors for input
- training_frame= Defines the training dataset
- hidden= Details the hidden layers, in this case we are using overcomplete representation, that is there are more hidden nodes compared to input number of nodes.
- epoch= Number of epochs
- activation= We are using Tanh as our activation function as we found this works better than Relu when datasets are small and there are many observations closer to zero.
- autoencoder= Sets up an autoencoder model
- input_dropout_ratio= Mimics a denoising autoencoder by setting the defined proportion of features to be missing in each training row. 0.5 means half of the features are set to missing for each row.
- ignore_const_cols= Ignoring constant training columns, shouldn't make much of a difference either way.

After running the model, we predict the reconstruction of our test set and store the results from multiple imputation as csv files for analysis. Now, let us see how good our model did on imputations. For evaluation, we will use rmse function from package [hydroGOF](https://cran.r-project.org/web/packages/hydroGOF/hydroGOF.pdf) package.


```{r evaluation, eval=FALSE}
require(hydroGOF)
daeres=NULL

for(i in 1:5)
{
  dae=read.table(paste0("U:/Imp results2/Imp results6/dae_mcar_rmd",i,".csv"),sep=",")
  full=test_full
  miss=test

  naloc=as.matrix(is.na(miss))
  miss[naloc]=dae[naloc]

  daeres[i]=sum(rmse(scale(miss,center=F), scale(full,center=F)))
}

```
	---
	title: "MIDA: Multiple Imputation using Denoising Autoencoders"
	author: "Lovedeep Gondara"
	date: "February 13, 2018"
	output: html_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(warning = FALSE, message = FALSE)
	```

	## Getting started

	[h2o](https://cran.r-project.org/web/packages/h2o/h2o.pdf) package offers an easy to use function for implementing autoencoders. More information is available at this [link](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/DeepLearningBooklet.pdf). Before we implement our model, let us first load a dataset and introduce missingness. For this we will use the [mlbench](https://cran.r-project.org/web/packages/mlbench/mlbench.pdf) package. This package provides some datasets that can be used for benchmarking machine learning models.

	Let us load the dataset and convert it to numeric. We will be using 'Shuttle' dataset, it has 58000 obervations and 10 variables.

	```{r dataset, eval=FALSE}
	require(mlbench)
	data("Shuttle")
	data_use=Shuttle
	data_use=as.data.frame(lapply(data_use, as.numeric))
	```

	## Inducing missingness

	Now we have our dataset loaded, we start with inducing missingness. Let us introduce simple random missing patterns (Missing Completely At Random), that is, we sample half of the variables and set observations in those variables to missing if an appended random uniform vector has value less than a certain threshhold. This should introduce about 20\% missingness.

	```{r missing, eval=FALSE}
	x <- data_use
	f <- data_use
	prop.m=0.2

	x$mcar=runif(nrow(x),0,1)

	for (j in sample(1:ncol(x),round(ncol(x)/2)))
	{
	x[,j]=ifelse(x$mcar<prop.m, NA, x[,j])
	}

	assign("MCAR_unif", data.frame(x[,-ncol(x)]))

	y <- get("MCAR_unif")
	smp_size <- floor(0.70 * nrow(y))
	train_ind <- sample(seq_len(nrow(y)), size = smp_size)

	train <- y[train_ind, ]
	test <- y[-train_ind, ]
	test_full <-f[-train_ind,]
	```

	Now we have 70\% training data and 30\% test data which includes missingness and a test data without missingness so we can calculate performance. We can now proceed to modelling. We start with initializing 'h2o' package and then reading the training and test datasets as the 'h2o's supported format Then we run our imputation model multiple times as each new start would initialize the weights with different values.

	```{r model, eval=FALSE}
	require(h2o)
	h2o.init()
	for(l in 1:5)
	{
	train.hex=as.h2o(train)
	test.hex=as.h2o(test)

	n=ncol(test)
	pred=c(1:n)
	ae_model <- h2o.deeplearning(x=pred,
	training_frame=train.hex,
	hidden=c(n,n+7,n+14,n+21,n+14,n+7),
	epoch=500,
	activation="Tanh",
	autoencoder=T,
	input_dropout_ratio = 0.5,
	ignore_const_cols=F)

	pred=h2o.predict(ae_model,test.hex)

	dae_imp=as.data.frame(pred)
	assign(paste0("dae_rand", l), data.frame(dae_imp))
	a <- get(paste0("dae_rand", l))
	file3=paste("U:/Imp results2/Imp results6/dae_mcar_rmd",l, ".csv", sep = "")
	write.table(a,file3, sep=",",row.names = F,col.names = F)

	}
	```

	Let us look in detail at the function call for imputation model and break it down for easy understanding:

	- x= This specifies the vector with predictors for input
	- training_frame= Defines the training dataset
	- hidden= Details the hidden layers, in this case we are using overcomplete representation, that is there are more hidden nodes compared to input number of nodes.
	- epoch= Number of epochs
	- activation= We are using Tanh as our activation function as we found this works better than Relu when datasets are small and there are many observations closer to zero.
	- autoencoder= Sets up an autoencoder model
	- input_dropout_ratio= Mimics a denoising autoencoder by setting the defined proportion of features to be missing in each training row. 0.5 means half of the features are set to missing for each row.
	- ignore_const_cols= Ignoring constant training columns, shouldn't make much of a difference either way.

	After running the model, we predict the reconstruction of our test set and store the results from multiple imputation as csv files for analysis. Now, let us see how good our model did on imputations. For evaluation, we will use rmse function from package [hydroGOF](https://cran.r-project.org/web/packages/hydroGOF/hydroGOF.pdf) package.


	```{r evaluation, eval=FALSE}
	require(hydroGOF)
	daeres=NULL

	for(i in 1:5)
	{
	dae=read.table(paste0("U:/Imp results2/Imp results6/dae_mcar_rmd",i,".csv"),sep=",")
	full=test_full
	miss=test

	naloc=as.matrix(is.na(miss))
	miss[naloc]=dae[naloc]

	daeres[i]=sum(rmse(scale(miss,center=F), scale(full,center=F)))
	}

	```