Skip to content

Instantly share code, notes, and snippets.

@lgondara
Last active January 30, 2021 23:30
Show Gist options
  • Save lgondara/18387c5f4d745673e9ca8e23f3d7ebd3 to your computer and use it in GitHub Desktop.
Save lgondara/18387c5f4d745673e9ca8e23f3d7ebd3 to your computer and use it in GitHub Desktop.
MIDA: Multiple Imputation using Denoising Autoencoders
---
title: "MIDA: Multiple Imputation using Denoising Autoencoders"
author: "Lovedeep Gondara"
date: "February 13, 2018"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
## Getting started
[h2o](https://cran.r-project.org/web/packages/h2o/h2o.pdf) package offers an easy to use function for implementing autoencoders. More information is available at this [link](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/DeepLearningBooklet.pdf). Before we implement our model, let us first load a dataset and introduce missingness. For this we will use the [mlbench](https://cran.r-project.org/web/packages/mlbench/mlbench.pdf) package. This package provides some datasets that can be used for benchmarking machine learning models.
Let us load the dataset and convert it to numeric. We will be using 'Shuttle' dataset, it has 58000 obervations and 10 variables.
```{r dataset, eval=FALSE}
require(mlbench)
data("Shuttle")
data_use=Shuttle
data_use=as.data.frame(lapply(data_use, as.numeric))
```
## Inducing missingness
Now we have our dataset loaded, we start with inducing missingness. Let us introduce simple random missing patterns (Missing Completely At Random), that is, we sample half of the variables and set observations in those variables to missing if an appended random uniform vector has value less than a certain threshhold. This should introduce about 20\% missingness.
```{r missing, eval=FALSE}
x <- data_use
f <- data_use
prop.m=0.2
x$mcar=runif(nrow(x),0,1)
for (j in sample(1:ncol(x),round(ncol(x)/2)))
{
x[,j]=ifelse(x$mcar<prop.m, NA, x[,j])
}
assign("MCAR_unif", data.frame(x[,-ncol(x)]))
y <- get("MCAR_unif")
smp_size <- floor(0.70 * nrow(y))
train_ind <- sample(seq_len(nrow(y)), size = smp_size)
train <- y[train_ind, ]
test <- y[-train_ind, ]
test_full <-f[-train_ind,]
```
Now we have 70\% training data and 30\% test data which includes missingness and a test data without missingness so we can calculate performance. We can now proceed to modelling. We start with initializing 'h2o' package and then reading the training and test datasets as the 'h2o's supported format Then we run our imputation model multiple times as each new start would initialize the weights with different values.
```{r model, eval=FALSE}
require(h2o)
h2o.init()
for(l in 1:5)
{
train.hex=as.h2o(train)
test.hex=as.h2o(test)
n=ncol(test)
pred=c(1:n)
ae_model <- h2o.deeplearning(x=pred,
training_frame=train.hex,
hidden=c(n,n+7,n+14,n+21,n+14,n+7),
epoch=500,
activation="Tanh",
autoencoder=T,
input_dropout_ratio = 0.5,
ignore_const_cols=F)
pred=h2o.predict(ae_model,test.hex)
dae_imp=as.data.frame(pred)
assign(paste0("dae_rand", l), data.frame(dae_imp))
a <- get(paste0("dae_rand", l))
file3=paste("U:/Imp results2/Imp results6/dae_mcar_rmd",l, ".csv", sep = "")
write.table(a,file3, sep=",",row.names = F,col.names = F)
}
```
Let us look in detail at the function call for imputation model and break it down for easy understanding:
- x= This specifies the vector with predictors for input
- training_frame= Defines the training dataset
- hidden= Details the hidden layers, in this case we are using overcomplete representation, that is there are more hidden nodes compared to input number of nodes.
- epoch= Number of epochs
- activation= We are using Tanh as our activation function as we found this works better than Relu when datasets are small and there are many observations closer to zero.
- autoencoder= Sets up an autoencoder model
- input_dropout_ratio= Mimics a denoising autoencoder by setting the defined proportion of features to be missing in each training row. 0.5 means half of the features are set to missing for each row.
- ignore_const_cols= Ignoring constant training columns, shouldn't make much of a difference either way.
After running the model, we predict the reconstruction of our test set and store the results from multiple imputation as csv files for analysis. Now, let us see how good our model did on imputations. For evaluation, we will use rmse function from package [hydroGOF](https://cran.r-project.org/web/packages/hydroGOF/hydroGOF.pdf) package.
```{r evaluation, eval=FALSE}
require(hydroGOF)
daeres=NULL
for(i in 1:5)
{
dae=read.table(paste0("U:/Imp results2/Imp results6/dae_mcar_rmd",i,".csv"),sep=",")
full=test_full
miss=test
naloc=as.matrix(is.na(miss))
miss[naloc]=dae[naloc]
daeres[i]=sum(rmse(scale(miss,center=F), scale(full,center=F)))
}
```
@ambareeshsrja16
Copy link

ambareeshsrja16 commented Jun 23, 2019

https://arxiv.org/abs/1705.02737 The paper mentions " ... as DAEs require complete data at initialization, we initially
use the respective column average in case of continuous variables and most frequent label in case of categorical variables as placeholders for missing data at initialization"
Where is this step taking place in the code?
Also, instead of replacing with the column average, will there be a performance drop if zeros are used instead?

@lgondara
Copy link
Author

This is a code "template" for MIDA, not the exact code used for the paper. In this version, h2o.deeplearning defaults to mean imputation if there are any NAs in the dataset.

There might be some performance impact as now the placeholder has no information at all, you can try other approaches (median imputation, etc.).

https://arxiv.org/abs/1705.02737 The paper mentions " ... as DAEs require complete data at initialization, we initially
use the respective column average in case of continuous variables and most frequent label in case of categorical variables as placeholders for missing data at initialization"
Where is this step taking place in the code?
Also, instead of replacing with the column average, will there be a performance drop if zeros are used instead?

@ambareeshsrja16
Copy link

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment