Last active
August 29, 2015 13:57
-
-
Save mittenchops/9866880 to your computer and use it in GitHub Desktop.
Testing Amelia for multiple imputation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# got some help here: http://stackoverflow.com/questions/22739070/function-of-a-list-of-data-frames-to-return-also-a-data-frame-in-r/ | |
library(Amelia) | |
set.seed(1234) | |
myN <- 50 | |
var1 <- runif(myN) | |
var2 <- 2*var1 + rnorm(myN, 0, 1) | |
var3 <- var1^2 + rnorm(myN, 0, 1) | |
df <- data.frame(cbind(var1,var2, var3)) | |
df$Y <- rowSums(df, na.rm=T) + rnorm(myN, 0, 0.1) | |
df[seq(10,25),1] <- NA | |
df[seq(20,30), 2] <- NA | |
df[seq(26,40),3] <- NA | |
print(df) | |
#plot(df) | |
fit <- lm(Y ~ ., data=df) | |
summary(fit) | |
IMP <- amelia(df, m=100) | |
IMP$imputations$imp1 | |
IMP$imputations$imp1 | |
#lapply(IMP$imputations, mean) | |
#bestimp <- rowMeans(as.data.frame(IMP$imputations)) | |
imputer <- function(mylist){ | |
df <- Reduce(rbind, lapply(mylist, function(df) { | |
df$id <- seq_len(nrow(df)) | |
df | |
})) | |
df <- aggregate(. ~ id, df, mean)[, -1] | |
df$id <- NULL | |
return(df) | |
} | |
x1 <- imputer(IMP$imputations) | |
#trueN[IMP$missMatrix] | |
#IMP$imputations$imp1[IMP$missMatrix] | |
imputer2 <- function(mylist) { | |
return(Reduce("+", mylist)/length(mylist)) | |
} | |
x <- imputer(IMP$imputations) | |
y <- imputer2(IMP$imputations) | |
x[!(x == y)] | |
y[!(x == y)] | |
head(x[!(x == y)]) == head(y[!(x == y)]) | |
head(x[!(x == y)]);head(y[!(x == y)]) | |
################################################## | |
Observations: | |
1. Interesting that there are differences between imputer1 and imputer2. I think it's a discrete math roundoff thing, not sure which is better, if either is. To the level of precision R shows me, they look the same. | |
2. Obviously, not the right way to do the mean on the monte carlo variations---you should apply your function then take the mean. f(g(x)) != g(f(x)). I started this way because I wanted to do some testing of how much variation we're getting by Rubin's Rules; what I learned is that doing apply statements across lists of lists of data.frames requires some syntax backflips in R, and not something easy for a quick check. I'll look into it later. | |
3. imputer2 clearly wins for speed. When m=1000, the results are not even close. (Because it doesn't copy.) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment