Skip to content

Instantly share code, notes, and snippets.

@tyokota tyokota/README.md
Last active May 6, 2016

Embed
What would you like to do?
generalized additive models

###### chart 1. 2016 Hawaii primary election results.

generalized additive model

"Silver bullet" and "predictive modeling" paired together in the same sentence? Do tell. The title sounded like the perfect link bait. Nonetheless, I delved into the article – and I am glad I did.

Mr. Kim Larsen's GAM: The Predictive Modeling Silver Bullet was a well-worth read as it introduced both the Generalized Additive Model and a feature selection technique based on information value (IV).

I quickly appreciated Mr. Larsen's choice of data to elucidate his feature selection method. GAM literally choked when trying to fit the full data set. Despite the reduced training set, GAM still performed almost as well as flavor-of-the-month XGBoost. In fact, the variables chosen by IV were comparable to XGBoost.

# hawaiikernel.com
# generalized additive modeling
# http://multithreaded.stitchfix.com/blog/2015/07/30/gam/
# INTRODUCTION -----------------------------------------------------------------
# assessing inclusion of GAM and IV feature selection method into workflow
# DEPENDENCIES -----------------------------------------------------------------
options(scipen=10)
pacman::p_load(plyr, dplyr, tidyr, stringr)
pacman::p_load(Information)
pacman::p_load(caret, Matrix, xgboost)
setwd("D:/scratch")
# DATA -------------------------------------------------------------------------
# note: sourced from Information pkg
data(train, package="Information")
data(valid, package="Information")
train <- subset(train, TREATMENT==1)
valid <- subset(valid, TREATMENT==1)
train.id <- train$UNIQUE_ID
valid.id <- valid$UNIQUE_ID
train.Y <- train$PURCHASE
valid.Y <- valid$PURCHASE
train_pp <- train %>% subset(., select = -c(TREATMENT, UNIQUE_ID, PURCHASE))
valid_pp <- valid %>% subset(., select = -c(TREATMENT, UNIQUE_ID, PURCHASE))
# note: split training set for validation
set.seed(96707)
train.ind <- createDataPartition(train.Y, p=.70, list=F, times=1)
# FIT --------------------------------------------------------------------------
# note: xgboost
xgtest <- xgb.DMatrix(data=data.matrix(valid))
xgtrain <- xgb.DMatrix(data=data.matrix(train_pp[train.ind, ]), label=train.Y[train.ind])
xg.grid <- expand.grid(objective = "binary:logistic",
eval_metric = "auc",
max_depth = c(4, 6, 8, 10),
min_child_weight = 1,
gamma = 0,
max_delta_step = 0,
nround = 200,
eta = 0.05,
subsample = c(0.5, 0.75, 1),
colsample_bytree = c(0.4, 0.6, 0.8, 1),
score = 0,
maximize = T,
stringsAsFactors = F)
for (i in 1:nrow(xg.grid)) {
message(paste0("\nRunning ",i," out of ",nrow(xg.grid),":\n"))
mdl <- xgb.cv(data = xgtrain,
nfold = 5,
early.stop.round = 10,
nthread = 4,
set.seed = 96707,
objective = xg.grid[i,]$objective,
eval_metric = xg.grid[i,]$eval_metric,
max_depth = xg.grid[i,]$max_depth,
min_child_weight = xg.grid[i,]$max_depth,
gamma = xg.grid[i,]$gamma,
max_delta_step = xg.grid[i,]$max_delta_step,
nround = xg.grid[i,]$nround,
eta = xg.grid[i,]$eta,
subsample = xg.grid[i,]$subsample,
colsample_bytree = xg.grid[i,]$colsample_bytree,
prediction = T)
xg.grid[i,]$score <- max(mdl$dt[, test.auc.mean])
}
xg.grid[which.max(xg.grid$score), ]
param.pp.lm.bal <- list("objective" = "binary:logistic",
"eval_metric" = "auc",
"max_depth" = 8,
"min_child_weight" = 1,
"gamma" = 0,
"max_delta_step" = 0,
"subsample" = 0.5,
"colsample_bytree" = 0.4
)
bst.cv <- xgb.cv(data = xgtrain, param = param.pp.lm.bal,
nfold = 5,
early.stop.round = 200,
eta = 0.01,
set.seed = 96707,
nround = 5000,
prediction = T)
(id.bstll <- which.max(bst.cv$dt[, test.auc.mean]))
bst <- xgboost(data = xgtrain, param = param.pp.lm.bal,
eta = 0.01,
nround = id.bstll,
prediction = T)
importance.names <- dimnames(data.matrix(train_pp))[[2]]
importance.matrix <- xgb.importance(importance.names, model=bst)
xgb.plot.importance(importance.matrix[1:10,])
pred <- predict(bst, xgtest)
pROC::auc(valid.Y, pred) # Area under the curve: 0.8278
# note: GAM
train2 <- train[train.ind, ]
train2_pp <- train2 %>% subset(., select = -c(TREATMENT, UNIQUE_ID))
valid_pp <- valid %>% subset(., select = -c(TREATMENT, UNIQUE_ID))
IV <- Information::create_infotables(data=train2_pp, NULL, "PURCHASE", 10)
View(IV$Summary)
train2_pp <- train2_pp[,c(subset(IV$Summary, IV>0.05)$Variable, "PURCHASE")]
valid_pp <- valid_pp[,c(subset(IV$Summary, IV>0.05)$Variable, "PURCHASE")]
nvars <- 20
tree <- hclustvar(train2_pp[,!(names(train2_pp)=="PURCHASE")])
part_init <- cutreevar(tree, nvars)$cluster
kmeans <- kmeansvar(X.quanti=train2_pp[,!(names(train2_pp)=="PURCHASE")], init=part_init)
clusters <- cbind.data.frame(melt(kmeans$cluster), row.names(melt(kmeans$cluster)))
names(clusters) <- c("Cluster", "Variable")
clusters <- join(clusters, IV$Summary, by="Variable", type="left")
clusters <- clusters[order(clusters$Cluster),]
clusters$Rank <- ave(-clusters$IV, clusters$Cluster, FUN=rank)
View(clusters)
variables <- as.character(subset(clusters, Rank==1)$Variable)
train2_pp <- train2_pp %>% subset(., select = c(variables, PURCHASE))
valid_pp <- valid_pp %>% subset(., select = variables)
train2_pp$PURCHASE <- train.Y[train.ind]
source('https://raw.githubusercontent.com/klarsen1/gampost/master/miscfunctions.r')
f <- CreateGAMFormula(data=train2_pp, y="PURCHASE", -1, "regspline")
gam2.model <- mgcv::gam(f, data=train2_pp, family=binomial(link="logit"))
gam2.predict <- 1/(1+exp(-predict(gam2.model, newdata=valid_pp)))
pROC::auc(valid.Y, gam2.predict) # Area under the curve: 0.8136
f <- CreateGAMFormula(data=train2_pp, y="PURCHASE", type="none")
gam3.model <- mgcv::gam(f, data=train2_pp, family=binomial(link="logit"), method="REML", select=TRUE)
gam3.predict <- 1/(1+exp(-predict(gam3.model, newdata=valid_pp)))
pROC::auc(valid.Y, gam3.predict) # Area under the curve: 0.8174
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.