Skip to content

Instantly share code, notes, and snippets.

@kkraoj
Last active July 11, 2019 07:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kkraoj/ca5d24ebe52dc1ac0a6be395b2de641e to your computer and use it in GitHub Desktop.
Save kkraoj/ca5d24ebe52dc1ac0a6be395b2de641e to your computer and use it in GitHub Desktop.
```{r first separate data into folds before choosing features}
# train the model on training set
# Leave out fold
accuracy.rates <- c()
for (itr in seq_len(folds)){
# Note: since input features are already random, no need to shuffle the data
# before creating folds. But ideally, examples need to be shuffled before
# creating folds to get rid of recording/data collecting bias
test_ind <- seq(from <- fold.size*(itr-1)+1, to = fold.size*itr, by = 1)
train <- data[-test_ind, ]
#After creating the leave out data, select best subset of features
selected.data.train <- best.subset(train)
# Pick the same features in the test set so that model can predict output
test <- data[test_ind, colnames(selected.data.train)]
model <- train(y ~ .,data = selected.data.train,method = 'naive_bayes')
#note no Cross validation while training. Cross validation is performed
#by the outer for loop.
test$yhat <- predict(model, newdata = test[,-which(names(test) == "y")])
accuracy <- mean(test$y==test$yhat)
accuracy.rates <- c(accuracy.rates, accuracy)
}
sprintf('Classification accuracy when CV is performed before subset selection = %0.0f %%',
100*mean(accuracy.rates))
```
[1] "Classification accuracy when CV is performed before subset selection = 46 %"
This error is more like what we would expect, after all our chosen
response variable (predicting a coin flip) was indeed a random
process completed unrelated to the predictor variables.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment