Wrong and right way to do CV
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Cross validation Example: The wrong way and the right way
```{r make dataset}
n.features = 1e5 #total features in data (genes) = 1e2 #features to be selected after screening
n.examples = 50 #number of examples (or patients)
#create completely random labels of occurence of heart disease in patients
labels = round(runif(n.examples, min = 1, max = 2))
data = data.frame(round(matrix(runif(n.examples*n.features, min = 1, max =2),
n.examples, n.features)))
data$y <- as.factor(labels)
```{r subset features}
#create a function to select the best features as per their correlation with the disease
best.subset <- function(data, = 50){
data$y <- as.numeric(data$y)
correlations <- apply( data[,-which(names(data) == "y")] , 2 , cor , y = data$y )
selected.features <- order(correlations, decreasing = TRUE)[]
selected.features <- names(correlations[selected.features]) <- data[,c(selected.features,'y')]$y <- as.factor($y)
} <- best.subset(data,
```{r fit model to selected features. WRONG way of doing it}
# define training control
folds <- 5
fold.size <- dim(data)[1]/folds
train_control <- trainControl(method = "cv", number = folds)
# train the model on training set
model <- train(y ~ .,data =,
trControl = train_control, method = 'naive_bayes')
# print(model)
sprintf('Classification accuracy when CV is performed after subset selection= %0.0f %%',
[1] "Classification accuracy when CV is performed after subset selection= 99%"
This is called cherry picking data. It is a completely inaccurate representation
of model error. For true representation of model accuracy, the model should not
"peek" into the validation set at all - which means feature selection must be
performed only after leaving out one fold of the data.
