kkraoj/cv_right.rmd

## cv_right.rmd
```{r first separate data into folds before choosing features}

# train the model on training set

# Leave out fold
accuracy.rates <- c()
for (itr in seq_len(folds)){
  # Note: since input features are already random, no need to shuffle the data
  # before creating folds. But ideally, examples need to be shuffled before
  # creating folds to get rid of recording/data collecting bias
  test_ind <- seq(from <- fold.size*(itr-1)+1, to = fold.size*itr, by = 1)
  train <- data[-test_ind, ]
  #After creating the leave out data, select best subset of features
  selected.data.train <- best.subset(train)
  # Pick the same features in the test set so that model can predict output
  test <- data[test_ind, colnames(selected.data.train)]
  model <- train(y ~ .,data = selected.data.train,method = 'naive_bayes')
  #note no Cross validation while training. Cross validation is performed
  #by the outer for loop.
  test$yhat <- predict(model, newdata = test[,-which(names(test) == "y")])
  accuracy <- mean(test$y==test$yhat)
  accuracy.rates <- c(accuracy.rates, accuracy)

  }
sprintf('Classification accuracy when CV is performed before subset selection = %0.0f %%',
100*mean(accuracy.rates))

```
[1] "Classification accuracy when CV is performed before subset selection = 46 %"

This error is more like what we would expect, after all our chosen
response variable (predicting a coin flip) was indeed a random
process completed unrelated to the predictor variables.
	```{r first separate data into folds before choosing features}

	# train the model on training set

	# Leave out fold
	accuracy.rates <- c()
	for (itr in seq_len(folds)){
	# Note: since input features are already random, no need to shuffle the data
	# before creating folds. But ideally, examples need to be shuffled before
	# creating folds to get rid of recording/data collecting bias
	test_ind <- seq(from <- fold.size(itr-1)+1, to = fold.sizeitr, by = 1)
	train <- data[-test_ind, ]
	#After creating the leave out data, select best subset of features
	selected.data.train <- best.subset(train)
	# Pick the same features in the test set so that model can predict output
	test <- data[test_ind, colnames(selected.data.train)]
	model <- train(y ~ .,data = selected.data.train,method = 'naive_bayes')
	#note no Cross validation while training. Cross validation is performed
	#by the outer for loop.
	test$yhat <- predict(model, newdata = test[,-which(names(test) == "y")])
	accuracy <- mean(test$y==test$yhat)
	accuracy.rates <- c(accuracy.rates, accuracy)

	}
	sprintf('Classification accuracy when CV is performed before subset selection = %0.0f %%',
	100*mean(accuracy.rates))

	```
	[1] "Classification accuracy when CV is performed before subset selection = 46 %"

	This error is more like what we would expect, after all our chosen
	response variable (predicting a coin flip) was indeed a random
	process completed unrelated to the predictor variables.