-
-
Save zdepablo/423bbb065747470b721b to your computer and use it in GitHub Desktop.
Stratified sampling: training / test data split preserving class distribution (caret functions) and scaling (standardize) the data. Stratified folds for CV.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(caret) | |
## select training indices preserving class distribution | |
in.train <- createDataPartition(yclass, p=0.8, list=FALSE) | |
summary(factor(yclass)) | |
ytra <- yclass[in.train]; summary(factor(ytra)) | |
ytst <- yclass[-in.train]; summary(factor(ytst)) | |
## standardize features: training parameters of scaling for test-part | |
Xtra <- scale(X[in.train,]) | |
Xtest <- scale(X[-in.train,], | |
center = attr(Xtra,"scaled:center"), | |
scale = attr(Xtra,"scaled:scale")) | |
## stratified folds for cross-validation: say Y is a factor | |
table(Y) | |
foldInds <- createFolds(Y, k=10, list=TRUE, returnTrain=FALSE) | |
lapply(foldInds, function(ii) table(Y[ii])) ## verify stratification | |
## set returnTrain=TRUE if supplyinf these indiced to train-function, | |
## see https://stat.ethz.ch/pipermail/r-help/2011-May/277722.html |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment