Skip to content

Instantly share code, notes, and snippets.

@ariannadibernardo
Created September 12, 2020 18:27
Show Gist options
  • Save ariannadibernardo/22f9454c27354b5097fe4c11ace56359 to your computer and use it in GitHub Desktop.
Save ariannadibernardo/22f9454c27354b5097fe4c11ace56359 to your computer and use it in GitHub Desktop.

3.2 Validation strategies

The way to split our dataset we just saw is called HOLD OUT strategy:

it divides data into two parts. Training part is used to fit our model, the validation part to evaluate its performance. Using the scores from this evaluation step, we can choose the best model and the best hyperparameters for the selected model.

If we have enough data, using holdout is usually a good choice. In particular we can use this scheme if we get similar scores for the same model when we try different Hold-out splits.

But what is the optimal size of our sets?

A common split is using 80% of data for training and the remaining 20% of the data for testing.

Another useful command you can use is sklearn.model_selection.ShuffleSplit (this method just splits the data into one training and one validation set)

Let's now introduce some other useful strategies to split data, and focus on the reasons to use each one of them.

Leave-One-Out cross validation

If your dataset is small, it makes sense the validation set is small, in order to save enough examples for the training. But this means that the validation error will be a bad estimation of the test error, because a too little validation set doesn't capture the behaviour of the test set.

To measure the performance of your model with only one example is too constrictive.

The solution is the leave-one-out stategy: in that scheme we iterate through every sample in our data, each time using one example for validation and the remaining examples for training. You will need to retrain the model N times (if N is the number of samples in the dataset).

In the end you will get predictions for every sample in the training set and can calculate the loss by average the single losses (this averaging process is called cross validation ).

loo.png

This method can be helpful if we have too little data and just enough model to choose from.

In scikit-learn there exist a function able to to this: sklearn.model_selection.LeaveOneOut

from sklearn.model_selection import LeaveOneOut

For example, let's consider a very small dataset, with only 10 examples:

from sklearn.datasets import make_classification

define the dataset

X_clf, y_clf = make_classification(n_samples=10, n_features=4, random_state=3)

plt.scatter(X_clf[:, 0], X_clf[:, 1], marker='o', c=y_clf, s=25, edgecolor='k') plt.title("Data")

img

loo = LeaveOneOut()

#try with a naive bayes classifier from numpy import mean, std score=[]

from sklearn.naive_bayes import GaussianNB NB2=GaussianNB()

for train_index, valid_index in loo.split(X_clf): #Generate indices to split data into training and test set. print("TRAIN:", train_index, "VALID:", valid_index) X_train, X_valid = X_clf[train_index], X_clf[valid_index] y_train, y_valid = y_clf[train_index], y_clf[valid_index] NB2.fit(X_train, y_train) print(NB2.score(X_valid, y_valid)) score.append(NB2.score(X_valid, y_valid)) print("Accuracy: %0.2f (+/- %0.2f)" % (mean(score), std(score) * 2)) #Average accuracy with cross validation

TRAIN: [1 2 3 4 5 6 7 8 9] VALID: [0] 1.0 TRAIN: [0 2 3 4 5 6 7 8 9] VALID: [1] 0.0 TRAIN: [0 1 3 4 5 6 7 8 9] VALID: [2] 1.0 TRAIN: [0 1 2 4 5 6 7 8 9] VALID: [3] 1.0 TRAIN: [0 1 2 3 5 6 7 8 9] VALID: [4] 1.0 TRAIN: [0 1 2 3 4 6 7 8 9] VALID: [5] 1.0 TRAIN: [0 1 2 3 4 5 7 8 9] VALID: [6] 0.0 TRAIN: [0 1 2 3 4 5 6 8 9] VALID: [7] 1.0 TRAIN: [0 1 2 3 4 5 6 7 9] VALID: [8] 1.0 TRAIN: [0 1 2 3 4 5 6 7 8] VALID: [9] 1.0 Accuracy: 0.80 (+/- 0.80)

K-fold cross validation

K-Fold divides data in K groups of samples of equal size, called folds (if K=size of training, this is equivalent to the Leave One Out strategy). The prediction function is learned using K-folds, and the fold left out is used as a validation set only once. After this procedure, we average scores over these K-folds (cross validation step).

This method is a good choice when we have enough data and we can get different scores and optimal parameters for different splits.

kfold.png

You can also estimate mean and variance of the loss. This is very helpful in order to understand the significance of improvement.

Here it is important to understand the difference between K-fold and a K-repeated hold-out. In the first case it is possible to average the scores in order to obtain a mean score indicating the quality of the model. In the second case, you can't: some samples may never get in validation, while others can be there multiple times. An average of the score is no longer informative.

Let's apply this strategy to our cancer breast classification problem and select the best classifier in this way.

Useful commands are sklearn.model_selection.KFold, that provides train/test indices to split data in train/test sets, and sklearn.model_selection.cross_val_score, for evaluating the scores by cross-validation. We will use this second one command.

from sklearn.model_selection import cross_val_score

for clf in [SVC(random_state=0), RandomForestClassifier(random_state=0), GaussianNB(), DecisionTreeClassifier(random_state=0), LogisticRegression(random_state=0), MLPClassifier(random_state=0)]: model = clf scores = cross_val_score(model, Xtrain, ytrain, cv=5)

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.63 (+/- 0.01) Accuracy: 0.95 (+/- 0.05) Accuracy: 0.93 (+/- 0.06) Accuracy: 0.92 (+/- 0.06) Accuracy: 0.94 (+/- 0.10) Accuracy: 0.94 (+/- 0.06)

This time, the winner is random forest classifier, and it is a more conscious and precise choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment