Skip to content

Instantly share code, notes, and snippets.

@kiwidamien
Created May 9, 2019 07:00
Show Gist options
  • Save kiwidamien/bcbe8e527a5f0cc9f28c4fe692f70cbc to your computer and use it in GitHub Desktop.
Save kiwidamien/bcbe8e527a5f0cc9f28c4fe692f70cbc to your computer and use it in GitHub Desktop.
Example of cross-validation with unbalanced data
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@pracapaulo
Copy link

Thanks for the best unbalanded jubiter notbook that I saw so far.
My question is in the last cell[36] you got 0.8392857142857143. Is the
same exact value from cell[34] -- >recall_score(y_test, y_test_predict), this was a coincidence or not, because the model used in cell [34] was trained in the CV folds.

Thanks,
Paulo Praça

@tomgt8
Copy link

tomgt8 commented Jul 23, 2020

Thanks for a great guide.
Quick question, in:
cross_val_score(imba_pipeline, X_train, y_train, scoring='recall', cv=kf)
Is it correct we can use X, y instead of the X_train and y_train, since imba_pipeline will take care of not doing the over sampling on the test data?

@Vini-002
Copy link

Vini-002 commented Feb 6, 2023

Thanks for the great guide!
Is there a reason for not using StratifiedKFold?

@kiwidamien
Copy link
Author

@Vini-002 StratifiedKFold doesn’t solve the imbalanced problem. What it does do is make sure that each of the folds are equally unbalanced. If the thyroid cases are only 10% of the population, you will have 10% of each fold have thyroid cases. This will still lead to some ML classifiers just stating the majority case.

there is no reason you could not use stratified sampling in addition to one of the techniques used here.

@Vini-002
Copy link

Vini-002 commented Feb 6, 2023

I understand that it doesn't solve the imbalance, but I thought it could be bad not to use StratifiedKFold as you could end up with some fold with no samples of the minority class.

@kiwidamien
Copy link
Author

That is absolutely fair. I probably wouldn’t add it into the main article (I think it is useful to focus on one problem) but can definitely see a section at the bottom for “going further” or “future improvements”.

I’ll take the advice onboard and add a section — thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment