-
-
Save kiwidamien/bcbe8e527a5f0cc9f28c4fe692f70cbc to your computer and use it in GitHub Desktop.
Thanks for a great guide.
Quick question, in:
cross_val_score(imba_pipeline, X_train, y_train, scoring='recall', cv=kf)
Is it correct we can use X, y instead of the X_train and y_train, since imba_pipeline will take care of not doing the over sampling on the test data?
Thanks for the great guide!
Is there a reason for not using StratifiedKFold?
@Vini-002 StratifiedKFold doesn’t solve the imbalanced problem. What it does do is make sure that each of the folds are equally unbalanced. If the thyroid cases are only 10% of the population, you will have 10% of each fold have thyroid cases. This will still lead to some ML classifiers just stating the majority case.
there is no reason you could not use stratified sampling in addition to one of the techniques used here.
I understand that it doesn't solve the imbalance, but I thought it could be bad not to use StratifiedKFold as you could end up with some fold with no samples of the minority class.
That is absolutely fair. I probably wouldn’t add it into the main article (I think it is useful to focus on one problem) but can definitely see a section at the bottom for “going further” or “future improvements”.
I’ll take the advice onboard and add a section — thank you!!
Thanks for the best unbalanded jubiter notbook that I saw so far.
My question is in the last cell[36] you got 0.8392857142857143. Is the
same exact value from cell[34] -- >recall_score(y_test, y_test_predict), this was a coincidence or not, because the model used in cell [34] was trained in the CV folds.
Thanks,
Paulo Praça