Last active
October 13, 2021 21:37
-
-
Save wrwr/3f6b66bf4ee01bf48be965f60d14454d to your computer and use it in GitHub Desktop.
XGBoost hyperparameter search using scikit-learn RandomizedSearchCV
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import time | |
import xgboost as xgb | |
from sklearn.model_selection import RandomizedSearchCV | |
x_train, y_train, x_valid, y_valid, x_test, y_test = # load datasets | |
clf = xgb.XGBClassifier() | |
param_grid = { | |
'silent': [False], | |
'max_depth': [6, 10, 15, 20], | |
'learning_rate': [0.001, 0.01, 0.1, 0.2, 0,3], | |
'subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0], | |
'colsample_bytree': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], | |
'colsample_bylevel': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], | |
'min_child_weight': [0.5, 1.0, 3.0, 5.0, 7.0, 10.0], | |
'gamma': [0, 0.25, 0.5, 1.0], | |
'reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0, 100.0], | |
'n_estimators': [100]} | |
fit_params = {'eval_metric': 'mlogloss', | |
'early_stopping_rounds': 10, | |
'eval_set': [(x_valid, y_valid)]} | |
rs_clf = RandomizedSearchCV(clf, param_grid, n_iter=20, | |
n_jobs=1, verbose=2, cv=2, | |
fit_params=fit_params, | |
scoring='neg_log_loss', refit=False, random_state=42) | |
print("Randomized search..") | |
search_time_start = time.time() | |
rs_clf.fit(x_train, y_train) | |
print("Randomized search time:", time.time() - search_time_start) | |
best_score = rs_clf.best_score_ | |
best_params = rs_clf.best_params_ | |
print("Best score: {}".format(best_score)) | |
print("Best params: ") | |
for param_name in sorted(best_params.keys()): | |
print('%s: %r' % (param_name, best_params[param_name])) |
According to docs, fit_params has been depreciated.
- x_test and y_test are declared but not used. Where are we supposed to use them?
- RandomizedSearchCV sets cv to 2. What does that mean? We're doing k-fold validation with 2 splits? Or does the xgboost classifier ignore that, and use (x_valid, y_valid) instead, no matter what value you supply to cv?
thank you for sharing. i just want to highlight statement from scikit-learn docs related to continuous parameter (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) => "It is highly recommended to use continuous distributions for continuous parameters."
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I think that you have to set
refit=True
in order to be able to extractbest_score = rs_clf.best_score_