Skip to content

Instantly share code, notes, and snippets.

@fmnobar
Created March 11, 2023 17:05
Show Gist options
  • Save fmnobar/a236ebe92642899aaaf1584cddf3004b to your computer and use it in GitHub Desktop.
Save fmnobar/a236ebe92642899aaaf1584cddf3004b to your computer and use it in GitHub Desktop.
HPO - Grid Search, Random Search and Bayesian Optimization
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "3a36787a-fab1-4c3a-9ace-c754d52cf455",
"metadata": {},
"source": [
"## 1. Grid Search"
]
},
{
"cell_type": "markdown",
"id": "87f4c18c-0c3d-45cb-9a92-a65eab8e0756",
"metadata": {},
"source": [
"## Comparison Table - Methodologies\n",
"| # | Methodology | Definition | Advantages | Disadvantages | Python Package |\n",
"| - | ----------- | ---------- | ---------- | ------------- | -------------- |\n",
"| 1 | Grid Search | - Exhaustively tries every combination of hyperparameters | - Easy to understand and implement <br> - Easy to parallalize <br> - Works well for discrete and continuous spaces| - Computationally expensive in large search spaces <br> - Memoryless (does not learn from previous attempts) | `sklearn.model_selection.GridSearchCV` |\n",
"| 2 | Random Search | - Randomly tries a pre-defined number of combinations of <br> hyperparameters | - Easy to understand and implement <br> - Easy to parallalize <br> - Works for discrete and continuous spaces <br> - More efficient than Grid Search | - Memoryless (does not learn from previous attempts) <br> - May miss important hyperparameter values, given the random selection | `sklearn.model_selection.RandomizedSearchCV` |\n",
"| 3 | Bayesian Optimization | - Uses Bayes' Theorem to try a pre-defined number of<br> combinations of hyperparameters | - Learns from previous attempts <br> - More efficient than Grid Search and Random Search | - Difficult to parallalize <br> - Computationally heavier than Grid Search and Random Search per attempt <br> - Sensitive to choice of assumptions (e.g. prior) | `sklearn.model_selection.BayesSearchCV` |\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "78cebe6f-a8c7-418d-920c-6a96c0e8ed25",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"selected hyperparameters:\n",
"{'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500}\n",
"\n",
"best_score: 0.9666666666666668\n",
"elapsed_time: 352.0\n"
]
}
],
"source": [
"# Import libraries\n",
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.datasets import load_iris\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.model_selection import cross_val_score\n",
"import time\n",
"\n",
"# Load the Iris dataset\n",
"iris = load_iris()\n",
"X, y = iris.data, iris.target\n",
"\n",
"# Define the hyperparameter search space\n",
"search_space = {'n_estimators': [10, 100, 500, 1000],\n",
" 'max_depth': [2, 10, 25, 50, 100],\n",
" 'min_samples_split': [2, 5, 10],\n",
" 'min_samples_leaf': [1, 5, 10]}\n",
"\n",
"# Define the random forest classifier\n",
"clf = RandomForestClassifier(random_state=1234)\n",
"\n",
"# Create a GridSearchCV object\n",
"optimizer = GridSearchCV(clf, search_space, cv=5, scoring='accuracy')\n",
"\n",
"# Store start time to calculate total elapsed time\n",
"start_time = time.time()\n",
"\n",
"# Fit the optimizer on the data\n",
"optimizer.fit(X, y)\n",
"\n",
"# Store end time to calculate total elapsed time\n",
"end_time = time.time()\n",
"\n",
"# Print the best set of hyperparameters and corresponding score\n",
"print(f\"selected hyperparameters:\")\n",
"print(optimizer.best_params_)\n",
"print(\"\")\n",
"print(f\"best_score: {optimizer.best_score_}\")\n",
"print(f\"elapsed_time: {round(end_time-start_time, 1)}\")"
]
},
{
"cell_type": "markdown",
"id": "060d0034-744f-4707-95fa-2006e6d4311e",
"metadata": {},
"source": [
"## 2. Random Search"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ea99586a-1ce6-4205-b3cf-6293be0366a2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"selected hyperparameters:\n",
"{'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 10}\n",
"\n",
"best_score: 0.9666666666666668\n",
"elapsed_time: 75.5\n"
]
}
],
"source": [
"# Import libraries\n",
"from sklearn.model_selection import RandomizedSearchCV\n",
"from scipy.stats import randint\n",
"\n",
"# Create a RandomizedSearchCV object\n",
"optimizer = RandomizedSearchCV(clf, param_distributions=search_space,\n",
" n_iter=50, cv=5, scoring='accuracy',\n",
" random_state=42)\n",
"\n",
"# Store start time to calculate total elapsed time\n",
"start_time = time.time()\n",
"\n",
"# Fit the optimizer on the data\n",
"optimizer.fit(X, y)\n",
"\n",
"# Store end time to calculate total elapsed time\n",
"end_time = time.time()\n",
"\n",
"# Print the best set of hyperparameters and corresponding score\n",
"print(f\"selected hyperparameters:\")\n",
"print(optimizer.best_params_)\n",
"print(\"\")\n",
"print(f\"best_score: {optimizer.best_score_}\")\n",
"print(f\"elapsed_time: {round(end_time-start_time, 1)}\")"
]
},
{
"cell_type": "markdown",
"id": "f9b814b6-8d1e-4d3b-ae9a-348f0afbb1e3",
"metadata": {},
"source": [
"## 3. Bayesian Optimization"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "1e63388d-d8f9-412e-9d4b-5cb5372beab8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"selected hyperparameters:\n",
"OrderedDict([('max_depth', 25), ('min_samples_leaf', 5), ('min_samples_split', 10), ('n_estimators', 100)])\n",
"\n",
"best_score: 0.9666666666666668\n",
"elapsed_time: 23.1\n"
]
}
],
"source": [
"# Import libraries\n",
"from skopt import BayesSearchCV\n",
"\n",
"# Define the objective function to be optimized\n",
"def objective_function(params):\n",
" clf = RandomForestClassifier(**params, random_state=1234)\n",
" score = cross_val_score(clf, X, y, cv=5, scoring='accuracy').mean()\n",
" return score\n",
"\n",
"# Perform Bayesian Optimization\n",
"optimizer = BayesSearchCV(estimator=RandomForestClassifier(),\n",
" search_spaces=search_space,\n",
" n_iter=10,\n",
" cv=5,\n",
" scoring='accuracy',\n",
" random_state=42)\n",
"\n",
"# Store start time to calculate total elapsed time\n",
"start_time = time.time()\n",
"\n",
"optimizer.fit(X, y)\n",
"\n",
"# Store start time to calculate total elapsed time\n",
"end_time = time.time()\n",
"\n",
"# Print the best set of hyperparameters and corresponding score\n",
"print(f\"selected hyperparameters:\")\n",
"print(optimizer.best_params_)\n",
"print(\"\")\n",
"print(f\"best_score: {optimizer.best_score_}\")\n",
"print(f\"elapsed_time: {round(end_time-start_time, 1)}\")"
]
},
{
"cell_type": "markdown",
"id": "93212236-35e2-445c-bda5-cbb44f42dbf2",
"metadata": {},
"source": [
"## Comparison Table - Results\n",
"\n",
"| # | Methodology | Selected `max_depth` | Selected `min_samples_leaf` | Selected `min_samples_split` | Selected `n_estimators` | Best Score | Elapsed Time | Gained Efficiency (%) |\n",
"| - | ----------- | -------------------- | --------------------------- | ---------------------------- | ----------------------- | ---------- | ------------ | --------------------- |\n",
"|1|Grid Search|10|1|2|500|0.9666666666666668|352.0|0.0%|\n",
"|2|Random Search|10|1|2|500|0.9666666666666668|75.5|78.5%|\n",
"|3|Bayesian Optimization|25|5|10|100|0.9666666666666668|23.1|93.4%|\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment