fmnobar/HPO.ipynb

## HPO.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3a36787a-fab1-4c3a-9ace-c754d52cf455",
   "metadata": {},
   "source": [
    "## 1. Grid Search"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87f4c18c-0c3d-45cb-9a92-a65eab8e0756",
   "metadata": {},
   "source": [
    "## Comparison Table - Methodologies\n",
    "| # | Methodology | Definition | Advantages | Disadvantages | Python Package |\n",
    "| - | ----------- | ---------- | ---------- | ------------- | -------------- |\n",
    "| 1 | Grid Search | - Exhaustively tries every combination of hyperparameters | - Easy to understand and implement <br> - Easy to parallalize <br> - Works well for discrete and continuous spaces| - Computationally expensive in large search spaces <br> - Memoryless (does not learn from previous attempts) | `sklearn.model_selection.GridSearchCV` |\n",
    "| 2 | Random Search | - Randomly tries a pre-defined number of combinations of <br> hyperparameters | - Easy to understand and implement <br> - Easy to parallalize <br> - Works for discrete and continuous spaces <br> - More efficient than Grid Search | - Memoryless (does not learn from previous attempts) <br> - May miss important hyperparameter values, given the random selection | `sklearn.model_selection.RandomizedSearchCV` |\n",
    "| 3 | Bayesian Optimization | - Uses Bayes' Theorem to try a pre-defined number of<br> combinations of hyperparameters | - Learns from previous attempts <br> - More efficient than Grid Search and Random Search | - Difficult to parallalize <br> - Computationally heavier than Grid Search and Random Search per attempt <br> - Sensitive to choice of assumptions (e.g. prior) | `sklearn.model_selection.BayesSearchCV` |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "78cebe6f-a8c7-418d-920c-6a96c0e8ed25",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "selected hyperparameters:\n",
      "{'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500}\n",
      "\n",
      "best_score: 0.9666666666666668\n",
      "elapsed_time: 352.0\n"
     ]
    }
   ],
   "source": [
    "# Import libraries\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.datasets import load_iris\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.model_selection import cross_val_score\n",
    "import time\n",
    "\n",
    "# Load the Iris dataset\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# Define the hyperparameter search space\n",
    "search_space = {'n_estimators': [10, 100, 500, 1000],\n",
    "              'max_depth': [2, 10, 25, 50, 100],\n",
    "              'min_samples_split': [2, 5, 10],\n",
    "              'min_samples_leaf': [1, 5, 10]}\n",
    "\n",
    "# Define the random forest classifier\n",
    "clf = RandomForestClassifier(random_state=1234)\n",
    "\n",
    "# Create a GridSearchCV object\n",
    "optimizer = GridSearchCV(clf, search_space, cv=5, scoring='accuracy')\n",
    "\n",
    "# Store start time to calculate total elapsed time\n",
    "start_time = time.time()\n",
    "\n",
    "# Fit the optimizer on the data\n",
    "optimizer.fit(X, y)\n",
    "\n",
    "# Store end time to calculate total elapsed time\n",
    "end_time = time.time()\n",
    "\n",
    "# Print the best set of hyperparameters and corresponding score\n",
    "print(f\"selected hyperparameters:\")\n",
    "print(optimizer.best_params_)\n",
    "print(\"\")\n",
    "print(f\"best_score: {optimizer.best_score_}\")\n",
    "print(f\"elapsed_time: {round(end_time-start_time, 1)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "060d0034-744f-4707-95fa-2006e6d4311e",
   "metadata": {},
   "source": [
    "## 2. Random Search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ea99586a-1ce6-4205-b3cf-6293be0366a2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "selected hyperparameters:\n",
      "{'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 10}\n",
      "\n",
      "best_score: 0.9666666666666668\n",
      "elapsed_time: 75.5\n"
     ]
    }
   ],
   "source": [
    "# Import libraries\n",
    "from sklearn.model_selection import RandomizedSearchCV\n",
    "from scipy.stats import randint\n",
    "\n",
    "# Create a RandomizedSearchCV object\n",
    "optimizer = RandomizedSearchCV(clf, param_distributions=search_space,\n",
    "                               n_iter=50, cv=5, scoring='accuracy',\n",
    "                               random_state=42)\n",
    "\n",
    "# Store start time to calculate total elapsed time\n",
    "start_time = time.time()\n",
    "\n",
    "# Fit the optimizer on the data\n",
    "optimizer.fit(X, y)\n",
    "\n",
    "# Store end time to calculate total elapsed time\n",
    "end_time = time.time()\n",
    "\n",
    "# Print the best set of hyperparameters and corresponding score\n",
    "print(f\"selected hyperparameters:\")\n",
    "print(optimizer.best_params_)\n",
    "print(\"\")\n",
    "print(f\"best_score: {optimizer.best_score_}\")\n",
    "print(f\"elapsed_time: {round(end_time-start_time, 1)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9b814b6-8d1e-4d3b-ae9a-348f0afbb1e3",
   "metadata": {},
   "source": [
    "## 3. Bayesian Optimization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "1e63388d-d8f9-412e-9d4b-5cb5372beab8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "selected hyperparameters:\n",
      "OrderedDict([('max_depth', 25), ('min_samples_leaf', 5), ('min_samples_split', 10), ('n_estimators', 100)])\n",
      "\n",
      "best_score: 0.9666666666666668\n",
      "elapsed_time: 23.1\n"
     ]
    }
   ],
   "source": [
    "# Import libraries\n",
    "from skopt import BayesSearchCV\n",
    "\n",
    "# Define the objective function to be optimized\n",
    "def objective_function(params):\n",
    "    clf = RandomForestClassifier(**params, random_state=1234)\n",
    "    score = cross_val_score(clf, X, y, cv=5, scoring='accuracy').mean()\n",
    "    return score\n",
    "\n",
    "# Perform Bayesian Optimization\n",
    "optimizer = BayesSearchCV(estimator=RandomForestClassifier(),\n",
    "                          search_spaces=search_space,\n",
    "                          n_iter=10,\n",
    "                          cv=5,\n",
    "                          scoring='accuracy',\n",
    "                          random_state=42)\n",
    "\n",
    "# Store start time to calculate total elapsed time\n",
    "start_time = time.time()\n",
    "\n",
    "optimizer.fit(X, y)\n",
    "\n",
    "# Store start time to calculate total elapsed time\n",
    "end_time = time.time()\n",
    "\n",
    "# Print the best set of hyperparameters and corresponding score\n",
    "print(f\"selected hyperparameters:\")\n",
    "print(optimizer.best_params_)\n",
    "print(\"\")\n",
    "print(f\"best_score: {optimizer.best_score_}\")\n",
    "print(f\"elapsed_time: {round(end_time-start_time, 1)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93212236-35e2-445c-bda5-cbb44f42dbf2",
   "metadata": {},
   "source": [
    "## Comparison Table - Results\n",
    "\n",
    "| # | Methodology | Selected `max_depth` | Selected `min_samples_leaf` | Selected `min_samples_split` | Selected `n_estimators` | Best Score | Elapsed Time | Gained Efficiency (%) |\n",
    "| - | ----------- | -------------------- | --------------------------- | ---------------------------- | ----------------------- | ---------- | ------------ | --------------------- |\n",
    "|1|Grid Search|10|1|2|500|0.9666666666666668|352.0|0.0%|\n",
    "|2|Random Search|10|1|2|500|0.9666666666666668|75.5|78.5%|\n",
    "|3|Bayesian Optimization|25|5|10|100|0.9666666666666668|23.1|93.4%|\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "3a36787a-fab1-4c3a-9ace-c754d52cf455",
	"metadata": {},
	"source": [
	"## 1. Grid Search"
	]
	},
	{
	"cell_type": "markdown",
	"id": "87f4c18c-0c3d-45cb-9a92-a65eab8e0756",
	"metadata": {},
	"source": [
	"## Comparison Table - Methodologies\n",
	"\| # \| Methodology \| Definition \| Advantages \| Disadvantages \| Python Package \|\n",
	"\| - \| ----------- \| ---------- \| ---------- \| ------------- \| -------------- \|\n",
	"\| 1 \| Grid Search \| - Exhaustively tries every combination of hyperparameters \| - Easy to understand and implement <br> - Easy to parallalize <br> - Works well for discrete and continuous spaces\| - Computationally expensive in large search spaces <br> - Memoryless (does not learn from previous attempts) \| `sklearn.model_selection.GridSearchCV` \|\n",
	"\| 2 \| Random Search \| - Randomly tries a pre-defined number of combinations of <br> hyperparameters \| - Easy to understand and implement <br> - Easy to parallalize <br> - Works for discrete and continuous spaces <br> - More efficient than Grid Search \| - Memoryless (does not learn from previous attempts) <br> - May miss important hyperparameter values, given the random selection \| `sklearn.model_selection.RandomizedSearchCV` \|\n",
	"\| 3 \| Bayesian Optimization \| - Uses Bayes' Theorem to try a pre-defined number of<br> combinations of hyperparameters \| - Learns from previous attempts <br> - More efficient than Grid Search and Random Search \| - Difficult to parallalize <br> - Computationally heavier than Grid Search and Random Search per attempt <br> - Sensitive to choice of assumptions (e.g. prior) \| `sklearn.model_selection.BayesSearchCV` \|\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"id": "78cebe6f-a8c7-418d-920c-6a96c0e8ed25",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"selected hyperparameters:\n",
	"{'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500}\n",
	"\n",
	"best_score: 0.9666666666666668\n",
	"elapsed_time: 352.0\n"
	]
	}
	],
	"source": [
	"# Import libraries\n",
	"from sklearn.model_selection import GridSearchCV\n",
	"from sklearn.datasets import load_iris\n",
	"from sklearn.ensemble import RandomForestClassifier\n",
	"from sklearn.model_selection import cross_val_score\n",
	"import time\n",
	"\n",
	"# Load the Iris dataset\n",
	"iris = load_iris()\n",
	"X, y = iris.data, iris.target\n",
	"\n",
	"# Define the hyperparameter search space\n",
	"search_space = {'n_estimators': [10, 100, 500, 1000],\n",
	" 'max_depth': [2, 10, 25, 50, 100],\n",
	" 'min_samples_split': [2, 5, 10],\n",
	" 'min_samples_leaf': [1, 5, 10]}\n",
	"\n",
	"# Define the random forest classifier\n",
	"clf = RandomForestClassifier(random_state=1234)\n",
	"\n",
	"# Create a GridSearchCV object\n",
	"optimizer = GridSearchCV(clf, search_space, cv=5, scoring='accuracy')\n",
	"\n",
	"# Store start time to calculate total elapsed time\n",
	"start_time = time.time()\n",
	"\n",
	"# Fit the optimizer on the data\n",
	"optimizer.fit(X, y)\n",
	"\n",
	"# Store end time to calculate total elapsed time\n",
	"end_time = time.time()\n",
	"\n",
	"# Print the best set of hyperparameters and corresponding score\n",
	"print(f\"selected hyperparameters:\")\n",
	"print(optimizer.best_params_)\n",
	"print(\"\")\n",
	"print(f\"best_score: {optimizer.best_score_}\")\n",
	"print(f\"elapsed_time: {round(end_time-start_time, 1)}\")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "060d0034-744f-4707-95fa-2006e6d4311e",
	"metadata": {},
	"source": [
	"## 2. Random Search"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"id": "ea99586a-1ce6-4205-b3cf-6293be0366a2",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"selected hyperparameters:\n",
	"{'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 10}\n",
	"\n",
	"best_score: 0.9666666666666668\n",
	"elapsed_time: 75.5\n"
	]
	}
	],
	"source": [
	"# Import libraries\n",
	"from sklearn.model_selection import RandomizedSearchCV\n",
	"from scipy.stats import randint\n",
	"\n",
	"# Create a RandomizedSearchCV object\n",
	"optimizer = RandomizedSearchCV(clf, param_distributions=search_space,\n",
	" n_iter=50, cv=5, scoring='accuracy',\n",
	" random_state=42)\n",
	"\n",
	"# Store start time to calculate total elapsed time\n",
	"start_time = time.time()\n",
	"\n",
	"# Fit the optimizer on the data\n",
	"optimizer.fit(X, y)\n",
	"\n",
	"# Store end time to calculate total elapsed time\n",
	"end_time = time.time()\n",
	"\n",
	"# Print the best set of hyperparameters and corresponding score\n",
	"print(f\"selected hyperparameters:\")\n",
	"print(optimizer.best_params_)\n",
	"print(\"\")\n",
	"print(f\"best_score: {optimizer.best_score_}\")\n",
	"print(f\"elapsed_time: {round(end_time-start_time, 1)}\")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "f9b814b6-8d1e-4d3b-ae9a-348f0afbb1e3",
	"metadata": {},
	"source": [
	"## 3. Bayesian Optimization"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"id": "1e63388d-d8f9-412e-9d4b-5cb5372beab8",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"selected hyperparameters:\n",
	"OrderedDict([('max_depth', 25), ('min_samples_leaf', 5), ('min_samples_split', 10), ('n_estimators', 100)])\n",
	"\n",
	"best_score: 0.9666666666666668\n",
	"elapsed_time: 23.1\n"
	]
	}
	],
	"source": [
	"# Import libraries\n",
	"from skopt import BayesSearchCV\n",
	"\n",
	"# Define the objective function to be optimized\n",
	"def objective_function(params):\n",
	" clf = RandomForestClassifier(**params, random_state=1234)\n",
	" score = cross_val_score(clf, X, y, cv=5, scoring='accuracy').mean()\n",
	" return score\n",
	"\n",
	"# Perform Bayesian Optimization\n",
	"optimizer = BayesSearchCV(estimator=RandomForestClassifier(),\n",
	" search_spaces=search_space,\n",
	" n_iter=10,\n",
	" cv=5,\n",
	" scoring='accuracy',\n",
	" random_state=42)\n",
	"\n",
	"# Store start time to calculate total elapsed time\n",
	"start_time = time.time()\n",
	"\n",
	"optimizer.fit(X, y)\n",
	"\n",
	"# Store start time to calculate total elapsed time\n",
	"end_time = time.time()\n",
	"\n",
	"# Print the best set of hyperparameters and corresponding score\n",
	"print(f\"selected hyperparameters:\")\n",
	"print(optimizer.best_params_)\n",
	"print(\"\")\n",
	"print(f\"best_score: {optimizer.best_score_}\")\n",
	"print(f\"elapsed_time: {round(end_time-start_time, 1)}\")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "93212236-35e2-445c-bda5-cbb44f42dbf2",
	"metadata": {},
	"source": [
	"## Comparison Table - Results\n",
	"\n",
	"\| # \| Methodology \| Selected `max_depth` \| Selected `min_samples_leaf` \| Selected `min_samples_split` \| Selected `n_estimators` \| Best Score \| Elapsed Time \| Gained Efficiency (%) \|\n",
	"\| - \| ----------- \| -------------------- \| --------------------------- \| ---------------------------- \| ----------------------- \| ---------- \| ------------ \| --------------------- \|\n",
	"\|1\|Grid Search\|10\|1\|2\|500\|0.9666666666666668\|352.0\|0.0%\|\n",
	"\|2\|Random Search\|10\|1\|2\|500\|0.9666666666666668\|75.5\|78.5%\|\n",
	"\|3\|Bayesian Optimization\|25\|5\|10\|100\|0.9666666666666668\|23.1\|93.4%\|\n"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.9.7"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}