Skip to content

Instantly share code, notes, and snippets.

@nami3373
Last active October 5, 2022 17:35
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save nami3373/ef1aac638e7dbc82bdcdc85abb208874 to your computer and use it in GitHub Desktop.
Save nami3373/ef1aac638e7dbc82bdcdc85abb208874 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from scipy.optimize import minimize\n",
"from sklearn.metrics import mean_squared_error"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Random Numbers as a set of 5 predictions\n",
"Each prediction has 100,000 items."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Set random seed for reproduction\n",
"np.random.seed(2019)\n",
"pred_df = pd.DataFrame(np.random.rand(100000, 5), columns=['pred1', 'pred2', 'pred3', 'pred4', 'pred5'])"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>pred1</th>\n",
" <th>pred2</th>\n",
" <th>pred3</th>\n",
" <th>pred4</th>\n",
" <th>pred5</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.903482</td>\n",
" <td>0.393081</td>\n",
" <td>0.623970</td>\n",
" <td>0.637877</td>\n",
" <td>0.880499</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.299172</td>\n",
" <td>0.702198</td>\n",
" <td>0.903206</td>\n",
" <td>0.881382</td>\n",
" <td>0.405750</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.452447</td>\n",
" <td>0.267070</td>\n",
" <td>0.162865</td>\n",
" <td>0.889215</td>\n",
" <td>0.148476</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.984723</td>\n",
" <td>0.032361</td>\n",
" <td>0.515351</td>\n",
" <td>0.201129</td>\n",
" <td>0.886011</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.513620</td>\n",
" <td>0.578302</td>\n",
" <td>0.299283</td>\n",
" <td>0.837197</td>\n",
" <td>0.526650</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" pred1 pred2 pred3 pred4 pred5\n",
"0 0.903482 0.393081 0.623970 0.637877 0.880499\n",
"1 0.299172 0.702198 0.903206 0.881382 0.405750\n",
"2 0.452447 0.267070 0.162865 0.889215 0.148476\n",
"3 0.984723 0.032361 0.515351 0.201129 0.886011\n",
"4 0.513620 0.578302 0.299283 0.837197 0.526650"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pred_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Target Variable\n",
"Target is either 0 or 1. As an expreiment, 100,000 numbers are randomly generated."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"np.random.seed(2019)\n",
"y_train = pd.Series(np.random.randint(2, size=100000), name='target')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 0\n",
"2 1\n",
"3 1\n",
"4 0\n",
"Name: target, dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Calculate Mean Squared Error for each prediction\n",
"MSE is around 0.33."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MSE for pred1: 0.3339381\n",
"MSE for pred2: 0.3341939\n",
"MSE for pred3: 0.3320243\n",
"MSE for pred4: 0.3330230\n",
"MSE for pred5: 0.3343232\n"
]
}
],
"source": [
"for i in range(pred_df.shape[1]):\n",
" print('MSE for {}: {:.7f}'.format(pred_df.columns[i], \n",
" mean_squared_error(y_train, pred_df.iloc[:, i])))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use Minimize Function of ScyPy's Optimize to get Optimized Emsemble Weights for predictions"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"Y_values = y_train.values\n",
"predictions = []\n",
"lls = []\n",
"wghts = []"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Define the objective function to optimize MSE by using ScyiPy algorithm.\n",
"def mse_func(weights):\n",
" ''' scipy minimize will pass the weights as a numpy array '''\n",
" final_prediction = 0\n",
" for weight, prediction in zip(weights, predictions):\n",
" final_prediction += weight*prediction\n",
"\n",
" return mean_squared_error(Y_values, final_prediction)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"for i in range(pred_df.shape[1]):\n",
" predictions.append(np.array(pred_df.iloc[:, i]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use SLSQP as a solver\n",
"Inspired by the Kaggle kernels below; \n",
"https://www.kaggle.com/hamzaben/tuned-random-forest-lasso-and-xgboost-regressors\n",
"https://www.kaggle.com/rishiarora/finding-ensemble-weights"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 2.7 s, sys: 0 ns, total: 2.7 s\n",
"Wall time: 2.7 s\n"
]
}
],
"source": [
"%%time\n",
"# Optimization runs 100 times.\n",
"for i in range(100):\n",
" starting_values = np.random.uniform(size=pred_df.shape[1])\n",
" # cons are given as constraints.\n",
" cons = ({'type':'eq','fun':lambda w: 1-sum(w)})\n",
" bounds = [(0,1)]*len(predictions)\n",
" \n",
" res = minimize(mse_func, starting_values, constraints=cons,\n",
" bounds = bounds, method='SLSQP')\n",
"\n",
" lls.append(res['fun'])\n",
" wghts.append(res['x'])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Ensemble Score: 0.2668502\n",
"\n",
" Best Weights: [0.19684903 0.19570723 0.20853669 0.20412792 0.19477913]\n"
]
}
],
"source": [
"bestSC = np.min(lls)\n",
"bestWght = wghts[np.argmin(lls)]\n",
"\n",
"print('\\n Ensemble Score: {best_score:.7f}'.format(best_score=bestSC))\n",
"print('\\n Best Weights: {weights:}'.format(weights=bestWght))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MSE: 0.2668502\n"
]
}
],
"source": [
"print('MSE: {:.7f}'.format(mean_squared_error(y_train, np.sum(bestWght * pred_df, axis=1))))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use L-BFGS-B as a solver\n",
"Inspired by the Kaggle kernel below; \n",
"https://www.kaggle.com/tilii7/ensemble-weights-minimization-vs-mcmc"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 2.76 s, sys: 0 ns, total: 2.76 s\n",
"Wall time: 2.76 s\n"
]
}
],
"source": [
"%%time\n",
"# Optimization runs 100 times.\n",
"for i in range(100):\n",
" starting_values = np.random.uniform(size=pred_df.shape[1])\n",
" \n",
" bounds = [(0,1)]*len(predictions)\n",
" \n",
" res = minimize(mse_func, starting_values, method='L-BFGS-B', \n",
" bounds=bounds, options={'disp': False, 'maxiter': 100000})\n",
"\n",
" lls.append(res['fun'])\n",
" wghts.append(res['x'])"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Ensemble Score: 0.2658374\n",
"\n",
" Best Weights: [0.18476566 0.18353229 0.19558398 0.19190282 0.18256912]\n"
]
}
],
"source": [
"bestSC = np.min(lls)\n",
"bestWght = wghts[np.argmin(lls)]\n",
"\n",
"print('\\n Ensemble Score: {best_score:.7f}'.format(best_score=bestSC))\n",
"print('\\n Best Weights: {weights:}'.format(weights=bestWght))"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MSE: 0.2658374\n"
]
}
],
"source": [
"print('MSE: {:.7f}'.format(mean_squared_error(y_train, np.sum(bestWght * pred_df, axis=1))))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although MSE of 'L-BFGS-B' is slightly better than 'SLSQP', MSE largely improves from 0.33 of single model to 0.27 of ensemble!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment