Skip to content

Instantly share code, notes, and snippets.

@esjacobs
Last active July 31, 2018 19:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save esjacobs/709a147fcec9906e3f80b4e9a90e55ea to your computer and use it in GitHub Desktop.
Save esjacobs/709a147fcec9906e3f80b4e9a90e55ea to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Modeling: The Movie\n",
"\n",
"(Go to the READ.ME of this repository for the entire write-up.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For modeling, I took the practice of throwing everything at the wall and seeing what worked. I imported many different models, including linear regression, lasso, SGD regressor, bagging regressor, random forrest regressor, SVR, and adaboost regressor, as well as classifiers including logistic regression, random forest classifier, adaboost classifier, k-nearest neighbors classifier, decision tree classifier, and even a neural network. "
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [],
"source": [
"import imdb\n",
"import warnings\n",
"warnings.simplefilter(\"ignore\")\n",
"import re\n",
"import pandas as pd\n",
"import numpy as np\n",
"import ast\n",
"from datetime import datetime, timedelta\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.model_selection import train_test_split, GridSearchCV\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.linear_model import LinearRegression, Lasso, LassoCV, SGDRegressor\n",
"from sklearn.feature_selection import RFE\n",
"from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor\n",
"from sklearn.metrics import mean_squared_error, f1_score\n",
"from keras.models import Sequential\n",
"from keras.layers import Dense, Dropout, Activation\n",
"from keras.utils import np_utils\n",
"from sklearn.linear_model import LogisticRegression, LogisticRegressionCV\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier\n",
"import matplotlib.pyplot as plt \n",
"import seaborn as sns\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I brought in my six DataFrames:\n",
"1. 1 df 2 = directors and actors weighted, , deleted columns with 1 or fewer terms\n",
"2. 2 df 2 = directors and actors weighted, deleted columns with 1 or fewer terms\n",
"3. 3 df 2 = directors and actors and writers weighted, deleted columns with 1 or fewer terms\n",
"4. 1 df 3 = directors and actors weighted, , deleted columns with 2 or fewer terms\n",
"5. 2 df 2 = directors and actors weighted, deleted columns with 2 or fewer terms\n",
"6. 3 df 2 = directors and actors and writers weighted, deleted columns with 2 or fewer terms"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [],
"source": [
"# Pre-made DataFrames with directors weighted\n",
"\n",
"# X_train = pd.read_csv('train_everything_director_weights_df2.csv') # 1\n",
"# X_test = pd.read_csv('test_everything_director_weights_df2.csv') # 1\n",
"# X_train = pd.read_csv('train_everything_director_actor_weights_df2.csv') # 2\n",
"# X_test = pd.read_csv('test_everything_director_actor_weights_df2.csv') # 2 \n",
"X_train = pd.read_csv('train_everything_director_actor_writer_weights_df2.csv') # 3\n",
"X_test = pd.read_csv('test_everything_director_actor_writer_weights_df2.csv') # 3\n",
"# X_train = pd.read_csv('train_everything_director_weights_df3.csv') # 4\n",
"# X_test = pd.read_csv('test_everything_director_weights_df3.csv') # 4\n",
"# X_train = pd.read_csv('train_everything_director_actor_weights_df3.csv') # 5\n",
"# X_test = pd.read_csv('test_everything_director_actor_weights_df3.csv') # 5\n",
"# X_train = pd.read_csv('train_everything_director_actor_writer_weights_df3.csv') # 6\n",
"# X_test = pd.read_csv('test_everything_director_actor_writer_weights_df3.csv') # 6"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I then fed the DataFrames through the following cell, which gave us three regressor scores, then transformed my y variable for classification (based on median Metacritic score) and fed that through three classifiers. Throughout this process many models were attempted and thrown out. DataFrames were changed and had to be saved again and reloaded. At the end of the day I decided on the following models:\n",
"\n",
"- Regression\n",
" - Bagging Regressor\n",
" - Random Forest Regressor\n",
" - LASSO\n",
"- Classification\n",
" - Logistic Regression\n",
" - Bagging Classifier\n",
" - Random Forest Classifier\n",
" \n",
"Except for LASSO and logistic regression, there wasn't much rhyme or reason for modeling choices. These just gave us the best relative scores (of the ones I tried), and also didn't take a huge amount of time. Also, the bagging regressor and classifier, which didn't seem to ever give us scores that were as good as the other models, still worked quickly and served as a veritable canary in a coal mine, warning us if something had gone wrong with the models. "
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [],
"source": [
"y_train = X_train.Metascore\n",
"y_test = X_test.Metascore\n",
"\n",
"X_train.drop(['Metascore'], axis=1, inplace=True)\n",
"X_test.drop(['Metascore'], axis=1, inplace=True)\n",
"\n",
"ss = StandardScaler()\n",
"X_train = ss.fit_transform(X_train)\n",
"X_test = ss.transform(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"br test score: \n",
"0.03739589734130877\n",
"\n",
"rf test score: \n",
"-0.02765041735558893\n",
"\n",
"lasso test score: \n",
"0.21460320119560503\n",
"\n",
"logreg test score: \n",
"0.6854026845637584\n",
"\n",
"br test score: \n",
"0.6593959731543624\n",
"\n",
"rf test score: \n",
"0.6753355704697986\n",
"\n"
]
}
],
"source": [
"br = BaggingRegressor()\n",
"br.fit(X_train, y_train)\n",
"# print('br train score: ')\n",
"# print(br.score(X_train, y_train))\n",
"print('br test score: ')\n",
"print(br.score(X_test, y_test))\n",
"print()\n",
"\n",
"rf = RandomForestRegressor()\n",
"rf.fit(X_train, y_train)\n",
"# print('rf train score: ')\n",
"# print(rf.score(X_train, y_train))\n",
"print('rf test score: ')\n",
"print(rf.score(X_test, y_test))\n",
"print()\n",
"\n",
"lasso = Lasso(.15)\n",
"lasso.fit(X_train, y_train)\n",
"# print('rf train score: ')\n",
"# print(rf.score(X_train, y_train))\n",
"print('lasso test score: ')\n",
"print(lasso.score(X_test, y_test))\n",
"print()\n",
"\n",
"median = np.median(y_train)\n",
"\n",
"new_y = []\n",
"for n in y_train:\n",
" if n > median:\n",
" new_y.append(1)\n",
" else:\n",
" new_y.append(0)\n",
"y_train = new_y\n",
"\n",
"new_y = []\n",
"for n in y_test:\n",
" if n > median:\n",
" new_y.append(1)\n",
" else:\n",
" new_y.append(0)\n",
"y_test = new_y\n",
"\n",
"logreg = LogisticRegression() \n",
"logreg.fit(X_train, y_train)\n",
"# print('logreg train score: ')\n",
"# print(logreg.score(X_train, y_train))\n",
"print('logreg test score: ')\n",
"print(logreg.score(X_test, y_test))\n",
"print()\n",
"\n",
"br = BaggingClassifier()\n",
"br.fit(X_train, y_train)\n",
"# print('br train score: ')\n",
"# print(br.score(X_train, y_train))\n",
"print('br test score: ')\n",
"print(br.score(X_test, y_test))\n",
"print()\n",
"\n",
"rf = RandomForestClassifier()\n",
"rf.fit(X_train, y_train)\n",
"# print('rf train score: ')\n",
"# print(rf.score(X_train, y_train))\n",
"print('rf test score: ')\n",
"print(rf.score(X_test, y_test))\n",
"print()\n",
"\n",
"# 1 reg \n",
"\n",
"# br test score: \n",
"# 0.038342711196289625\n",
"\n",
"# rf test score: \n",
"# 0.11832620794676674\n",
"\n",
"# lasso test score: \n",
"# 0.19244316790430385\n",
"\n",
"# 1 class \n",
"\n",
"# logreg test score: \n",
"# 0.6736577181208053\n",
"\n",
"# br test score: \n",
"# 0.662751677852349\n",
"\n",
"# rf test score: \n",
"# 0.6753355704697986\n",
"\n",
"# 2 reg \n",
"\n",
"# br test score: \n",
"# 0.006896130293002622\n",
"\n",
"# rf test score: \n",
"# 0.07139091002869702\n",
"\n",
"# lasso test score: \n",
"# 0.1924431679043039\n",
"\n",
"# 2 class\n",
"\n",
"# logreg test score: \n",
"# 0.6736577181208053\n",
"\n",
"# br test score: \n",
"# 0.6375838926174496\n",
"\n",
"# rf test score: \n",
"# 0.6585570469798657\n",
"\n",
"# 3 reg\n",
"\n",
"# br test score: \n",
"# 0.05994540328342234\n",
"\n",
"# rf test score: \n",
"# -0.03186605837286138\n",
"\n",
"# lasso test score: \n",
"# 0.1924431679043039\n",
"\n",
"# 3 class\n",
"\n",
"# logreg test score: \n",
"# 0.6736577181208053\n",
"\n",
"# br test score: \n",
"# 0.6384228187919463\n",
"\n",
"# rf test score: \n",
"# 0.6719798657718121\n",
"\n",
"# 4 reg\n",
"\n",
"# br test score: \n",
"# 0.023266042810753954\n",
"\n",
"# rf test score: \n",
"# 0.07619378931494514\n",
"\n",
"# lasso test score: \n",
"# 0.21460320119560472\n",
"\n",
"# 4 class \n",
"\n",
"# logreg test score: \n",
"# 0.6854026845637584\n",
"\n",
"# br test score: \n",
"# 0.6434563758389261\n",
"\n",
"# rf test score: \n",
"# 0.6375838926174496\n",
"\n",
"# 5 reg\n",
"\n",
"# br test score: \n",
"# 0.005276011558945859\n",
"\n",
"# rf test score: \n",
"# 0.03497975713168888\n",
"\n",
"# lasso test score: \n",
"# 0.21460320119560497\n",
"\n",
"# 5 class \n",
"\n",
"# logreg test score: \n",
"# 0.6854026845637584\n",
"\n",
"# br test score: \n",
"# 0.6518456375838926\n",
"\n",
"# rf test score: \n",
"# 0.6610738255033557\n",
"\n",
"# 6 reg\n",
"\n",
"# br test score: \n",
"# 0.03739589734130877\n",
"\n",
"# rf test score: \n",
"# -0.02765041735558893\n",
"\n",
"# lasso test score: \n",
"# 0.21460320119560503\n",
"\n",
"# 6 class\n",
"\n",
"# logreg test score: \n",
"# 0.6854026845637584\n",
"\n",
"# br test score: \n",
"# 0.6593959731543624\n",
"\n",
"# rf test score: \n",
"# 0.6753355704697986"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cap_mods = pd.read_csv('capstone_models_1.csv')\n",
"ap_mods.columns = ['', '1 df 3', '2 df 3', '3 df 3', '1 df 2', '2 df2 ',\n",
" '3 df 2 ']\n",
"cap_mods = cap_mods.set_index('')\n",
"cap_mods_class = cap_mods.iloc[3:,:].copy()\n",
"cap_mods_reg = cap_mods.iloc[:3,:].copy()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.set_style(\"darkgrid\",{\"xtick.color\":\"black\", \"ytick.color\":\"black\"})\n",
"plt.figure(figsize=(10,5))\n",
"sns.heatmap(cap_mods_reg, annot = True, cmap=\"Greens\")\n",
"# plt.tick_params(color='white', labelcolor='white');"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# sns.set_style(\"dark\",{\"xtick.color\":\"white\", \"ytick.color\":\"white\"})\n",
"plt.figure(figsize=(10,5))\n",
"sns.heatmap(cap_mods_class, annot = True, cmap = \"Blues\")\n",
"# plt.tick_params(color='white', labelcolor='white');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After analyzing the output from my models, I decided to use the 3 df 2 DataFrame, aka, # 3, to tune hyperparameters on. Similarly, I tuned classifers on random forest, logreg, and LASSO, omitting the others for time. Frankly, the differences between performance is largely negligible, but I had might as well take the .02 bump provided by my best models. "
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>1 df 3</th>\n",
" <th>2 df 3</th>\n",
" <th>3 df 3</th>\n",
" <th>1 df 2</th>\n",
" <th>2 df2</th>\n",
" <th>3 df 2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>br reg</td>\n",
" <td>0.038343</td>\n",
" <td>0.006896</td>\n",
" <td>0.059945</td>\n",
" <td>0.023266</td>\n",
" <td>0.005276</td>\n",
" <td>0.037396</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>rf reg</td>\n",
" <td>0.118326</td>\n",
" <td>0.071391</td>\n",
" <td>-0.031866</td>\n",
" <td>0.076194</td>\n",
" <td>0.034980</td>\n",
" <td>-0.027650</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>lasso reg</td>\n",
" <td>0.192443</td>\n",
" <td>0.192443</td>\n",
" <td>0.192443</td>\n",
" <td>0.214603</td>\n",
" <td>0.214603</td>\n",
" <td>0.214603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>logreg class</td>\n",
" <td>0.673658</td>\n",
" <td>0.673658</td>\n",
" <td>0.673658</td>\n",
" <td>0.685403</td>\n",
" <td>0.685403</td>\n",
" <td>0.685403</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>br class</td>\n",
" <td>0.662752</td>\n",
" <td>0.637584</td>\n",
" <td>0.638423</td>\n",
" <td>0.643456</td>\n",
" <td>0.651846</td>\n",
" <td>0.659396</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>rf class</td>\n",
" <td>0.675336</td>\n",
" <td>0.658557</td>\n",
" <td>0.671980</td>\n",
" <td>0.637584</td>\n",
" <td>0.661074</td>\n",
" <td>0.675336</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 1 df 3 2 df 3 3 df 3 1 df 2 2 df2 3 df 2 \n",
"0 br reg 0.038343 0.006896 0.059945 0.023266 0.005276 0.037396\n",
"1 rf reg 0.118326 0.071391 -0.031866 0.076194 0.034980 -0.027650\n",
"2 lasso reg 0.192443 0.192443 0.192443 0.214603 0.214603 0.214603\n",
"3 logreg class 0.673658 0.673658 0.673658 0.685403 0.685403 0.685403\n",
"4 br class 0.662752 0.637584 0.638423 0.643456 0.651846 0.659396\n",
"5 rf class 0.675336 0.658557 0.671980 0.637584 0.661074 0.675336"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cap_mods"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.697986577181208\n",
"0.6808888888888889\n",
"{'max_depth': 1000, 'max_features': 2, 'n_estimators': 200}\n"
]
}
],
"source": [
"# y_train = X_train.Metascore\n",
"# y_test = X_test.Metascore\n",
"\n",
"# X_train.drop(['Metascore'], axis=1, inplace=True)\n",
"# X_test.drop(['Metascore'], axis=1, inplace=True)\n",
"\n",
"# ss = StandardScaler()\n",
"# X_train = ss.fit_transform(X_train)\n",
"# X_test = ss.transform(X_test)\n",
"\n",
"# median = np.median(y_train)\n",
"\n",
"# new_y = []\n",
"# for n in y_train:\n",
"# if n > median:\n",
"# new_y.append(1)\n",
"# else:\n",
"# new_y.append(0)\n",
"# y_train = new_y\n",
"\n",
"# new_y = []\n",
"# for n in y_test:\n",
"# if n > median:\n",
"# new_y.append(1)\n",
"# else:\n",
"# new_y.append(0)\n",
"# y_test = new_y\n",
"\n",
"rf_params = {\n",
" 'max_depth': [None],\n",
" 'n_estimators': [200],\n",
" 'max_features': [2, 10],\n",
"}\n",
"\n",
"gs = GridSearchCV(rf, param_grid=rf_params)\n",
"gs.fit(X_train, y_train)\n",
"print(gs.score(X_test, y_test))\n",
"print(gs.best_score_)\n",
"print(gs.best_params_)qwa\n",
"\n",
"# 0.7030201342281879\n",
"# 0.667\n",
"# {'max_depth': None, 'max_features': 10, 'n_estimators': 200}\n",
"\n",
"# 0.697986577181208\n",
"# 0.6808888888888889\n",
"# {'max_depth': 1000, 'max_features': 2, 'n_estimators': 200}"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.21850824761335874"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lasso = LassoCV()\n",
"lasso.fit(X_train, y_train)\n",
"lasso.score(X_test, y_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"lasso = Lasso()\n",
"lasso_params = {\n",
" 'alphas': [None, .15],\n",
"}\n",
"\n",
"gs = GridSearchCV(lasso, param_grid=lasso_params)\n",
"gs.fit(X_train, y_train)\n",
"print(gs.score(X_test, y_test))\n",
"print(gs.best_score_)\n",
"print(gs.best_params_)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"logreg_params = {\n",
" 'penalty': ['l1'],\n",
" 'C': [10, 100],\n",
"}\n",
"\n",
"gs = GridSearchCV(logreg, param_grid=logreg_params)\n",
"gs.fit(X_train, y_train)\n",
"print(gs.score(X_test, y_test))\n",
"print(gs.best_score_)\n",
"print(gs.best_params_)\n",
"\n",
"# 0.6971476510067114\n",
"# 0.6945555555555556\n",
"# {'C': 10, 'penalty': 'l1'}\n",
"\n",
"# 0.7055369127516778\n",
"# 0.6921111111111111\n",
"# {'C': 10, 'penalty': 'l1'}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"my best classifier (logreg) accuracy was \n",
"\n",
"0.6945555555555556 \n",
"\n",
"using C = 10 with an l1 penalty. \n",
"\n",
"And my best regression R$^2$ score was \n",
"\n",
"0.21460320119560503\n",
"\n",
"with an $\\alpha$ = .15\n",
"\n",
"There is no reason I shouldn't be able to achieve better than this given more time in the future. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Future recommendations are numerous. There are many different ways possible to make this score better, the only constraint being time. \n",
"\n",
"In terms of data collection, there are several other large databases to access, including imdb's itself as well as Metacritic's. It is entirely possible I have all the Metacritic scores, but I could always use more. Plus, Metacritic has statistics such as whether the movie is part of a franchise and how well the previous film did. I can, of course, make that data myself, but again, time is a factor here.\n",
"\n",
"I would also like access to more of the cast and crew including producers, cinematographers, composers, editors, and more of the cast. After all, the theory underlying this entire endeavous is that people make movies and people are consistent in their product. \n",
"\n",
"I could impute null values, especially with things like box office revenue, opening weekend box office revenue, Rotten Tomatoes scores, which could all replace Metacritic scores as the target variable. It would then be a simple mapping from one to the other. There could easily be more Rotten Tomatoes scores than Metacritic.\n",
"\n",
"In terms of feature engineering, there are always more columns to make. I could use polynomial features on my numerical data. I could just use directors and writers. I could run more n-grams on the titles. I could change my min_dfs per column. I could sift down out list of actor weights. I could go back and try to get the actors averages like before. \n",
"\n",
"Finally, there are more models for me to use. Several will allow me to tune hyperparameters to eek out better scores. There are models that work better with NLP. I can try a neural network for both classification and regression. I can try a passive aggressive classifer. And I'll do all that and I'll predict movie scores and eventually they'll make a movie about me. \n",
"\n",
"And that's my capstone! Wasn't it great? "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment