Skip to content

Instantly share code, notes, and snippets.

@georgeodsc
Created January 18, 2018 17:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save georgeodsc/acba146ef81d149d81acb2d0f8713748 to your computer and use it in GitHub Desktop.
Save georgeodsc/acba146ef81d149d81acb2d0f8713748 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Import libraries. Peyton is Throne AI's Python library, so much you run `pip install peyton` before using it."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"import peyton\n",
"\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.cross_validation import cross_val_score\n",
"from sklearn.metrics import log_loss"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Assign API token to a variable. You'll be assigned a token when you sign up for an account."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"API_TOKEN = \"\""
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Connect to Throne AI using your username and API token."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"throne = peyton.Throne(username='GeorgeMcIntire', token=API_TOKEN)"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Download both the historical data, which you'll use to train a model, and the competition data, which you'll use to make predictions."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"# Get historical data for the NFL\n",
"throne.competition('NFL').get_historical_data()\n",
"my_historical_data = throne.competition.historical_data\n",
"\n",
"# Get competition data for the upcoming conference championship games this weekend\n",
"throne.competition('NFL').get_competition_data()\n",
"my_competition_data = throne.competition.competition_data"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"View the data and some basic information about it."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>date</th>\n",
" <th>team_1_name</th>\n",
" <th>team_1_score</th>\n",
" <th>team_2_name</th>\n",
" <th>team_2_score</th>\n",
" <th>d_ability_1</th>\n",
" <th>is_november</th>\n",
" <th>d_schedule_1</th>\n",
" <th>d_form_3</th>\n",
" <th>p_ability_4</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>596a03ab48250836899582ca</td>\n",
" <td>2009-09-11 19:30:00</td>\n",
" <td>Pittsburgh Steelers</td>\n",
" <td>13.0</td>\n",
" <td>Tennessee Titans</td>\n",
" <td>10.0</td>\n",
" <td>-0.005721</td>\n",
" <td>-0.551079</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.005131</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>596a03ab48250836899582c9</td>\n",
" <td>2009-09-14 12:00:00</td>\n",
" <td>Tampa Bay Buccaneers</td>\n",
" <td>21.0</td>\n",
" <td>Dallas Cowboys</td>\n",
" <td>34.0</td>\n",
" <td>-0.005721</td>\n",
" <td>-0.551079</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.005131</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>596a03ab48250836899582c1</td>\n",
" <td>2009-09-14 12:00:00</td>\n",
" <td>Atlanta Falcons</td>\n",
" <td>19.0</td>\n",
" <td>Miami Dolphins</td>\n",
" <td>7.0</td>\n",
" <td>-0.005721</td>\n",
" <td>-0.551079</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.005131</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>596a03ab48250836899582c2</td>\n",
" <td>2009-09-14 12:00:00</td>\n",
" <td>Baltimore Ravens</td>\n",
" <td>38.0</td>\n",
" <td>Kansas City Chiefs</td>\n",
" <td>24.0</td>\n",
" <td>-0.005721</td>\n",
" <td>-0.551079</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.005131</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>596a03ab48250836899582c3</td>\n",
" <td>2009-09-14 12:00:00</td>\n",
" <td>Carolina Panthers</td>\n",
" <td>10.0</td>\n",
" <td>Philadelphia Eagles</td>\n",
" <td>38.0</td>\n",
" <td>-0.005721</td>\n",
" <td>-0.551079</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.005131</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id date team_1_name \\\n",
"0 596a03ab48250836899582ca 2009-09-11 19:30:00 Pittsburgh Steelers \n",
"1 596a03ab48250836899582c9 2009-09-14 12:00:00 Tampa Bay Buccaneers \n",
"2 596a03ab48250836899582c1 2009-09-14 12:00:00 Atlanta Falcons \n",
"3 596a03ab48250836899582c2 2009-09-14 12:00:00 Baltimore Ravens \n",
"4 596a03ab48250836899582c3 2009-09-14 12:00:00 Carolina Panthers \n",
"\n",
" team_1_score team_2_name team_2_score d_ability_1 is_november \\\n",
"0 13.0 Tennessee Titans 10.0 -0.005721 -0.551079 \n",
"1 21.0 Dallas Cowboys 34.0 -0.005721 -0.551079 \n",
"2 19.0 Miami Dolphins 7.0 -0.005721 -0.551079 \n",
"3 38.0 Kansas City Chiefs 24.0 -0.005721 -0.551079 \n",
"4 10.0 Philadelphia Eagles 38.0 -0.005721 -0.551079 \n",
"\n",
" d_schedule_1 d_form_3 p_ability_4 \n",
"0 NaN NaN 0.005131 \n",
"1 NaN NaN 0.005131 \n",
"2 NaN NaN 0.005131 \n",
"3 NaN NaN 0.005131 \n",
"4 NaN NaN 0.005131 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_historical_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 2401 entries, 0 to 2400\n",
"Data columns (total 11 columns):\n",
"id 2401 non-null object\n",
"date 2401 non-null object\n",
"team_1_name 2401 non-null object\n",
"team_1_score 2401 non-null float64\n",
"team_2_name 2401 non-null object\n",
"team_2_score 2401 non-null float64\n",
"d_ability_1 2401 non-null float64\n",
"is_november 2401 non-null float64\n",
"d_schedule_1 2384 non-null float64\n",
"d_form_3 2365 non-null float64\n",
"p_ability_4 2401 non-null float64\n",
"dtypes: float64(7), object(4)\n",
"memory usage: 206.4+ KB\n"
]
}
],
"source": [
"my_historical_data.info()"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Throne provides access to data on 2400 NFL games going back to 2009."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Drop the null values from the historical data."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"my_historical_data.dropna(inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Create an outcome variable by subtracting the score of team 2 (away team) and team 1 (home team) followed by assigning a \"W\" to positive score differences, a \"L\" to negative differences, and a \"T\" to ties."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"df = my_historical_data.copy()\n",
"\n",
"df[\"score\"] = df.team_1_score - df.team_2_score\n",
"\n",
"def decider(x):\n",
" if x > 0:\n",
" return \"W\"\n",
" elif x == 0:\n",
" return \"T\"\n",
" else:\n",
" return \"L\"\n",
" \n",
"df[\"outcome\"] = df.score.apply(decider)"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Drop all matches that ended in ties."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"df = df[df.outcome!= \"T\"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Assign the features to a variable X and the outcome column to a variable y."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['d_ability_1', 'is_november', 'd_schedule_1', 'd_form_3',\n",
" 'p_ability_4'],\n",
" dtype='object')"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Feature variables\n",
"df.columns[6:-2]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"feature_cols = df.columns[6:-2]\n",
"\n",
"X = df[feature_cols]\n",
"\n",
"y = df.outcome"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"We're ready to start modeling, but first let's check the null accuracy."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.57330508474576269"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y.value_counts(normalize=True).max()"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Fit a logistic regression model on the whole dataset and then score it."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.6580508474576271"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lr = LogisticRegression()\n",
"lr.fit(X, y)\n",
"lr.score(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Cross validate with 5-folds and accuracy as your the metric."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.65636009753962143"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cross_val_score(LogisticRegression(), X, y, cv = 5, scoring=\"accuracy\").mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Given that the cross-validated accuracy score is statistically the same the training score, this indicates that we didn't build an overfit model. However since log loss is Throne AI's preferred metric, we have to evaluate our model using that."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.61467008189480055"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lr = LogisticRegression()\n",
"lr.fit(X, y)\n",
"preds = lr.predict_proba(X)[:, 1]\n",
"log_loss(y, preds)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.61720688532237999"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"-cross_val_score(LogisticRegression(), X, y, cv = 5, scoring=\"neg_log_loss\").mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"With log loss as our metric, our model still does not overfit. Now let's transform the testing or competition data and make predictions with it."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0.23622915, 0.76377085],\n",
" [ 0.47868876, 0.52131124]])"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_test = my_competition_data[feature_cols]\n",
"\n",
"preds = lr.predict_proba(X_test)\n",
"preds"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"It's important to remember here that the second column represents the probabilites for the \"W\" class and the first is for class \"L\", they're ordered alphabetically. Now let's input the predictions into the original competition dataset."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"button": false,
"collapsed": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>date</th>\n",
" <th>team_1_name</th>\n",
" <th>team_2_name</th>\n",
" <th>d_ability_1</th>\n",
" <th>is_november</th>\n",
" <th>d_schedule_1</th>\n",
" <th>d_form_3</th>\n",
" <th>p_ability_4</th>\n",
" <th>team_1_prob</th>\n",
" <th>team_2_prob</th>\n",
" <th>confidence</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>5a5c606d9772221738ebcfd1</td>\n",
" <td>2018-01-22 15:05:00</td>\n",
" <td>New England Patriots</td>\n",
" <td>Jacksonville Jaguars</td>\n",
" <td>1.217135</td>\n",
" <td>-0.551079</td>\n",
" <td>1.038469</td>\n",
" <td>2.261745</td>\n",
" <td>0.922463</td>\n",
" <td>0.236229</td>\n",
" <td>0.763771</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5a5c606d9772221738ebcfd2</td>\n",
" <td>2018-01-22 18:40:00</td>\n",
" <td>Philadelphia Eagles</td>\n",
" <td>Minnesota Vikings</td>\n",
" <td>0.028317</td>\n",
" <td>-0.551079</td>\n",
" <td>-0.346406</td>\n",
" <td>0.291636</td>\n",
" <td>-0.417000</td>\n",
" <td>0.478689</td>\n",
" <td>0.521311</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id date team_1_name \\\n",
"0 5a5c606d9772221738ebcfd1 2018-01-22 15:05:00 New England Patriots \n",
"1 5a5c606d9772221738ebcfd2 2018-01-22 18:40:00 Philadelphia Eagles \n",
"\n",
" team_2_name d_ability_1 is_november d_schedule_1 d_form_3 \\\n",
"0 Jacksonville Jaguars 1.217135 -0.551079 1.038469 2.261745 \n",
"1 Minnesota Vikings 0.028317 -0.551079 -0.346406 0.291636 \n",
"\n",
" p_ability_4 team_1_prob team_2_prob confidence \n",
"0 0.922463 0.236229 0.763771 1.0 \n",
"1 -0.417000 0.478689 0.521311 1.0 "
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_competition_data.team_1_prob = preds[: , 0]\n",
"my_competition_data.team_2_prob = preds[: , 1]\n",
"my_competition_data"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"The probability values should no longer be null values and should be the probabilities you made with your model. Once you're ready, you can submit your predictions using the Throne API as shown below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"throne.competition('NFL').submit(my_competition_data)"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"In addition I highly recommend saving your predictions and your model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"my_competition_data.to_csv(\"predictions_submissions.csv\")\n",
"from sklearn.externals import joblib\n",
"filename = 'nfl_model.sav'\n",
"joblib.dump(model, filename)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment