pmbaumgartner/Logistic Regression

## Logistic Regression
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##Logistic Regression - Python Demonstration\n",
    "**This script will run a logistic regression on our demo dataset. The goal is to predict whether a customer will buy a subscription to a magazine based on a number of factors.**\n",
    "\n",
    "###Steps:\n",
    "1. Import necessary packages\n",
    "    - We need the entire `pandas` and 'statsmodels.api' packages, but only specific *names* from the `math` and `sklearn` modules\n",
    "    - Imports can be confusing to python beginners, for some additional details [see here](http://stackoverflow.com/a/21547572)\n",
    "2. Read in the data sets\n",
    "    - Also generate some summary statistics for info and data validation\n",
    "    - Add in an 'intercept' column as a constant\n",
    "3. Split our data into training and testing data sets using the `StratifiedShuffleSplit` from `sklearn`\n",
    "4. Perform a logistic regression\n",
    "    - Review results\n",
    "    - Structure results into a readable format to view significant variables and interpret coefficients\n",
    "5. Predict the outcome in the test data set and validate the model.\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###1. Import necessary packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# import relevant packages\n",
    "\n",
    "import pandas as pd # for basic data frame functionality\n",
    "import statsmodels.api as sm # for statistical modeling functionality with R-like syntax \n",
    "\n",
    "from math import exp # for exponentiating (not natively implemented in python)\n",
    "from sklearn.cross_validation import StratifiedShuffleSplit # to form stratified train and test data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###2. Read in the data sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data = pd.read_excel('LogRegData.xlsx') # using the read_excel function from pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Generate summary statistics: What proportion of our dataset bought a subscription? This will give us a baseline for our model.\n",
    "\n",
    "**What's the code doing?** Using the 'Buy' column from `data`, perform the `value_counts` function on it. `value_counts` has a `normalize` parameter which outputs the percentages rather than raw counts of the aggregation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.814264\n",
       "1    0.185736\n",
       "dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['Buy'].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###2b. Add an intercept to our data set\n",
    "For some reason, the models in the `statsmodels` package don't automatically add an intercept. So we have to add a column with a constant value to account for the intercept in the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data['(Intercept)'] = 1.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Split our data into training and testing data sets using the `StratifiedShuffleSplit` from `sklearn`\n",
    "**Note:** Declaring objects before using them is an odd concept for python beginners. What we're going to first do is define `sss` as a `StratifiedShuffleSplit` class. A class is an object that has associated functions or objects within it. The `StratifiedShuffleSplit` class performs a split that retains the stratified proportions of the target variable. It returns an index of values from the data frame that we then have to subselect. Essentially it's telling us how to cut the data set, not doing the actual cutting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sss = StratifiedShuffleSplit(data['Buy'], n_iter=1, test_size=0.25, random_state=333)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to split the dataset using the results from our `StratifiedShuffleSplit`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for train_index, test_index in sss:\n",
    "    dataTrain, dataTest = data.ix[train_index], data.ix[test_index]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###4. Perform logistic regression\n",
    "First we need to define our dependent and independent variables. We're going to define a list of our predictor columns by using python's [awesome list indexing](http://effbot.org/zone/python-list.htm). If we have a list of items (like columns in our dataset), we can select all columns from the first column to the end by appending it with `[1:]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "predictorCols = data.columns[1:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we declare a logit object as a Logit model from the statsmodels package. Since we called in the entire statsmodels package as `sm`, we then ust the Logit class from the `sm` package. The Logit class takes two required arguments: your target variable, and your predictor variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "model = sm.Logit(dataTrain['Buy'], dataTrain[predictorCols])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After declaring the model, we need to also apply the `fit` method to the model since we haven't told the Logit class to *do* anything yet, and store the fitted model as a new object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimization terminated successfully.\n",
      "         Current function value: 0.145078\n",
      "         Iterations 10\n"
     ]
    }
   ],
   "source": [
    "logitResults = model.fit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output tells us that the model converged after 10 iterations. We can now print out a summary of the model statistics.\n",
    "\n",
    "**Note**: We can also use the .summary() function, but summary2() provides the AIC and BIC in the statistics table for us. These are available otherwise (we would just write `logitResults.aic()`, but it's nice to include them automatically)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "        <td>Model:</td>              <td>Logit</td>      <td>Pseudo R-squared:</td>    <td>0.698</td>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <td>Dependent Variable:</td>        <td>Buy</td>             <td>AIC:</td>         <td>180.2388</td> \n",
       "</tr>\n",
       "<tr>\n",
       "         <td>Date:</td>        <td>2015-07-13 13:34</td>       <td>BIC:</td>         <td>252.0226</td> \n",
       "</tr>\n",
       "<tr>\n",
       "   <td>No. Observations:</td>         <td>504</td>        <td>Log-Likelihood:</td>    <td>-73.119</td> \n",
       "</tr>\n",
       "<tr>\n",
       "       <td>Df Model:</td>             <td>16</td>            <td>LL-Null:</td>        <td>-242.48</td> \n",
       "</tr>\n",
       "<tr>\n",
       "     <td>Df Residuals:</td>           <td>487</td>         <td>LLR p-value:</td>    <td>2.3085e-62</td>\n",
       "</tr>\n",
       "<tr>\n",
       "      <td>Converged:</td>           <td>1.0000</td>           <td>Scale:</td>         <td>1.0000</td>  \n",
       "</tr>\n",
       "<tr>\n",
       "    <td>No. Iterations:</td>        <td>10.0000</td>             <td></td>               <td></td>     \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "          <td></td>           <th>Coef.</th>  <th>Std.Err.</th>    <th>z</th>     <th>P>|z|</th>  <th>[0.025</th>   <th>0.975]</th> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Income</th>            <td>0.0002</td>   <td>0.0000</td>  <td>7.6556</td>  <td>0.0000</td>  <td>0.0001</td>   <td>0.0002</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Female</th>            <td>1.6206</td>   <td>0.4977</td>  <td>3.2561</td>  <td>0.0011</td>  <td>0.6451</td>   <td>2.5961</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Married</th>           <td>0.6135</td>   <td>0.6431</td>  <td>0.9539</td>  <td>0.3401</td>  <td>-0.6470</td>  <td>1.8739</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Student</th>           <td>-0.1021</td>  <td>0.5019</td>  <td>-0.2033</td> <td>0.8389</td>  <td>-1.0859</td>  <td>0.8817</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Professional</th>      <td>0.0328</td>   <td>0.5373</td>  <td>0.0610</td>  <td>0.9514</td>  <td>-1.0203</td>  <td>1.0858</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Retired</th>           <td>-1.2590</td>  <td>1.0369</td>  <td>-1.2142</td> <td>0.2247</td>  <td>-3.2912</td>  <td>0.7732</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Unemployed</th>        <td>0.9117</td>   <td>3.8095</td>  <td>0.2393</td>  <td>0.8109</td>  <td>-6.5548</td>  <td>8.3782</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Residence Length</th>  <td>0.0211</td>   <td>0.0156</td>  <td>1.3530</td>  <td>0.1761</td>  <td>-0.0095</td>  <td>0.0516</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Dual Income</th>       <td>0.3512</td>   <td>0.5880</td>  <td>0.5973</td>  <td>0.5503</td>  <td>-0.8013</td>  <td>1.5036</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Minors</th>            <td>0.6606</td>   <td>0.5094</td>  <td>1.2970</td>  <td>0.1946</td>  <td>-0.3377</td>  <td>1.6590</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Apartment</th>         <td>1.0709</td>   <td>0.6101</td>  <td>1.7553</td>  <td>0.0792</td>  <td>-0.1249</td>  <td>2.2667</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Rent</th>              <td>-0.7703</td>  <td>0.6792</td>  <td>-1.1343</td> <td>0.2567</td>  <td>-2.1015</td>  <td>0.5608</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Own</th>               <td>1.5089</td>   <td>0.5944</td>  <td>2.5386</td>  <td>0.0111</td>  <td>0.3439</td>   <td>2.6739</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>English</th>           <td>1.7863</td>   <td>0.9208</td>  <td>1.9400</td>  <td>0.0524</td>  <td>-0.0184</td>  <td>3.5911</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prev Child Mag</th>    <td>1.8793</td>   <td>0.7817</td>  <td>2.4042</td>  <td>0.0162</td>  <td>0.3472</td>   <td>3.4114</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prev Parent Mag</th>   <td>-0.1584</td>  <td>0.7067</td>  <td>-0.2242</td> <td>0.8226</td>  <td>-1.5434</td>  <td>1.2266</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>(Intercept)</th>      <td>-16.6282</td>  <td>2.2535</td>  <td>-7.3789</td> <td>0.0000</td> <td>-21.0449</td> <td>-12.2114</td>\n",
       "</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary2.Summary'>\n",
       "\"\"\"\n",
       "                          Results: Logit\n",
       "===================================================================\n",
       "Model:                Logit            Pseudo R-squared: 0.698     \n",
       "Dependent Variable:   Buy              AIC:              180.2388  \n",
       "Date:                 2015-07-13 13:34 BIC:              252.0226  \n",
       "No. Observations:     504              Log-Likelihood:   -73.119   \n",
       "Df Model:             16               LL-Null:          -242.48   \n",
       "Df Residuals:         487              LLR p-value:      2.3085e-62\n",
       "Converged:            1.0000           Scale:            1.0000    \n",
       "No. Iterations:       10.0000                                      \n",
       "-------------------------------------------------------------------\n",
       "                  Coef.   Std.Err.    z    P>|z|   [0.025   0.975] \n",
       "-------------------------------------------------------------------\n",
       "Income             0.0002   0.0000  7.6556 0.0000   0.0001   0.0002\n",
       "Female             1.6206   0.4977  3.2561 0.0011   0.6451   2.5961\n",
       "Married            0.6135   0.6431  0.9539 0.3401  -0.6470   1.8739\n",
       "Student           -0.1021   0.5019 -0.2033 0.8389  -1.0859   0.8817\n",
       "Professional       0.0328   0.5373  0.0610 0.9514  -1.0203   1.0858\n",
       "Retired           -1.2590   1.0369 -1.2142 0.2247  -3.2912   0.7732\n",
       "Unemployed         0.9117   3.8095  0.2393 0.8109  -6.5548   8.3782\n",
       "Residence Length   0.0211   0.0156  1.3530 0.1761  -0.0095   0.0516\n",
       "Dual Income        0.3512   0.5880  0.5973 0.5503  -0.8013   1.5036\n",
       "Minors             0.6606   0.5094  1.2970 0.1946  -0.3377   1.6590\n",
       "Apartment          1.0709   0.6101  1.7553 0.0792  -0.1249   2.2667\n",
       "Rent              -0.7703   0.6792 -1.1343 0.2567  -2.1015   0.5608\n",
       "Own                1.5089   0.5944  2.5386 0.0111   0.3439   2.6739\n",
       "English            1.7863   0.9208  1.9400 0.0524  -0.0184   3.5911\n",
       "Prev Child Mag     1.8793   0.7817  2.4042 0.0162   0.3472   3.4114\n",
       "Prev Parent Mag   -0.1584   0.7067 -0.2242 0.8226  -1.5434   1.2266\n",
       "(Intercept)      -16.6282   2.2535 -7.3789 0.0000 -21.0449 -12.2114\n",
       "===================================================================\n",
       "\n",
       "\"\"\""
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "logitResults.summary2()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Refine results**\n",
    "\n",
    "We're going to use the `zip` function to tie all of our data together. The zip function takes group of equal length lists (vectors) and lines them up so they can be put into a DataFrame. The colums we're going to create are:\n",
    "- **(Variable Names)** `predictorCols` -- We defined this up above, it's just the column names of our predictors\n",
    "- **(Parameter Estimates)** `logitResults.params` -- This is the `params` attribute (parameters) of our logitResults object.\n",
    "- **(Odds Ratios)** `(logitResults.params).apply(lambda x: round(exp(x), 4))` -- This one is complicated. We're going to take the parameters and apply a function to each of them. That function exponetiates the value to give us the odds ratio. Then we're going to round that number to 4 decimal digits.\n",
    "- **(p-values)** `logitResults.pvalues.round(4)` -- the p-value of each of our variables\n",
    "- **(significance)**  `logitResults.pvalues < 0.05` -- a binary indicator of whether our variable is significant at alpha = 0.05"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "resultColumns = zip(predictorCols, # gather out all of our predictor columns\n",
    "                    logitResults.params, # gather the parameters \n",
    "                    (logitResults.params).apply(lambda x: round(exp(x), 4)), # gather the parameters, exponentiate, and round them\n",
    "                    logitResults.pvalues.round(4), # gather the p-values, round them\n",
    "                    logitResults.pvalues < 0.05) # print whether each variable is significant at alpha < 0.05"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we put our data into a DataFrame for formatting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Variable</th>\n",
       "      <th>Paremeter</th>\n",
       "      <th>Odds Ratio</th>\n",
       "      <th>Pr(&gt;|z|)</th>\n",
       "      <th>Sig</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Income</td>\n",
       "      <td>0.000184</td>\n",
       "      <td>1.0002</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Female</td>\n",
       "      <td>1.620632</td>\n",
       "      <td>5.0563</td>\n",
       "      <td>0.0011</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Married</td>\n",
       "      <td>0.613478</td>\n",
       "      <td>1.8468</td>\n",
       "      <td>0.3401</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Student</td>\n",
       "      <td>-0.102066</td>\n",
       "      <td>0.9030</td>\n",
       "      <td>0.8389</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Professional</td>\n",
       "      <td>0.032768</td>\n",
       "      <td>1.0333</td>\n",
       "      <td>0.9514</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Retired</td>\n",
       "      <td>-1.258998</td>\n",
       "      <td>0.2839</td>\n",
       "      <td>0.2247</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Unemployed</td>\n",
       "      <td>0.911673</td>\n",
       "      <td>2.4885</td>\n",
       "      <td>0.8109</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Residence Length</td>\n",
       "      <td>0.021071</td>\n",
       "      <td>1.0213</td>\n",
       "      <td>0.1761</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Dual Income</td>\n",
       "      <td>0.351180</td>\n",
       "      <td>1.4207</td>\n",
       "      <td>0.5503</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Minors</td>\n",
       "      <td>0.660632</td>\n",
       "      <td>1.9360</td>\n",
       "      <td>0.1946</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Apartment</td>\n",
       "      <td>1.070915</td>\n",
       "      <td>2.9180</td>\n",
       "      <td>0.0792</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Rent</td>\n",
       "      <td>-0.770346</td>\n",
       "      <td>0.4629</td>\n",
       "      <td>0.2567</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Own</td>\n",
       "      <td>1.508918</td>\n",
       "      <td>4.5218</td>\n",
       "      <td>0.0111</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>English</td>\n",
       "      <td>1.786333</td>\n",
       "      <td>5.9675</td>\n",
       "      <td>0.0524</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Prev Child Mag</td>\n",
       "      <td>1.879325</td>\n",
       "      <td>6.5491</td>\n",
       "      <td>0.0162</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Prev Parent Mag</td>\n",
       "      <td>-0.158417</td>\n",
       "      <td>0.8535</td>\n",
       "      <td>0.8226</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>(Intercept)</td>\n",
       "      <td>-16.628185</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Variable  Paremeter  Odds Ratio  Pr(>|z|)    Sig\n",
       "0             Income   0.000184      1.0002    0.0000   True\n",
       "1             Female   1.620632      5.0563    0.0011   True\n",
       "2            Married   0.613478      1.8468    0.3401  False\n",
       "3            Student  -0.102066      0.9030    0.8389  False\n",
       "4       Professional   0.032768      1.0333    0.9514  False\n",
       "5            Retired  -1.258998      0.2839    0.2247  False\n",
       "6         Unemployed   0.911673      2.4885    0.8109  False\n",
       "7   Residence Length   0.021071      1.0213    0.1761  False\n",
       "8        Dual Income   0.351180      1.4207    0.5503  False\n",
       "9             Minors   0.660632      1.9360    0.1946  False\n",
       "10         Apartment   1.070915      2.9180    0.0792  False\n",
       "11              Rent  -0.770346      0.4629    0.2567  False\n",
       "12               Own   1.508918      4.5218    0.0111   True\n",
       "13           English   1.786333      5.9675    0.0524  False\n",
       "14    Prev Child Mag   1.879325      6.5491    0.0162   True\n",
       "15   Prev Parent Mag  -0.158417      0.8535    0.8226  False\n",
       "16       (Intercept) -16.628185      0.0000    0.0000   True"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "resultsData = pd.DataFrame(resultColumns, columns=['Variable', 'Paremeter', 'Odds Ratio', 'Pr(>|z|)', 'Sig'])\n",
    "resultsData"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###5. Predict the outcome in the test data set and validate the model.\n",
    "Now that we have a model we're going to validate it's accuracy against a test data set. Our model object `logitResults` has a  'predict' function that generates predicted probabilities for a set of observations with the same independent variables. We're going to store these probabilities, then add them as a column to our `dataTest` data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "predictions = logitResults.predict(dataTest[predictorCols])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "dataTest['Prediction Prob'] = predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Since we have predicted probabilities we need to determine a cutoff. Before this step, you would need to perform some cutoff calibration using something like [ROC Curves](http://blog.yhathq.com/posts/roc-curves.html) or [other methods](http://scikit-learn.org/stable/modules/calibration.html). For our purposes, we'll assume a probability above 0.5 is a \"True\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dataTest['Prediction'] = dataTest['Prediction Prob'] > 0.5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- A common method of verifying model accuracy is using a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix). We'll use `pandas' crosstab` function to create a nice one. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Predicted</th>\n",
       "      <th>False</th>\n",
       "      <th>True</th>\n",
       "      <th>All</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Actual</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>136</td>\n",
       "      <td>2</td>\n",
       "      <td>138</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6</td>\n",
       "      <td>25</td>\n",
       "      <td>31</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>All</th>\n",
       "      <td>142</td>\n",
       "      <td>27</td>\n",
       "      <td>169</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Predicted  False  True  All\n",
       "Actual                     \n",
       "0            136     2  138\n",
       "1              6    25   31\n",
       "All          142    27  169"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.crosstab(dataTest['Buy'], dataTest['Prediction'], rownames=['Actual'], colnames=['Predicted'], margins=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many different metrics used to evaluate a model, but True Positives, True Negatives, and Accuracy are good basic ones. We can calculate them all using the confusion matrix numbers and some basic calculations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from __future__ import division"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "######re: above\n",
    "We're importing division as it's fixed in Python 3. Python 2 uses integer division by default. Don't worry about it, it's complicated. If you ever get weird division results, this is why. https://www.python.org/dev/peps/pep-0238/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True Positives: 0.8065\n",
      "True Negatives: 0.9855\n",
      "Accuracy: 0.9527\n"
     ]
    }
   ],
   "source": [
    "true_pos = 25/31\n",
    "true_neg = 136/138\n",
    "accuracy = (136 + 25) / 169\n",
    "\n",
    "print \"True Positives: \" + str(round(true_pos, 4))\n",
    "print \"True Negatives: \" + str(round(true_neg, 4))\n",
    "print \"Accuracy: \" + str(round(accuracy, 4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###Recap\n",
    "We've created and fit a predictive model, developed a readable output for the coefficients, and used a model to predict additional observations for model validation.\n",
    "\n",
    "**Extra Credit** (some ideas for improvement):\n",
    "- Run a different random seed into the shuffle to verify model results. Do significant variables stay consistent despite the seed?\n",
    "- Check out some other [model evaluation metrics](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics).\n",
    "- Our model includes all of our original variables, despite their significance. Testing whether a simplified model would give us similar results without a huge loss in accuracy would be beneficial.\n",
    "- Repeat the modeling using [`sklearn's LogisticRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). What's different?"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}