paulgb/Affairs.ipynb

## Affairs.ipynb
{
 "metadata": {
  "name": "Affairs"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note: this is meant as a demonstration of [sklearn-pandas](https://github.com/paulgb/sklearn-pandas) and in particular the new grid search capabilities. In a real use-case you would want to hold out a test set to determine the performance of the algorithm."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Predicting Extramarital Affairs with Sklearn\n============================================\n\nLoad the Dataset\n----------------\n\nThe dataset comes from the [`statsmodels.api`](http://statsmodels.sourceforge.net/stable/datasets/index.html) package."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "import statsmodels.api as sm\naffair_meta = sm.datasets.fair\naffair = affair_meta.load_pandas()",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "About the Data\n--------------"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "print affair_meta.DESCRLONG\nprint affair_meta.SOURCE\nprint affair_meta.NOTE",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "Extramarital affair data used to explain the allocation\nof an individual's time among work, time spent with a spouse, and time\nspent with a paramour. The data is used as an example of regression\nwith censored data.\n\nFair, Ray. 1978. \"A Theory of Extramarital Affairs,\" `Journal of Political\n    Economy`, February, 45-61.\n\nThe data is available at http://fairmodel.econ.yale.edu/rayfair/pdf/2011b.htm\n\n\nNumber of observations: 6366\nNumber of variables: 9\nVariable name definitions:\n\n    rate_marriage   : How rate marriage, 1 = very poor, 2 = poor, 3 = fair,\n                      4 = good, 5 = very good\n    age             : Age\n    yrs_married     : No. years married. Interval approximations. See\n                      original paper for detailed explanation.\n    children        : No. children\n    religious       : How relgious, 1 = not, 2 = mildly, 3 = fairly,\n                      4 = strongly\n    educ            : Level of education, 9 = grade school, 12 = high school,\n                      14 = some college, 16 = college graduate, 17 = some\n                      graduate school, 20 = advanced degree\n    occupation      : 1 = student, 2 = farming, agriculture; semi-skilled,\n                      or unskilled worker; 3 = white-colloar; 4 = teacher\n                      counselor social worker, nurse; artist, writers;\n                      technician, skilled worker, 5 = managerial,\n                      administrative, business, 6 = professional with\n                      advanced degree\n    occupation_husb : Husband's occupation. Same as occupation.\n    affairs         : measure of time spent in extramarital affairs\n\nSee the original paper for more details.\n\n"
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Set up data\n-----------\n\nThe original dataset has a continuous field for \"time spent in extramarital affairs\". Rather than predicting the amount of time spent in affairs, let's focus on predicting whether an affair exists at all."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "data = affair.exog\ntarget = affair.endog > 0",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Set up pipeline\n---------------"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from sklearn_pandas import cross_val_score, DataFrameMapper, GridSearchCV\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.ensemble import RandomForestClassifier",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "pipeline_stages = list()",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "The first stage of the pipeline is an sklearn-pandas `DataFrameMapper` class. Some of the features are categoical, some are quantative. We want to apply a one-hot encoder to the categorical features, and copy the other features verbatim."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "mapper = DataFrameMapper([\n    (['occupation', 'occupation_husb'], OneHotEncoder()),\n    (['rate_marriage', 'age', 'yrs_married', 'educ', 'children', 'religious'], None)\n])\npipeline_stages.append(('mapper', mapper))",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "The second stage of the pipeline is a Support Vector Classification."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "rf = RandomForestClassifier()\npipeline_stages.append(('rf', rf))",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Build a pipeline from the stages"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "pipeline = Pipeline(pipeline_stages)",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Grid Search\n-----------"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "print list(pipeline.get_params())",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "['mapper', 'rf__max_depth', 'mapper__features', 'rf__n_estimators', 'rf__verbose', 'rf__criterion', 'rf__min_density', 'rf__min_samples_split', 'rf__compute_importances', 'rf__bootstrap', 'rf', 'rf__max_features', 'rf__n_jobs', 'rf__random_state', 'rf__oob_score', 'rf__min_samples_leaf']\n"
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "gs = GridSearchCV(pipeline, ({\n    'rf__n_estimators': [20, 40, 60],\n    'rf__min_samples_split': [100, 150, 200],\n    'rf__max_features': ['auto', 2, 4, 8],\n    'rf__criterion': ['gini', 'entropy']\n}), 'roc_auc')",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "gs.fit(data, target)",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "gs.best_params_",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 17,
       "text": "{'rf__criterion': 'gini',\n 'rf__max_features': 4,\n 'rf__min_samples_split': 150,\n 'rf__n_estimators': 60}"
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "gs.best_score_",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 18,
       "text": "0.74808104422072164"
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "Affairs"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Note: this is meant as a demonstration of [sklearn-pandas](https://github.com/paulgb/sklearn-pandas) and in particular the new grid search capabilities. In a real use-case you would want to hold out a test set to determine the performance of the algorithm."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Predicting Extramarital Affairs with Sklearn\n============================================\n\nLoad the Dataset\n----------------\n\nThe dataset comes from the [`statsmodels.api`](http://statsmodels.sourceforge.net/stable/datasets/index.html) package."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "import statsmodels.api as sm\naffair_meta = sm.datasets.fair\naffair = affair_meta.load_pandas()",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 1
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "About the Data\n--------------"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "print affair_meta.DESCRLONG\nprint affair_meta.SOURCE\nprint affair_meta.NOTE",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "Extramarital affair data used to explain the allocation\nof an individual's time among work, time spent with a spouse, and time\nspent with a paramour. The data is used as an example of regression\nwith censored data.\n\nFair, Ray. 1978. \"A Theory of Extramarital Affairs,\" `Journal of Political\n Economy`, February, 45-61.\n\nThe data is available at http://fairmodel.econ.yale.edu/rayfair/pdf/2011b.htm\n\n\nNumber of observations: 6366\nNumber of variables: 9\nVariable name definitions:\n\n rate_marriage : How rate marriage, 1 = very poor, 2 = poor, 3 = fair,\n 4 = good, 5 = very good\n age : Age\n yrs_married : No. years married. Interval approximations. See\n original paper for detailed explanation.\n children : No. children\n religious : How relgious, 1 = not, 2 = mildly, 3 = fairly,\n 4 = strongly\n educ : Level of education, 9 = grade school, 12 = high school,\n 14 = some college, 16 = college graduate, 17 = some\n graduate school, 20 = advanced degree\n occupation : 1 = student, 2 = farming, agriculture; semi-skilled,\n or unskilled worker; 3 = white-colloar; 4 = teacher\n counselor social worker, nurse; artist, writers;\n technician, skilled worker, 5 = managerial,\n administrative, business, 6 = professional with\n advanced degree\n occupation_husb : Husband's occupation. Same as occupation.\n affairs : measure of time spent in extramarital affairs\n\nSee the original paper for more details.\n\n"
	}
	],
	"prompt_number": 2
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Set up data\n-----------\n\nThe original dataset has a continuous field for \"time spent in extramarital affairs\". Rather than predicting the amount of time spent in affairs, let's focus on predicting whether an affair exists at all."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "data = affair.exog\ntarget = affair.endog > 0",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 3
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Set up pipeline\n---------------"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "from sklearn_pandas import cross_val_score, DataFrameMapper, GridSearchCV\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.ensemble import RandomForestClassifier",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 4
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "pipeline_stages = list()",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 5
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "The first stage of the pipeline is an sklearn-pandas `DataFrameMapper` class. Some of the features are categoical, some are quantative. We want to apply a one-hot encoder to the categorical features, and copy the other features verbatim."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "mapper = DataFrameMapper([\n (['occupation', 'occupation_husb'], OneHotEncoder()),\n (['rate_marriage', 'age', 'yrs_married', 'educ', 'children', 'religious'], None)\n])\npipeline_stages.append(('mapper', mapper))",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 6
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "The second stage of the pipeline is a Support Vector Classification."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "rf = RandomForestClassifier()\npipeline_stages.append(('rf', rf))",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 7
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Build a pipeline from the stages"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "pipeline = Pipeline(pipeline_stages)",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 8
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Grid Search\n-----------"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "print list(pipeline.get_params())",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "['mapper', 'rf__max_depth', 'mapper__features', 'rf__n_estimators', 'rf__verbose', 'rf__criterion', 'rf__min_density', 'rf__min_samples_split', 'rf__compute_importances', 'rf__bootstrap', 'rf', 'rf__max_features', 'rf__n_jobs', 'rf__random_state', 'rf__oob_score', 'rf__min_samples_leaf']\n"
	}
	],
	"prompt_number": 9
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "gs = GridSearchCV(pipeline, ({\n 'rf__n_estimators': [20, 40, 60],\n 'rf__min_samples_split': [100, 150, 200],\n 'rf__max_features': ['auto', 2, 4, 8],\n 'rf__criterion': ['gini', 'entropy']\n}), 'roc_auc')",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 15
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "gs.fit(data, target)",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 16
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "gs.best_params_",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 17,
	"text": "{'rf__criterion': 'gini',\n 'rf__max_features': 4,\n 'rf__min_samples_split': 150,\n 'rf__n_estimators': 60}"
	}
	],
	"prompt_number": 17
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "gs.best_score_",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 18,
	"text": "0.74808104422072164"
	}
	],
	"prompt_number": 18
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "",
	"language": "python",
	"metadata": {},
	"outputs": []
	}
	],
	"metadata": {}
	}
	]
	}