Last active December 23, 2015 16:29
"metadata": {
"name": "Affairs"
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: this is meant as a demonstration of [sklearn-pandas]( and in particular the new grid search capabilities. In a real use-case you would want to hold out a test set to determine the performance of the algorithm."
"cell_type": "markdown",
"metadata": {},
Predicting Extramarital Affairs with Sklearn
============================================

Load the Dataset
----------------

The dataset comes from the [`statsmodels.api`]( package.
"cell_type": "code",
"collapsed": false,
import statsmodels.api as sm
affair_meta = sm.datasets.fair
affair = affair_meta.load_pandas()
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
"cell_type": "markdown",
"metadata": {},
About the Data
--------------
"cell_type": "code",
"collapsed": false,
print affair_meta.DESCRLONG
print affair_meta.SOURCE
print affair_meta.NOTE
"language": "python",
"metadata": {},
"outputs": [
"output_type": "stream",
"stream": "stdout",
Extramarital affair data used to explain the allocation
of an individual's time among work, time spent with a spouse, and time
spent with a paramour. The data is used as an example of regression
with censored data.

Fair, Ray. 1978. "A Theory of Extramarital Affairs," `Journal of Political
Economy`, February, 45-61.

The data is available at


Number of observations: 6366
Number of variables: 9
Variable name definitions:

rate_marriage : How rate marriage, 1 = very poor, 2 = poor, 3 = fair,
4 = good, 5 = very good
age : Age
yrs_married : No. years married. Interval approximations. See
original paper for detailed explanation.
children : No. children
religious : How relgious, 1 = not, 2 = mildly, 3 = fairly,
4 = strongly
educ : Level of education, 9 = grade school, 12 = high school,
14 = some college, 16 = college graduate, 17 = some
graduate school, 20 = advanced degree
occupation : 1 = student, 2 = farming, agriculture; semi-skilled,
or unskilled worker; 3 = white-colloar; 4 = teacher
counselor social worker, nurse; artist, writers;
technician, skilled worker, 5 = managerial,
administrative, business, 6 = professional with
advanced degree
occupation_husb : Husband's occupation. Same as occupation.
affairs : measure of time spent in extramarital affairs

See the original paper for more details.
"prompt_number": 2
"cell_type": "markdown",
"metadata": {},
Set up data
-----------

The original dataset has a continuous field for "time spent in extramarital affairs". Rather than predicting the amount of time spent in affairs, let's focus on predicting whether an affair exists at all.
"cell_type": "code",
"collapsed": false,
data = affair.exog
target = affair.endog > 0
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
"cell_type": "markdown",
"metadata": {},
Set up pipeline
---------------
"cell_type": "code",
"collapsed": false,
from sklearn_pandas import cross_val_score, DataFrameMapper, GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
"cell_type": "code",
"collapsed": false,
pipeline_stages = list()
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
"cell_type": "markdown",
"metadata": {},
The first stage of the pipeline is an sklearn-pandas `DataFrameMapper` class. Some of the features are categoical, some are quantative. We want to apply a one-hot encoder to the categorical features, and copy the other features verbatim.
"cell_type": "code",
"collapsed": false,
mapper = DataFrameMapper([
(['occupation', 'occupation_husb'], OneHotEncoder()),
(['rate_marriage', 'age', 'yrs_married', 'educ', 'children', 'religious'], None)
])
pipeline_stages.append(('mapper', mapper))
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
"cell_type": "markdown",
"metadata": {},
The second stage of the pipeline is a Support Vector Classification.
"cell_type": "code",
"collapsed": false,
rf = RandomForestClassifier()
pipeline_stages.append(('rf', rf))
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
"cell_type": "markdown",
"metadata": {},
Build a pipeline from the stages
"cell_type": "code",
"collapsed": false,
pipeline = Pipeline(pipeline_stages)
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
"cell_type": "markdown",
"metadata": {},
Grid Search
-----------
"cell_type": "code",
"collapsed": false,
print list(pipeline.get_params())
"language": "python",
"metadata": {},
"outputs": [
"output_type": "stream",
"stream": "stdout",
['mapper', 'rf__max_depth', 'mapper__features', 'rf__n_estimators', 'rf__verbose', 'rf__criterion', 'rf__min_density', 'rf__min_samples_split', 'rf__compute_importances', 'rf__bootstrap', 'rf', 'rf__max_features', 'rf__n_jobs', 'rf__random_state', 'rf__oob_score', 'rf__min_samples_leaf']
"prompt_number": 9
"cell_type": "code",
"collapsed": false,
gs = GridSearchCV(pipeline, ({
'rf__n_estimators': [20, 40, 60],
'rf__min_samples_split': [100, 150, 200],
'rf__max_features': ['auto', 2, 4, 8],
'rf__criterion': ['gini', 'entropy']
}), 'roc_auc')
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 15
"cell_type": "code",
"collapsed": false,
, target)
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 16
"cell_type": "code",
"collapsed": false,
gs.best_params_
"language": "python",
"metadata": {},
"outputs": [
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
{'rf__criterion': 'gini',
'rf__max_features': 4,
'rf__min_samples_split': 150,
'rf__n_estimators': 60}
"prompt_number": 17
"cell_type": "code",
"collapsed": false,
gs.best_score_
"language": "python",
"metadata": {},
"outputs": [
"metadata": {},
"output_type": "pyout",
"prompt_number": 18,
0.74808104422072164
"prompt_number": 18
"cell_type": "code",
"collapsed": false,
"input": "",
"language": "python",
"metadata": {},
"outputs": []
"metadata": {}
