Last active
December 23, 2015 16:29
-
-
Save paulgb/6662807 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "Affairs" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Note: this is meant as a demonstration of [sklearn-pandas](https://github.com/paulgb/sklearn-pandas) and in particular the new grid search capabilities. In a real use-case you would want to hold out a test set to determine the performance of the algorithm." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Predicting Extramarital Affairs with Sklearn\n============================================\n\nLoad the Dataset\n----------------\n\nThe dataset comes from the [`statsmodels.api`](http://statsmodels.sourceforge.net/stable/datasets/index.html) package." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "import statsmodels.api as sm\naffair_meta = sm.datasets.fair\naffair = affair_meta.load_pandas()", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 1 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "About the Data\n--------------" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "print affair_meta.DESCRLONG\nprint affair_meta.SOURCE\nprint affair_meta.NOTE", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "Extramarital affair data used to explain the allocation\nof an individual's time among work, time spent with a spouse, and time\nspent with a paramour. The data is used as an example of regression\nwith censored data.\n\nFair, Ray. 1978. \"A Theory of Extramarital Affairs,\" `Journal of Political\n Economy`, February, 45-61.\n\nThe data is available at http://fairmodel.econ.yale.edu/rayfair/pdf/2011b.htm\n\n\nNumber of observations: 6366\nNumber of variables: 9\nVariable name definitions:\n\n rate_marriage : How rate marriage, 1 = very poor, 2 = poor, 3 = fair,\n 4 = good, 5 = very good\n age : Age\n yrs_married : No. years married. Interval approximations. See\n original paper for detailed explanation.\n children : No. children\n religious : How relgious, 1 = not, 2 = mildly, 3 = fairly,\n 4 = strongly\n educ : Level of education, 9 = grade school, 12 = high school,\n 14 = some college, 16 = college graduate, 17 = some\n graduate school, 20 = advanced degree\n occupation : 1 = student, 2 = farming, agriculture; semi-skilled,\n or unskilled worker; 3 = white-colloar; 4 = teacher\n counselor social worker, nurse; artist, writers;\n technician, skilled worker, 5 = managerial,\n administrative, business, 6 = professional with\n advanced degree\n occupation_husb : Husband's occupation. Same as occupation.\n affairs : measure of time spent in extramarital affairs\n\nSee the original paper for more details.\n\n" | |
} | |
], | |
"prompt_number": 2 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Set up data\n-----------\n\nThe original dataset has a continuous field for \"time spent in extramarital affairs\". Rather than predicting the amount of time spent in affairs, let's focus on predicting whether an affair exists at all." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "data = affair.exog\ntarget = affair.endog > 0", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 3 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Set up pipeline\n---------------" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "from sklearn_pandas import cross_val_score, DataFrameMapper, GridSearchCV\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.ensemble import RandomForestClassifier", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 4 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "pipeline_stages = list()", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 5 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "The first stage of the pipeline is an sklearn-pandas `DataFrameMapper` class. Some of the features are categoical, some are quantative. We want to apply a one-hot encoder to the categorical features, and copy the other features verbatim." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "mapper = DataFrameMapper([\n (['occupation', 'occupation_husb'], OneHotEncoder()),\n (['rate_marriage', 'age', 'yrs_married', 'educ', 'children', 'religious'], None)\n])\npipeline_stages.append(('mapper', mapper))", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 6 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "The second stage of the pipeline is a Support Vector Classification." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "rf = RandomForestClassifier()\npipeline_stages.append(('rf', rf))", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 7 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Build a pipeline from the stages" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "pipeline = Pipeline(pipeline_stages)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 8 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Grid Search\n-----------" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "print list(pipeline.get_params())", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "['mapper', 'rf__max_depth', 'mapper__features', 'rf__n_estimators', 'rf__verbose', 'rf__criterion', 'rf__min_density', 'rf__min_samples_split', 'rf__compute_importances', 'rf__bootstrap', 'rf', 'rf__max_features', 'rf__n_jobs', 'rf__random_state', 'rf__oob_score', 'rf__min_samples_leaf']\n" | |
} | |
], | |
"prompt_number": 9 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "gs = GridSearchCV(pipeline, ({\n 'rf__n_estimators': [20, 40, 60],\n 'rf__min_samples_split': [100, 150, 200],\n 'rf__max_features': ['auto', 2, 4, 8],\n 'rf__criterion': ['gini', 'entropy']\n}), 'roc_auc')", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 15 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "gs.fit(data, target)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 16 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "gs.best_params_", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 17, | |
"text": "{'rf__criterion': 'gini',\n 'rf__max_features': 4,\n 'rf__min_samples_split': 150,\n 'rf__n_estimators': 60}" | |
} | |
], | |
"prompt_number": 17 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "gs.best_score_", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 18, | |
"text": "0.74808104422072164" | |
} | |
], | |
"prompt_number": 18 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment