Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save vinceallenvince/a312f7b718fc3322bad60fc8f2fa8004 to your computer and use it in GitHub Desktop.
Save vinceallenvince/a312f7b718fc3322bad60fc8f2fa8004 to your computer and use it in GitHub Desktop.
Titanic Kaggle competition - Feature EDA - The Gaurantee Group and other special passengers
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from __future__ import division\n",
"import operator\n",
"\n",
"import pandas as pd\n",
"from pandas import Series, DataFrame\n",
"import numpy as np\n",
"\n",
"from sklearn.cross_validation import cross_val_score\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.svm import SVC, LinearSVC\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.naive_bayes import GaussianNB"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Titanic EDA - The Gaurantee Group and other special passengers\n",
"Several men traveled as passengers but were considered crew ([the Guarantee Group](https://en.wikipedia.org/wiki/Crew_of_the_RMS_Titanic#Guarantee_group)) or employees of other cruise lines. All of these passengers traveled on complimentary tickets (Fare = 0.0). We can find them in our training set with a simple query."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"X_train = pd.read_csv('data/train.csv', dtype={'Age': np.float64})\n",
"y_train = X_train['Survived']"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>179</th>\n",
" <td>180</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Leonard, Mr. Lionel</td>\n",
" <td>male</td>\n",
" <td>36.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>LINE</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>263</th>\n",
" <td>264</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>Harrison, Mr. William</td>\n",
" <td>male</td>\n",
" <td>40.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112059</td>\n",
" <td>0.0</td>\n",
" <td>B94</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>271</th>\n",
" <td>272</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Tornquist, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>25.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>LINE</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>277</th>\n",
" <td>278</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Parkes, Mr. Francis \"Frank\"</td>\n",
" <td>male</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>239853</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>302</th>\n",
" <td>303</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnson, Mr. William Cahoone Jr</td>\n",
" <td>male</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>LINE</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>413</th>\n",
" <td>414</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Cunningham, Mr. Alfred Fleming</td>\n",
" <td>male</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>239853</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>466</th>\n",
" <td>467</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Campbell, Mr. William</td>\n",
" <td>male</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>239853</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>481</th>\n",
" <td>482</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Frost, Mr. Anthony Wood \"Archie\"</td>\n",
" <td>male</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>239854</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>597</th>\n",
" <td>598</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnson, Mr. Alfred</td>\n",
" <td>male</td>\n",
" <td>49.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>LINE</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>633</th>\n",
" <td>634</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>Parr, Mr. William Henry Marsh</td>\n",
" <td>male</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112052</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>674</th>\n",
" <td>675</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Watson, Mr. Ennis Hastings</td>\n",
" <td>male</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>239856</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>732</th>\n",
" <td>733</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Knight, Mr. Robert J</td>\n",
" <td>male</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>239855</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>806</th>\n",
" <td>807</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>Andrews, Mr. Thomas Jr</td>\n",
" <td>male</td>\n",
" <td>39.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112050</td>\n",
" <td>0.0</td>\n",
" <td>A36</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>815</th>\n",
" <td>816</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>Fry, Mr. Richard</td>\n",
" <td>male</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112058</td>\n",
" <td>0.0</td>\n",
" <td>B102</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>822</th>\n",
" <td>823</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>Reuchlin, Jonkheer. John George</td>\n",
" <td>male</td>\n",
" <td>38.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>19972</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name Sex \\\n",
"179 180 0 3 Leonard, Mr. Lionel male \n",
"263 264 0 1 Harrison, Mr. William male \n",
"271 272 1 3 Tornquist, Mr. William Henry male \n",
"277 278 0 2 Parkes, Mr. Francis \"Frank\" male \n",
"302 303 0 3 Johnson, Mr. William Cahoone Jr male \n",
"413 414 0 2 Cunningham, Mr. Alfred Fleming male \n",
"466 467 0 2 Campbell, Mr. William male \n",
"481 482 0 2 Frost, Mr. Anthony Wood \"Archie\" male \n",
"597 598 0 3 Johnson, Mr. Alfred male \n",
"633 634 0 1 Parr, Mr. William Henry Marsh male \n",
"674 675 0 2 Watson, Mr. Ennis Hastings male \n",
"732 733 0 2 Knight, Mr. Robert J male \n",
"806 807 0 1 Andrews, Mr. Thomas Jr male \n",
"815 816 0 1 Fry, Mr. Richard male \n",
"822 823 0 1 Reuchlin, Jonkheer. John George male \n",
"\n",
" Age SibSp Parch Ticket Fare Cabin Embarked \n",
"179 36.0 0 0 LINE 0.0 NaN S \n",
"263 40.0 0 0 112059 0.0 B94 S \n",
"271 25.0 0 0 LINE 0.0 NaN S \n",
"277 NaN 0 0 239853 0.0 NaN S \n",
"302 19.0 0 0 LINE 0.0 NaN S \n",
"413 NaN 0 0 239853 0.0 NaN S \n",
"466 NaN 0 0 239853 0.0 NaN S \n",
"481 NaN 0 0 239854 0.0 NaN S \n",
"597 49.0 0 0 LINE 0.0 NaN S \n",
"633 NaN 0 0 112052 0.0 NaN S \n",
"674 NaN 0 0 239856 0.0 NaN S \n",
"732 NaN 0 0 239855 0.0 NaN S \n",
"806 39.0 0 0 112050 0.0 A36 S \n",
"815 NaN 0 0 112058 0.0 B102 S \n",
"822 38.0 0 0 19972 0.0 NaN S "
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train[X_train.Fare < 1.0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see members of the Guarantee Group as well other special passengers. Notice some have a 'LINE' value for their ticket. These are employees of the American cruise line who had to take the Titanic as a solution to some scheduling problems.\n",
"\n",
"We also see employees traveling in first class. Mr. Richard Fry and Mr. William Harrison were employees of Joseph Bruce Ismay, chairman and managing director of the White Star Line. Ismay survived. Fry and Harrison did not.\n",
"\n",
"Also traveling in first class was Mr. Thomas Andrews, the Titanic's chief designer.\n",
"\n",
"Only one of these men, Mr. William Henry Tornquist traveling in 3rd class, survived the sinking."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another interesting group was the reverands."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>149</th>\n",
" <td>150</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Byles, Rev. Thomas Roussel Davids</td>\n",
" <td>male</td>\n",
" <td>42.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>244310</td>\n",
" <td>13.000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>150</th>\n",
" <td>151</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Bateman, Rev. Robert James</td>\n",
" <td>male</td>\n",
" <td>51.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>S.O.P. 1166</td>\n",
" <td>12.525</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>249</th>\n",
" <td>250</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Carter, Rev. Ernest Courtenay</td>\n",
" <td>male</td>\n",
" <td>54.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>244252</td>\n",
" <td>26.000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>626</th>\n",
" <td>627</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Kirkland, Rev. Charles Leonard</td>\n",
" <td>male</td>\n",
" <td>57.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>219533</td>\n",
" <td>12.350</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" <tr>\n",
" <th>848</th>\n",
" <td>849</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Harper, Rev. John</td>\n",
" <td>male</td>\n",
" <td>28.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>248727</td>\n",
" <td>33.000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name Sex \\\n",
"149 150 0 2 Byles, Rev. Thomas Roussel Davids male \n",
"150 151 0 2 Bateman, Rev. Robert James male \n",
"249 250 0 2 Carter, Rev. Ernest Courtenay male \n",
"626 627 0 2 Kirkland, Rev. Charles Leonard male \n",
"848 849 0 2 Harper, Rev. John male \n",
"886 887 0 2 Montvila, Rev. Juozas male \n",
"\n",
" Age SibSp Parch Ticket Fare Cabin Embarked \n",
"149 42.0 0 0 244310 13.000 NaN S \n",
"150 51.0 0 0 S.O.P. 1166 12.525 NaN S \n",
"249 54.0 1 0 244252 26.000 NaN S \n",
"626 57.0 0 0 219533 12.350 NaN Q \n",
"848 28.0 0 1 248727 33.000 NaN S \n",
"886 27.0 0 0 211536 13.000 NaN S "
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train[X_train['Name'].apply(lambda x: x.find('Rev.') != -1)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All of them traveled in class 2 and all died."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just for fun, let's make an 'employee' feature that includes the Guarantee Group, cruise line employees and reverands and see how it does."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def check_classifiers(X_train, Y_train):\n",
" \n",
" _cv = 5\n",
" classifier_score = {}\n",
" \n",
" scores = cross_val_score(LogisticRegression(), X, y, cv=_cv)\n",
" classifier_score['LogisticRegression'] = scores.mean()\n",
" \n",
" scores = cross_val_score(KNeighborsClassifier(), X, y, cv=_cv)\n",
" classifier_score['KNeighborsClassifier'] = scores.mean()\n",
" \n",
" scores = cross_val_score(RandomForestClassifier(), X, y, cv=_cv)\n",
" classifier_score['RandomForestClassifier'] = scores.mean()\n",
" \n",
" scores = cross_val_score(SVC(), X, y, cv=_cv)\n",
" classifier_score['SVC'] = scores.mean()\n",
" \n",
" scores = cross_val_score(GaussianNB(), X, y, cv=_cv)\n",
" classifier_score['GaussianNB'] = scores.mean()\n",
"\n",
" return sorted(classifier_score.items(), key=operator.itemgetter(1), reverse=True)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def check_employee(passenger):\n",
" name, fare = passenger\n",
" if fare < 1 or name.find('Rev.') != -1:\n",
" return 1.0\n",
" else:\n",
" return 0.0\n",
"\n",
"X_train['employee'] = X_train[['Name', 'Fare']].apply(check_employee, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"features = ['employee']"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('RandomForestClassifier', 0.61616490890978648),\n",
" ('LogisticRegression', 0.61616490890978648),\n",
" ('SVC', 0.61616490890978648),\n",
" ('KNeighborsClassifier', 0.57482412678688144),\n",
" ('GaussianNB', 0.4051838312610137)]"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = DataFrame(X_train[features])\n",
"y = y_train\n",
"scores = check_classifiers(X, y)\n",
"scores"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Features</th>\n",
" <th>Coefficient Estimate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>employee</td>\n",
" <td>-1.539345</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Features Coefficient Estimate\n",
"0 employee -1.539345"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get Correlation Coefficient for each feature using Logistic Regression\n",
"coeff_df = DataFrame(X.columns)\n",
"coeff_df.columns = ['Features']\n",
"classifier = LogisticRegression()\n",
"coeff_df[\"Coefficient Estimate\"] = pd.Series(classifier.fit(X, y).coef_[0])\n",
"\n",
"# preview\n",
"coeff_df"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Being an employee looks like a strong predictor for not surviving... no surprise there."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.6161616161616161"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"null_hypothesis = 1 - X_train.Survived.mean()\n",
"null_hypothesis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, as a model it does not perform better than the null hypothesis. Again, not a surprise since the null hypothesis asserts everyone died."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment