Skip to content

Instantly share code, notes, and snippets.

@subpath
Created October 25, 2015 10:09
Show Gist options
  • Save subpath/b0d75a7392a9c13c4a8d to your computer and use it in GitHub Desktop.
Save subpath/b0d75a7392a9c13c4a8d to your computer and use it in GitHub Desktop.
Titanic Kaggle Challenge
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Kaggle Competition | Titanic: Machine Learning from Disaster\n",
"The competition's website is located on [Kaggle.com](https://www.kaggle.com/c/titanic)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###Decription from Kaggle:\n",
"\n",
"The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.\n",
"\n",
"One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.\n",
"\n",
"In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###Introduction\n",
"Hi! This is my fourth practice notebook, and this is second month of my [Python + Machine Learning + Data Science](https://plus.google.com/b/112453425797644541537/112453425797644541537/posts) learning path. Previous practice notebooks are [here](http://nbviewer.ipython.org/gist/9bf22088e5b940e4d6d5) and [here](http://nbviewer.ipython.org/gist/subpath/7b52120610dbd7eab857).\n",
"\n",
"\n",
"In this practise notebook I will show my solution for the Titanic challenge.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#imports we will need\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.linear_model import LogisticRegression\n",
"import sklearn.ensemble as ske\n",
"from patsy import dmatrices\n",
"import seaborn as sns\n",
"sns.set_style('whitegrid')\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 891 entries, 0 to 890\n",
"Data columns (total 12 columns):\n",
"PassengerId 891 non-null int64\n",
"Survived 891 non-null int64\n",
"Pclass 891 non-null int64\n",
"Name 891 non-null object\n",
"Sex 891 non-null object\n",
"Age 714 non-null float64\n",
"SibSp 891 non-null int64\n",
"Parch 891 non-null int64\n",
"Ticket 891 non-null object\n",
"Fare 891 non-null float64\n",
"Cabin 204 non-null object\n",
"Embarked 889 non-null object\n",
"dtypes: float64(2), int64(5), object(5)\n",
"memory usage: 90.5+ KB\n"
]
}
],
"source": [
"#Reading the dataset in a dataframe using Pandas\n",
"df = pd.read_csv('Train.csv') \n",
"#check our data\n",
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Moran, Mr. James</td>\n",
" <td>male</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>330877</td>\n",
" <td>8.4583</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>McCarthy, Mr. Timothy J</td>\n",
" <td>male</td>\n",
" <td>54</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>17463</td>\n",
" <td>51.8625</td>\n",
" <td>E46</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Palsson, Master. Gosta Leonard</td>\n",
" <td>male</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>349909</td>\n",
" <td>21.0750</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>9</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)</td>\n",
" <td>female</td>\n",
" <td>27</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>347742</td>\n",
" <td>11.1333</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Nasser, Mrs. Nicholas (Adele Achem)</td>\n",
" <td>female</td>\n",
" <td>14</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>237736</td>\n",
" <td>30.0708</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"5 6 0 3 \n",
"6 7 0 1 \n",
"7 8 0 3 \n",
"8 9 1 3 \n",
"9 10 1 2 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n",
"2 Heikkinen, Miss. Laina female 26 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 \n",
"4 Allen, Mr. William Henry male 35 0 \n",
"5 Moran, Mr. James male NaN 0 \n",
"6 McCarthy, Mr. Timothy J male 54 0 \n",
"7 Palsson, Master. Gosta Leonard male 2 3 \n",
"8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 \n",
"9 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S \n",
"5 0 330877 8.4583 NaN Q \n",
"6 0 17463 51.8625 E46 S \n",
"7 1 349909 21.0750 NaN S \n",
"8 2 347742 11.1333 NaN S \n",
"9 0 237736 30.0708 NaN C "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So looks like data have missing values and in the column \"Cabin\" almost half is missing.\n",
"Also columns 'Cabin' and 'Ticket' are not intresting for machine learning model.\n",
"So we will drops that values"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#remove columns\n",
"df = df.drop(['Ticket','Cabin'], axis=1)\n",
"\n",
"#remove all rows with missing values\n",
"df = df.dropna() "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 712 entries, 0 to 890\n",
"Data columns (total 10 columns):\n",
"PassengerId 712 non-null int64\n",
"Survived 712 non-null int64\n",
"Pclass 712 non-null int64\n",
"Name 712 non-null object\n",
"Sex 712 non-null object\n",
"Age 712 non-null float64\n",
"SibSp 712 non-null int64\n",
"Parch 712 non-null int64\n",
"Fare 712 non-null float64\n",
"Embarked 712 non-null object\n",
"dtypes: float64(2), int64(5), object(3)\n",
"memory usage: 61.2+ KB\n"
]
}
],
"source": [
"#check our data again\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now data looks fine, we've removed all rows with empty values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Machine Learning part"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to make our data understandable for machine learning algorithm we need to create an arrays like that: [x1, x2...xn] and [y1, y2..yn].\n",
"It's mean that for every set of features Xi we will have unique label Yi.\n",
"\n",
"In order to do that lets create model formula:\n",
"\n",
"###Survived ~ C(Pclass) + C(Sex) + Age + SibSp + Parch + C(Embarked),\n",
"\n",
"##### here the ~ sign is an = sign, and the features of our dataset are written as a formula to predict survived. The C() lets our algorithm know that those variables are categorical."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Create an acceptable formula for our machine learning algorithms\n",
"formula_ml = 'Survived ~ C(Pclass) + C(Sex) + Age + SibSp + Parch + C(Embarked)'\n",
"\n",
"#assign the variables\n",
"y_train, x_train = dmatrices(formula_ml, data=df, return_type='dataframe')\n",
"\n",
"y_train = np.asarray(y_train).ravel()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So now we can apply machine learning algorithms!\n",
"I will be using following classification algorithms:\n",
"\n",
"[Random Forest](https://en.wikipedia.org/wiki/Random_forest),\n",
" [k-nearest neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm),\n",
" [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Finally Machine Learning part"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.9452247191011236, 0.8553370786516854, 0.8019662921348315]\n"
]
}
],
"source": [
"#instantiate and fit our model\n",
"rf = ske.RandomForestClassifier(n_estimators=200).fit(x_train, y_train)\n",
"\n",
"# Score the results\n",
"score_rf = rf.score(x_train, y_train)\n",
"\n",
"#instantiate and fit our model\n",
"kN = KNeighborsClassifier(n_neighbors = 2).fit(x_train, y_train)\n",
"\n",
"# Score the results\n",
"score_kN = kN.score(x_train, y_train)\n",
"\n",
"#instantiate and fit our model\n",
"logreg = LogisticRegression().fit(x_train, y_train)\n",
"\n",
"# Score the results\n",
"score_logreg = logreg.score(x_train, y_train)\n",
"\n",
"#print results\n",
"overall_results = [score_rf,score_kN,score_logreg]\n",
"\n",
"print(overall_results)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So looks like Random Forest best of three algorithms for this modeling."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Uploading results on Kaggle"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 418 entries, 0 to 417\n",
"Data columns (total 9 columns):\n",
"PassengerId 418 non-null int64\n",
"Pclass 418 non-null int64\n",
"Name 418 non-null object\n",
"Sex 418 non-null object\n",
"Age 332 non-null float64\n",
"SibSp 418 non-null int64\n",
"Parch 418 non-null int64\n",
"Fare 417 non-null float64\n",
"Embarked 418 non-null object\n",
"dtypes: float64(2), int64(4), object(3)\n",
"memory usage: 32.7+ KB\n"
]
}
],
"source": [
"#upload features for testing\n",
"test_data = pd.read_csv(\"Test.csv\")\n",
"#Drop some columns\n",
"test_data = test_data .drop(['Ticket','Cabin'], axis=1)\n",
"\n",
"test_data.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"According to Kaggle requirements:\n",
"\n",
"####\"We expect the solution file to have 418 predictions. The file should have a header row.\"\n",
"\n",
"We need to submit 418 predictions, so here we will not removes rows with empty Age, insted we will write in average age in empty cells. And same think with Fare column\n",
"\t\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 418 entries, 0 to 417\n",
"Data columns (total 9 columns):\n",
"PassengerId 418 non-null int64\n",
"Pclass 418 non-null int64\n",
"Name 418 non-null object\n",
"Sex 418 non-null object\n",
"Age 418 non-null float64\n",
"SibSp 418 non-null int64\n",
"Parch 418 non-null int64\n",
"Fare 418 non-null float64\n",
"Embarked 418 non-null object\n",
"dtypes: float64(2), int64(4), object(3)\n",
"memory usage: 32.7+ KB\n"
]
}
],
"source": [
"meanAge=np.mean(test_data.Age)\n",
"\n",
"test_data.Age=test_data.Age.fillna(meanAge)\n",
"\n",
"meanFare=np.mean(test_data.Fare)\n",
"\n",
"test_data.Fare=test_data.Fare.fillna(meanFare)\n",
"\n",
"#check that everything fine\n",
"test_data.info()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"formula_ml_test = 'Name ~ C(Pclass) + C(Sex) + Age + SibSp + Parch + C(Embarked)'\n",
"# here we will need only x_test, I wasn't sure how to do it other way\n",
"y_test, x_test = dmatrices(formula_ml_test, data=test_data, return_type='dataframe')\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# make a prediction\n",
"y_pred = rf.predict(x_test)\n",
"\n",
"PassengerId = np.array(test_data['PassengerId'])\n",
"\n",
"#make a data frame\n",
"y_pred= pd.DataFrame(y_pred,PassengerId,columns=['Survived'])\n",
"\n",
"# saves the results into csv\n",
"\n",
"y_pred.to_csv(\"MyPredictionTitanic.csv\")\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"##Now you can upload your CSV with prediction on [Kaggle](https://www.kaggle.com/c/titanic/submissions/attach)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment