Skip to content

Instantly share code, notes, and snippets.

@antoniszczepanik
Last active January 29, 2018 04:55
Show Gist options
  • Save antoniszczepanik/ebf4d80dd6b6994cc87bfb4a8bb6faec to your computer and use it in GitHub Desktop.
Save antoniszczepanik/ebf4d80dd6b6994cc87bfb4a8bb6faec to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# First I'm importing all modules I'll use in this project. The list is very limited, as are my skills, but that is only for now (:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#data wrangling\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt \n",
"import numpy as np\n",
"\n",
"#machine learning\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Now I'm reading the data. The split between the training and testing data is predefined by Kaggle. If you want to run this code remember to specify different path. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"train_df = pd.read_csv('C:\\\\Users\\\\user\\\\Desktop\\\\KaggleProblems\\\\Titanic\\\\train.csv')\n",
"test_df = pd.read_csv('C:\\\\Users\\\\user\\\\Desktop\\\\KaggleProblems\\\\Titanic\\\\test.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Now let's take a look at what we have:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Our columns are:\n",
"\n",
"\n",
"\n",
"- PassengerId - individual Id of each passenger\n",
"\n",
"- Survived - binary feature specifing if person survived (1) or not (0)\n",
"\n",
"- Pclass- class of the ticket, values 1, 2, 3 \n",
"\n",
"- Name - passengers name\n",
"\n",
"- Sex - male or female.\n",
"\n",
"- Age - persons age as a float value\n",
"\n",
"- SibSp - # of siblings / spouses aboard the Titanic\n",
"\n",
"- Parch - number of parents or children on the board\n",
"\n",
"- Ticket - simply a ticket number\n",
"\n",
"- Fare - amount of money someone paid for their place (float)\n",
"\n",
"- Cabin - number of ones cabin\n",
"\n",
"- Embarked - place where each person embarked (S, C, Q)\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"# I decided to drop PassengerId, Ticket and a Cabin. I think all of this data is completely irrelevant and will not bring us any value. \n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>7.2500</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>71.2833</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>7.9250</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>53.1000</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>8.0500</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass Sex Age SibSp Parch Fare Embarked\n",
"0 0 3 male 22.0 1 0 7.2500 S\n",
"1 1 1 female 38.0 1 0 71.2833 C\n",
"2 1 3 female 26.0 0 0 7.9250 S\n",
"3 1 1 female 35.0 1 0 53.1000 S\n",
"4 0 3 male 35.0 0 0 8.0500 S"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df = train_df.drop([\"Ticket\", \"Cabin\", \"PassengerId\", \"Name\"], axis = 1)\n",
"test_df = test_df.drop([\"Ticket\", \"Cabin\", \"Name\"], axis = 1)\n",
"\n",
"train_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Know I need to identify if I lack any records."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 891 entries, 0 to 890\n",
"Data columns (total 8 columns):\n",
"Survived 891 non-null int64\n",
"Pclass 891 non-null int64\n",
"Sex 891 non-null object\n",
"Age 714 non-null float64\n",
"SibSp 891 non-null int64\n",
"Parch 891 non-null int64\n",
"Fare 891 non-null float64\n",
"Embarked 889 non-null object\n",
"dtypes: float64(2), int64(4), object(2)\n",
"memory usage: 55.8+ KB\n",
"++++++++++++++++++++++++++++++++++++++\n",
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 418 entries, 0 to 417\n",
"Data columns (total 8 columns):\n",
"PassengerId 418 non-null int64\n",
"Pclass 418 non-null int64\n",
"Sex 418 non-null object\n",
"Age 332 non-null float64\n",
"SibSp 418 non-null int64\n",
"Parch 418 non-null int64\n",
"Fare 417 non-null float64\n",
"Embarked 418 non-null object\n",
"dtypes: float64(2), int64(4), object(2)\n",
"memory usage: 26.2+ KB\n"
]
}
],
"source": [
"train_df.info()\n",
"print(\"++++++++++++++++++++++++++++++++++++++\")\n",
"test_df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# We can see that our data lacks Age values as well as Embarked and Fare(only one in test data). I will start from dealing with Embarked. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>7.2500</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>71.2833</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>7.9250</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>53.1000</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>8.0500</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass Sex Age SibSp Parch Fare Embarked\n",
"0 0 3 male 22.0 1 0 7.2500 S\n",
"1 1 1 female 38.0 1 0 71.2833 C\n",
"2 1 3 female 26.0 0 0 7.9250 S\n",
"3 1 1 female 35.0 1 0 53.1000 S\n",
"4 0 3 male 35.0 0 0 8.0500 S"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df[\"Embarked\"].fillna(\"S\",inplace = True)\n",
"train_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Next thing we need to do is to deal with missing Age data. For now I will fill it in with mean."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train_df[\"Age\"].fillna(train_df[\"Age\"].mean(), inplace = True)\n",
"test_df[\"Age\"].fillna(test_df[\"Age\"].mean(), inplace = True)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Also lacking one Fare value, which I'll fill with the most frequent value. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"test_df[\"Fare\"] = test_df[\"Fare\"].fillna(test_df['Fare'].value_counts().max())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Now as our data complete I will analyze structures of each variable. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>0.383838</td>\n",
" <td>2.308642</td>\n",
" <td>29.699118</td>\n",
" <td>0.523008</td>\n",
" <td>0.381594</td>\n",
" <td>32.204208</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.486592</td>\n",
" <td>0.836071</td>\n",
" <td>13.002015</td>\n",
" <td>1.102743</td>\n",
" <td>0.806057</td>\n",
" <td>49.693429</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.420000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" <td>22.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>7.910400</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>0.000000</td>\n",
" <td>3.000000</td>\n",
" <td>29.699118</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>14.454200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>1.000000</td>\n",
" <td>3.000000</td>\n",
" <td>35.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>31.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>1.000000</td>\n",
" <td>3.000000</td>\n",
" <td>80.000000</td>\n",
" <td>8.000000</td>\n",
" <td>6.000000</td>\n",
" <td>512.329200</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass Age SibSp Parch Fare\n",
"count 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000\n",
"mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208\n",
"std 0.486592 0.836071 13.002015 1.102743 0.806057 49.693429\n",
"min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000\n",
"25% 0.000000 2.000000 22.000000 0.000000 0.000000 7.910400\n",
"50% 0.000000 3.000000 29.699118 0.000000 0.000000 14.454200\n",
"75% 1.000000 3.000000 35.000000 1.000000 0.000000 31.000000\n",
"max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are some important informations that may give us some insight?\n",
"\n",
"1. 38% of passengers in training dataset have survived. \n",
"2. Average passenger age is 29 years.\n",
"3. More than 50% passengers did not travel with siblings.\n",
"4. More that 50% passengers did not travel with parents or children.\n",
"5. Average ticket price was 32 ($ presumably)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I belive that if someone was traveling with parents or with siblings does not make a diffrence. What will make a difference is whether someone travelled alone or with family. That is why I am creating a new feature called Family, which will be binary. (0 - person travelled alone, 1 - person travelled with family member/s)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>Fare</th>\n",
" <th>Embarked</th>\n",
" <th>Family</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>7.2500</td>\n",
" <td>S</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>71.2833</td>\n",
" <td>C</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>7.9250</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>53.1000</td>\n",
" <td>S</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>8.0500</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass Sex Age Fare Embarked Family\n",
"0 0 3 male 22.0 7.2500 S 1\n",
"1 1 1 female 38.0 71.2833 C 1\n",
"2 1 3 female 26.0 7.9250 S 0\n",
"3 1 1 female 35.0 53.1000 S 1\n",
"4 0 3 male 35.0 8.0500 S 0"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df[\"Family\"] = np.where(train_df[\"SibSp\"] + train_df[\"Parch\"] == 0, 0, 1)\n",
"test_df[\"Family\"] = np.where(test_df[\"SibSp\"] + test_df[\"Parch\"] == 0, 0, 1)\n",
"\n",
"train_df = train_df.drop([\"SibSp\",\"Parch\"], axis = 1)\n",
"test_df = test_df.drop([\"SibSp\",\"Parch\"], axis = 1)\n",
"\n",
"train_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 891 entries, 0 to 890\n",
"Data columns (total 7 columns):\n",
"Survived 891 non-null int64\n",
"Pclass 891 non-null int64\n",
"Sex 891 non-null object\n",
"Age 891 non-null float64\n",
"Fare 891 non-null float64\n",
"Embarked 891 non-null object\n",
"Family 891 non-null int32\n",
"dtypes: float64(2), int32(1), int64(2), object(2)\n",
"memory usage: 45.3+ KB\n"
]
}
],
"source": [
"train_df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Now I will convert Embarked and Sex values to numerical, as they tend to work better with algorithms such as logistic regression. "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>Fare</th>\n",
" <th>Embarked</th>\n",
" <th>Family</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>22.0</td>\n",
" <td>7.2500</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>38.0</td>\n",
" <td>71.2833</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>26.0</td>\n",
" <td>7.9250</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>35.0</td>\n",
" <td>53.1000</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>35.0</td>\n",
" <td>8.0500</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass Sex Age Fare Embarked Family\n",
"0 0 3 0 22.0 7.2500 0 1\n",
"1 1 1 1 38.0 71.2833 1 1\n",
"2 1 3 1 26.0 7.9250 0 0\n",
"3 1 1 1 35.0 53.1000 0 1\n",
"4 0 3 0 35.0 8.0500 0 0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
" \n",
"embarked_num = {\"S\":0,\n",
" \"C\":1,\n",
" \"Q\":2}\n",
"\n",
"train_df[\"Embarked\"] = train_df[\"Embarked\"].apply(embarked_num.get).astype(int)\n",
"test_df[\"Embarked\"] = test_df[\"Embarked\"].apply(embarked_num.get).astype(int)\n",
"\n",
"sex_num = {\"male\":0,\n",
" \"female\":1}\n",
"\n",
"train_df[\"Sex\"] = train_df[\"Sex\"].apply(sex_num.get).astype(int)\n",
"test_df[\"Sex\"] = test_df[\"Sex\"].apply(sex_num.get).astype(int)\n",
"\n",
"train_df.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# I will bucket continous values. I would preffer to do a better split for example: infants, kids, adults, but I don't know the method - I will research it later. For now I am just creating equal groups. "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th>Fare</th>\n",
" <th>Embarked</th>\n",
" <th>Family</th>\n",
" </tr>\n",
" <tr>\n",
" <th>AgeGroup</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>100</td>\n",
" <td>100</td>\n",
" <td>100</td>\n",
" <td>100</td>\n",
" <td>100</td>\n",
" <td>100</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>523</td>\n",
" <td>523</td>\n",
" <td>523</td>\n",
" <td>523</td>\n",
" <td>523</td>\n",
" <td>523</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>188</td>\n",
" <td>188</td>\n",
" <td>188</td>\n",
" <td>188</td>\n",
" <td>188</td>\n",
" <td>188</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>69</td>\n",
" <td>69</td>\n",
" <td>69</td>\n",
" <td>69</td>\n",
" <td>69</td>\n",
" <td>69</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>11</td>\n",
" <td>11</td>\n",
" <td>11</td>\n",
" <td>11</td>\n",
" <td>11</td>\n",
" <td>11</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass Sex Fare Embarked Family\n",
"AgeGroup \n",
"0 100 100 100 100 100 100\n",
"1 523 523 523 523 523 523\n",
"2 188 188 188 188 188 188\n",
"3 69 69 69 69 69 69\n",
"4 11 11 11 11 11 11"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df[\"AgeGroup\"] = pd.cut(train_df.Age, bins=5, labels = [0, 1, 2, 3, 4])\n",
"\n",
"\n",
"test_df[\"AgeGroup\"] = pd.cut(test_df.Age, bins=5, labels = [0, 1, 2, 3, 4])\n",
"\n",
"\n",
"train_df = train_df.drop([\"Age\"], axis = 1)\n",
"test_df = test_df.drop([\"Age\"], axis = 1)\n",
"\n",
"train_df.groupby([\"AgeGroup\"]).count()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th>Embarked</th>\n",
" <th>Family</th>\n",
" <th>AgeGroup</th>\n",
" </tr>\n",
" <tr>\n",
" <th>FareGroup</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>838</td>\n",
" <td>838</td>\n",
" <td>838</td>\n",
" <td>838</td>\n",
" <td>838</td>\n",
" <td>838</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>33</td>\n",
" <td>33</td>\n",
" <td>33</td>\n",
" <td>33</td>\n",
" <td>33</td>\n",
" <td>33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>17</td>\n",
" <td>17</td>\n",
" <td>17</td>\n",
" <td>17</td>\n",
" <td>17</td>\n",
" <td>17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass Sex Embarked Family AgeGroup\n",
"FareGroup \n",
"0 838 838 838 838 838 838\n",
"1 33 33 33 33 33 33\n",
"2 17 17 17 17 17 17\n",
"3 0 0 0 0 0 0\n",
"4 3 3 3 3 3 3"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df[\"FareGroup\"] = pd.cut(train_df.Fare, bins=5, labels = [0, 1, 2, 3, 4])\n",
"\n",
"\n",
"test_df[\"FareGroup\"] = pd.cut(test_df.Fare, bins=5, labels = [0, 1, 2, 3, 4])\n",
"\n",
"\n",
"train_df = train_df.drop([\"Fare\"], axis = 1)\n",
"test_df = test_df.drop([\"Fare\"], axis = 1)\n",
"\n",
"train_df.head()\n",
"train_df.groupby([\"FareGroup\"]).count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# It seems that our data is clean and ready to do the training part. I picked Logistic Regression as it is simple solution. Moreover it's one of the few I know the math behind.\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"78.900000000000006"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train = train_df.drop(\"Survived\", axis=1)\n",
"Y_train = train_df[\"Survived\"]\n",
"X_test = test_df.drop(\"PassengerId\", axis=1).copy()\n",
"\n",
"logreg = LogisticRegression()\n",
"logreg.fit(X_train, Y_train)\n",
"Y_pred = logreg.predict(X_test)\n",
"acc_log = round(logreg.score(X_train, Y_train) * 100, 2)\n",
"acc_log"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# I think that 78.9% is not a bad score, taking into consideration that it is the first time I actually use Scikitlearn! If you've read that and have any comments feel free to post them :)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment