Skip to content

Instantly share code, notes, and snippets.

@codistwa
Created May 25, 2023 14:10
Show Gist options
  • Save codistwa/e651d45f6414e59c4643c114629baf51 to your computer and use it in GitHub Desktop.
Save codistwa/e651d45f6414e59c4643c114629baf51 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "legitimate-candy",
"metadata": {},
"source": [
"## Problem statement"
]
},
{
"cell_type": "markdown",
"id": "hourly-species",
"metadata": {},
"source": [
"Whether this passenger will survive or not"
]
},
{
"cell_type": "markdown",
"id": "apart-designer",
"metadata": {},
"source": [
"## 1. Importing Libraries"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "atlantic-strike",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np #for algebraic operations on arrays\n",
"import pandas as pd #for data exploration and manipulation\n",
"\n",
"# plotting libraries\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline\n",
"\n",
"from mlxtend.plotting import plot_learning_curves"
]
},
{
"cell_type": "markdown",
"id": "fixed-juvenile",
"metadata": {},
"source": [
"## 2. Loading the dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "opened-replication",
"metadata": {},
"outputs": [],
"source": [
"train = './titanic/train.csv'\n",
"test = './titanic/test.csv'\n",
"df_train = pd.read_csv(train)\n",
"df_test = pd.read_csv(test)"
]
},
{
"cell_type": "markdown",
"id": "generic-artist",
"metadata": {},
"source": [
"## 3. Exploratory data analysis"
]
},
{
"cell_type": "markdown",
"id": "opening-finding",
"metadata": {},
"source": [
"**Topics** (think like a stakeholder)\n",
"\n",
"- What question(s) are you trying to solve (or prove wrong)?\n",
"- What kind of data do you have and how do you treat different types?\n",
"- What’s missing from the data and how do you deal with it?\n",
"- Where are the outliers and why should you care about them?\n",
"- How can you add, change or remove features to get more out of your data?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "banned-union",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# View the first five lines\n",
"df_train.head()"
]
},
{
"cell_type": "markdown",
"id": "expensive-communist",
"metadata": {},
"source": [
"Our response label here is “Survived” which represents an answer to the classification question “Whether this passenger will survive or not”"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "hazardous-playback",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(891, 12)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train.shape"
]
},
{
"cell_type": "markdown",
"id": "intimate-knowing",
"metadata": {},
"source": [
"There are 891 rows and 12 columns"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "directed-security",
"metadata": {},
"outputs": [],
"source": [
"# View the first five lines transposed (useful when the dataset is huge)\n",
"# df_train.head().T"
]
},
{
"cell_type": "markdown",
"id": "creative-premises",
"metadata": {},
"source": [
"**Survived** is the target variable for our ML model which we want to predict (1 or 0)\n",
"\n",
"It has 2 values:\n",
"\n",
"1 - Survived.\n",
"\n",
"0 - No survived."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "accepted-christmas",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 891 entries, 0 to 890\n",
"Data columns (total 12 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 PassengerId 891 non-null int64 \n",
" 1 Survived 891 non-null int64 \n",
" 2 Pclass 891 non-null int64 \n",
" 3 Name 891 non-null object \n",
" 4 Sex 891 non-null object \n",
" 5 Age 714 non-null float64\n",
" 6 SibSp 891 non-null int64 \n",
" 7 Parch 891 non-null int64 \n",
" 8 Ticket 891 non-null object \n",
" 9 Fare 891 non-null float64\n",
" 10 Cabin 204 non-null object \n",
" 11 Embarked 889 non-null object \n",
"dtypes: float64(2), int64(5), object(5)\n",
"memory usage: 83.7+ KB\n"
]
}
],
"source": [
"# checking data information\n",
"df_train.info()"
]
},
{
"cell_type": "markdown",
"id": "ongoing-orleans",
"metadata": {},
"source": [
"**Interpreting Data Information**\n",
"\n",
"- We have 891 rows, any column that contains lesser number of rows has missing values.\n",
"- We have 12 columns.\n",
"- There are categorical features that have data type float64.\n",
"- There are numerical features that have data type object.\n",
"\n",
"We can see some columns with missing values (less than the total) : Age, Cabin"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "backed-framing",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" <td>714.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>446.000000</td>\n",
" <td>0.383838</td>\n",
" <td>2.308642</td>\n",
" <td>29.699118</td>\n",
" <td>0.523008</td>\n",
" <td>0.381594</td>\n",
" <td>32.204208</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>257.353842</td>\n",
" <td>0.486592</td>\n",
" <td>0.836071</td>\n",
" <td>14.526497</td>\n",
" <td>1.102743</td>\n",
" <td>0.806057</td>\n",
" <td>49.693429</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.420000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>223.500000</td>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" <td>20.125000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>7.910400</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>446.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.000000</td>\n",
" <td>28.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>14.454200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>668.500000</td>\n",
" <td>1.000000</td>\n",
" <td>3.000000</td>\n",
" <td>38.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>31.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>891.000000</td>\n",
" <td>1.000000</td>\n",
" <td>3.000000</td>\n",
" <td>80.000000</td>\n",
" <td>8.000000</td>\n",
" <td>6.000000</td>\n",
" <td>512.329200</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Age SibSp \\\n",
"count 891.000000 891.000000 891.000000 714.000000 891.000000 \n",
"mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n",
"std 257.353842 0.486592 0.836071 14.526497 1.102743 \n",
"min 1.000000 0.000000 1.000000 0.420000 0.000000 \n",
"25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n",
"50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n",
"75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n",
"max 891.000000 1.000000 3.000000 80.000000 8.000000 \n",
"\n",
" Parch Fare \n",
"count 891.000000 891.000000 \n",
"mean 0.381594 32.204208 \n",
"std 0.806057 49.693429 \n",
"min 0.000000 0.000000 \n",
"25% 0.000000 7.910400 \n",
"50% 0.000000 14.454200 \n",
"75% 0.000000 31.000000 \n",
"max 6.000000 512.329200 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# statistical summary of numerical variables\n",
"df_train.describe()"
]
},
{
"cell_type": "markdown",
"id": "diverse-proxy",
"metadata": {},
"source": [
"We can compare the mean of each column with the min/max value, to check if we might have outliers as there's a considerable difference between average value and max value.\n",
"\n",
"We can compare the general mean and standard deviation to see if we need to normalize the data as there's a considerable difference between all of them."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "false-canon",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Ticket</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>891</td>\n",
" <td>891</td>\n",
" <td>891</td>\n",
" <td>204</td>\n",
" <td>889</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>891</td>\n",
" <td>2</td>\n",
" <td>681</td>\n",
" <td>147</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>Young, Miss. Marie Grice</td>\n",
" <td>male</td>\n",
" <td>347082</td>\n",
" <td>B96 B98</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>1</td>\n",
" <td>577</td>\n",
" <td>7</td>\n",
" <td>4</td>\n",
" <td>644</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Sex Ticket Cabin Embarked\n",
"count 891 891 891 204 889\n",
"unique 891 2 681 147 3\n",
"top Young, Miss. Marie Grice male 347082 B96 B98 S\n",
"freq 1 577 7 4 644"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# summary statistics for categorical columns\n",
"df_train.describe(include=['object'])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "weekly-proposal",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"PassengerId 891\n",
"Survived 2\n",
"Pclass 3\n",
"Name 891\n",
"Sex 2\n",
"Age 88\n",
"SibSp 7\n",
"Parch 7\n",
"Ticket 681\n",
"Fare 248\n",
"Cabin 147\n",
"Embarked 3\n"
]
}
],
"source": [
"# We compute the number of unique elements one columns at a time as it's faster\n",
"for column in df_train.columns:\n",
" print(column, df_train[column].nunique())"
]
},
{
"cell_type": "markdown",
"id": "commercial-dream",
"metadata": {},
"source": [
"**The number of unique values in each columns is also relevant. Columns with less than two unique values can be discarded**"
]
},
{
"cell_type": "markdown",
"id": "corrected-reception",
"metadata": {},
"source": [
"### Univariate Analysis"
]
},
{
"cell_type": "markdown",
"id": "attempted-calculation",
"metadata": {},
"source": [
"**Analyze the target variable**\n",
"\n",
"For numerical continue variables, we can use a histogram or scatter plot, for categorical data, we commonly preferred bar plots or pie charts."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "stock-founder",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# checking for missing values\n",
"df_train['Survived'].isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "brown-accuracy",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# number of unique values\n",
"df_train['Survived'].nunique()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "emerging-adaptation",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 549\n",
"1 342\n",
"Name: Survived, dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# frequency distribution\n",
"df_train['Survived'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "manual-religion",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.616162\n",
"1 0.383838\n",
"Name: Survived, dtype: float64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Percent breakdown of response variable (ratio of frequency distribution of values)\n",
"df_train['Survived'].value_counts(normalize=True)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "imposed-intellectual",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAHgCAYAAABKGnGhAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAASjklEQVR4nO3dfbCmd33X8c83LGkdqCQh25hmEzeWTDtxLCldMS2dsYJVoJVkKkQYkIVmZn1Ah07bwah/SJ3qtGpBsC1jxlCSjEJSKiYy2JYJxGoHKBtLA0msrEgm2QYSnkIpgga//nGu/XIIG/Ys2fvcJzmv18w993X9rofzO5md8851P1Z3BwCS5LR1TwCAnUMUABiiAMAQBQCGKAAwRAGAsWfdE3g0zj777N6/f/+6pwHwmHLbbbd9qrv3Hm/bYzoK+/fvz+HDh9c9DYDHlKq6+5G2efgIgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGA8pr+j+VT4qf903bqnwA70C897+bqnAGvhSgGAIQoADFEAYIgCAEMUABiiAMAQBQCGKAAwRAGAIQoADFEAYIgCAEMUABiiAMAQBQCGKAAwRAGAIQoADFEAYIgCAEMUABiiAMAQBQCGKAAwRAGAIQoADFEAYIgCAEMUABiiAMAQBQCGKAAwVhqFqvp4VX24qj5UVYeXsbOq6t1V9dHl/sxlvKrqjVV1pKpur6pnrHJuAHy97bhS+AvdfUl3H1jWr0pyS3dflOSWZT1JnpfkouV2KMmbtmFuAGyyjoePLkty7bJ8bZLLN41f1xven+SMqjp3DfMD2LVWHYVO8ptVdVtVHVrGzunu+5blTyQ5Z1k+L8k9m469dxkDYJvsWfH5f7C7j1bVtyd5d1X9980bu7urqk/mhEtcDiXJBRdccOpmCsBqrxS6++hyf3+SdyR5ZpJPHntYaLm/f9n9aJLzNx2+bxl7+Dmv7u4D3X1g7969q5w+wK6zsihU1ZOq6tuOLSf5S0k+kuTmJAeX3Q4muWlZvjnJy5dXIV2a5MFNDzMBsA1W+fDROUneUVXHfs6/6+5fr6oPJrmxqq5McneSK5b935Xk+UmOJPlikleucG4AHMfKotDdH0vy9OOMfzrJc44z3kletar5AHBi3tEMwBAFAIYoADBEAYAhCgAMUQBgiAIAQxQAGKIAwBAFAIYoADBEAYAhCgAMUQBgiAIAQxQAGKIAwBAFAIYoADBEAYAhCgAMUQBgiAIAQxQAGKIAwBAFAIYoADBEAYAhCgAMUQBgiAIAQxQAGKIAwBAFAIYoADBEAYAhCgAMUQBgiAIAQxQAGKIAwBAFAIYoADBEAYAhCgAMUQBgiAIAQxQAGKIAwBAFAIYoADBEAYAhCgAMUQBgiAIAQxQAGKIAwBAFAIYoADBEAYAhCgAMUQBgiAIAQxQAGKIAwBAFAMbKo1BVT6iq362qdy7rF1bVB6rqSFXdUFWnL+PfsqwfWbbvX/XcAPha23Gl8Ookd21a//kkr+/upyX5bJIrl/Erk3x2GX/9sh8A22ilUaiqfUl+JMm/WdYrybOTvH3Z5dokly/Lly3rWbY/Z9kfgG2y6iuFf5nkNUn+37L+1CSf6+6HlvV7k5y3LJ+X5J4kWbY/uOwPwDZZWRSq6keT3N/dt53i8x6qqsNVdfiBBx44lacG2PVWeaXwrCQvqKqPJ3lbNh42ekOSM6pqz7LPviRHl+WjSc5PkmX7U5J8+uEn7e6ru/tAdx/Yu3fvCqcPsPusLArd/fe7e19370/y4iTv6e6XJnlvkhcuux1MctOyfPOynmX7e7q7VzU/AL7eOt6n8PeS/GRVHcnGcwbXLOPXJHnqMv6TSa5aw9wAdrU9J97l0evuW5Pcuix/LMkzj7PPl5K8aDvmA8DxeUczAEMUABiiAMAQBQCGKAAwRAGAIQoADFEAYIgCAEMUABiiAMAQBQCGKAAwRAGAIQoADFEAYIgCAEMUABiiAMAQBQCGKAAwRAGAIQoADFEAYIgCAEMUABiiAMAQBQCGKAAwRAGAIQoADFEAYIgCAEMUABiiAMAQBQCGKAAwRAGAIQoAjD3rngBwfPe/6TXrngI70Lf/rX+20vO7UgBgiAIAQxQAGKIAwBAFAIYoADBEAYAhCgAMUQBgiAIAQxQAGKIAwBAFAIYoADBEAYAhCgAMUQBgiAIAQxQAGKIAwBAFAIYoADBEAYAhCgAMUQBgiAIAQxQAGCuLQlV9a1X9TlX9XlXdUVU/s4xfWFUfqKojVXVDVZ2+jH/Lsn5k2b5/VXMD4Pi2FIWqumUrYw/z5STP7u6nJ7kkyXOr6tIkP5/k9d39tCSfTXLlsv+VST67jL9+2Q+AbfQNo7D83/5ZSc6uqjOr6qzltj/Jed/o2N7whWX1icutkzw7yduX8WuTXL4sX7asZ9n+nKqqk/x9AHgU9pxg+99I8hNJviPJbUmO/ZH+fJJfPNHJq+oJy3FPS/JLSf5nks9190PLLvfmq3E5L8k9SdLdD1XVg0memuRTDzvnoSSHkuSCCy440RQAOAnf8Eqhu9/Q3Rcm+enu/lPdfeFye3p3nzAK3f2V7r4kyb4kz0zy3Y92wt19dXcf6O4De/fufbSnA2CTE10pJEm6+19V1Q8k2b/5mO6+bovHf66q3pvk+5OcUVV7lquFfUmOLrsdTXJ+knurak+SpyT59FZ/EQAeva0+0Xx9kn+R5AeT/NnlduAEx+ytqjOW5T+W5IeT3JXkvUleuOx2MMlNy/LNy3qW7e/p7t7qLwLAo7elK4VsBODik/wjfW6Sa5fnFU5LcmN3v7Oq7kzytqr62SS/m+SaZf9rklxfVUeSfCbJi0/iZwFwCmw1Ch9J8ieS3LfVE3f37Um+9zjjH8vG8wsPH/9Skhdt9fwAnHpbjcLZSe6sqt/JxvsPkiTd/YKVzAqAtdhqFF67ykkAsDNs9dVH/3nVEwFg/bYUhar6w2y8GzlJTs/Gu5P/qLv/+KomBsD22+qVwrcdW14+euKyJJeualIArMdJf0rq8plG/yHJXz710wFgnbb68NGPbVo9LRvvW/jSSmYEwNps9dVHf2XT8kNJPp6Nh5AAeBzZ6nMKr1z1RABYv61+9tG+qnpHVd2/3H6tqvatenIAbK+tPtH8K9n4wLrvWG7/cRkD4HFkq1HY292/0t0PLbe3JPFlBgCPM1uNwqer6mVV9YTl9rL4rgOAx52tRuHHk1yR5BPZ+KTUFyZ5xYrmBMCabPUlqf84ycHu/mySVNVZ2fjSnR9f1cQA2H5bvVL4nmNBSJLu/kyO810JADy2bTUKp1XVmcdWliuFrV5lAPAYsdU/7L+Q5H1V9avL+ouS/JPVTAmAddnqO5qvq6rDSZ69DP1Yd9+5umkBsA5bfghoiYAQADyOnfRHZwPw+CUKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAWFkUqur8qnpvVd1ZVXdU1auX8bOq6t1V9dHl/sxlvKrqjVV1pKpur6pnrGpuABzfKq8UHkryU919cZJLk7yqqi5OclWSW7r7oiS3LOtJ8rwkFy23Q0netMK5AXAcK4tCd9/X3f9tWf7DJHclOS/JZUmuXXa7Nsnly/JlSa7rDe9PckZVnbuq+QHw9bblOYWq2p/ke5N8IMk53X3fsukTSc5Zls9Lcs+mw+5dxgDYJiuPQlU9OcmvJfmJ7v785m3d3Un6JM93qKoOV9XhBx544BTOFICVRqGqnpiNIPzb7v73y/Anjz0stNzfv4wfTXL+psP3LWNfo7uv7u4D3X1g7969q5s8wC60ylcfVZJrktzV3a/btOnmJAeX5YNJbto0/vLlVUiXJnlw08NMAGyDPSs897OS/PUkH66qDy1j/yDJzyW5saquTHJ3kiuWbe9K8vwkR5J8MckrVzg3AI5jZVHo7v+apB5h83OOs38nedWq5gPAiXlHMwBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAWFkUqurNVXV/VX1k09hZVfXuqvrocn/mMl5V9caqOlJVt1fVM1Y1LwAe2SqvFN6S5LkPG7sqyS3dfVGSW5b1JHlekouW26Ekb1rhvAB4BCuLQnf/VpLPPGz4siTXLsvXJrl80/h1veH9Sc6oqnNXNTcAjm+7n1M4p7vvW5Y/keScZfm8JPds2u/eZezrVNWhqjpcVYcfeOCB1c0UYBda2xPN3d1J+ps47uruPtDdB/bu3buCmQHsXtsdhU8ee1houb9/GT+a5PxN++1bxgDYRtsdhZuTHFyWDya5adP4y5dXIV2a5MFNDzMBsE32rOrEVfXWJD+U5OyqujfJP0ryc0lurKork9yd5Ipl93cleX6SI0m+mOSVq5oXAI9sZVHo7pc8wqbnHGffTvKqVc0FgK3xjmYAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMEQBgCEKAAxRAGCIAgBDFAAYogDAEAUAhigAMHZUFKrquVX1+1V1pKquWvd8AHabHROFqnpCkl9K8rwkFyd5SVVdvN5ZAewuOyYKSZ6Z5Eh3f6y7/0+StyW5bM1zAthVdlIUzktyz6b1e5cxALbJnnVP4GRV1aEkh5bVL1TV769zPo8zZyf51LonsRO8LgfXPQW+ln+bx/ztf34qzvInH2nDTorC0STnb1rft4x9je6+OsnV2zWp3aSqDnf3gXXPAx7Ov83ts5MePvpgkouq6sKqOj3Ji5PcvOY5AewqO+ZKobsfqqq/k+Q3kjwhyZu7+441TwtgV9kxUUiS7n5Xknetex67mIfl2Kn829wm1d3rngMAO8ROek4BgDUTBXy8CDtWVb25qu6vqo+sey67hSjscj5ehB3uLUmeu+5J7CaigI8XYcfq7t9K8pl1z2M3EQV8vAgwRAGAIQps6eNFgN1BFPDxIsAQhV2uux9KcuzjRe5KcqOPF2GnqKq3Jnlfku+qqnur6sp1z+nxzjuaARiuFAAYogDAEAUAhigAMEQBgCEKkKSq/mFV3VFVt1fVh6rqz52Cc77gVH3qbFV94VScB07ES1LZ9arq+5O8LskPdfeXq+rsJKd39x9s4dg9y3s9Vj3HL3T3k1f9c8CVAiTnJvlUd385Sbr7U939B1X18SUQqaoDVXXrsvzaqrq+qn47yfVV9f6q+tPHTlZVty77v6KqfrGqnlJVd1fVacv2J1XVPVX1xKr6zqr69aq6rar+S1V997LPhVX1vqr6cFX97Db/92AXEwVIfjPJ+VX1P6rql6vqz2/hmIuT/MXufkmSG5JckSRVdW6Sc7v78LEdu/vBJB9Kcuy8P5rkN7r7/2bju4f/bnd/X5KfTvLLyz5vSPKm7v4zSe57tL8gbJUosOt19xeSfF+SQ0keSHJDVb3iBIfd3N3/e1m+MckLl+Urkrz9OPvfkOSvLcsvXn7Gk5P8QJJfraoPJfnX2bhqSZJnJXnrsnz9yfw+8GjsWfcEYCfo7q8kuTXJrVX14SQHkzyUr/6P07c+7JA/2nTs0ar6dFV9Tzb+8P/N4/yIm5P806o6KxsBek+SJyX5XHdf8kjT+uZ+G/jmuVJg16uq76qqizYNXZLk7iQfz8Yf8CT5qyc4zQ1JXpPkKd19+8M3LlcjH8zGw0Lv7O6vdPfnk/yvqnrRMo+qqqcvh/x2Nq4okuSlJ/1LwTdJFCB5cpJrq+rOqro9G88XvDbJzyR5Q1UdTvKVE5zj7dn4I37jN9jnhiQvW+6PeWmSK6vq95Lcka9+Feqrk7xquWrxTXhsGy9JBWC4UgBgiAIAQxQAGKIAwBAFAIYoADBEAYAhCgCM/w9hHcZCrJXs8gAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 432x576 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# visualizing the frequency distribution\n",
"f, ax = plt.subplots(figsize=(6, 8))\n",
"ax = sns.countplot(x=\"Survived\", data=df_train, palette=\"Set2\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "mathematical-sixth",
"metadata": {},
"source": [
"The dataset is a little imbalanced (60/40) but we still use the accuracy because is not a huge imbalance."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "played-brief",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"342 passengers survived out of 891\n"
]
}
],
"source": [
"# Examining how many passengers survived\n",
"print(sum(df_train['Survived']),'passengers survived out of',len(df_train))"
]
},
{
"cell_type": "markdown",
"id": "thorough-baseball",
"metadata": {},
"source": [
"**Analyze independant variables**"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "conservative-tablet",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 NaN\n",
"1 C85\n",
"2 NaN\n",
"3 C123\n",
"4 NaN\n",
" ... \n",
"886 NaN\n",
"887 B42\n",
"888 NaN\n",
"889 C148\n",
"890 NaN\n",
"Name: Cabin, Length: 891, dtype: object"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['Cabin']"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "junior-plenty",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The Cabin column is missing 687 values out of 891\n"
]
}
],
"source": [
"print(\"The Cabin column is missing\", sum(df_train['Cabin'].isna()), \"values out of\",len(df_train['Cabin']))"
]
},
{
"cell_type": "markdown",
"id": "medieval-cartoon",
"metadata": {},
"source": [
"Takeaways from Univariate Analysis :\n",
"\n",
"1. We have 2 categories of values, 1 and 0.\n",
"2. We have far more negative(0) values than positive(1) values.\n",
"3. 0 appears 61.6% of time, 1 appears 38.8% of time.\n",
"4. The Cabin column is missing 687 values out of 891, we should drop it.\n",
"\n",
"61.6% is our null accuracy - the accuracy of a classification model that just guesses the most common category over and over again. Our absolute baseline for our machine learning pipeline will have to be beating the null accuracy."
]
},
{
"cell_type": "markdown",
"id": "offshore-tonight",
"metadata": {},
"source": [
"### Bivariate Analysis\n",
"\n",
"Bivariate Analysis requires you to learn about relationships between pairs of variables. We can use a scatter plot, a pair plot or a correlation matrix."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "gothic-adjustment",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>22.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>38.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>26.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>35.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>35.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>27.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>19.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>26.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>32.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>891 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" Name Age\n",
"0 Braund, Mr. Owen Harris 22.0\n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0\n",
"2 Heikkinen, Miss. Laina 26.0\n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0\n",
"4 Allen, Mr. William Henry 35.0\n",
".. ... ...\n",
"886 Montvila, Rev. Juozas 27.0\n",
"887 Graham, Miss. Margaret Edith 19.0\n",
"888 Johnston, Miss. Catherine Helen \"Carrie\" NaN\n",
"889 Behr, Mr. Karl Howell 26.0\n",
"890 Dooley, Mr. Patrick 32.0\n",
"\n",
"[891 rows x 2 columns]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# One can look at several columns together\n",
"df_train[[\"Name\", \"Age\"]]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "exciting-latvia",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In first class 62.96296296296296 % of passengers survived\n",
"In second class 47.28260869565217 % of passengers survived\n",
"In third class 24.236252545824847 % of passengers survived\n"
]
}
],
"source": [
"class_survived = df_train[['Pclass', 'Survived']]\n",
"\n",
"first_class = class_survived[class_survived['Pclass'] == 1]\n",
"second_class = class_survived[class_survived['Pclass'] == 2]\n",
"third_class = class_survived[class_survived['Pclass'] == 3]\n",
"\n",
"print(\"In first class\", sum(first_class['Survived'])/len(first_class)*100, \"% of passengers survived\")\n",
"print(\"In second class\", sum(second_class['Survived'])/len(second_class)*100, \"% of passengers survived\")\n",
"print(\"In third class\", sum(third_class['Survived'])/len(third_class)*100, \"% of passengers survived\")"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "religious-relations",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'Age by Passenger Class, Titanic')"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 576x360 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(8,5))\n",
"sns.boxplot(x='Pclass',y='Age',data=df_train, palette='rainbow')\n",
"plt.title(\"Age by Passenger Class, Titanic\")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "periodic-result",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"74.2"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"female_mean = df_train[df_train.Sex == 'female'].Survived.mean()\n",
"female_mean = round(female_mean * 100, 2)\n",
"female_mean"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "south-click",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"18.89"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"male_mean = df_train[df_train.Sex == 'male'].Survived.mean()\n",
"male_mean = round(male_mean * 100, 2)\n",
"male_mean"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "collected-complaint",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"38.38"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"global_mean = df_train.Survived.mean()\n",
"global_mean = round(global_mean * 100, 2)\n",
"global_mean"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "amazing-davis",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Percent</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>female</th>\n",
" <td>74.20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male</th>\n",
" <td>18.89</td>\n",
" </tr>\n",
" <tr>\n",
" <th>global mean</th>\n",
" <td>38.38</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Percent\n",
"female 74.20\n",
"male 18.89\n",
"global mean 38.38"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compare_mean = pd.DataFrame(\n",
" data = [female_mean, male_mean, global_mean],\n",
" index = ['female', 'male', 'global mean'],\n",
" columns = ['Percent'],\n",
")\n",
" \n",
"compare_mean"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "dated-jason",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD4CAYAAAD1jb0+AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAATMElEQVR4nO3dfbBc9X3f8fcnkrBsiQACVcNYaaXGGEMRCFBlBKXjIhTjOjyUYscMTQVWq/GUuKaUiUljDB3jFjIUkhI7jAZTqTMu2DG2pcgDNigmsYmLLR4MAfGgYGGL4eFCwQZiHhS+/eMewY1YcVe6T/pJ79fMzp7zO7+z57u7Zz/37O+e3U1VIUlqz69MdAGSpJ1jgEtSowxwSWqUAS5JjTLAJalRk8dzYwcccEDNmTNnPDcpSc278847n6mqmdu2j2uAz5kzh/Xr14/nJiWpeUke69XuEIokNcoAl6RGGeCS1KhxHQOXtOd67bXX2Lx5My+//PJEl7LLmjp1KrNnz2bKlCl99TfAJY2LzZs3s/feezNnzhySTHQ5u5yq4tlnn2Xz5s3MnTu3r3UcQpE0Ll5++WX2339/w3s7krD//vvv0DsUA1zSuDG8396OPj4GuCQ1yjFwSRNizoXfGtXb23TZh4ftM2nSJObNm8eWLVs45JBDWLVqFe9617tGtY7h3Hbbbey1114ce+yxI76t3TrAR3sH2dX0s8NKetM73/lO7rnnHgDOOussrrnmGs4///xh19uyZQuTJ49OXN52221Mnz59VALcIRRJe6Tjjz+ejRs38tJLL/Hxj3+chQsXcuSRR7J69WoAVq5cySmnnMIJJ5zA4sWLefHFFznnnHOYN28ehx9+ODfeeCMA3/nOd1i0aBFHHXUUH/nIR3jxxReBwa8OufjiiznqqKOYN28eDz74IJs2beKaa67hqquuYv78+Xzve98b0X3YrY/AJamXLVu2cNNNN3HSSSfx+c9/nhNOOIHrrruO559/noULF3LiiScCcNddd3HvvfcyY8YMPv3pT7PPPvtw3333AfDcc8/xzDPPcOmll3Lrrbcybdo0Lr/8cq688ko++9nPAnDAAQdw11138cUvfpErrriCa6+9lk984hNMnz6dCy64YMT3wwCXtMf45S9/yfz584HBI/Bly5Zx7LHHsmbNGq644gpg8HTHn/70pwAsWbKEGTNmAHDrrbdyww03vHFb++23H2vXruWBBx7guOOOA+DVV19l0aJFb/Q5/fTTATj66KP5+te/Pur3xwCXtMcYOga+VVVx4403cvDBB/+99jvuuINp06a97e1VFUuWLOH666/vufwd73gHMPjP0y1btux84dvhGLikPdoHP/hBrr76aqoKgLvvvrtnvyVLlvCFL3zhjfnnnnuOY445httvv52NGzcC8NJLL/Hwww+/7fb23ntvXnjhhVGp3SNwSRNiVzmL6qKLLuK8887j8MMP5/XXX2fu3LmsXbv2Lf0+85nPcO6553LYYYcxadIkLr74Yk4//XRWrlzJmWeeySuvvALApZdeynvf+97tbu/kk0/mjDPOYPXq1Vx99dUcf/zxO117tv7VGQ8LFiyo8fxBB08jlHYdGzZs4JBDDpnoMnZ5vR6nJHdW1YJt+w47hJLk4CT3DLn8Isl5SWYkuSXJI931fqN4HyRJwxg2wKvqoaqaX1XzgaOBvwW+AVwIrKuqg4B13bwkaZzs6D8xFwN/U1WPAacCq7r2VcBpo1iXpN3QeA7ZtmhHH58dDfCPAVvPl5lVVU90008Cs3qtkGR5kvVJ1g8MDOzg5iTtLqZOncqzzz5riG/H1u8Dnzp1at/r9H0WSpK9gFOA3+ux4UrS81mpqhXAChj8J2bflUnarcyePZvNmzfjgdz2bf1Fnn7tyGmEHwLuqqqnuvmnkhxYVU8kORB4egduS9IeZsqUKX3/0oz6syNDKGfy5vAJwBpgaTe9FFg9WkVJkobXV4AnmQYsAYZ+mP8yYEmSR4ATu3lJ0jjpawilql4C9t+m7VkGz0qRJE0AvwtFkhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJalS/P2q8b5KvJXkwyYYki5LMSHJLkke66/3GulhJ0pv6PQL/I+DmqnofcASwAbgQWFdVBwHrunlJ0jgZNsCT7AP8c+BLAFX1alU9D5wKrOq6rQJOG5sSJUm99HMEPhcYAP5XkruTXJtkGjCrqp7o+jwJzOq1cpLlSdYnWT8wMDA6VUuS+grwycBRwJ9U1ZHAS2wzXFJVBVSvlatqRVUtqKoFM2fOHGm9kqROPwG+GdhcVXd0819jMNCfSnIgQHf99NiUKEnqZdgAr6ongZ8lObhrWgw8AKwBlnZtS4HVY1KhJKmnyX32+yTw5SR7AY8C5zAY/l9Nsgx4DPjo2JQoSeqlrwCvqnuABT0WLR7VaiRJffOTmJLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1Kj+vpNzCSbgBeAvwO2VNWCJDOArwBzgE3AR6vqubEpU5K0rR05Av8XVTW/qrb+uPGFwLqqOghY181LksbJSIZQTgVWddOrgNNGXI0kqW/9BngB30lyZ5LlXdusqnqim34SmNVrxSTLk6xPsn5gYGCE5UqStuprDBz4Z1X1eJJ/ANyS5MGhC6uqklSvFatqBbACYMGCBT37SJJ2XF9H4FX1eHf9NPANYCHwVJIDAbrrp8eqSEnSWw0b4EmmJdl76zTwG8BfA2uApV23pcDqsSpSkvRW/QyhzAK+kWRr//9TVTcn+RHw1STLgMeAj45dmZKkbQ0b4FX1KHBEj/ZngcVjUZQkaXh+ElOSGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqVN8BnmRSkruTrO3m5ya5I8nGJF9JstfYlSlJ2taOHIF/CtgwZP5y4Kqqeg/wHLBsNAuTJL29vgI8yWzgw8C13XyAE4CvdV1WAaeNQX2SpO3o9wj8D4HfBV7v5vcHnq+qLd38ZuDdvVZMsjzJ+iTrBwYGRlKrJGmIYQM8yW8CT1fVnTuzgapaUVULqmrBzJkzd+YmJEk9TO6jz3HAKUn+JTAV+FXgj4B9k0zujsJnA4+PXZmSpG0NewReVb9XVbOrag7wMeDPq+os4LvAGV23pcDqMatSkvQWIzkP/NPA+Uk2Mjgm/qXRKUmS1I9+hlDeUFW3Abd1048CC0e/JEmtm3Phtya6hDG16bIPT3QJgJ/ElKRmGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckho1bIAnmZrkh0l+nOT+JP+1a5+b5I4kG5N8JcleY1+uJGmrfo7AXwFOqKojgPnASUmOAS4Hrqqq9wDPAcvGrEpJ0lsMG+A16MVudkp3KeAE4Gtd+yrgtLEoUJLUW19j4EkmJbkHeBq4Bfgb4Pmq2tJ12Qy8ezvrLk+yPsn6gYGBUShZkgR9BnhV/V1VzQdmAwuB9/W7gapaUVULqmrBzJkzd65KSdJb7NBZKFX1PPBdYBGwb5LJ3aLZwOOjW5ok6e30cxbKzCT7dtPvBJYAGxgM8jO6bkuB1WNUoySph8nDd+FAYFWSSQwG/leram2SB4AbklwK3A18aQzrlCRtY9gAr6p7gSN7tD/K4Hi4JGkC+ElMSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVH9/Cr9ryX5bpIHktyf5FNd+4wktyR5pLveb+zLlSRt1c8R+BbgP1fVocAxwLlJDgUuBNZV1UHAum5ekjROhg3wqnqiqu7qpl8ANgDvBk4FVnXdVgGnjVGNkqQedmgMPMkc4EjgDmBWVT3RLXoSmLWddZYnWZ9k/cDAwEhqlSQN0XeAJ5kO3AicV1W/GLqsqgqoXutV1YqqWlBVC2bOnDmiYiVJb+orwJNMYTC8v1xVX++an0pyYLf8QODpsSlRktTL5OE6JAnwJWBDVV05ZNEaYClwWXe9ekwq1B5pzoXfmugSxtSmyz480SVoNzBsgAPHAb8N3Jfknq7tvzAY3F9Nsgx4DPjomFQoSepp2ACvqu8D2c7ixaNbjiSpX34SU5IaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWrUsAGe5LokTyf56yFtM5LckuSR7nq/sS1TkrStfo7AVwInbdN2IbCuqg4C1nXzkqRxNGyAV9VfAv9vm+ZTgVXd9CrgtNEtS5I0nJ0dA59VVU90008Cs7bXMcnyJOuTrB8YGNjJzUmStjXif2JWVQH1NstXVNWCqlowc+bMkW5OktTZ2QB/KsmBAN3106NXkiSpHzsb4GuApd30UmD16JQjSepXP6cRXg/8ADg4yeYky4DLgCVJHgFO7OYlSeNo8nAdqurM7SxaPMq1SJJ2gJ/ElKRGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckho1ogBPclKSh5JsTHLhaBUlSRreTgd4kknAF4APAYcCZyY5dLQKkyS9vZEcgS8ENlbVo1X1KnADcOrolCVJGs7kEaz7buBnQ+Y3A+/ftlOS5cDybvbFJA+NYJu7ugOAZ8ZrY7l8vLa0R/C5a9vu/vz9o16NIwnwvlTVCmDFWG9nV5BkfVUtmOg6tON87tq2pz5/IxlCeRz4tSHzs7s2SdI4GEmA/wg4KMncJHsBHwPWjE5ZkqTh7PQQSlVtSfI7wLeBScB1VXX/qFXWpj1iqGg35XPXtj3y+UtVTXQNkqSd4CcxJalRBrgkNcoAHyLJf0yyIcmXx+j2L0lywVjctkZXkg8kWTvRdewukqxMcsYwfTYlOWAHbvPsJH888uraNebngTfmPwAnVtXmiS5EkobjEXgnyTXAPwZuSvL7Sa5L8sMkdyc5tetzdpJvJrmlO1r4nSTnd33+b5IZXb9/n+RHSX6c5MYk7+qxvV9PcnOSO5N8L8n7xvce7/6SzEnyYHf093CSLyc5McntSR5JsrC7/KB7Dv8qycE9bmdar/1Bb5Xkou4L7r6f5Ppe7ziTLO4ex/u6x/UdQxb/btf+wyTv6fqfnOSObp1bk8wapoZLkqzqXlePJTk9yR90t3tzkildv6OT/EX3Gvx2kgO79p6v324/+p/dfvLocO8oxkVVeekuwCYGP5L734B/07XtCzwMTAPOBjYCewMzgZ8Dn+j6XQWc103vP+Q2LwU+2U1fAlzQTa8DDuqm3w/8+UTf/93tAswBtgDzGDxYuRO4DgiD39vzTeBXgcld/xOBG7vpDwBru+me+8NE379d7QL8U+AeYGr3GnlkyP6+EjijW/Yz4L1d+/8e8rrZBPx+N/1vhzz++/HmGXP/Dvgf3fTZwB/3qOMS4PvAFOAI4G+BD3XLvgGc1i37K2Bm1/5bDJ4K/Xav35XAn3b70qEMfhfUhD7mDqH09hvAKUOOHqYC/7Cb/m5VvQC8kOTnwJ917fcBh3fThyW5lMEX+3QGz5V/Q5LpwLHAnybZ2jz0KESj5ydVdR9AkvuBdVVVSe5jMOD3AVYlOQgoBl/Y29re/rBhrItvzHHA6qp6GXg5yZ/16HMwg8/Jw938KuBc4A+7+euHXF/VTc8GvtIdIe8F/KSPWm6qqte653kScHPXvvV5Pxg4DLilew1OAp7o+rzd6/ebVfU68MBw7wTGgwHeW4B/XVV/74u3krwfeGVI0+tD5l/nzcdzJXBaVf04ydkMHs0N9SvA81U1f1SrVi/DPV+fY/CP8r9KMge4rcdt9NwfNCaqx/TVwJVVtSbJBxg8wh7OKwBV9XqS16o7hObN5z3A/VW1qMe6K9n+63fo/hQmmGPgvX0b+GS6P81JjtzB9fcGnujG2s7admFV/QL4SZKPdLefJEeMsGbtnH148zt8zt5On5HuD3uK24GTk0zt3mX+Zo8+DwFzto5vA78N/MWQ5b815PoH3fTQ52jpKNX6EDAzySKAJFOS/JNu2du+fnclBnhvn2PwrfS93dvuz+3g+hcBdzC4Qz+4nT5nAcuS/Bi4H79LfaL8AfDfk9zN9t+RjnR/2CNU1Y8Y/D6ke4GbGByu+Pk2fV4GzmFw+PA+Bo+IrxnSZb8k9wKfAv5T13ZJ1/9ORukrY2vwNwzOAC7vXoP3MDisCf29fncJfpRe0qhJMr2qXuzO3PhLYHlV3TXRde2uHAOXNJpWZPCnFacCqwzvseURuCQ1yjFwSWqUAS5JjTLAJalRBrgkNcoAl6RG/X+lbBx96OA19AAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"ax = compare_mean.plot.bar(y='Percent', rot=0)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "gothic-illustration",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1584x1440 with 9 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# distribution \n",
"df_train.hist(figsize=(22, 20), bins=30, edgecolor=\"black\")\n",
"plt.subplots_adjust(hspace=0.7, wspace=0.4)"
]
},
{
"cell_type": "markdown",
"id": "essential-serum",
"metadata": {},
"source": [
"### Check missing values and outliers"
]
},
{
"cell_type": "markdown",
"id": "royal-seller",
"metadata": {},
"source": [
"#### Categorial Variables Analysis"
]
},
{
"cell_type": "code",
"execution_count": 707,
"id": "brilliant-newton",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Ticket</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>A/5 21171</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>PC 17599</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>113803</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>373450</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Sex \\\n",
"0 Braund, Mr. Owen Harris male \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female \n",
"2 Heikkinen, Miss. Laina female \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female \n",
"4 Allen, Mr. William Henry male \n",
"\n",
" Ticket Cabin Embarked \n",
"0 A/5 21171 NaN S \n",
"1 PC 17599 C85 C \n",
"2 STON/O2. 3101282 NaN S \n",
"3 113803 C123 S \n",
"4 373450 NaN S "
]
},
"execution_count": 707,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# First, segregating the categorical from the dataframe.\n",
"cat_vars = ['object']\n",
"cat_df = df_train.select_dtypes(include=cat_vars)\n",
"cat_df.head()"
]
},
{
"cell_type": "markdown",
"id": "solid-writer",
"metadata": {},
"source": [
"We have 5 categorical columns out of which:\n",
"\n",
"Sex is a binary categorical features."
]
},
{
"cell_type": "markdown",
"id": "verbal-extreme",
"metadata": {},
"source": [
"Next, we need to check for the number of labels each of these variables has. The number of labels a variable has defines its cardinality. Each categorical variable consists of unique values. A categorical feature is said to possess high cardinality when there are too many of these unique values. One-Hot Encoding becomes a big problem in such a case since we have a separate column for each unique value (indicating its presence or absence) in the categorical variable."
]
},
{
"cell_type": "code",
"execution_count": 708,
"id": "swedish-shelter",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')"
]
},
"execution_count": 708,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat_df = df_train.select_dtypes(include=cat_vars)\n",
"cat_df.columns"
]
},
{
"cell_type": "code",
"execution_count": 709,
"id": "remarkable-james",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Name 0\n",
"Sex 0\n",
"Ticket 0\n",
"Cabin 687\n",
"Embarked 2\n",
"dtype: int64 =============\n",
"Name has 891 labels\n",
"Sex has 2 labels\n",
"Ticket has 681 labels\n",
"Cabin has 148 labels\n",
"Embarked has 4 labels\n"
]
}
],
"source": [
"# printing missing value and labels in each column\n",
"print(cat_df.isnull().sum(), \"=============\")\n",
"for var in list(cat_df.columns):\n",
" print(var, 'has', len(cat_df[var].unique()), 'labels')"
]
},
{
"cell_type": "markdown",
"id": "younger-machinery",
"metadata": {},
"source": [
"Name, Ticket and Cabin has a high cardinality."
]
},
{
"cell_type": "markdown",
"id": "elementary-overall",
"metadata": {},
"source": [
"#### Numerical Variables Analysis"
]
},
{
"cell_type": "code",
"execution_count": 710,
"id": "invisible-apache",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <td>1.00</td>\n",
" <td>2.0000</td>\n",
" <td>3.000</td>\n",
" <td>4.0</td>\n",
" <td>5.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Survived</th>\n",
" <td>0.00</td>\n",
" <td>1.0000</td>\n",
" <td>1.000</td>\n",
" <td>1.0</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Pclass</th>\n",
" <td>3.00</td>\n",
" <td>1.0000</td>\n",
" <td>3.000</td>\n",
" <td>1.0</td>\n",
" <td>3.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Age</th>\n",
" <td>22.00</td>\n",
" <td>38.0000</td>\n",
" <td>26.000</td>\n",
" <td>35.0</td>\n",
" <td>35.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>SibSp</th>\n",
" <td>1.00</td>\n",
" <td>1.0000</td>\n",
" <td>0.000</td>\n",
" <td>1.0</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Parch</th>\n",
" <td>0.00</td>\n",
" <td>0.0000</td>\n",
" <td>0.000</td>\n",
" <td>0.0</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Fare</th>\n",
" <td>7.25</td>\n",
" <td>71.2833</td>\n",
" <td>7.925</td>\n",
" <td>53.1</td>\n",
" <td>8.05</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4\n",
"PassengerId 1.00 2.0000 3.000 4.0 5.00\n",
"Survived 0.00 1.0000 1.000 1.0 0.00\n",
"Pclass 3.00 1.0000 3.000 1.0 3.00\n",
"Age 22.00 38.0000 26.000 35.0 35.00\n",
"SibSp 1.00 1.0000 0.000 1.0 0.00\n",
"Parch 0.00 0.0000 0.000 0.0 0.00\n",
"Fare 7.25 71.2833 7.925 53.1 8.05"
]
},
"execution_count": 710,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# isolating numerical columns in a dataframe\n",
"numerics = ['int64', 'float64']\n",
"num_df = df_train.select_dtypes(include=numerics)\n",
"num_df.head().T"
]
},
{
"cell_type": "markdown",
"id": "otherwise-vector",
"metadata": {},
"source": [
"1 numerical column has missing values"
]
},
{
"cell_type": "code",
"execution_count": 711,
"id": "checked-tooth",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId 0\n",
"Survived 0\n",
"Pclass 0\n",
"Name 0\n",
"Sex 0\n",
"Age 177\n",
"SibSp 0\n",
"Parch 0\n",
"Ticket 0\n",
"Fare 0\n",
"Cabin 687\n",
"Embarked 2\n",
"dtype: int64"
]
},
"execution_count": 711,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# simple way to get missing data\n",
"df_train.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 712,
"id": "proper-leadership",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Total</th>\n",
" <th>Percent</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Cabin</th>\n",
" <td>687</td>\n",
" <td>0.771044</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Age</th>\n",
" <td>177</td>\n",
" <td>0.198653</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Embarked</th>\n",
" <td>2</td>\n",
" <td>0.002245</td>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <td>0</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Survived</th>\n",
" <td>0</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Pclass</th>\n",
" <td>0</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Name</th>\n",
" <td>0</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sex</th>\n",
" <td>0</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>SibSp</th>\n",
" <td>0</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Parch</th>\n",
" <td>0</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Ticket</th>\n",
" <td>0</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Fare</th>\n",
" <td>0</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Total Percent\n",
"Cabin 687 0.771044\n",
"Age 177 0.198653\n",
"Embarked 2 0.002245\n",
"PassengerId 0 0.000000\n",
"Survived 0 0.000000\n",
"Pclass 0 0.000000\n",
"Name 0 0.000000\n",
"Sex 0 0.000000\n",
"SibSp 0 0.000000\n",
"Parch 0 0.000000\n",
"Ticket 0 0.000000\n",
"Fare 0 0.000000"
]
},
"execution_count": 712,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Make sure to get all missing data\n",
"total = df_train.isnull().sum().sort_values(ascending=False)\n",
"percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)\n",
"missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])\n",
"missing_data.head(20)"
]
},
{
"cell_type": "code",
"execution_count": 713,
"id": "accessible-earthquake",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'Percent missing data by feature')"
]
},
"execution_count": 713,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1080x864 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Plot missing values\n",
"\n",
"f, ax = plt.subplots(figsize=(15, 12))\n",
"plt.xticks(rotation='90')\n",
"sns.barplot(x=total.index, y=total)\n",
"plt.xlabel('Features', fontsize=15)\n",
"plt.ylabel('Percent of missing values', fontsize=15)\n",
"plt.title('Percent missing data by feature', fontsize=15)"
]
},
{
"cell_type": "markdown",
"id": "permanent-density",
"metadata": {},
"source": [
"- Cabin has 77% of missing data\n",
"- Age has 20% of missing data"
]
},
{
"cell_type": "markdown",
"id": "discrete-latex",
"metadata": {},
"source": [
"### Outlier Analysis"
]
},
{
"cell_type": "code",
"execution_count": 714,
"id": "modern-hydrogen",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" <th>mean</th>\n",
" <th>std</th>\n",
" <th>min</th>\n",
" <th>25%</th>\n",
" <th>50%</th>\n",
" <th>75%</th>\n",
" <th>max</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <td>891.0</td>\n",
" <td>446.000000</td>\n",
" <td>257.353842</td>\n",
" <td>1.00</td>\n",
" <td>223.5000</td>\n",
" <td>446.0000</td>\n",
" <td>668.5</td>\n",
" <td>891.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Survived</th>\n",
" <td>891.0</td>\n",
" <td>0.383838</td>\n",
" <td>0.486592</td>\n",
" <td>0.00</td>\n",
" <td>0.0000</td>\n",
" <td>0.0000</td>\n",
" <td>1.0</td>\n",
" <td>1.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Pclass</th>\n",
" <td>891.0</td>\n",
" <td>2.308642</td>\n",
" <td>0.836071</td>\n",
" <td>1.00</td>\n",
" <td>2.0000</td>\n",
" <td>3.0000</td>\n",
" <td>3.0</td>\n",
" <td>3.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Age</th>\n",
" <td>714.0</td>\n",
" <td>29.699118</td>\n",
" <td>14.526497</td>\n",
" <td>0.42</td>\n",
" <td>20.1250</td>\n",
" <td>28.0000</td>\n",
" <td>38.0</td>\n",
" <td>80.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>SibSp</th>\n",
" <td>891.0</td>\n",
" <td>0.523008</td>\n",
" <td>1.102743</td>\n",
" <td>0.00</td>\n",
" <td>0.0000</td>\n",
" <td>0.0000</td>\n",
" <td>1.0</td>\n",
" <td>8.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Parch</th>\n",
" <td>891.0</td>\n",
" <td>0.381594</td>\n",
" <td>0.806057</td>\n",
" <td>0.00</td>\n",
" <td>0.0000</td>\n",
" <td>0.0000</td>\n",
" <td>0.0</td>\n",
" <td>6.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Fare</th>\n",
" <td>891.0</td>\n",
" <td>32.204208</td>\n",
" <td>49.693429</td>\n",
" <td>0.00</td>\n",
" <td>7.9104</td>\n",
" <td>14.4542</td>\n",
" <td>31.0</td>\n",
" <td>512.3292</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" count mean std min 25% 50% 75% \\\n",
"PassengerId 891.0 446.000000 257.353842 1.00 223.5000 446.0000 668.5 \n",
"Survived 891.0 0.383838 0.486592 0.00 0.0000 0.0000 1.0 \n",
"Pclass 891.0 2.308642 0.836071 1.00 2.0000 3.0000 3.0 \n",
"Age 714.0 29.699118 14.526497 0.42 20.1250 28.0000 38.0 \n",
"SibSp 891.0 0.523008 1.102743 0.00 0.0000 0.0000 1.0 \n",
"Parch 891.0 0.381594 0.806057 0.00 0.0000 0.0000 0.0 \n",
"Fare 891.0 32.204208 49.693429 0.00 7.9104 14.4542 31.0 \n",
"\n",
" max \n",
"PassengerId 891.0000 \n",
"Survived 1.0000 \n",
"Pclass 3.0000 \n",
"Age 80.0000 \n",
"SibSp 8.0000 \n",
"Parch 6.0000 \n",
"Fare 512.3292 "
]
},
"execution_count": 714,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# summary statistics of all the columns\n",
"num_df.describe().T"
]
},
{
"cell_type": "markdown",
"id": "willing-interview",
"metadata": {},
"source": [
"We check the mean column in relation to the min and the max."
]
},
{
"cell_type": "code",
"execution_count": 715,
"id": "measured-corps",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAawAAAEfCAYAAAAdlvJ3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAASQUlEQVR4nO3de5CddX3H8fduAqgkIXRdMJBA8JJvEZlSlWKHW8FSr6jUeuFOM06VOiLFiSI4OtWigre0QqtWUyMXEdqKIuMNqQi0MFrBDlS/ChJIJJRlgZCAIGRP/zhn4ewSNudAzvOc38n7NXNmz3M73+8OZD/ze57feZ6hRqOBJEn9brjuBiRJ6oSBJUkqgoElSSqCgSVJKsLsuhvowHbAvsBaYGPNvUiSemcWsAD4MfDw9I0lBNa+wFV1NyFJqsyBwNXTV5YQWGsB7r33ASYmnIIvSYNqeHiIHXfcHlp/96crIbA2AkxMNAwsSdo6bPLyj5MuJElFMLAkSUUwsCRJRTCwJElFMLCkPnTxxReydOlRfP3rF9XditQ3KgusiHhtRFwfETdExM8i4s+rqi2V5tvf/iYAl156Sb2NSH2kksCKiCHgXODYzNwHOBZYGRGO8KRpLr74winLjrKkpioDYwLYofV+PrA2MycqrC8VYXJ0NclRltRUyReHM7MREW8GvhERDwBzgVd38xkjI3N60ptUgtHRuXW3INWuksCKiNnA+4HXZ+Y1EbE/cFFEvDAzN3TyGePjG7zThbZaY2Pr625B6rnh4aEZBydVnRLcB9glM68BaP18ANizovpSMV71qtdNWT788DfU04jUZ6oKrDXAwogIgIjYE9gZuKWi+lIx3vSmt05ZPuKIN9fUidRfKgmszLwTOBH414j4GXAhsDQz76mivlSayVGWoyvpcUONRt9fF1oM3Oo1LEkabG3XsPYAVj1he9UNSZL0VBhYkqQiGFiSpCIYWJKkIhhYkqQiGFiSpCIYWJKkIhhYkqQiGFiSpCIYWJKkIhhYkqQiGFiSpCIYWJKkIhhYkqQiGFiSpCIYWJKkIhhYkqQiGFiSpCIYWJKkIhhYkqQizK6iSEQsBi5pWzUfmJeZv1dFfUlS+SoJrMxcBewzuRwRy6uqLUkaDJWHRkRsCxwNvKLq2pKkctUxynkd8JvM/Gk3B42MzOlRO5KkEtQRWEuBFd0eND6+gYmJRg/akST1g+HhoRkHJ5XOEoyIXYGDgfOrrCtJKl/V09qPBy7LzPGK60qSCld1YJ3AUzgdKElSpdewMnNJlfUkSYPDO11IkopgYEmSimBgSZKKYGBJkopgYEmSimBgSZKKYGBJkopgYEmSimBgSZKKYGBJkopgYEmSimBgSZKKYGBJkopgYEmSimBgSZKKYGBJkopgYEmSimBgSZKKYGBJkoowu6pCEfEM4DPAnwIPAf+VmX9VVX1JUtkqCyzgLJpBtSQzGxGxc4W1JUmFqySwImIOcBywMDMbAJn5f1XUliQNhqpGWM8DxoEPRcQhwAbgA5l5dacfMDIyp1e9SZIKUFVgzQKeC1yfmcsiYj/g0oh4fmbe38kHjI9vYGKi0dMmJUn1GR4emnFwUtUswduBR4GvAmTmdcDdwJKK6kuSCldJYGXm3cB/AIcBRMQSYCfg5irqS5LKV+UswXcAKyLiU8AjwLGZeV+F9SVJBasssDLz18CfVFVPkjRYvNOFJKkIBpYkqQgGliSpCAaWJKkIBpYkqQgGliSpCAaWJKkIBpYkqQgGliSpCAaWJKkIBpYkqQgGliSpCAaWJKkIBpYkqQgGliSpCAaWJKkIBpYkqQgGliSpCAaW1IeuuOL7LF16FFde+YO6W5H6xuyqCkXEKuCh1gvgfZn53arqSyU577wvA7By5QoOPvjl9TYj9YnKAqvlLzLzxoprSkW54orvA43WUoMrr/yBoSXhKUGp70yOriatXLminkakPlP1COv8iBgCrgZOy8z7Oj1wZGROz5qS+kvjCcujo3Nr6UTqJ1UG1oGZuToitgOWA2cDx3R68Pj4BiYmpv9DlgbREFNDa4ixsfV1NSNVZnh4aMbBSWWnBDNzdevnw8A/AvtXVVsqyTHHnDBl+fjjl9bTiNRnKgmsiNg+InZovR8C3grcUEVtqTSHHnoYzVEWwJATLqSWqkZYOwM/jIj/AW4ElgB/XVFtqTiToyxHV9LjhhqNvr8utBi41WtYkjTY2q5h7QGsesL2qhuSJOmpMLAkSUUwsCRJRTCwJElFMLAkSUUwsCRJRTCwJElFMLAkSUUwsCRJRej4bu0RsS3wAeBIYBfgDuBC4IzMfGimYyVJerq6ebzIPwEBnATcBuwOnAbsCnjDM0lST3UTWG8Antf20MX/jYjrgJsxsCRJPdbNNaw7gWdNW/dMYO2Wa0eSpE3rZoR1LvCdiPgssAZYBLwT+EpEHDq5U2ZesWVblCSpu8B6e+vnadPWv6P1guZzvZ/7dJuSJGm6jgMrM/foZSOSJM2kmxHWYyIigBcCP83M27ZsS5IkPdFmJ11ExKcj4pi25eOAm4AvAL+IiFf1sD9JkoDOZgm+AfhR2/JHgZMyc5TmtasP9aAvaau2bNnJLF16FKeeekrdrUh9o5PAenZm3g4QES8CRoAvtbadByzppmBEfCgiGq3PkrQJ4+N3AXDXXXfW3InUPzoJrHURsXPr/YHATzLz4dbyNsBQp8Ui4sXAy2jeKUPSJixbdvKUZUdZUlMngXURcGFEnAScClzQtm0/4JZOCkXEdsA5wIndNiltTSZHV5McZUlNncwSPJXmd68OoznR4vNt2/aZtjyTDwPnZeaq5iTD7oyMzOn6GGlQjI7OrbsFqXabDazMfAT42yfZ9vedFImIPwZeSjP8npLx8Q1MTDSe6uFS0cbG1tfdgtRzw8NDMw5OuvoeVkQs5YmPF1mRmZtLkoOBPYFbW6OrhcB3I+IvM/N73fQgDbqRkZ2mnBbcaafn1NiN1D+GGo3ORi0RcRbwemA5zUkTuwHvBi7NzPd2UzQiVgGvzcwbO9h9MXCrIyxtTZYuPeqx9ytWXDDDntLgaBth7QGsmr69mxHWCcCLM3PN5IqIuAz4KdBVYEma2eQoy9GV9LhuAmt96zV93f3dFs3Mxd0eI21NPvGJ5XW3IPWdGQMrItrvvL4c+PeI+DiPP15kGfCZnnUnSVLL5kZYN9N8ZEj7l4MPmbbPocDZW7IpSZKmmzGwMrObJxJLktQzBpIkqQibu4b1ncx8Zev9VTRPDz5BZh7Ug94kSXrM5q5hfaXt/Rd72YgkSTPZ3DWsCyLiJcDDmbkSICJ2ojljcC/gWuA9vW5SkqROrmEtB9q/vfgF4AWtn3sBZ235tiRJmqqTwNoTuAogIuYDrwGOzsxzaN5X8PCedSdJUksngTUb+F3r/cuAtZn5S4DMXA3M701rkiQ9rpPAugl4U+v9W4HLJzdExK7Auh70JUnSFJ3cS/B9wKUR8TlgI3BA27a3ANf0ojFJktptdoSVmVfTfJTIYcBzMzPbNl8G/E2PepMk6TEdPw+rRovxeViSNPA29zwsb80kSSqCgSVJKoKBJUkqgoElSSpCJ9PaJVXs9NPfy9q1a1i4cDc+/OGP192O1BcqC6yIuITmzI8JYAPwrsy8oar6UknWrl0DwJo1t9fcidQ/qjwleHxm/kFm/iHwSWBFhbWlYpx++nunLH/wg6fW1InUXyoLrMxsv4XTDjRHWpKmmRxdTXKUJTVVeg0rIr4I/BkwBLyym2NbXyaTtkqjo3PrbkGqXaWBlZlvA4iIY4FPAK/u9FjvdKGt2djY+rpbkHqu7U4Xm95eYS+PycxzgUMiYqSO+lI/W7Bg4ZTlhQt3q6kTqb9UElgRMSciFrUtHw7c03pJanPGGVMf4u20dqmpqlOC2wMXR8T2NB9Rcg9weGZ6jk/ahAULFj72PSxJTd6tXZLUF7xbuyRpIBhYkqQiGFiSpCIYWJKkIhhYkqQiGFiSpCIYWJKkIhhYkqQiGFiSpCIYWJKkIhhYkqQiGFiSpCIYWJKkIhhYkqQiGFiSpCIYWJKkIhhYkqQiGFiSpCIYWJKkIsyuokhEjADnAs8Dfgf8Cnh7Zo5VUV+SVL6qRlgN4KzMjMzcG7gF+HhFtSVJA6CSEVZm3gP8sG3VtcCJVdRWf7jmmh9x9dVX1t1GMdatuw+AHXaYX2sfJTnggIPZf/+D6m5DPVRJYLWLiGGaYfXNbo4bGZnTm4ZUiXnznsk228yqu41i3H//OgCe/eyRmjspx7x5z2R0dG7dbaiHKg8s4LPABuDsbg4aH9/AxESjNx2p5/bee1/23nvfutsoxplnfgSAU045reZOyjI2tr7uFvQ0DA8PzTg4qTSwIuKTwAuAwzNzosrakqSyVRZYEfFR4CXAazLz4arqSpIGQ1XT2vcC3g/8EvjPiAC4NTOPqKK+JKl8Vc0SvAkYqqKWJGkweacLSVIRDCxJUhEMLElSEQwsSVIRDCxJUhEMLElSEQwsSVIRDCxJUhEMLElSEQwsSVIRDCxJUhEMLElSEQwsSVIRDCxJUhEMLElSEQwsSVIRDCxJUhEMLElSEQwsSVIRZldRJCI+CbwRWAzsnZk3VlFXkjQ4qhphXQIcBNxWUT1J0oCpZISVmVcDREQV5SpxwQVfYfVq81e9cfvtzf+3zjzzIzV3okG1aNHuHHXUcXW30ZVKAmtLGBmZU3cLU9x55xryVzcz6xnz625FA2hi4ywAbl59d82daBBtfOg+ttlmFqOjc+tupSvFBNb4+AYmJhp1t/GYRx7ZyKxnzOdZu7+87lYkqSsP3vYDHnlkI2Nj6+tuZYrh4aEZByfOEpQkFcHAkiQVoZLAioh/iIg1wELg8oi4qYq6kqTBUdUswZOAk6qoJUkaTJ4SlCQVwcCSJBXBwJIkFcHAkiQVwcCSJBXBwJIkFcHAkiQVwcCSJBXBwJIkFcHAkiQVwcCSJBXBwJIkFcHAkiQVwcCSJBXBwJIkFcHAkiQVwcCSJBXBwJIkFcHAkiQVYXZVhSJiCbASGAHGgeMy81dV1Zckla3KEdbngHMycwlwDvD5CmtLkgpXyQgrInYCXgwc1lr1VeDsiBjNzLEqetjS1q27j40P3s36/Le6WylDY6L5knplaLj50uZNPMq6dZWdYNtiqup4EfCbzNwIkJkbI+KO1vqOAmtkZE4P2+veggXP4f7719XdRjEeffRRHn20UXcbGmCzZ89i9uzy/gjXYxsWLHgOo6Nz626kK8X81x0f38DERP/8wXvXu5bV3YIkPS1jY+vrbmGK4eGhGQcnVY2fVwO7RsQsgNbPXVrrJUnarEoCKzPvAm4AjmytOhK4vtTrV5Kk6lV5SvAdwMqI+CBwL3BchbUlSYWrLLAy8xfAflXVkyQNFueASpKKYGBJkopgYEmSimBgSZKKUMIXh2dB8wtlkqTB1fZ3ftamtpcQWAsAdtxx+7r7kCRVYwFwy/SVQ41G/9zu6ElsB+wLrAU21tyLJKl3ZtEMqx8DD0/fWEJgSZLkpAtJUhkMLElSEQwsSVIRDCxJUhEMLElSEQwsSVIRDCxJUhEMLElSEQwsSVIRDCxJUhEMLElSEQwsSVIRDCypBhFxdER8r225ERHPr7Mnqd+V8DwsqVgRcQBwFrAXzcfj/Bw4OTPPB87v8DO2BT4GvAWYD9wNXJKZJ/egZalvGVhSj0TEPOBbwInARcC2wIFs4jk/m/F+4KXAH9F8LtzuwEFbrlOpDAaW1DtLADLzq63l3wLfA4iIE4C3ZeYBbfu/OiJOBuYB/wK8LzMnaD7A9OuZeUdrv1WtF63PWgV8HjiW5sPvLgFOzMyHtvyvJNXHwJJ655fAxohYCVwIXJuZ986w/xE0R1JzgMuBBL4IXAucEhG/A64CbszM6U9ePRp4BfAAcCnwgdZLGhhOupB6JDPvBw4AGsA/A2MR8c2I2PlJDjkzM+/JzNuB5cCRrfUfA86kGUo/AX4TEcdPO/bszFydmfcAZ7QdKw0MA0vqocz8eWaekJkLgRcBu9AMo01Z3fb+tta+ZObGzDwnM/enOeniDGBFROy5uWOlQWJgSRXJzF8AX6YZXJuyqO39bsAd03fIzN9m5jnAvcALuzlWKp3XsKQeiYjfB14DfC0z10TEIpqn6q59kkOWRcR1NK9hvRv4dOtzTgZuAK4DHqF5anAucH3bse+MiG8BDwKnA1/b0r+PVDdHWFLvrAf2A66LiAdoBtWNwHueZP9vAP9NM5wuA77UWv8g8CngTprfwXon8MbM/HXbsRfQnIH4a+AW4O+25C8i9YOhRmP6ZCNJJWlNa39bZl5edy9SLznCkiQVwcCSJBXBU4KSpCI4wpIkFcHAkiQVwcCSJBXBwJIkFcHAkiQV4f8BVAklWoUJZsoAAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 1080x720 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(15,10))\n",
"\n",
"plt.subplot(2, 2, 1)\n",
"ax = sns.boxplot(y=df_train[\"SibSp\"])\n",
"ax.set_xlabel(\"SibSp\")\n",
"sns.set(style=\"darkgrid\")"
]
},
{
"cell_type": "code",
"execution_count": 716,
"id": "affecting-pierce",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1080x720 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(15,10))\n",
"\n",
"plt.subplot(2, 2, 1)\n",
"ax = sns.boxplot(y=df_train[\"Age\"])\n",
"ax.set_xlabel(\"Age\")\n",
"sns.set(style=\"darkgrid\")"
]
},
{
"cell_type": "code",
"execution_count": 717,
"id": "fatty-university",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAboAAAEfCAYAAAA3JgPYAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAV5UlEQVR4nO3df3DU9Z3H8dfuIsEAYU0gsAkgQ6fCCmrHhh8qtEwYJ7QTEKQtHKc4Db9OwNHrwJhaTSSUXiMZOZEgUsEeLco5YzVnxMRWHH+OKBMzTAytuhMgNWuAxJBA+Lm79wey556wWYT9fnc/+3zMOPLZz8Z9rwM85/vd3e86QqFQSAAAGMpp9wAAAMQToQMAGI3QAQCMRugAAEYjdAAAoxE6AIDRetk9QLx99dVxBYN8ggIATOV0OnTNNX0vum986ILBEKEDgBTGqUsAgNEIHQDAaIQOAGA0QgcAMBqhAwzyySd7tWDBv6qxscHuUYCEQegAgzz11JMKhULauPEJu0cBEgahAwzxySd71d19XJLU3X2cozrga4QOMMRTTz0ZseaoDjiH0AGGOH80d7E1kKoIHWCI9PS+UddAqiJ0gCHuvfe+iPXSpffbNAmQWCy71mV+fr569+6ttLQ0SdKKFSs0efJk1dfXq6SkRKdOnVJubq7Wrl2rrKwsSYq6ByDSmDE3Kj29r7q7jys9va+uv36s3SMBCcHSI7r169erqqpKVVVVmjx5soLBoFauXKmSkhLV1tYqLy9PFRUVkhR1D8CF3XvvfXI4HBzNAd9g66nLhoYGpaWlKS8vT5I0d+5c1dTU9LgH4MLGjLlRW7Zs52gO+AZLv6ZnxYoVCoVC+uEPf6hf/epX8vv9ysnJCe9nZmYqGAyqo6Mj6p7b7Y75MbOy+l3JpwAASDKWhW779u3yeDw6ffq01qxZo7KyMt1+++1xf9y2tmN8Hx0AGMzpdEQ9qLHs1KXH45Ek9e7dW/PmzVNdXZ08Ho9aWlrC92lvb5fT6ZTb7Y66BwBArCwJXXd3t7q6uiRJoVBIO3fulNfr1dixY3Xy5Ent2bNHkrRjxw5NmzZNkqLuAQAQK0coFIr7eb3m5mbdd999CgQCCgaD+t73vqeHH35Y2dnZqqurU2lpacRHCAYOHChJUfdixalLADBbT6cuLQmdnQgdAJgtYV6jAwDADoQOAGA0QgcAMBqhAwAYjdABAIxG6AAARiN0AACjEToAgNEIHQDAaIQOAGA0QgcAMBqhAwAYjdABAIxG6AAARiN0AACjEToAgNEIHQDAaIQOAGA0QgcAMBqhAwAYjdABAIxG6AAARiN0AACjEToAgNEIHQDAaIQOAGA0QgcAMBqhAwAYjdABAIxG6AAARiN0AACjEToAgNEIHQDAaIQOAGA0QgcAMJrloduwYYNGjRqlTz/9VJJUX1+vGTNmqKCgQEVFRWprawvfN9oeAACxsDR0n3zyierr65WbmytJCgaDWrlypUpKSlRbW6u8vDxVVFT0uAcAQKwsC93p06dVVlamRx99NHxbQ0OD0tLSlJeXJ0maO3euampqetwDACBWvax6oCeeeEIzZszQ0KFDw7f5/X7l5OSE15mZmQoGg+ro6Ii653a7Y37crKx+V2R+AEBysiR0H3/8sRoaGrRixQorHi5CW9sxBYMhyx8XAGANp9MR9aDGktB99NFH8vl8mjp1qiTpyy+/1IIFC3T33XerpaUlfL/29nY5nU653W55PJ6L7gEAECtLXqNbvHix3n33Xe3atUu7du3SkCFDtGXLFi1cuFAnT57Unj17JEk7duzQtGnTJEljx4696B4AALGy7DW6C3E6nXrsscdUWlqqU6dOKTc3V2vXru1xDwCAWDlCoZDRL2DxGh0AmK2n1+i4MgoAwGiEDgBgNEIHADAaoQMAGI3QAQCMRugAAEYjdAAAoxE6AIDRCB0AwGiEDgBgNEIHADAaoQMAGI3QAQCMRugAAEYjdAAAoxE6AIDRCB0AwGiEDgBgNEIHADAaoQMAGI3QAQCMRugAAEYjdAAAoxE6AIDRCB0AwGiEDgBgNEIHADAaoQMAGI3QAQCMRugAAEYjdAAAoxE6AIDRCB0AwGiEDgBgNEIHADBaL6seaOnSpfrnP/8pp9Op9PR0PfLII/J6vWpqalJxcbE6OjrkdrtVXl6uESNGSFLUPQAAYuEIhUIhKx6oq6tL/fv3lyT97W9/U2VlpV566SXNnz9fs2fP1h133KGqqiq9+OKL2rZtmyRF3YtVW9sxBYOWPEUAgA2cToeysvpdfN+qQc5HTpKOHTsmh8OhtrY2NTY2qrCwUJJUWFioxsZGtbe3R90DACBWlp26lKTf/OY3eu+99xQKhfTMM8/I7/dr8ODBcrlckiSXy6Xs7Gz5/X6FQqGL7mVmZsb8mNEqDwAwn6WhW7NmjSTp5Zdf1mOPPab7778/7o/JqUsAMFvCnLr8ppkzZ2r37t0aMmSIWltbFQgEJEmBQECHDh2Sx+ORx+O56B4AALGyJHTHjx+X3+8Pr3ft2qUBAwYoKytLXq9X1dXVkqTq6mp5vV5lZmZG3QMAIFaWvOvyyJEjWrp0qU6cOCGn06kBAwbowQcf1JgxY+Tz+VRcXKzOzk5lZGSovLxcI0eOlKSoe7Hi1CUAmK2nU5eWfbzALoQOAMyWkK/RAQBgFUIHADAaoQMAGI3QAQCMdkkfGPf5fKqpqdGRI0dUWloqn8+nM2fOaPTo0fGaDwCAyxLzEd1rr72mu+66S62traqqqpIkdXd36/e//33chgMA4HLFHLr169fr2WefVVlZWfj6k6NHj9bf//73uA0H4NJUV1epqGieampesXsUIGHEHLr29naNGjVKkuRwOML/Pv9rAPb7y1/+W5L0wgvP2zwJkDhiDt2YMWPCpyzPe/XVV3XjjTde8aEAXLrq6sg/nxzVAefEfGUUn8+nBQsWaOjQoaqvr9eECRPU1NSkrVu3JvS3fnNlFKSKoqJ537pt69bnbJgEsFZPV0aJ6V2XoVBIvXv3VnV1td5++21NmTJFHo9HU6ZMUd++fa/YsAAAXGkxhc7hcGj69Omqq6vTT3/603jPBADAFRPza3Rer1dNTU3xnAXAZbjzzjkR61/84l9smgRILDF/YHz8+PFatGiRZs2apSFDhkS82/JnP/tZXIYDELvCwjvC77qUpGnTpts4DZA4Yg5dXV2dcnNz9eGHH0bc7nA4CB2QIPr1669jx7rUv39/u0cBEkbMofvTn/4UzzkAXKaDB/fr2LEuSVJXV5eamw9o2LBrbZ4KsN93uqhzKBRSMBgM/wPAfps3V0asn356g02TAIkl5iO61tZWlZWVac+ePers7IzY27dv3xUfDMClaWn5IuoaSFUxH9GVlpbqqquu0h//+Eelp6frpZdeUn5+vlatWhXP+QDEKCcnN+oaSFUxh+7jjz/W7373O3m9XjkcDo0ePVpr1qzR1q1b4zkfgBgtXrwsYr1kyXKbJgESS8yhczqd6tXr3JnOjIwMtbe3Kz09Xa2trXEbDkDshg8fET6Ky8nJ5Y0owNd6DN3hw4clSTfddJPeeustSdKkSZP0wAMPaPny5Ro7dmx8JwQQs4kTJ0mSJk36kc2TAImjx4s633zzzaqrq1NnZ6eCwaAefvhhVVRUaMuWLeru7tY999yj7Oxsq+a9ZFzUGalk0aL5CgTOyuXqpT/8YZvd4wCWuOyLOp/vYEZGhiTpww8/VJ8+fbRs2bJoPwbAYrt3v69A4KwkKRA4q48++kDjxk20eSrAfj2euuSLVYHk8MwzmyLWmzdvtGkSILH0eEQXCAT0wQcfhI/szp49G7GWpFtuuSV+EwKIyfmjuYutgVTVY+iysrL00EMPhddutzti7XA49MYbb8RnOgAxc7lcCgQCEWsAMYRu165dVswB4DJdd51X+/Y1hNejR19v4zRA4vhO17oEkHiamnwRa5/vc5smARILoQMMccstt0VdA6mK0AGGyM0dFrEePpwrowASoQOM8fzzkd8Z+ec//5dNkwCJhdABhuDjBcCFETrAEC5Xr6hrIFVZErqvvvpKixYtUkFBgaZPn67ly5ervb1dklRfX68ZM2aooKBARUVFamtrC/9ctD0AkRYu/LeI9eLFS22aBEgsloTO4XBo4cKFqq2t1SuvvKJhw4apoqJCwWBQK1euVElJiWpra5WXl6eKigpJiroH4NsmTLg1fBTncvXiOpfA1ywJndvt1oQJE8LrH/zgB2ppaVFDQ4PS0tKUl5cnSZo7d65qamokKeoegAs7f1TH0Rzwfyw/iR8MBvX8888rPz9ffr9fOTk54b3MzEwFg0F1dHRE3XO73TE/XrSvbgBMU1hYoMLCArvHABKK5aFbvXq10tPTddddd+mvf/1r3B+P76MDALNd9vfRXUnl5eU6cOCANm3aJKfTKY/Ho5aWlvB+e3u7nE6n3G531D0AAGJl2ccLHn/8cTU0NKiyslK9e/eWJI0dO1YnT57Unj17JEk7duzQtGnTetwDACBWjtA3v1guTj777DMVFhZqxIgR6tOnjyRp6NChqqysVF1dnUpLS3Xq1Cnl5uZq7dq1GjhwoCRF3YsVpy4BwGw9nbq0JHR2InQAYLaeQseVUQAARiN0AACjEToAgNEIHQDAaIQOAGA0QgcAMBqhAwxy8OB+LVu2QM3NB+weBUgYhA4wyObNlTpx4oSefnqD3aMACYPQAYY4eHC/Wlq+kCS1tHzBUR3wNUIHGGLz5sqINUd1wDmEDjDE+aO5i62BVEXoAEPk5ORGXQOpitABhli8eFnEesmS5TZNAiQWQgcYoqur8/+tu2yaBEgshA4wxJNPPh6xXr++wqZJgMRC6ABDnD59OuoaSFWEDgBgNEIHADAaoQMAGI3QAYbgc3TAhRE6wBB8jg64MEIHGGL37g8i1nv27LZpEiCxEDrAEK+99j8R61deedmeQYAEQ+gAAEYjdAAAoxE6wBA/+cmMiPX06TPtGQRIMIQOMMTPfz43Yj1r1i9smgRILIQOMMTu3e9HrD/66IOL3BNILYQOMMQzz2yKWG/evNGmSYDEQugAQwQCZ6OugVRF6ABDOJ3OqGsgVfEnATCGI3LlcFzkfkBqIXSAIYLBQMQ6EAhc5J5AaiF0AACjEToAgNEsCV15ebny8/M1atQoffrpp+Hbm5qaNGfOHBUUFGjOnDnav39/THsAAMTKktBNnTpV27dvV25u5BdBlpaWat68eaqtrdW8efNUUlIS0x6Abxs0KDvqGkhVloQuLy9PHo8n4ra2tjY1NjaqsLBQklRYWKjGxka1t7dH3QNwYcuWPRCxXr783+0ZBEgwvex6YL/fr8GDB8vlckmSXC6XsrOz5ff7FQqFLrqXmZl5SY+TldXvis8OJKJBg26Q0+lUMBiU0+nUzTePtXskICHYFjqrtLUdUzAYsnsMIO4OHtyvYDAoSQoGg6qra9CwYdfaPBUQf06nI+pBjW3vuvR4PGptbQ1/1icQCOjQoUPyeDxR9wBcWGXlf0asN2xYZ88gQIKxLXRZWVnyer2qrq6WJFVXV8vr9SozMzPqHoALO3z4UNQ1kKocoVAo7uf1fvvb3+r111/XkSNHdM0118jtduvVV1+Vz+dTcXGxOjs7lZGRofLyco0cOVKSou5dCk5dIlUUFc371m1btz5nwySAtXo6dWlJ6OxE6JAqCB1SVcK+RgcAgBUIHQDAaIQOAGA0QgcAMBqhAwAYjdABAIxG6AAARiN0AACjEToAgNG4MgoS2nvvva13333L7jGSwj/+se9bt40a5bVhkuQyadKPddttP7J7DFwGrowCAEhpHNEBhjh4cL8effSh8HrVqv/g++iQEjiiA1LE8OEjwr/u0+dqIgd8jdABBrn22hFyOBz69a9L7B4FSBiEDjBInz5X67rrRnM0B3wDoQMAGI3QAQCMRugAAEYjdAAAoxE6AIDRCB0AwGiEDgBgNEIHADAa17q02HPPbVNz8wG7x4ChDh4893tr+HA+MI74GDbsWs2bN9/uMSL0dK3LXhbOAknNzQf0j88+l6uP2+5RYKBgwCVJ+rz5iM2TwESBkx12j/CdEDobuPq4lX7tVLvHAIBL0n3gDbtH+E54jQ4AYDRCBwAwGqEDABiN0AEAjEboAABG412XFjt6tEOBkx1J++4lAKkrcLJDR48mXzY4ogMAGC350pzkBgxw63DnWT5HByDpdB94QwMGuO0e45IROhtw6hLxEjx7UpLk7NXH5klgonNXRhlo9xiXLOFD19TUpOLiYnV0dMjtdqu8vFwjRoywe6zvbNgwrkGI+Alf63JY8v1lhGQwMCn/Dkv4izrPnz9fs2fP1h133KGqqiq9+OKL2rZtW8w/n2gXdQbiqbx8tSTpwQcfsXkSwDo9XdQ5oUPX1tamgoIC7d69Wy6XS4FAQBMmTNDrr7+uzMzMGP8bhC6Zvffe23r33bfsHiNp8O0Fl27SpB/rttt+ZPcYuAxJ/e0Ffr9fgwcPlst17orsLpdL2dnZ8vv9MYcu2pNH4svIuFpXXeWye4ykkZV17s8F/89il5FxtQYN6m/3GIijhA7dlcARXXK74YZxuuGGcXaPAcMdPtxl9wi4DD0d0SX05+g8Ho9aW1sVCAQkSYFAQIcOHZLH47F5MgBAskjo0GVlZcnr9aq6ulqSVF1dLa/XG/NpSwAAEvrNKJLk8/lUXFyszs5OZWRkqLy8XCNHjoz55zl1CQBmS+p3XV4JhA4AzJbUr9EBAHC5CB0AwGiEDgBgNEIHADCa8R8Ydzoddo8AAIijnv6eN/5dlwCA1MapSwCA0QgdAMBohA4AYDRCBwAwGqEDABiN0AEAjEboAABGI3QAAKMROgCA0QgdAMBohA4AYDRCBwAwGqEDABiN0AEAjEboAABGM/6LV4Fkl5+fryNHjsjlcoVvq6mp0eDBg22cCkgehA5IAps2bdKtt956yT8XCoUUCoXkdHLyBqmL3/1Akjl69KiWLFmiiRMnaty4cVqyZIm+/PLL8P7dd9+tdevWae7cubrpppvU3Nwsn8+nX/7ylxo/frwKCgq0c+dOG58BYC1CBySZYDCoO++8U2+++abefPNNpaWlqaysLOI+VVVVWr16terq6pSZmamioiIVFhbq/fff17p167Rq1Sp9/vnnNj0DwFqcugSSwLJly8Kv0Y0fP14bN24M7917772aP39+xP1nzZql73//+5Kkd955R7m5uZo9e7Yk6frrr1dBQYFqamq0fPlyi54BYB9CBySBysrK8Gt0J06cUElJid555x0dPXpUknT8+HEFAoFwDD0eT/hnv/jiC+3du1d5eXnh2wKBgGbMmGHhMwDsQ+iAJLN161Y1NTXphRde0KBBg7Rv3z7NnDlToVAofB+HwxH+tcfj0bhx4/Tss8/aMS5gO16jA5LM8ePHlZaWpoyMDHV0dGjDhg1R7z9lyhTt379fL7/8ss6cOaMzZ85o79698vl8Fk0M2IvQAUnmnnvu0alTpzRx4kTNmTNHkydPjnr/fv36acuWLdq5c6cmT56sSZMmqaKiQqdPn7ZoYsBejtA3z3cAAGAYjugAAEYjdAAAoxE6AIDRCB0AwGiEDgBgNEIHADAaoQMAGI3QAQCM9r+CEzw/sD2L+AAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 1080x720 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(15,10))\n",
"\n",
"plt.subplot(2, 2, 1)\n",
"ax = sns.boxplot(y=df_train[\"Fare\"])\n",
"ax.set_xlabel(\"Fare\")\n",
"sns.set(style=\"darkgrid\")"
]
},
{
"cell_type": "code",
"execution_count": 718,
"id": "loved-mountain",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Outliers for the Fare are < -61.358399999999996 or > 100.2688\n"
]
}
],
"source": [
"# calculating outlier space for stays_in_weekend_nights\n",
"\n",
"IQR = df_train[\"Fare\"].quantile(0.75) - df_train[\"Fare\"].quantile(0.25)\n",
"lf = df_train[\"Fare\"].quantile(0.25) - (IQR * 3)\n",
"uf = df_train[\"Fare\"].quantile(0.75) + (IQR * 3)\n",
"print('Outliers for the Fare are < {lbound} or > {ubound}'.format(\n",
" lbound=lf, \n",
" ubound=uf)\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "burning-fever",
"metadata": {},
"source": [
"## Baseline model"
]
},
{
"cell_type": "code",
"execution_count": 719,
"id": "existing-young",
"metadata": {},
"outputs": [],
"source": [
"# Remove the target variable\n",
"X = df_train.drop([\"Survived\"], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 720,
"id": "fuzzy-favor",
"metadata": {},
"outputs": [],
"source": [
"# Get the target variable\n",
"y = df_train[\"Survived\"]"
]
},
{
"cell_type": "markdown",
"id": "acknowledged-promotion",
"metadata": {},
"source": [
"On va utiliser Fare, SibSp et Parch car ce sont des valeurs numériques continues et utiliser la Logistic regression. Validation avec la cross-validation"
]
},
{
"cell_type": "code",
"execution_count": 721,
"id": "domestic-blocking",
"metadata": {},
"outputs": [],
"source": [
"base = ['Fare', 'SibSp', 'Parch']"
]
},
{
"cell_type": "code",
"execution_count": 722,
"id": "occasional-candy",
"metadata": {},
"outputs": [],
"source": [
"# isolate the target and filter the features we want to use\n",
"def baseline_model(X):\n",
" target = X[\"Survived\"]\n",
" X = X[base]\n",
" return X, target"
]
},
{
"cell_type": "code",
"execution_count": 723,
"id": "continuous-cocktail",
"metadata": {},
"outputs": [],
"source": [
"X, y = baseline_model(df_train.copy())"
]
},
{
"cell_type": "markdown",
"id": "brown-executive",
"metadata": {},
"source": [
" works by taking an estimator (machine learning model) along with data and labels. It then evaluates the machine learning model on the data and labels using cross-validation and a defined scoring parameter."
]
},
{
"cell_type": "code",
"execution_count": 724,
"id": "complete-rescue",
"metadata": {},
"outputs": [],
"source": [
"# cross_validation\n",
"def compute_score(clf, X, y):\n",
" from sklearn.model_selection import cross_val_score\n",
" xval = cross_val_score(clf, X, y, cv=5)\n",
" return np.mean(xval)"
]
},
{
"cell_type": "code",
"execution_count": 725,
"id": "postal-brush",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.6746092524009792\n"
]
}
],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"lr_model = LogisticRegression(max_iter=1000)\n",
"lr_model.fit(X, y)\n",
"\n",
"print(compute_score(lr_model, X, y))"
]
},
{
"cell_type": "markdown",
"id": "corrected-vessel",
"metadata": {},
"source": [
"Nous avons un premier benchmark de 67% qui bat la null accuracy de 61.6%. "
]
},
{
"cell_type": "code",
"execution_count": 726,
"id": "thorough-subject",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0.7374049 , 0.2625951 ],\n",
" [0.49607529, 0.50392471],\n",
" [0.68499508, 0.31500492],\n",
" ...,\n",
" [0.61158623, 0.38841377],\n",
" [0.60239549, 0.39760451],\n",
" [0.6856129 , 0.3143871 ]])"
]
},
"execution_count": 726,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# make predictions on training data \n",
"lr_predicted_labels = lr_model.predict_proba(X)\n",
"lr_predicted_labels"
]
},
{
"cell_type": "markdown",
"id": "complicated-milton",
"metadata": {},
"source": [
"## 4. Feature engineering"
]
},
{
"cell_type": "markdown",
"id": "spatial-amazon",
"metadata": {},
"source": [
"### Feature Improvement"
]
},
{
"cell_type": "markdown",
"id": "meaning-organizer",
"metadata": {},
"source": [
"Using mathematical formulas to augment the predictiveness of a particular feature"
]
},
{
"cell_type": "markdown",
"id": "imported-sunrise",
"metadata": {},
"source": [
"### Imputing"
]
},
{
"cell_type": "code",
"execution_count": 729,
"id": "threatened-whole",
"metadata": {},
"outputs": [],
"source": [
"# Imputing Missing Quantitative Data\n",
"from sklearn.impute import SimpleImputer # sklearn class to impute missing data\n",
"\n",
"# could be mean or median for numerical values\n",
"numerical_imputer = SimpleImputer(strategy='median')\n",
"\n",
"df_train['Age'] = numerical_imputer.fit_transform(df_train[['Age']])"
]
},
{
"cell_type": "code",
"execution_count": 730,
"id": "early-plaza",
"metadata": {},
"outputs": [],
"source": [
"# Imputing Missing Qualititative Data\n",
"categorical_imputer = SimpleImputer(strategy='most_frequent')\n",
"\n",
"df_train['Embarked'] = categorical_imputer.fit_transform(df_train[['Embarked']])"
]
},
{
"cell_type": "markdown",
"id": "failing-origin",
"metadata": {},
"source": [
"### Normalize"
]
},
{
"cell_type": "code",
"execution_count": 731,
"id": "martial-tablet",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" <th>mean</th>\n",
" <th>std</th>\n",
" <th>min</th>\n",
" <th>25%</th>\n",
" <th>50%</th>\n",
" <th>75%</th>\n",
" <th>max</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <td>891.0</td>\n",
" <td>0.500000</td>\n",
" <td>0.289162</td>\n",
" <td>0.0</td>\n",
" <td>0.250000</td>\n",
" <td>0.500000</td>\n",
" <td>0.750000</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Survived</th>\n",
" <td>891.0</td>\n",
" <td>0.383838</td>\n",
" <td>0.486592</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Pclass</th>\n",
" <td>891.0</td>\n",
" <td>0.654321</td>\n",
" <td>0.418036</td>\n",
" <td>0.0</td>\n",
" <td>0.500000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Age</th>\n",
" <td>891.0</td>\n",
" <td>0.363679</td>\n",
" <td>0.163605</td>\n",
" <td>0.0</td>\n",
" <td>0.271174</td>\n",
" <td>0.346569</td>\n",
" <td>0.434531</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>SibSp</th>\n",
" <td>891.0</td>\n",
" <td>0.065376</td>\n",
" <td>0.137843</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.125000</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Parch</th>\n",
" <td>891.0</td>\n",
" <td>0.063599</td>\n",
" <td>0.134343</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Fare</th>\n",
" <td>891.0</td>\n",
" <td>0.062858</td>\n",
" <td>0.096995</td>\n",
" <td>0.0</td>\n",
" <td>0.015440</td>\n",
" <td>0.028213</td>\n",
" <td>0.060508</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" count mean std min 25% 50% 75% max\n",
"PassengerId 891.0 0.500000 0.289162 0.0 0.250000 0.500000 0.750000 1.0\n",
"Survived 891.0 0.383838 0.486592 0.0 0.000000 0.000000 1.000000 1.0\n",
"Pclass 891.0 0.654321 0.418036 0.0 0.500000 1.000000 1.000000 1.0\n",
"Age 891.0 0.363679 0.163605 0.0 0.271174 0.346569 0.434531 1.0\n",
"SibSp 891.0 0.065376 0.137843 0.0 0.000000 0.000000 0.125000 1.0\n",
"Parch 891.0 0.063599 0.134343 0.0 0.000000 0.000000 0.000000 1.0\n",
"Fare 891.0 0.062858 0.096995 0.0 0.015440 0.028213 0.060508 1.0"
]
},
"execution_count": 731,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import MinMaxScaler\n",
"\n",
"numeric_types = ['float16', 'float32', 'float64', 'int16', 'int32', 'int64'] # the numeric types in Pandas\n",
"\n",
"numerical_columns = df_train.select_dtypes(include=numeric_types).columns.tolist()\n",
"\n",
"pd.DataFrame( \n",
" MinMaxScaler().fit_transform(df_train[numerical_columns]),\n",
" columns=numerical_columns\n",
").describe().T"
]
},
{
"cell_type": "markdown",
"id": "superior-employment",
"metadata": {},
"source": [
"### Feature Selection"
]
},
{
"cell_type": "markdown",
"id": "personal-stake",
"metadata": {},
"source": [
"Eliminating a subset of existing features to isolate the most useful subset of features(Hypothesis testing, Recursive feature elimination)"
]
},
{
"cell_type": "markdown",
"id": "initial-millennium",
"metadata": {},
"source": [
"### Feature extraction"
]
},
{
"cell_type": "markdown",
"id": "comparative-brown",
"metadata": {},
"source": [
"Applying parametric mathematical transformations to a subset of features to create a new set of features(Principal component analysis, Singular value decomposition)"
]
},
{
"cell_type": "markdown",
"id": "italic-label",
"metadata": {},
"source": [
"### Feature Construction"
]
},
{
"cell_type": "markdown",
"id": "wired-romania",
"metadata": {},
"source": [
"Creating new features from existing features or a new data source(Multiplying/dividing features together, Joining with a new dataset)"
]
},
{
"cell_type": "code",
"execution_count": 732,
"id": "matched-thunder",
"metadata": {},
"outputs": [],
"source": [
"# Adding Family_Size\n",
"df_train['Family_Size'] = df_train['Parch'] + df_train['SibSp']\n",
"df_test['Family_Size'] = df_test['Parch'] + df_test['SibSp']"
]
},
{
"cell_type": "markdown",
"id": "endless-platinum",
"metadata": {},
"source": [
"### Binning"
]
},
{
"cell_type": "markdown",
"id": "regular-tunisia",
"metadata": {},
"source": [
"\n",
"\n",
" Binning refers to the act of creating a new categorical (usually ordinal) feature from a numerical or categorical feature. The most common way to bin data is to group numerical data into bins based on threshold cutoffs, similar to how a histogram is created.\n",
"\n",
" The main goal of binning is to decrease our model’s chance of overfitting the data. Usually, this will come at the cost of performance, as we are losing granularity in the feature that we are binning."
]
},
{
"cell_type": "code",
"execution_count": 733,
"id": "cleared-huntington",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Family_Size</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.0000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.0000</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>28.0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.4500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.0000</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.7500</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>891 rows × 13 columns</p>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
".. ... ... ... \n",
"886 887 0 2 \n",
"887 888 1 1 \n",
"888 889 0 3 \n",
"889 890 1 1 \n",
"890 891 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
".. ... ... ... ... \n",
"886 Montvila, Rev. Juozas male 27.0 0 \n",
"887 Graham, Miss. Margaret Edith female 19.0 0 \n",
"888 Johnston, Miss. Catherine Helen \"Carrie\" female 28.0 1 \n",
"889 Behr, Mr. Karl Howell male 26.0 0 \n",
"890 Dooley, Mr. Patrick male 32.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked Family_Size \n",
"0 0 A/5 21171 7.2500 NaN S 1 \n",
"1 0 PC 17599 71.2833 C85 C 1 \n",
"2 0 STON/O2. 3101282 7.9250 NaN S 0 \n",
"3 0 113803 53.1000 C123 S 1 \n",
"4 0 373450 8.0500 NaN S 0 \n",
".. ... ... ... ... ... ... \n",
"886 0 211536 13.0000 NaN S 0 \n",
"887 0 112053 30.0000 B42 S 0 \n",
"888 2 W./C. 6607 23.4500 NaN S 3 \n",
"889 0 111369 30.0000 C148 C 0 \n",
"890 0 370376 7.7500 NaN Q 0 \n",
"\n",
"[891 rows x 13 columns]"
]
},
"execution_count": 733,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train"
]
},
{
"cell_type": "code",
"execution_count": 734,
"id": "separate-consent",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:title={'center':'Age (Uniform Binning)'}, ylabel='Frequency'>"
]
},
"execution_count": 734,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.preprocessing import KBinsDiscretizer # we will use this module for binning our data\n",
"\n",
"# uniform will create bins of equal width\n",
"binner = KBinsDiscretizer(n_bins=8, encode='ordinal', strategy='uniform')\n",
"binned_age_data = binner.fit_transform(df_train[['Age']].dropna())\n",
"pd.Series(binned_age_data.reshape(-1,)).plot(\n",
" title='Age (Uniform Binning)', kind='hist', xlabel='Age'\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 743,
"id": "sized-album",
"metadata": {},
"outputs": [],
"source": [
"df_train['Age'] = pd.Series(binned_age_data.reshape(-1,))\n",
"df_test['Age'] = pd.Series(binned_age_data.reshape(-1,))"
]
},
{
"cell_type": "code",
"execution_count": 744,
"id": "favorite-marine",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Family_Size</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>0</td>\n",
" <td>2.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>2.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>1</td>\n",
" <td>3.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>3.0</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>2.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>1</td>\n",
" <td>3.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>3.0</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>0</td>\n",
" <td>3.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>3.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>0</td>\n",
" <td>2.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>2.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>1</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>1.0</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>2.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>0</td>\n",
" <td>2.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>2.0</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>0</td>\n",
" <td>3.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>3.0</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>891 rows × 13 columns</p>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
".. ... ... ... \n",
"886 887 0 2 \n",
"887 888 1 1 \n",
"888 889 0 3 \n",
"889 890 1 1 \n",
"890 891 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris 0 2.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 3.0 1 \n",
"2 Heikkinen, Miss. Laina 1 2.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 3.0 1 \n",
"4 Allen, Mr. William Henry 0 3.0 0 \n",
".. ... ... ... ... \n",
"886 Montvila, Rev. Juozas 0 2.0 0 \n",
"887 Graham, Miss. Margaret Edith 1 1.0 0 \n",
"888 Johnston, Miss. Catherine Helen \"Carrie\" 1 2.0 1 \n",
"889 Behr, Mr. Karl Howell 0 2.0 0 \n",
"890 Dooley, Mr. Patrick 0 3.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked Family_Size \n",
"0 0 A/5 21171 2.0 NaN S 1 \n",
"1 0 PC 17599 3.0 C85 C 1 \n",
"2 0 STON/O2. 3101282 2.0 NaN S 0 \n",
"3 0 113803 3.0 C123 S 1 \n",
"4 0 373450 3.0 NaN S 0 \n",
".. ... ... ... ... ... ... \n",
"886 0 211536 2.0 NaN S 0 \n",
"887 0 112053 1.0 B42 S 0 \n",
"888 2 W./C. 6607 2.0 NaN S 3 \n",
"889 0 111369 2.0 C148 C 0 \n",
"890 0 370376 3.0 NaN Q 0 \n",
"\n",
"[891 rows x 13 columns]"
]
},
"execution_count": 744,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train"
]
},
{
"cell_type": "code",
"execution_count": 798,
"id": "convinced-bachelor",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:title={'center':'Fare (Uniform Binning)'}, ylabel='Frequency'>"
]
},
"execution_count": 798,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.preprocessing import KBinsDiscretizer # we will use this module for binning our data\n",
"\n",
"# uniform will create bins of equal width\n",
"binner = KBinsDiscretizer(n_bins=8, encode='ordinal', strategy='uniform')\n",
"binned_fare_data = binner.fit_transform(df_train[['Fare']].dropna())\n",
"pd.Series(binned_fare_data.reshape(-1,)).plot(\n",
" title='Fare (Uniform Binning)', kind='hist', xlabel='Fare'\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 746,
"id": "behind-disclaimer",
"metadata": {},
"outputs": [],
"source": [
"df_train['Fare'] = pd.Series(binned_age_data.reshape(-1,))\n",
"df_test['Fare'] = pd.Series(binned_age_data.reshape(-1,))"
]
},
{
"cell_type": "code",
"execution_count": 747,
"id": "premier-undergraduate",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Family_Size</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>0</td>\n",
" <td>2.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>2.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>1</td>\n",
" <td>3.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>3.0</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>2.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>1</td>\n",
" <td>3.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>3.0</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>0</td>\n",
" <td>3.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>3.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>0</td>\n",
" <td>2.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>2.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>1</td>\n",
" <td>1.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>1.0</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>2.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>0</td>\n",
" <td>2.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>2.0</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>0</td>\n",
" <td>3.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>3.0</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>891 rows × 13 columns</p>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
".. ... ... ... \n",
"886 887 0 2 \n",
"887 888 1 1 \n",
"888 889 0 3 \n",
"889 890 1 1 \n",
"890 891 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris 0 2.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 3.0 1 \n",
"2 Heikkinen, Miss. Laina 1 2.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 3.0 1 \n",
"4 Allen, Mr. William Henry 0 3.0 0 \n",
".. ... ... ... ... \n",
"886 Montvila, Rev. Juozas 0 2.0 0 \n",
"887 Graham, Miss. Margaret Edith 1 1.0 0 \n",
"888 Johnston, Miss. Catherine Helen \"Carrie\" 1 2.0 1 \n",
"889 Behr, Mr. Karl Howell 0 2.0 0 \n",
"890 Dooley, Mr. Patrick 0 3.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked Family_Size \n",
"0 0 A/5 21171 2.0 NaN S 1 \n",
"1 0 PC 17599 3.0 C85 C 1 \n",
"2 0 STON/O2. 3101282 2.0 NaN S 0 \n",
"3 0 113803 3.0 C123 S 1 \n",
"4 0 373450 3.0 NaN S 0 \n",
".. ... ... ... ... ... ... \n",
"886 0 211536 2.0 NaN S 0 \n",
"887 0 112053 1.0 B42 S 0 \n",
"888 2 W./C. 6607 2.0 NaN S 3 \n",
"889 0 111369 2.0 C148 C 0 \n",
"890 0 370376 3.0 NaN Q 0 \n",
"\n",
"[891 rows x 13 columns]"
]
},
"execution_count": 747,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train"
]
},
{
"cell_type": "markdown",
"id": "cardiac-blackberry",
"metadata": {},
"source": [
"### One-hot encodings"
]
},
{
"cell_type": "markdown",
"id": "pregnant-december",
"metadata": {},
"source": [
"Our goal is to transform a feature on the nominal level and create a one-hot encoding matrix, where each feature represents a distinct category, and the value is either 1 or 0, representing the presence of that value in the original observation."
]
},
{
"cell_type": "code",
"execution_count": 748,
"id": "diagnostic-honey",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Family_Size</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>0</td>\n",
" <td>2.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>2.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>1</td>\n",
" <td>3.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>3.0</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>2.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>1</td>\n",
" <td>3.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>3.0</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>0</td>\n",
" <td>3.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>3.0</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp Parch \\\n",
"0 Braund, Mr. Owen Harris 0 2.0 1 0 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 3.0 1 0 \n",
"2 Heikkinen, Miss. Laina 1 2.0 0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 3.0 1 0 \n",
"4 Allen, Mr. William Henry 0 3.0 0 0 \n",
"\n",
" Ticket Fare Cabin Embarked Family_Size \n",
"0 A/5 21171 2.0 NaN S 1 \n",
"1 PC 17599 3.0 C85 C 1 \n",
"2 STON/O2. 3101282 2.0 NaN S 0 \n",
"3 113803 3.0 C123 S 1 \n",
"4 373450 3.0 NaN S 0 "
]
},
"execution_count": 748,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train.head()"
]
},
{
"cell_type": "code",
"execution_count": 749,
"id": "confidential-airplane",
"metadata": {},
"outputs": [],
"source": [
"df_train['Sex'].replace(['male','female'],[0,1],inplace=True)\n",
"df_test['Sex'].replace(['male','female'],[0,1],inplace=True)"
]
},
{
"cell_type": "markdown",
"id": "motivated-chorus",
"metadata": {},
"source": [
"The Cabin column is missing too many values to be useful (687). We have to drop it."
]
},
{
"cell_type": "markdown",
"id": "employed-viewer",
"metadata": {},
"source": [
"Drop less useful features"
]
},
{
"cell_type": "code",
"execution_count": 750,
"id": "appreciated-present",
"metadata": {},
"outputs": [],
"source": [
"df_train = df_train.drop(['PassengerId', 'Cabin', 'Embarked', 'Ticket', 'Name', 'Parch', 'SibSp'], axis=1)\n",
"df_test = df_test.drop(['PassengerId', 'Cabin', 'Embarked', 'Ticket', 'Name', 'Parch', 'SibSp'], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 751,
"id": "angry-watson",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>Fare</th>\n",
" <th>Family_Size</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass Sex Age Fare Family_Size\n",
"0 0 3 0 2.0 2.0 1\n",
"1 1 1 1 3.0 3.0 1\n",
"2 1 3 1 2.0 2.0 0\n",
"3 1 1 1 3.0 3.0 1\n",
"4 0 3 0 3.0 3.0 0"
]
},
"execution_count": 751,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train.head()"
]
},
{
"cell_type": "code",
"execution_count": 752,
"id": "south-summer",
"metadata": {},
"outputs": [],
"source": [
"def update_model(X):\n",
" target = X[\"Survived\"]\n",
" X = X.drop([\"Survived\"], axis=1)\n",
" return X, target"
]
},
{
"cell_type": "code",
"execution_count": 753,
"id": "editorial-decimal",
"metadata": {},
"outputs": [],
"source": [
"X, y = update_model(df_train.copy())"
]
},
{
"cell_type": "code",
"execution_count": 754,
"id": "broke-eight",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.7901136149645346\n"
]
}
],
"source": [
"lr_model.fit(X, y)\n",
"\n",
"print(compute_score(lr_model, X, y))"
]
},
{
"cell_type": "markdown",
"id": "retired-genome",
"metadata": {},
"source": [
"Sex male and sex female are among the most important features."
]
},
{
"cell_type": "markdown",
"id": "compound-magic",
"metadata": {},
"source": [
"**Poids des variables**"
]
},
{
"cell_type": "code",
"execution_count": 755,
"id": "floral-atlanta",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Coefficients</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Pclass</th>\n",
" <td>-1.117657</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Family_Size</th>\n",
" <td>-0.189451</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Age</th>\n",
" <td>-0.168342</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Fare</th>\n",
" <td>-0.168342</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sex</th>\n",
" <td>2.663014</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Coefficients\n",
"Pclass -1.117657\n",
"Family_Size -0.189451\n",
"Age -0.168342\n",
"Fare -0.168342\n",
"Sex 2.663014"
]
},
"execution_count": 755,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_names = lr_model.feature_names_in_\n",
"feature_names\n",
"\n",
"coefficients = pd.DataFrame(\n",
" lr_model.coef_[0],\n",
" columns=[\"Coefficients\"],\n",
" index=feature_names,\n",
")\n",
"\n",
"coefficients.sort_values(by=['Coefficients'])"
]
},
{
"cell_type": "code",
"execution_count": 756,
"id": "friendly-education",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYEAAAExCAYAAACakx5RAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAed0lEQVR4nO3deXBUZb7G8ae7CV4hQUiqyXRCShAcQMFBhxHUkluGIIuEBMdAWEYHZRNKRsCS4Ch7qaAgqyxuaKkYEWJMiBARBy1wcGZwNAIiUrJmATosCSCQ7tw/KDrkJkJIDjlN3u+nyqru9EmfX/9o85zzvmdxlJaWlgoAYCSn3QUAAOxDCACAwQgBADAYIQAABiMEAMBghAAAGIwQAACD1bO7gCt19OhJ+f32ntoQEREqr7fY1hqCBb0oQy/K0IsydvfC6XSoSZOGv/n6NRcCfn+p7SFwoQ6cRy/K0Isy9KJMMPeC4SAAMBghAAAGIwQAwGCEAAAYjBAAAIMRAgBgMEIAAAx2zZ0nAOuENbpe/3Ndzb8CbndYjX7/1zMlKjpxusZ1ALhyhIDB/ue6eoofn253GcqYnaAiu4sADMVwEAAYjBAAAIMRAgBgMMvmBI4ePaqnn35a+/btU/369XXjjTdq2rRpCg8PL7dcSkqKNm/erCZNmkiSevTooccff9yqMgAAV8CyEHA4HBo6dKg6deokSZo5c6ZefvllPf/88xWWHT58uAYPHmzVqgEA1WTZcFDjxo0DASBJHTp0UG5urlVvDwC4Cq7KIaJ+v18rVqxQbGxspa+/9dZbSk1NVUxMjMaPH6+WLVtW+b0jIkKtKrNGanpsPMqrK/2sK5/DCvSiTDD34qqEwPTp09WgQYNKh3zGjh0rt9stp9Opjz/+WEOHDtX69evlcrmq9N5eb7HtN2hwu8N0+PC1f2R7MH0x60o/68LnsAK9KGN3L5xOxyU3ni0/OmjmzJnau3ev5s6dK6ez4ttHRkYGfp6YmKhTp04pPz/f6jIAAFVgaQjMmTNHP/zwgxYtWqT69etXukxBQUHg8VdffSWn06nIyEgrywAAVJFlw0G7du3S0qVL1bx5cyUnJ0uSmjVrpkWLFikhIUHLli1TZGSkJkyYIK/XK4fDodDQUC1evFj16nH1CgCwg2V/fW+++Wbt3Lmz0tfS08uuT7N8+XKrVgkAqCHOGAYAgxECAGAwQgAADEYIAIDBCAEAMBghAAAGIwQAwGCEAAAYjBAAAIMRAgBgMEIAAAxGCACAwQgBADAYIQAABiMEAMBghAAAGIwQAACDWXJnsaNHj+rpp5/Wvn37VL9+fd14442aNm2awsPDyy13+vRpTZw4Udu2bZPL5dKECRN03333WVECAKAaLNkTcDgcGjp0qNatW6eMjAzFxMTo5ZdfrrDcG2+8odDQUH322WdasmSJnn32WZ08edKKEgAA1WBJCDRu3FidOnUKPO/QoYNyc3MrLPfpp5+qf//+kqTmzZurXbt2+vLLL60oAQBQDZbPCfj9fq1YsUKxsbEVXsvNzVV0dHTgucfjUX5+vtUlAACqyJI5gYtNnz5dDRo00ODBg61+a0lSREToVXnfK+V2h9ldQp1SV/pZVz6HFehFmWDuhaUhMHPmTO3du1dLliyR01lxJyMqKkoHDx4MTBjn5eWVG0aqCq+3WH5/qSX1VpfbHabDh4tsrcEKwfTFrCv9rAufwwr0oozdvXA6HZfceLZsOGjOnDn64YcftGjRItWvX7/SZXr06KHU1FRJ0p49e5STk6N7773XqhIAAFfIkhDYtWuXli5dqkOHDik5OVkJCQkaPXq0JCkhIUEFBQWSpMcee0wnTpxQt27dNGLECE2bNk2hocExvAMAJrJkOOjmm2/Wzp07K30tPT098LhBgwaaP3++FasEAFiAM4YBwGCEAAAYjBAAAIMRAgBgMEIAAAxGCACAwQgBADAYIQAABiMEAMBghAAAGIwQAACDEQIAYDBCAAAMRggAgMEIAQAwGCEAAAYjBADAYJaFwMyZMxUbG6vWrVvrp59+qnSZBQsW6K677lJCQoISEhI0depUq1YPAKgGS24vKUldu3bVww8/rEGDBl1yucTERE2YMMGq1QIAasCyEOjYsaNVbwUAqCW1PiewZs0axcfH69FHH9W3335b26sHAFzEsj2BqkhOTtbIkSMVEhKiTZs2adSoUcrKylKTJk2q/B4REaFXscKqc7vD7C6hTqkr/awrn8MK9KJMMPeiVkPA7XYHHt9zzz3yeDzatWuX7rzzziq/h9dbLL+/9GqUV2Vud5gOHy6ytQYrBNMXs670sy58DivQizJ298LpdFxy47lWh4MKCgoCj3fs2KGDBw+qRYsWtVkCAOAilu0JzJgxQ9nZ2Tpy5IiGDBmixo0ba82aNRo2bJjGjBmj9u3ba86cOdq2bZucTqdCQkI0a9ascnsHAIDa5SgtLbV3bOUKMRxkHbc7TPHj0+0uQxmzE+pMP+vC57ACvShjdy+CajgIABBcCAEAMBghAAAGIwQAwGCEAAAYjBAAAIMRAgBgMEIAAAxGCACAwQgBADAYIQAABiMEAMBghAAAGIwQAACDEQIAYDBCAAAMRggAgMEsC4GZM2cqNjZWrVu31k8//VTpMj6fT1OnTlVcXJy6deumlStXWrV6AEA1WBYCXbt21Xvvvafo6OjfXCYjI0P79u1Tdna2UlNTtWDBAh04cMCqEgAAV8iyEOjYsaM8Hs8ll8nKylJSUpKcTqfCw8MVFxentWvXWlUCAOAK1eqcQF5enqKiogLPPR6P8vPza7MEAMBF6tldwJWKiAi1uwRJktsdZncJdUpd6Wdd+RxWoBdlgrkXtRoCHo9Hubm5uu222yRV3DOoCq+3WH5/6dUor8rc7jAdPlxkaw1WCKYvZl3pZ134HFagF2Xs7oXT6bjkxnOtDgf16NFDK1eulN/vV2FhodavX6/u3bvXZgkAgItYFgIzZsxQly5dlJ+fryFDhuiBBx6QJA0bNkw5OTmSpISEBDVr1kz333+/+vXrp9GjRysmJsaqEgAAV8hRWlpq79jKFWI4yDpud5jix6fbXYYyZifUmX7Whc9hBXpRxu5eBNVwEAAguBACAGAwQgAADEYIAIDBCAEAMBghAAAGIwQAwGCEAAAYjBAAAIMRAgBgMEIAAAxGCACAwQgBADAYIQAABiMEAMBghAAAGIwQAACDWXaj+V9++UUpKSk6duyYGjdurJkzZ6p58+blllmwYIHef/99NW3aVJJ0xx13aPLkyVaVAAC4QpaFwOTJkzVw4EAlJCQoPT1dkyZN0jvvvFNhucTERE2YMMGq1QIAasCS4SCv16vt27erd+/ekqTevXtr+/btKiwstOLtAQBXiSUhkJeXp8jISLlcLkmSy+VS06ZNlZeXV2HZNWvWKD4+Xo8++qi+/fZbK1YPAKgmy4aDqiI5OVkjR45USEiINm3apFGjRikrK0tNmjSp8ntERIRexQqrzu0Os7uEOqWu9LOufA4r0IsywdwLS0LA4/GooKBAPp9PLpdLPp9Phw4dksfjKbec2+0OPL7nnnvk8Xi0a9cu3XnnnVVel9dbLL+/1Iqyq83tDtPhw0W21mCFYPpi1pV+1oXPYQV6UcbuXjidjktuPFsyHBQREaG2bdsqMzNTkpSZmam2bdsqPDy83HIFBQWBxzt27NDBgwfVokULK0oAAFSDZcNBU6ZMUUpKil599VU1atRIM2fOlCQNGzZMY8aMUfv27TVnzhxt27ZNTqdTISEhmjVrVrm9AwBA7bIsBFq2bKmVK1dW+Plrr70WeHwhGAAAwYEzhgHAYIQAABiMEAAAgxECAGAwQgAADEYIAIDBCAEAMBghAAAGIwQAwGCEAAAYjBAAAIMRAgBgMEIAAAxGCACAwQgBADAYIQAABiMEAMBgloXAL7/8ov79+6t79+7q37+/9uzZU2EZn8+nqVOnKi4uTt26dav0TmQAgNpjWQhMnjxZAwcO1Lp16zRw4EBNmjSpwjIZGRnat2+fsrOzlZqaqgULFujAgQNWlQAAuEKWhIDX69X27dvVu3dvSVLv3r21fft2FRYWllsuKytLSUlJcjqdCg8PV1xcnNauXWtFCQCAarAkBPLy8hQZGSmXyyVJcrlcatq0qfLy8iosFxUVFXju8XiUn59vRQkAgGqoZ3cBVyoiIrRGv3/2nE/1Q1w1rsPtDguKOmpaQ8bsBFtruFBHTftpRQ12fy+C4TthZR30okww98KSEPB4PCooKJDP55PL5ZLP59OhQ4fk8XgqLJebm6vbbrtNUsU9g6rweovl95dWu1a3O0zx49Or/ftWyZidoMOHi+wuo8bc7rA68zns/l4Ey3eCXpSpC71wOh2X3Hi2ZDgoIiJCbdu2VWZmpiQpMzNTbdu2VXh4eLnlevTooZUrV8rv96uwsFDr169X9+7drSgBAFANlh0dNGXKFL377rvq3r273n33XU2dOlWSNGzYMOXk5EiSEhIS1KxZM91///3q16+fRo8erZiYGKtKAABcIcvmBFq2bFnpcf+vvfZa4LHL5QqEAwDAfpwxDAAGIwQAwGCEAAAYjBAAAINdcyeLAVfDr2dKbD9x7tczJbau/wJ6YRZCAJBUdOK0anpqUl05cY5elDEhEAkBAPgNJgQicwIAYDBCAAAMRggAgMEIAQAwGCEAAAYjBADAYIQAABiMEAAAgxECAGAwQgAADFbjy0acPn1aEydO1LZt2+RyuTRhwgTdd999FZbbsmWLhg8frubNm0uS6tevX+mdyAAAtafGIfDGG28oNDRUn332mfbs2aNBgwYpOztbDRs2rLBsy5YttXr16pquEgBgkRoPB3366afq37+/JKl58+Zq166dvvzyyxoXBgC4+mocArm5uYqOjg4893g8ys/Pr3TZPXv2qG/fvkpKSlJaWlpNVw0AqKHLDgf17dtXubm5lb62efPmKq/o1ltv1caNGxUWFqb9+/dryJAhioyM1N133131aiVFRIRe0fLBzO0Os7sES9SVz2EFelGGXpQJ5l5cNgQut8UeFRWlgwcPKjw8XJKUl5enTp06VVguNLTsj3dMTIzi4uK0devWKw4Br7dYfn/pFf3OxYLpHyOYrzFeVcF+rfTaRC/K0IsydvfC6XRccuO5xsNBPXr0UGpqqqTzwz05OTm69957Kyx36NAhlZae/+N97Ngxbdq0SW3atKnp6gEANVDjo4Mee+wxpaSkqFu3bnI6nZo2bVpgq3/evHlq2rSpBgwYoOzsbK1YsUL16tWTz+dTYmKi4uLiavwBAADV5yi9sHl+jbBiOCh+fLqFFVVPxuyEOrG7bPeubjChF2XoRRm7e3HVh4MAANcuQgAADEYIAIDBCAEAMBghAAAGIwQAwGCEAAAYjBAAAIMRAgBgMEIAAAxGCACAwQgBADAYIQAABiMEAMBghAAAGIwQAACDEQIAYDBCAAAMVuMQSE9PV3x8vG655Ra9++67l1z2ww8/VLdu3RQXF6dp06bJ7/fXdPUAgBqocQi0bdtWr7zyinr37n3J5fbv36+FCxcqNTVV2dnZ2rt3rz755JOarh4AUAM1DoHf//73atWqlZzOS7/VunXrFBcXp/DwcDmdTiUlJSkrK6umqwcA1EC92lpRXl6eoqKiAs+joqKUl5d3xe8TERFqZVm2crvD7C7BEnXlc1iBXpShF2WCuReXDYG+ffsqNze30tc2b94sl8tleVGX4vUWy+8vrfbvB9M/xuHDRXaXUGNud1id+BxWoBdl6EUZu3vhdDouufF82RBIS0uzpBCPx1MuTHJzc+XxeCx5bwBA9dTaIaLdu3fX+vXrVVhYKL/fr5UrV6pnz561tXoAQCVqHAKZmZnq0qWL1q5dq3nz5qlLly76+eefJUnz5s3TihUrJEkxMTEaNWqU+vXrp/vvv1/NmjVTnz59arp6AEANOEpLS6s/wG4DK+YE4senW1hR9WTMTqgTY6Z2j3cGE3pRhl6UsbsXl5sT4IxhADBYrR0iGix+PVOijNkJdpehX8+U2F0CAJgXAkUnTqumO2Z2794BgFUYDgIAgxECAGAwQgAADEYIAIDBCAEAMBghAAAGu+YOEXU6HXaXICl46ggG9KIMvShDL8rY2YvLrfuau2wEAMA6DAcBgMEIAQAwGCEAAAYjBADAYIQAABiMEAAAgxECAGAwQgAADEYIAIDBCAEAMBghAAAGIwRwxb788ssKP0tNTbWhEgA1RQhU0S+//KIzZ85Ikr766istW7ZMx48ft7kqe7z00kuaPXu2/H6/Tp06pXHjxmnNmjV2l1XrTp8+rVdeeUXjx4+XJO3evVvr16+3uSoEi+LiYm3bts3uMi6LEKiiJ598Uk6nU/v379fkyZO1f/9+TZgwwe6ybPHhhx/K6/Vq4MCBeuihh3TTTTdp+fLldpdV66ZMmSKfz6cff/xRkvS73/1OCxcutLkq+xCKZTZu3KgHHnhATzzxhCQpJydHI0eOtLmqyhECVeR0OhUSEqKNGzdqwIABmj59uvLy8uwuyxbXX3+9brnlFh08eFAnT57U3XffLafTvK/Szp079dRTTykkJESS1LBhQ/n9fpursg+hWGb+/Pn66KOP1KhRI0lS+/bttW/fPpurqpx5/+dW05kzZ3TkyBF98cUX6ty5syTJ1FsxPPHEE/riiy+Unp6u+fPnKyUlRcuWLbO7rFpXv379cs/PnDlj7HdCIhT/P7fbXe75//++BAtCoIoeeeQR9ejRQw0aNFD79u21f/9+hYWF2V2WLW655Ra9/vrrCg8P1x/+8AetXLlS33//vd1l1bqOHTtqyZIlOnv2rLZs2aK//e1vio2Ntbss2xCKZRo2bKgjR47I4Th/V68tW7YE7d8L7ixWTX6/XyUlJUGb7ldbcXGx9u7dq1tvvVXS+b2iC194U5w7d06vv/66NmzYoNLSUsXGxmr48OGqV++au2urJWbNmqVGjRrpk08+0eTJk/XWW2+pdevWGjt2rN2l1brvvvtOU6ZM0YEDB9SmTRvt2bNHixcvVrt27ewurQJCoIqysrLUpUsXhYaGau7cucrJydG4ceMCfwRNsnHjRk2aNEkul0sbNmxQTk6OFi1apCVLlthdGmxEKJZXVFSkrVu3SpJuv/32wPxAsDHzX6caFi9erF69eun777/Xpk2b9PDDD2v69On64IMP7C6t1l2Y9Bo2bJik4J70uppmzZpV4WdhYWHq0KGD7rrrLhsqso/P59PUqVM1Y8YMPf7443aXY7sXX3xRgwcP1v/+7/8Gfvbmm2/q0UcftbGqyjEnUEUXtmY2bdqkpKQkxcfHB84bMNG1Mul1NXm9Xq1bt04+n08+n0/Z2dn66aef9MILL2jx4sV2l1erXC6Xdu7caXcZQSMtLU1DhgwpN1eWkZFhY0W/jRCoIofDoaysLGVlZQW28s6dO2dzVfa4lia9rqZDhw5p9erVmjhxoiZOnKjVq1ersLBQ77//ftD+D381de7cWdOmTdP333+vn3/+OfCfiTwej+bPn6/x48frs88+kxS8RxMyHFRFzz33nF577TU99NBDiomJ0Z49e9SpUye7y7LFU089pWHDhunAgQP6y1/+Epj0Mk1BQYFuuOGGwPNGjRrp8OHDCg0NNXLP6MJZ4//4xz8CP3M4HPr8889tqsg+DodDbdu21TvvvKORI0fqwIEDQXvgBBPDqJaioiJt3LhRktSqVSu1adPG5opq35gxY3TDDTfowQcflHR+CODo0aOaPXu2kpOTtXr1apsrhF0SExP18ccfSzp/JN2YMWO0ZcuWoLyMBCFQRSUlJVq1apV27NhRbi7ghRdesLGq2vXUU09p6NChatOmjY4dO6Y+ffooLCxMR48e1dixY5WUlGR3ibWquLhYCxcu1DfffCNJ6tSpk7p27arbb79dx48fV3h4uM0V2sPr9Zb7fyQqKsrGauxRWFhY7t/f5/Np69at+tOf/mRjVZVjOKiKJk2aJJ/Ppy1btmjAgAHKzMxUx44d7S6rVm3fvj2wxZ+enq5WrVrpzTffVH5+vkaMGGFcCISGhiolJUUFBQVKS0tTWlqaPv/8c2VnZxsZAF9//bVSUlLk9XrldDp17tw5NW7cWF9//bXdpdWa/fv3KyYmRoWFhSosLCz3WpMmTWyq6tIIgSrKyclRRkaG4uPjNWLECA0cOFCjRo2yu6xadd111wUe/+c//1FcXJyk89eICdbxzqulpKREn3/+uVatWqXvvvtOJSUleuONN9ShQwe7S7PNSy+9pOXLl2vs2LFKS0vTRx99pAMHDthdVq2aMWOGli5dquHDh1d4LVjnRwiBKrrwB9Dlcun06dMKCwuT1+u1uarad2Ey9JtvvtGYMWMCPzfpcNnnn39ea9asUevWrdW3b1/Nnz9fvXr1MjoALmjRooVKSkrkcDiUlJSkBx980KgzhpcuXSpJ2rBhg82VVB0hUEU33HCDjh8/rnvvvVfDhg1TkyZNFBkZaXdZtWr48OFKTExUSEiI/vjHP6pVq1aSpP/+979GjfumpqaqQ4cOGj58eOBigqbtCVXmwrk0kZGR2rBhg6Kjo42958YF+/bt04YNGxQTE6OuXbvaXU6lmBiuIp/PJ5fLJb/fr08++UTFxcVKTExUaGio3aXVqsOHD+vIkSNq06ZN4A9fQUGBfD6fMUFw4sQJZWRkaNWqVTp+/LgSExO1atWqcodGmuTFF19USkqKMjMz5XK5FB0drfHjx6uoqEjPPPOM+vTpY3eJteavf/2rUlJS1KZNG+Xn5ys+Pl4dOnTQgQMHlJCQEJT3FCAEgBr48ccftWrVKmVmZuqmm25SfHy8kpOT7S6rVvXt21dpaWkVHpuoV69eysrKkiQtW7ZMP/zwg+bPn68TJ05o0KBBQXkSIcNBl/HnP//5krv6H330US1Wg2DTpk0b/f3vf9fTTz+t9evXa/Xq1caFwMXbkaZvU1588MTWrVsDB080atRILpfLrrIuiRC4DFNvIYkrExISop49e6pnz552l1Lrzp49q927d6u0tLTc4wsuzB2ZICQkRLt27VJERIT+9a9/6dlnnw28FqwHTxACl3HnnXfaXQIQ1H799dfAFWUllXscrIdFXi3jxo3T4MGDderUKfXr10/NmjWTdP7Cky1atLC5usoxJ1BFAwYM0JIlSwLXijl27JhGjx6t9957z+bKAAQTn8+nkydPlrt/wKlTp1RaWqqGDRtKOj+XFCyXWuEqolV06tSpchcLa9y4sU6ePGljRQCCkcvlqnADmQYNGgQCQJImTpxY22X9JkKgivx+v06fPh14fvLkSZWUlNhYEYBrVTANwDAnUEW9e/fWkCFDNGDAAEnSihUrjDr+GYB1gunkQkKgCo4dO6Z77rkncCakJCUnJysxMdHewgCghgiBy8jKytLEiRPVsGFDnT17VgsWLDDu/rEArBVMw0HMCVzG4sWL9cEHH2jz5s1auHChXn31VbtLAhDk8vPzL/n6oEGDaqmSyyMELsPpdKpt27aSzt9DtaioyOaKAAS7hx56SE888cRv3kshmO69QQhcxrlz57R79+7ATbMvnBFp8k20AVzahg0b1LVrV82dO1e9evXSe++9p+LiYrvLqhQni11GbGzsb75m2tmQAK7c1q1bNW7cOJ04cUJ9+/bVqFGjFBERYXdZAYQAAFwFBw8e1AcffKDMzEx17txZSUlJ+uc//6ns7OzATeiDAUcHAYDFRowYoV27dik5OVmrV68O3F/4jjvuCFxqOliwJwAAFlu7dq26desWtJePvhghAAAWufjSMpW5/vrra6mSqiMEAMAiF9929cKfVofDodLSUjkcDu3YscPO8ipFCACAwThPAAAMxtFBAGCRRx55RG+//bY6d+5c7kqhF4aDfusMYjsxHAQAFjl06JCaNm2qgwcPVvp6dHR0LVd0eYQAABiM4SAAsNi///1vzZkzR/v27ZPP52M4CABM0r17dz355JNq166dnM6y42+CcTiIPQEAsFijRo3Us2dPu8uoEvYEAMBib7/9turXr6+ePXvquuuuC/ycM4YBwACZmZl67rnn9Ouvv0oSZwwDgEliY2M1b9483XrrreXmBIIRcwIAYLGmTZuqffv2dpdRJewJAIDF5s6dq3PnzqlXr17l5gRatWplY1WVIwQAwGKV3ZY2WG9HSwgAgMGYEwCAq8Tr9erMmTOB51FRUTZWUzlCAAAs9vXXXyslJUVer1dOp1Pnzp1T48aNg/KyEcF97BIAXINeeuklLV++XK1atdJ3332nadOmqV+/fnaXVSlCAACughYtWqikpEQOh0NJSUn66quv7C6pUgwHAYDF6tU7/6c1MjJSGzZsUHR0tI4fP25zVZXj6CAAsMiLL76olJQUZWZmyuVyKTo6WuPHj1dRUZGeeeYZ9enTx+4SKyAEAMAiffv2VVpaWoXHwYw5AQCwyMXb1NfK9jVzAgBgkbNnz2r37t0qLS0t9/gCLhsBAHVYZZeLuIDLRgAAgg5zAgBgMEIAAAxGCACAwQgBADAYIQAABvs/mUjb2mUAle0AAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"axes = coefficients['Coefficients'].plot.bar()"
]
},
{
"cell_type": "code",
"execution_count": 757,
"id": "durable-cooperation",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.96305218603375"
]
},
"execution_count": 757,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# bias\n",
"lr_model.intercept_[0]"
]
},
{
"cell_type": "markdown",
"id": "electronic-highway",
"metadata": {},
"source": [
"\n",
"Le poids des hommes, des femmes et des enfants de moins de 10 ans sont très discriminants. Etre un homme signifie une forte proba de ne pas survivre contrairement au fait d'être une femme ou un enfant (les femmes et les enfants d'abord) \n",
"\n",
"When we train our model on all features, the bias term is 1.96.\n",
"The reason why the sign before the bias term is positive is the class balance. There are a fewer surviver in the training data than non-survivers ones, meaning the probability of non-surviving on average is a little high."
]
},
{
"cell_type": "markdown",
"id": "optional-warren",
"metadata": {},
"source": [
"## 5. Training other models"
]
},
{
"cell_type": "code",
"execution_count": 758,
"id": "occupational-intervention",
"metadata": {},
"outputs": [],
"source": [
"X = df_train.drop('Survived', 1)\n",
"y = df_train['Survived']"
]
},
{
"cell_type": "code",
"execution_count": 759,
"id": "adult-texture",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_val_score"
]
},
{
"cell_type": "code",
"execution_count": 760,
"id": "answering-lesbian",
"metadata": {},
"outputs": [],
"source": [
"# clf = svm.SVC(kernel='linear', C=1, random_state=42)\n",
"# scores = cross_val_score(clf, X, y, cv=5)\n",
"# scores"
]
},
{
"cell_type": "markdown",
"id": "assured-landing",
"metadata": {},
"source": [
"**Algorithms to use**\n",
"- Decision tree\n",
"- Naive Bayes\n",
"- Support vector machine (SVM)\n",
"- Random Forest\n",
"- Gradient Boosting"
]
},
{
"cell_type": "markdown",
"id": "utility-easter",
"metadata": {},
"source": [
"**Create instances of algorithms for classification**"
]
},
{
"cell_type": "code",
"execution_count": 799,
"id": "collected-strand",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7901136149645346"
]
},
"execution_count": 799,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lr_model = LogisticRegression(max_iter=1000)\n",
"\n",
"lr_model.fit(X, y)\n",
"\n",
"# cross-validation score\n",
"np.mean(cross_val_score(lr_model, X, y, cv=5))"
]
},
{
"cell_type": "code",
"execution_count": 800,
"id": "spoken-vessel",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7980290000627708"
]
},
"execution_count": 800,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.tree import DecisionTreeClassifier\n",
"\n",
"dt_model = DecisionTreeClassifier()\n",
"\n",
"dt_model.fit(X, y)\n",
"\n",
"np.mean(cross_val_score(dt_model, X, y, cv=5))"
]
},
{
"cell_type": "code",
"execution_count": 801,
"id": "economic-crowd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7811813445483649"
]
},
"execution_count": 801,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.naive_bayes import GaussianNB\n",
"\n",
"nb_model = GaussianNB()\n",
"\n",
"nb_model.fit(X, y)\n",
"\n",
"np.mean(cross_val_score(nb_model, X, y, cv=5))"
]
},
{
"cell_type": "code",
"execution_count": 802,
"id": "departmental-restaurant",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.827154604230745"
]
},
"execution_count": 802,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.svm import SVC\n",
"\n",
"svm_model = SVC()\n",
"\n",
"svm_model.fit(X, y)\n",
"\n",
"np.mean(cross_val_score(svm_model, X, y, cv=5))"
]
},
{
"cell_type": "code",
"execution_count": 803,
"id": "quarterly-option",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8092398468394955"
]
},
"execution_count": 803,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"rf_model = RandomForestClassifier()\n",
"\n",
"rf_model.fit(X, y)\n",
"\n",
"np.mean(cross_val_score(rf_model, X, y, cv=5))"
]
},
{
"cell_type": "code",
"execution_count": 766,
"id": "defined-disability",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8204193082669011"
]
},
"execution_count": 766,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.ensemble import GradientBoostingClassifier\n",
"\n",
"gb_model = GradientBoostingClassifier()\n",
"\n",
"gb_model.fit(X, y)\n",
"\n",
"np.mean(cross_val_score(gb_model, X, y, cv=5))"
]
},
{
"cell_type": "markdown",
"id": "daily-railway",
"metadata": {},
"source": [
"Gradient Boosting is the winner"
]
},
{
"cell_type": "markdown",
"id": "second-phoenix",
"metadata": {},
"source": [
"## 8. Hyperparameter tuning"
]
},
{
"cell_type": "markdown",
"id": "favorite-still",
"metadata": {},
"source": [
"Let's create a hyperparameter grid (a dictionary of different hyperparameters) for each and then test them out."
]
},
{
"cell_type": "code",
"execution_count": 767,
"id": "vocal-funeral",
"metadata": {},
"outputs": [],
"source": [
"# LogisticRegression hyperparameters\n",
"lr_grid = {\"C\": np.logspace(-4, 4, 20),\n",
" \"solver\": [\"liblinear\"]}\n",
"\n",
"# RandomForestClassifier hyperparameters\n",
"rf_grid = {\"n_estimators\": np.arange(10, 1000, 50),\n",
" \"max_depth\": [None, 3, 5, 10],\n",
" \"min_samples_split\": np.arange(2, 20, 2),\n",
" \"min_samples_leaf\": np.arange(1, 20, 2)}"
]
},
{
"cell_type": "markdown",
"id": "considered-specification",
"metadata": {},
"source": [
"We'll pass it the different hyperparameters from log_reg_grid as well as set n_iter = 20. This means, RandomizedSearchCV will try 20 different combinations of hyperparameters from log_reg_grid and save the best ones."
]
},
{
"cell_type": "code",
"execution_count": 768,
"id": "powered-mortgage",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 5 folds for each of 20 candidates, totalling 100 fits\n"
]
},
{
"data": {
"text/html": [
"<style>#sk-container-id-4 {color: black;background-color: white;}#sk-container-id-4 pre{padding: 0;}#sk-container-id-4 div.sk-toggleable {background-color: white;}#sk-container-id-4 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-4 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-4 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-4 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-4 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-4 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-4 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-4 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-4 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-4 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-4 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-4 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-4 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-4 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-4 div.sk-item {position: relative;z-index: 1;}#sk-container-id-4 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-4 div.sk-item::before, #sk-container-id-4 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-4 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-4 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-4 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-4 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-4 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-4 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-4 div.sk-label-container {text-align: center;}#sk-container-id-4 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-4 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-4\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>RandomizedSearchCV(cv=5, estimator=LogisticRegression(), n_iter=20,\n",
" param_distributions={&#x27;C&#x27;: array([1.00000000e-04, 2.63665090e-04, 6.95192796e-04, 1.83298071e-03,\n",
" 4.83293024e-03, 1.27427499e-02, 3.35981829e-02, 8.85866790e-02,\n",
" 2.33572147e-01, 6.15848211e-01, 1.62377674e+00, 4.28133240e+00,\n",
" 1.12883789e+01, 2.97635144e+01, 7.84759970e+01, 2.06913808e+02,\n",
" 5.45559478e+02, 1.43844989e+03, 3.79269019e+03, 1.00000000e+04]),\n",
" &#x27;solver&#x27;: [&#x27;liblinear&#x27;]},\n",
" verbose=True)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-10\" type=\"checkbox\" ><label for=\"sk-estimator-id-10\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">RandomizedSearchCV</label><div class=\"sk-toggleable__content\"><pre>RandomizedSearchCV(cv=5, estimator=LogisticRegression(), n_iter=20,\n",
" param_distributions={&#x27;C&#x27;: array([1.00000000e-04, 2.63665090e-04, 6.95192796e-04, 1.83298071e-03,\n",
" 4.83293024e-03, 1.27427499e-02, 3.35981829e-02, 8.85866790e-02,\n",
" 2.33572147e-01, 6.15848211e-01, 1.62377674e+00, 4.28133240e+00,\n",
" 1.12883789e+01, 2.97635144e+01, 7.84759970e+01, 2.06913808e+02,\n",
" 5.45559478e+02, 1.43844989e+03, 3.79269019e+03, 1.00000000e+04]),\n",
" &#x27;solver&#x27;: [&#x27;liblinear&#x27;]},\n",
" verbose=True)</pre></div></div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-11\" type=\"checkbox\" ><label for=\"sk-estimator-id-11\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">estimator: LogisticRegression</label><div class=\"sk-toggleable__content\"><pre>LogisticRegression()</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-12\" type=\"checkbox\" ><label for=\"sk-estimator-id-12\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LogisticRegression</label><div class=\"sk-toggleable__content\"><pre>LogisticRegression()</pre></div></div></div></div></div></div></div></div></div></div>"
],
"text/plain": [
"RandomizedSearchCV(cv=5, estimator=LogisticRegression(), n_iter=20,\n",
" param_distributions={'C': array([1.00000000e-04, 2.63665090e-04, 6.95192796e-04, 1.83298071e-03,\n",
" 4.83293024e-03, 1.27427499e-02, 3.35981829e-02, 8.85866790e-02,\n",
" 2.33572147e-01, 6.15848211e-01, 1.62377674e+00, 4.28133240e+00,\n",
" 1.12883789e+01, 2.97635144e+01, 7.84759970e+01, 2.06913808e+02,\n",
" 5.45559478e+02, 1.43844989e+03, 3.79269019e+03, 1.00000000e+04]),\n",
" 'solver': ['liblinear']},\n",
" verbose=True)"
]
},
"execution_count": 768,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import RandomizedSearchCV\n",
"\n",
"# Setup random seed\n",
"np.random.seed(42)\n",
"\n",
"# Setup random hyperparameter search for LogisticRegression\n",
"rs_lr = RandomizedSearchCV(LogisticRegression(),\n",
" param_distributions=lr_grid,\n",
" cv=5,\n",
" n_iter=20,\n",
" verbose=True)\n",
"\n",
"# Fit random hyperparameter search model\n",
"rs_lr.fit(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 769,
"id": "bright-mercy",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'solver': 'liblinear', 'C': 0.08858667904100823}"
]
},
"execution_count": 769,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rs_lr.best_params_"
]
},
{
"cell_type": "code",
"execution_count": 770,
"id": "pediatric-ethernet",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8024691358024691"
]
},
"execution_count": 770,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rs_lr.score(X, y)\n"
]
},
{
"cell_type": "code",
"execution_count": 771,
"id": "arranged-northern",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 5 folds for each of 20 candidates, totalling 100 fits\n"
]
},
{
"data": {
"text/html": [
"<style>#sk-container-id-5 {color: black;background-color: white;}#sk-container-id-5 pre{padding: 0;}#sk-container-id-5 div.sk-toggleable {background-color: white;}#sk-container-id-5 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-5 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-5 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-5 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-5 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-5 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-5 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-5 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-5 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-5 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-5 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-5 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-5 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-5 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-5 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-5 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-5 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-5 div.sk-item {position: relative;z-index: 1;}#sk-container-id-5 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-5 div.sk-item::before, #sk-container-id-5 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-5 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-5 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-5 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-5 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-5 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-5 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-5 div.sk-label-container {text-align: center;}#sk-container-id-5 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-5 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-5\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=20,\n",
" param_distributions={&#x27;max_depth&#x27;: [None, 3, 5, 10],\n",
" &#x27;min_samples_leaf&#x27;: array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]),\n",
" &#x27;min_samples_split&#x27;: array([ 2, 4, 6, 8, 10, 12, 14, 16, 18]),\n",
" &#x27;n_estimators&#x27;: array([ 10, 60, 110, 160, 210, 260, 310, 360, 410, 460, 510, 560, 610,\n",
" 660, 710, 760, 810, 860, 910, 960])},\n",
" verbose=True)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-13\" type=\"checkbox\" ><label for=\"sk-estimator-id-13\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">RandomizedSearchCV</label><div class=\"sk-toggleable__content\"><pre>RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=20,\n",
" param_distributions={&#x27;max_depth&#x27;: [None, 3, 5, 10],\n",
" &#x27;min_samples_leaf&#x27;: array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]),\n",
" &#x27;min_samples_split&#x27;: array([ 2, 4, 6, 8, 10, 12, 14, 16, 18]),\n",
" &#x27;n_estimators&#x27;: array([ 10, 60, 110, 160, 210, 260, 310, 360, 410, 460, 510, 560, 610,\n",
" 660, 710, 760, 810, 860, 910, 960])},\n",
" verbose=True)</pre></div></div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-14\" type=\"checkbox\" ><label for=\"sk-estimator-id-14\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">estimator: RandomForestClassifier</label><div class=\"sk-toggleable__content\"><pre>RandomForestClassifier()</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-15\" type=\"checkbox\" ><label for=\"sk-estimator-id-15\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">RandomForestClassifier</label><div class=\"sk-toggleable__content\"><pre>RandomForestClassifier()</pre></div></div></div></div></div></div></div></div></div></div>"
],
"text/plain": [
"RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=20,\n",
" param_distributions={'max_depth': [None, 3, 5, 10],\n",
" 'min_samples_leaf': array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]),\n",
" 'min_samples_split': array([ 2, 4, 6, 8, 10, 12, 14, 16, 18]),\n",
" 'n_estimators': array([ 10, 60, 110, 160, 210, 260, 310, 360, 410, 460, 510, 560, 610,\n",
" 660, 710, 760, 810, 860, 910, 960])},\n",
" verbose=True)"
]
},
"execution_count": 771,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Setup random seed\n",
"np.random.seed(42)\n",
"\n",
"# Setup random hyperparameter search for RandomForestClassifier\n",
"rs_rf = RandomizedSearchCV(RandomForestClassifier(),\n",
" param_distributions=rf_grid,\n",
" cv=5,\n",
" n_iter=20,\n",
" verbose=True)\n",
"\n",
"# Fit random hyperparameter search model\n",
"rs_rf.fit(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 772,
"id": "conditional-fitting",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'n_estimators': 610,\n",
" 'min_samples_split': 18,\n",
" 'min_samples_leaf': 1,\n",
" 'max_depth': 5}"
]
},
"execution_count": 772,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Find the best parameters\n",
"rs_rf.best_params_"
]
},
{
"cell_type": "code",
"execution_count": 773,
"id": "personalized-superior",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8361391694725028"
]
},
"execution_count": 773,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Evaluate the randomized search random forest model\n",
"rs_rf.score(X, y)"
]
},
{
"cell_type": "markdown",
"id": "confirmed-audience",
"metadata": {},
"source": [
"Tha'ts way better!"
]
},
{
"cell_type": "code",
"execution_count": 774,
"id": "nasty-mixture",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 5 folds for each of 20 candidates, totalling 100 fits\n"
]
},
{
"data": {
"text/html": [
"<style>#sk-container-id-6 {color: black;background-color: white;}#sk-container-id-6 pre{padding: 0;}#sk-container-id-6 div.sk-toggleable {background-color: white;}#sk-container-id-6 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-6 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-6 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-6 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-6 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-6 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-6 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-6 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-6 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-6 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-6 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-6 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-6 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-6 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-6 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-6 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-6 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-6 div.sk-item {position: relative;z-index: 1;}#sk-container-id-6 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-6 div.sk-item::before, #sk-container-id-6 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-6 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-6 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-6 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-6 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-6 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-6 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-6 div.sk-label-container {text-align: center;}#sk-container-id-6 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-6 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-6\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(), n_iter=20,\n",
" param_distributions={&#x27;max_depth&#x27;: [None, 3, 5, 10],\n",
" &#x27;min_samples_leaf&#x27;: array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]),\n",
" &#x27;min_samples_split&#x27;: array([ 2, 4, 6, 8, 10, 12, 14, 16, 18]),\n",
" &#x27;n_estimators&#x27;: array([ 10, 60, 110, 160, 210, 260, 310, 360, 410, 460, 510, 560, 610,\n",
" 660, 710, 760, 810, 860, 910, 960])},\n",
" verbose=True)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-16\" type=\"checkbox\" ><label for=\"sk-estimator-id-16\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">RandomizedSearchCV</label><div class=\"sk-toggleable__content\"><pre>RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(), n_iter=20,\n",
" param_distributions={&#x27;max_depth&#x27;: [None, 3, 5, 10],\n",
" &#x27;min_samples_leaf&#x27;: array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]),\n",
" &#x27;min_samples_split&#x27;: array([ 2, 4, 6, 8, 10, 12, 14, 16, 18]),\n",
" &#x27;n_estimators&#x27;: array([ 10, 60, 110, 160, 210, 260, 310, 360, 410, 460, 510, 560, 610,\n",
" 660, 710, 760, 810, 860, 910, 960])},\n",
" verbose=True)</pre></div></div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-17\" type=\"checkbox\" ><label for=\"sk-estimator-id-17\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">estimator: GradientBoostingClassifier</label><div class=\"sk-toggleable__content\"><pre>GradientBoostingClassifier()</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-18\" type=\"checkbox\" ><label for=\"sk-estimator-id-18\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">GradientBoostingClassifier</label><div class=\"sk-toggleable__content\"><pre>GradientBoostingClassifier()</pre></div></div></div></div></div></div></div></div></div></div>"
],
"text/plain": [
"RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(), n_iter=20,\n",
" param_distributions={'max_depth': [None, 3, 5, 10],\n",
" 'min_samples_leaf': array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]),\n",
" 'min_samples_split': array([ 2, 4, 6, 8, 10, 12, 14, 16, 18]),\n",
" 'n_estimators': array([ 10, 60, 110, 160, 210, 260, 310, 360, 410, 460, 510, 560, 610,\n",
" 660, 710, 760, 810, 860, 910, 960])},\n",
" verbose=True)"
]
},
"execution_count": 774,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Setup random seed\n",
"np.random.seed(42)\n",
"\n",
"# Setup random hyperparameter search for RandomForestClassifier\n",
"rs_gb = RandomizedSearchCV(GradientBoostingClassifier(),\n",
" param_distributions=rf_grid,\n",
" cv=5,\n",
" n_iter=20,\n",
" verbose=True)\n",
"\n",
"# Fit random hyperparameter search model\n",
"rs_gb.fit(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 775,
"id": "demographic-assessment",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'n_estimators': 210,\n",
" 'min_samples_split': 4,\n",
" 'min_samples_leaf': 19,\n",
" 'max_depth': 3}"
]
},
"execution_count": 775,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Find the best parameters\n",
"rs_gb.best_params_"
]
},
{
"cell_type": "code",
"execution_count": 776,
"id": "disabled-stream",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8428731762065096"
]
},
"execution_count": 776,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Evaluate the randomized search random forest model\n",
"rs_gb.score(X, y)"
]
},
{
"cell_type": "markdown",
"id": "located-palestine",
"metadata": {},
"source": [
"Still the best"
]
},
{
"cell_type": "markdown",
"id": "allied-produce",
"metadata": {},
"source": [
"## 7. Testing the model"
]
},
{
"cell_type": "markdown",
"id": "distinguished-florence",
"metadata": {},
"source": [
"**Use the model**"
]
},
{
"cell_type": "code",
"execution_count": 779,
"id": "diagnostic-associate",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>Fare</th>\n",
" <th>Family_Size</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pclass Sex Age Fare Family_Size\n",
"0 3 0 2.0 2.0 0\n",
"1 3 1 3.0 3.0 1\n",
"2 2 0 2.0 2.0 0\n",
"3 3 0 3.0 3.0 0\n",
"4 3 1 3.0 3.0 2"
]
},
"execution_count": 779,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_test.head()"
]
},
{
"cell_type": "code",
"execution_count": 797,
"id": "configured-conditioning",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Pclass': 3.0, 'Sex': 0.0, 'Age': 0.0, 'Fare': 0.0, 'Family_Size': 0.0}"
]
},
"execution_count": 797,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"passenger = df_test.iloc[10]\n",
"passenger_dict = passenger.to_dict()\n",
"passenger_dict"
]
},
{
"cell_type": "code",
"execution_count": 788,
"id": "provincial-bride",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/assitan/.local/lib/python3.8/site-packages/sklearn/base.py:420: UserWarning: X does not have valid feature names, but GradientBoostingClassifier was fitted with feature names\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 788,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rs_gb.predict([passenger])[0]"
]
},
{
"cell_type": "markdown",
"id": "junior-accommodation",
"metadata": {},
"source": [
"This passenger should survive."
]
},
{
"cell_type": "markdown",
"id": "incorrect-adult",
"metadata": {},
"source": [
"Let's try another one."
]
},
{
"cell_type": "code",
"execution_count": 795,
"id": "designed-harris",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pclass 1.0\n",
"Sex 0.0\n",
"Age 5.0\n",
"Fare 5.0\n",
"Family_Size 0.0\n",
"Name: 11, dtype: float64"
]
},
"execution_count": 795,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"passenger_2 = df_test.iloc[11]\n",
"passenger_2.T"
]
},
{
"cell_type": "code",
"execution_count": 796,
"id": "juvenile-pathology",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/assitan/.local/lib/python3.8/site-packages/sklearn/base.py:420: UserWarning: X does not have valid feature names, but GradientBoostingClassifier was fitted with feature names\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 796,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rs_gb.predict([passenger_2])[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "settled-audience",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment