Skip to content

Instantly share code, notes, and snippets.

@HyeongWookKim
Last active June 13, 2020 08:11
Show Gist options
  • Save HyeongWookKim/b47faadb5959d6978aad58a7f5ae4794 to your computer and use it in GitHub Desktop.
Save HyeongWookKim/b47faadb5959d6978aad58a7f5ae4794 to your computer and use it in GitHub Desktop.
Kaggle 타이타닉 튜토리얼 Part 2
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Kaggle - Titanic Tutorial_Part 2\n",
"**<분석 목적>**\n",
"- 타이타닉에 탑승한 사람들의 신상정보를 활용하여, 승선한 사람들의 **생존 여부**를 **예측**하는 모델을 생성하는 것이다.\n",
"\n",
"**<튜토리얼 과정>**\n",
"1. 데이터 셋 확인\n",
" - Null 값 존재 여부를 확인하고, Null 값들에 대한 처리를 해준다.\n",
"2. 탐색적 데이터 분석(EDA)\n",
" - 여러 개의 feature들을 개별적으로 분석하고, feature들 간의 상관관계를 확인한다.\n",
" - 또한 시각화 라이브러리들을 활용하여 insight를 도출한다.\n",
"3. Feature Engineering(특성 공학)\n",
" - 모델 설정에 앞서, 모델의 성능을 높이기 위한 작업이다.\n",
" - 원-핫-인코딩(One-hot-encoding), class 나누기, 구간 나누기, 텍스트 데이터 처리 등...\n",
"4. 모델 생성\n",
" - 사이킷런을 사용해서 모델을 만든다.\n",
"5. 모델 학습 및 예측\n",
" - 학습 데이터 셋(train dataset)을 가지고 모델을 학습시킨 후, 테스트 데이터 셋(test dataset)을 가지고 예측을 수행한다.\n",
"6. 모델 평가\n",
" - 모델의 예측 성능을 평가한다.\n",
" - 우리가 직면한 문제에 맞게 모델 평가 지표를 결정하면 된다.\n",
" - ex) RMSE, R_square, F1-score, ..."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:37.378244Z",
"start_time": "2020-06-09T06:37:35.860463Z"
}
},
"outputs": [
{
"data": {
"text/html": [
" <script type=\"text/javascript\">\n",
" window.PlotlyConfig = {MathJaxConfig: 'local'};\n",
" if (window.MathJax) {MathJax.Hub.Config({SVG: {font: \"STIX-Web\"}});}\n",
" if (typeof require !== 'undefined') {\n",
" require.undef(\"plotly\");\n",
" requirejs.config({\n",
" paths: {\n",
" 'plotly': ['https://cdn.plot.ly/plotly-latest.min']\n",
" }\n",
" });\n",
" require(['plotly'], function(Plotly) {\n",
" window._Plotly = Plotly;\n",
" });\n",
" }\n",
" </script>\n",
" "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from pandas import Series\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline\n",
"\n",
"# 이 두 줄의 코드는 matplotlib의 기본 scheme말고, seaborn scheme을 세팅해준다\n",
"# 일일이 graph의 font size를 지정할 필요 없이, seaborn의 font_scale을 사용하면 편리하다\n",
"plt.style.use('seaborn')\n",
"sns.set(font_scale = 2.5)\n",
"\n",
"import plotly.offline as py\n",
"py.init_notebook_mode(connected = True)\n",
"import plotly.graph_objs as go\n",
"import plotly.tools as tls\n",
"\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**본격적으로 <Part 2>를 시작하기에 앞서, <Part 1>에서 수행해주었던 작업들을 다시 수행해주겠다.**"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:37.430117Z",
"start_time": "2020-06-09T06:37:37.381236Z"
}
},
"outputs": [],
"source": [
"df_train = pd.read_csv('../titanic/train.csv')\n",
"df_test = pd.read_csv('../titanic/test.csv')\n",
"\n",
"# SibSp, Parch 변수를 하나의 변수(FamilySize)로 합쳐준다\n",
"# 자신을 포함해야하므로, 1 을 더해준다\n",
"df_train['FamilySize'] = df_train['SibSp'] + df_train['Parch'] + 1\n",
"df_test['FamilySize'] = df_test['SibSp'] + df_test['Parch'] + 1\n",
"\n",
"# Fare 변수의 결측치들을 평균값으로 대체\n",
"df_test.loc[df_test.Fare.isnull(), 'Fare'] = df_test['Fare'].mean()\n",
"\n",
"# Fare 변수에 대해서 '로그 변환' 수행\n",
"df_train['Fare'] = df_train['Fare'].map(lambda x: np.log(x) if x > 0 else 0)\n",
"df_test['Fare'] = df_test['Fare'].map(lambda x: np.log(x) if x > 0 else 0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Engineering(특성 공학)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Null 값 처리"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**1. Age 변수의 Null 값 처리**\n",
"- Age에는 Null 값이 177개나 존재한다.\n",
"- 이를 채워주기 위해, **title + statistics** 를 사용해보겠다.\n",
" - 영어에서는 Mr, Mrs, Miss와 같은 title이 존재한다.\n",
" - 때문에 각 탑승객의 이름에는 위와 같은 title이 들어가게 되는데, 이를 사용해보겠다."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:37.447062Z",
"start_time": "2020-06-09T06:37:37.433097Z"
}
},
"outputs": [],
"source": [
"# 정규표현식을 사용하여 title을 추출\n",
"df_train['Initial'] = df_train.Name.str.extract('([A-Za-z]+)\\.')\n",
"df_test['Initial'] = df_test.Name.str.extract('([A-Za-z]+)\\.')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:37.822594Z",
"start_time": "2020-06-09T06:37:37.450053Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<style type=\"text/css\" >\n",
" #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col0 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col1 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col2 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col3 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col4 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col5 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col6 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col7 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col8 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col9 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col10 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col11 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col12 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col13 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col14 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col15 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col16 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col0 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col1 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col2 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col3 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col4 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col5 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col6 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col7 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col8 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col9 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col10 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col11 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col12 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col13 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col14 {\n",
" background-color: #ffff66;\n",
" color: #000000;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col15 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" } #T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col16 {\n",
" background-color: #008066;\n",
" color: #f1f1f1;\n",
" }</style><table id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60\" ><thead> <tr> <th class=\"index_name level0\" >Initial</th> <th class=\"col_heading level0 col0\" >Capt</th> <th class=\"col_heading level0 col1\" >Col</th> <th class=\"col_heading level0 col2\" >Countess</th> <th class=\"col_heading level0 col3\" >Don</th> <th class=\"col_heading level0 col4\" >Dr</th> <th class=\"col_heading level0 col5\" >Jonkheer</th> <th class=\"col_heading level0 col6\" >Lady</th> <th class=\"col_heading level0 col7\" >Major</th> <th class=\"col_heading level0 col8\" >Master</th> <th class=\"col_heading level0 col9\" >Miss</th> <th class=\"col_heading level0 col10\" >Mlle</th> <th class=\"col_heading level0 col11\" >Mme</th> <th class=\"col_heading level0 col12\" >Mr</th> <th class=\"col_heading level0 col13\" >Mrs</th> <th class=\"col_heading level0 col14\" >Ms</th> <th class=\"col_heading level0 col15\" >Rev</th> <th class=\"col_heading level0 col16\" >Sir</th> </tr> <tr> <th class=\"index_name level0\" >Sex</th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> <th class=\"blank\" ></th> </tr></thead><tbody>\n",
" <tr>\n",
" <th id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60level0_row0\" class=\"row_heading level0 row0\" >female</th>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col0\" class=\"data row0 col0\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col1\" class=\"data row0 col1\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col2\" class=\"data row0 col2\" >1</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col3\" class=\"data row0 col3\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col4\" class=\"data row0 col4\" >1</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col5\" class=\"data row0 col5\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col6\" class=\"data row0 col6\" >1</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col7\" class=\"data row0 col7\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col8\" class=\"data row0 col8\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col9\" class=\"data row0 col9\" >182</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col10\" class=\"data row0 col10\" >2</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col11\" class=\"data row0 col11\" >1</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col12\" class=\"data row0 col12\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col13\" class=\"data row0 col13\" >125</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col14\" class=\"data row0 col14\" >1</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col15\" class=\"data row0 col15\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row0_col16\" class=\"data row0 col16\" >0</td>\n",
" </tr>\n",
" <tr>\n",
" <th id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60level0_row1\" class=\"row_heading level0 row1\" >male</th>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col0\" class=\"data row1 col0\" >1</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col1\" class=\"data row1 col1\" >2</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col2\" class=\"data row1 col2\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col3\" class=\"data row1 col3\" >1</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col4\" class=\"data row1 col4\" >6</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col5\" class=\"data row1 col5\" >1</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col6\" class=\"data row1 col6\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col7\" class=\"data row1 col7\" >2</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col8\" class=\"data row1 col8\" >40</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col9\" class=\"data row1 col9\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col10\" class=\"data row1 col10\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col11\" class=\"data row1 col11\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col12\" class=\"data row1 col12\" >517</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col13\" class=\"data row1 col13\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col14\" class=\"data row1 col14\" >0</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col15\" class=\"data row1 col15\" >6</td>\n",
" <td id=\"T_b5bd59f6_aa1b_11ea_8a3b_9822ef754c60row1_col16\" class=\"data row1 col16\" >1</td>\n",
" </tr>\n",
" </tbody></table>"
],
"text/plain": [
"<pandas.io.formats.style.Styler at 0x2127ae897c8>"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df_train['Initial'], df_train['Sex']).T.style.background_gradient(cmap = 'summer_r')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:37.839966Z",
"start_time": "2020-06-09T06:37:37.824268Z"
}
},
"outputs": [],
"source": [
"# 위에서 생성된 테이블을 참고하여 남자, 여자가 쓰는 Initial을 구분\n",
"df_train['Initial'].replace(['Mlle', 'Mme', 'Ms', 'Dr', 'Major',\n",
" 'Lady', 'Countess', 'Jonkheer', 'Col',\n",
" 'Rev', 'Capt', 'Sir', 'Don', 'Dona'],\n",
" ['Miss', 'Miss', 'Miss', 'Mr', 'Mr',\n",
" 'Mrs', 'Mrs', 'Other', 'Other',\n",
" 'Other', 'Mr', 'Mr', 'Mr', 'Mr'], inplace = True)\n",
"\n",
"df_test['Initial'].replace(['Mlle', 'Mme', 'Ms', 'Dr', 'Major',\n",
" 'Lady', 'Countess', 'Jonkheer', 'Col',\n",
" 'Rev', 'Capt', 'Sir', 'Don', 'Dona'],\n",
" ['Miss', 'Miss', 'Miss', 'Mr', 'Mr',\n",
" 'Mrs', 'Mrs', 'Other', 'Other',\n",
" 'Other', 'Mr', 'Mr', 'Mr', 'Mr'], inplace = True)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:37.877902Z",
"start_time": "2020-06-09T06:37:37.841961Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" <th>FamilySize</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Initial</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Master</th>\n",
" <td>414.975000</td>\n",
" <td>0.575000</td>\n",
" <td>2.625000</td>\n",
" <td>4.574167</td>\n",
" <td>2.300000</td>\n",
" <td>1.375000</td>\n",
" <td>3.340710</td>\n",
" <td>4.675000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Miss</th>\n",
" <td>411.741935</td>\n",
" <td>0.704301</td>\n",
" <td>2.284946</td>\n",
" <td>21.860000</td>\n",
" <td>0.698925</td>\n",
" <td>0.537634</td>\n",
" <td>3.123713</td>\n",
" <td>2.236559</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mr</th>\n",
" <td>455.880907</td>\n",
" <td>0.162571</td>\n",
" <td>2.381853</td>\n",
" <td>32.739609</td>\n",
" <td>0.293006</td>\n",
" <td>0.151229</td>\n",
" <td>2.651507</td>\n",
" <td>1.444234</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mrs</th>\n",
" <td>456.393701</td>\n",
" <td>0.795276</td>\n",
" <td>1.984252</td>\n",
" <td>35.981818</td>\n",
" <td>0.692913</td>\n",
" <td>0.818898</td>\n",
" <td>3.443751</td>\n",
" <td>2.511811</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Other</th>\n",
" <td>564.444444</td>\n",
" <td>0.111111</td>\n",
" <td>1.666667</td>\n",
" <td>45.888889</td>\n",
" <td>0.111111</td>\n",
" <td>0.111111</td>\n",
" <td>2.641605</td>\n",
" <td>1.222222</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Age SibSp Parch \\\n",
"Initial \n",
"Master 414.975000 0.575000 2.625000 4.574167 2.300000 1.375000 \n",
"Miss 411.741935 0.704301 2.284946 21.860000 0.698925 0.537634 \n",
"Mr 455.880907 0.162571 2.381853 32.739609 0.293006 0.151229 \n",
"Mrs 456.393701 0.795276 1.984252 35.981818 0.692913 0.818898 \n",
"Other 564.444444 0.111111 1.666667 45.888889 0.111111 0.111111 \n",
"\n",
" Fare FamilySize \n",
"Initial \n",
"Master 3.340710 4.675000 \n",
"Miss 3.123713 2.236559 \n",
"Mr 2.651507 1.444234 \n",
"Mrs 3.443751 2.511811 \n",
"Other 2.641605 1.222222 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train.groupby('Initial').mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.095286Z",
"start_time": "2020-06-09T06:37:37.879860Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x2127c65c148>"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAgAAAAG6CAYAAAB+wf50AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAgAElEQVR4nO3deXxMZ///8dfJJhFUEHdVpdUWse9yUwRVjWhLq8XtbkRVte67m6Jo77Zo++2itFR3a6u48auW2ksRQhNLREPt8rWLiCWJbDK/P/KdIZVkJpFkTM77+Xh4NGauuc4nx3TOe865znUZFovFgoiIiJiKm7MLEBERkdKnACAiImJCCgAiIiImpAAgIiJiQgoAIiIiJqQAICIiYkIKACIiIibk4ewCSlNSUgrZ2a4z7UHVqhVITEx2dhllmvZx6dB+LnnaxyXP1faxm5uBn59vvs+bKgBkZ1tcKgAALlevK9I+Lh3azyVP+7jklaV9XKgAEBcXx/Tp04mKiuLChQv4+fnRokULwsPDadGiRZGLWL58OYsWLSIuLo6UlBQqV65Ms2bNeOqpp/j73/9e5H5FREQkb4ajUwGvWLGCESNGkJWVdcNzbm5uDB8+nMGDBxdq41evXmXkyJEsW7Ys3zZDhgxh+PDhheo3P4mJyS6V3vz9K5KQcNnZZZRp2selQ/u55GkflzxX28dubgZVq1bI/3lHOomNjeW1114jKyuLVq1aMXfuXLZs2cK8efMICgoiOzubjz/+mPXr1xequM8//9x28O/SpQsLFixg8+bNfPfddzRv3hyAb775hsWLFxeqXxERESmYQ2cAnnnmGTZt2kSdOnX4f//v/1GuXDnbc5mZmQwcOJBt27Zx77338ssvv+DmZj9XZGRk8Pe//52UlBTat2/PtGnTMAzD9nxaWhq9e/fm4MGD3H333axataqIv+I1OgMgf6V9XDq0n0ue9nHJc7V9fNNnAA4dOsSmTZsAGDp0aK6DP4CnpycjR460td2xY4dDhR0+fJiUlBQAevfunevgD+Dt7U3Pnj0BOHr0KBcvXnSoXxEREbHPbgCIiIgAwN3dnY4dO+bZpmnTplStWhWAtWvXOrbh684SXL16Nc82Hh7Xxig6clZBREREHGP3qLp3714AatWqRcWKFfNsYxgGgYGBQM6dAo646667qFSpEgA//fTTDc9nZmbyyy+/AFC3bt18ty0iIiKFZzcAnDhxAoA777yzwHY1atQA4Pjx4w5tuFy5crzwwgsAbNq0iZdeeonY2FgSExPZvn07zz77LHFxcXh6ejJmzBiH+hQRERHH2J0HICkpCcDuN3Dr85cuXXJ44+Hh4ZQvX57JkyezatWqGwb6NWnShDFjxtzUHAMiIiJyI7sBID09HcgZlFcQ6+BAa3tHXL16lbS0NMqXL5/n88eOHSM6OpqmTZvi7u7ucL/5KWg05K3K31+XPkqa9nHp0H4uedrHJa8s7WO7AcB64P3rKP2blZWVxSuvvMKaNWvw9PTk5ZdfplevXvj7+3PixAn++9//MmvWLCZNmsSff/7JxIkTb3ogoG4DlL/SPi4d2s8lT/u45LnaPrZ3G6DdAODj4wPk3JdfEEfPFFgtXryYNWvWAPDpp5/StWtX23N33303o0aN4r777uP1119n+fLlPPDAAzz88MMO9S0icquqWMkH73IlswxLSXw7TUvP4vKlK8Xerzif3Xeh9dr+5csFpx7rtX8/Pz+HNrxgwQIA2rRpk+vgf73evXszd+5c/vjjD+bPn68AICIuz7ucB48M/9nZZThs6cSeuM53XikMu+fUa9euDcCpU6cKbHf69Gng2t0A9hw9ehTANuVvftq0aQPkTBwkIiIixcNuAKhTpw4A8fHxpKam5tnGYrHY5gto0KCBQxvOzMzM9V9H24uIiMjNsxsAgoODgZwDsHVWwL+KiYnh/PnzAHTo0MGhDVvPLGzdurXAdtu3b8/VXkRERG6e3QAQEBBgO00/ZcoU2/z9VpmZmUycOBHImbGvbdu2Dm24e/fuAOzZs4eFCxfm2WbJkiXs2rULQNf/RUREipFD99WNGTMGwzA4ePAgYWFhbN26laSkJGJiYhg8eDDR0dEYhsGwYcNy3S4YGxtLSEgIISEhzJkzJ1efYWFh3HvvvQC8+eabjB8/nri4OC5cuMCBAweYMGECo0ePBnIuK/Tr16+4fmcRERHTc+helKZNmzJu3DjGjh1LXFwc4eHhN7QZPXo0Xbp0yfXYlStXOHLkCHBtRkErHx8fvv32W4YOHcq+ffv44Ycf+OGHH27ot3HjxnzxxRd4eXk5/EuJiIhIwRy+GbVv3740atSIGTNmEBUVRVJSEr6+vjRv3pzw8HCHT/1fr2bNmixatIgff/yR5cuXs2/fPpKTk6lYsSKBgYE8/PDD9OzZE09Pz0L3LSIiIvkzLBaL60yNd5M0E6D8lfZx6dB+vsbfv6LLzQOgf7scrvY+tjcT4M3NrSsiIiIuSQFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAEREREzIw9kFiDiiYiUfvMuVzNvV379iifSblp7F5UtXSqRvEZGbpQAgLsG7nAePDP/Z2WUUytKJPbns7CJERPKhSwAiIiImpAAgIiJiQgoAIiIiJqQAICIiYkIKACIiIiakACAiImJCCgAiIiImpAAgIiJiQgoAIiIiJqQAICIiYkIKACIiIiakACAiImJCCgAiIiImpAAgIiJiQgoAIiIiJqQAICIiYkIKACIiIiakACAiImJCCgAiIiImpAAgIiJiQgoAIiIiJqQAICIiYkIKACIiIiakACAiImJCCgAiIiIm5FGYxnFxcUyfPp2oqCguXLiAn58fLVq0IDw8nBYtWhS5iLNnzzJz5kw2bNjAyZMnMQyDu+++m+7duzNgwAC8vb2L3LeIiIjcyOEAsGLFCkaMGEFWVpbtsbNnz7Jy5UpWr17N8OHDGTx4cKEL2LBhA8OGDSMlJSXX43v27GHPnj0sXbqU2bNnU6VKlUL3LSIiInlz6BJAbGwsr732GllZWbRq1Yq5c+eyZcsW5s2bR1BQENnZ2Xz88cesX7++UBvfu3cvL7zwAikpKdSsWZNPPvmEiIgIfv75ZwYMGIBhGOzfv58xY8YU5XcTERGRfDgUACZPnkxGRgZ16tRhxowZtGzZkipVqtCiRQumT59Oq1atsFgsfPTRR2RnZzu88bfffpuMjAxq1KjBDz/8QGhoKNWrVycwMJA33niD559/HoD169ezf//+ov2GIiIicgO7AeDQoUNs2rQJgKFDh1KuXLlcz3t6ejJy5Ehb2x07dji04djYWHbt2gXAmDFjqFGjxg1twsLCcHNzw9vbm9jYWIf6FREREfvsjgGIiIgAwN3dnY4dO+bZpmnTplStWpXExETWrl1Lq1at7G54xYoVAAQEBPDQQw/l2aZq1ars3LlTgwBFRESKmd0zAHv37gWgVq1aVKxYMc82hmEQGBgI5Nwp4AjrN/qgoKBcj1ssFq5evWr7uw7+IiIixc/uGYATJ04AcOeddxbYznoK//jx4w5t+MCBAwDcddddWCwWFi1axIIFC/jzzz/JysoiICCARx99lEGDBuHj4+NQnyIiIuIYuwEgKSkJIN9v/1bW5y9dumR3oxkZGVy8eBGA8uXLM3jwYNs4A6ujR48yZcoUVq1axbRp06hevbrdfkVERMQxdi8BpKenA/ZPxVsHB1rbFyQ5Odn285dffsmmTZsIDQ1lyZIl7N69m7Vr1zJ48GAMw2Dfvn289NJLWCwWu/2KiIiIY+yeAXB3dwdyrvMXl+tDQkJCAn379mX8+PG2x+68805GjhyJn58fEyZMYOfOnaxZs4Zu3brd1HarVq1wU693Bn//gs+8yK1N/37XaF+4Lv3bXVOW9oXdAGC9/p6WllZgO0fPFPy1jZeXFyNGjMiz3cCBA5k5cybnzp1j1apVNx0AEhOTyc52nTMJ/v4VSUi47Owybgmu+j+d/v1y6L18jSu+l/Vvl8PV3sdubkaBX3ztXgKwXtu/fLngX9p67d/Pz89uUb6+vraf69evT6VKlfJs5+HhQbNmzYCcOQZERESkeNgNALVr1wbg1KlTBbY7ffo0QJ4T+vyVl5eXbVDfXycW+qsKFXLSiyNjC0RERMQxdgNAnTp1AIiPjyc1NTXPNhaLxTZfQIMGDRzasHXegGPHjhXYLjExEUB3AYiIiBQjuwEgODgYgMzMTNusgH8VExPD+fPnAejQoYNDG7b2e+rUqXyn+U1NTWXnzp0AtksBIiIicvPsBoCAgACaN28OwJQpU25YtjczM5OJEycCULduXdq2bevQhkNDQ21jAd599908Bxl+9tlnJCcnYxgGjz76qEP9ioiIiH0OrQY4ZswYDMPg4MGDhIWFsXXrVpKSkoiJiWHw4MFER0djGAbDhg3LdbtgbGwsISEhhISEMGfOnFx9VqlShddeew2AXbt20b9/fzZs2MD58+c5ePAg//nPf5gxYwYAAwYM4N577y2u31lERMT07N4GCDmL/YwbN46xY8cSFxdHeHj4DW1Gjx5Nly5dcj125coVjhw5AlybUfB6/fr149KlS3z66afExcUxZMiQG9r06NGD4cOHO/TLiIiIiGMcCgAAffv2pVGjRsyYMYOoqCiSkpLw9fWlefPmhIeHO3zq/6+GDBlCx44dmT17Nr///jsJCQlUqlSJwMBA+vTpk+9KgbeSipV88C7n8K4slJK4ZzgtPYvLl64Ue78iIuI6CnXUatiwoe16vyOCgoLYt2+f3XaBgYG8//77hSnlluJdzoNHhv/s7DIctnRiT1xnKgsRESkJDo0BEBERkbJFAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMqFABIC4ujldffZX27dvTqFEjOnTowMsvv8yOHTuKtagDBw7QuHFj6tWrx7Zt24q1bxERESlEAFixYgV9+vRh2bJlJCQkkJmZydmzZ1m5ciX//Oc/mTZtWrEUlJmZyWuvvUZGRkax9CciIiI3cigAxMbG8tprr5GVlUWrVq2YO3cuW7ZsYd68eQQFBZGdnc3HH3/M+vXrb7qgqVOnsmfPnpvuR0RERPLnUACYPHkyGRkZ1KlThxkzZtCyZUuqVKlCixYtmD59Oq1atcJisfDRRx+RnZ1d5GJiYmL49ttvi/x6ERERcYzdAHDo0CE2bdoEwNChQylXrlyu5z09PRk5cqStbVHHA1y5coVRo0Zx9epVHnvssSL1ISIiIo6xGwAiIiIAcHd3p2PHjnm2adq0KVWrVgVg7dq1RSrkww8/5OjRo3Tu3FkBQEREpITZDQB79+4FoFatWlSsWDHPNoZhEBgYCOTcKVBYERERzJs3j8qVK/POO+8U+vUiIiJSOHYDwIkTJwC48847C2xXo0YNAI4fP16oAi5evMgbb7wBwNtvv42/v3+hXi8iIiKFZzcAJCUlAeT77d/K+vylS5cKVcC4ceM4c+YMISEhhIaGFuq1IiIiUjR2A0B6ejoA3t7eBbazDg60tnfE8uXLWbZsGdWqVWPs2LEOv05ERERujoe9Bu7u7kDOdf7idPbsWcaNGwfA+PHj8fPzK9b+81K1aoUS34ar8Pcv+IyOFA/t52u0L1yX/u2uKUv7wm4A8PHxASAtLa3Ado6eKbB64403uHDhAo899hgPPPCAQ6+5WYmJyWRnW4q9X1d8QyQkXHZ2CYXiivsYXG8/lxR//4raF//HFd/L+rfL4WrvYzc3o8AvvnYvAViv7V++XPAvbb3278g3+Xnz5rFx40Zuv/122wBAERERKT12zwDUrl2bqKgoTp06VWC706dPA9fuBijI8uXLba9p1apVgW3/+c9/AlCzZk3WrVtnt28RERGxz+4ZgDp16gAQHx9Pampqnm0sFottvoAGDRoUY3kiIiJSEuyeAQgODubdd98lMzOTiIgIHnrooRvaxMTEcP78eQA6dOhgd6PffvstV69ezff5bdu2MWTIEABmzJhBs2bNcHMr1MrFIiIiUgC7R9WAgACaN28OwJQpU0hJScn1fGZmJhMnTgSgbt26tG3b1u5Gvb298fX1zffP9QMJy5Urh6+vr20wooiIiNw8h75WjxkzBsMwOHjwIGFhYWzdupWkpCRiYmIYPHgw0dHRGIbBsGHDct0uGBsbS0hICCEhIcyZM6fEfgkREREpHLuXACBnsZ9x48YxduxY4uLiCA8Pv6HN6NGj6dKlS67Hrly5wpEjR4BrMwqKiIiI8zkUAAD69u1Lo0aNmDFjBlFRUSQlJeHr60vz5s0JDw936NS/iIiI3BocDgAADRs2tF3vd0RQUBD79u0rdFFFfZ2IiIg4RkPrRURETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAEREREzIozCN4+LimD59OlFRUVy4cAE/Pz9atGhBeHg4LVq0KFIB2dnZ/PLLLyxZsoQ9e/Zw6dIlypcvT926denevTtPPvkkXl5eRepbRERE8uZwAFixYgUjRowgKyvL9tjZs2dZuXIlq1evZvjw4QwePLhQG09OTmbo0KFERUXlevzixYtER0cTHR3NokWL+Prrr6levXqh+hYREZH8OXQJIDY2ltdee42srCxatWrF3Llz2bJlC/PmzSMoKIjs7Gw+/vhj1q9fX6iNjx49mqioKAzD4KmnnuLnn39m69atLFy4kP79+2MYBnv27OGFF14gOzu7KL+fiIiI5MGhMwCTJ08mIyODOnXqMGPGDMqVKwdAlSpVmD59OgMHDmTbtm189NFHdOzYETc3+7kiNjaWNWvWADBs2DCee+4523N+fn40adKEunXrMnbsWHbt2sXq1asJCQkpyu8oIiIif2H3SH3o0CE2bdoEwNChQ20HfytPT09Gjhxpa7tjxw6HNrxy5UoAKlWqxNNPP51nm379+vG3v/0NgA0bNjjUr4iIiNhnNwBEREQA4O7uTseOHfNs07RpU6pWrQrA2rVrHdrwuXPn8PT0JDAwMN9BfoZhUKtWLSBnvIGIiIgUD7sBYO/evQDUqlWLihUr5tnGMAwCAwOBnDsFHPHRRx+xe/duvvzyywLbHTt2DMg5UyAiIiLFw24AOHHiBAB33nlnge1q1KgBwPHjxx3euGEYVKhQId/nf/vtN86cOQNAy5YtHe5XRERECmY3ACQlJQHk++3fyvr8pUuXiqGsnH7ee+89IOfb/8MPP1ws/YqIiIgDdwGkp6cD4O3tXWA76+BAa/ubkZ6ezosvvmg7/T98+HAqV6580/1WrZr/2Qaz8fcvONBJ8dB+vkb7wnXp3+6asrQv7AYAd3d3IOd0fWm4cuUKL7zwAlu3bgWgZ8+e9OvXr1j6TkxMJjvbUix9Xc8V3xAJCZedXUKhuOI+BtfbzyXF37+i9sX/ccX3sv7tcrja+9jNzSjwi6/dAODj4wNAWlpage0cPVNQkKSkJIYOHcrOnTsB6NKli+0ygIiIiBQfu2MArNf2L18uOPVYr/37+fkVqZCjR4/St29f28G/R48eTJkyBU9PzyL1JyIiIvmzGwBq164NwKlTpwpsd/r0aeDa3QCFsWPHDvr27Ut8fDwAAwYM4OOPP9bBX0REpITYvQRQp04dAOLj40lNTaV8+fI3tLFYLLb5Aho0aFCoAjZt2sS//vUv0tPTcXNzY9SoUQwcOLBQfYiIiEjh2D0DEBwcDEBmZqZtVsC/iomJ4fz58wB06NDB4Y3v3LmTF154gfT0dDw9PZk0aZIO/iIiIqXAbgAICAigefPmAEyZMoWUlJRcz2dmZjJx4kQA6tatS9u2bR3a8MWLFxk2bBhXrlzBw8ODL774gu7duxe2fhERESkCh5YDHjNmDIZhcPDgQcLCwti6dStJSUnExMQwePBgoqOjMQyDYcOG5bpdMDY2lpCQEEJCQpgzZ06uPr/66ivbuIIhQ4bQsmVLUlJS8v1j7y4EERERcZxDywE3bdqUcePGMXbsWOLi4ggPD7+hzejRo+nSpUuux65cucKRI0eAazMKQs4tg/Pnz7f9/YsvvuCLL74osIY2bdrw/fffO1KuiIiI2OFQAADo27cvjRo1YsaMGURFRZGUlISvry/NmzcnPDzc4VP/APv37yc1NbVIBYuIiMjNczgAADRs2NB2vd8RQUFB7Nu374bHGzdunOfjIiIiUjocGgMgIiIiZYsCgIiIiAkV6hKAiJRdFSv54F2u5D4SSmIRnLT0LC5fulLs/YqYgQKAiADgXc6DR4b/7OwyCmXpxJ64ztpsIrcWXQIQERExIQUAERERE1IAEBERMSEFABERERNSABARETEhBQARERETUgAQERExIQUAERERE1IAEBERMSEFABERERNSABARETEhBQARERETUgAQERExIQUAERERE1IAEBERMSEFABERERPycHYBIiIixaliJR+8y5XM4c3fv2KJ9JuWnsXlS1dKpO/8KACIiEiZ4l3Og0eG/+zsMgpl6cSeXC7lbeoSgIiIiAkpAIiIiJiQAoCIiIgJKQCIiIiYkAKAiIiICSkAiIiImJACgIiIiAkpAIiIiJiQAoCIiIgJKQCIiIiYkAKAiIiICSkAiIiImJACgIiIiAkpAIiIiJiQAoCIiIgJKQCIiIiYkAKAiIiICSkAiIiImJACgIiIiAkpAIiIiJiQAoCIiIgJKQCIiIiYkAKAiIiICSkAiIiImJACgIiIiAkpAIiIiJiQAoCIiIgJKQCIiIiYkEdhGsfFxTF9+nSioqK4cOECfn5+tGjRgvDwcFq0aFHkIrZu3cqsWbOIiYkhOTmZatWq0bZtW55++mnq1q1b5H5FREQkbw6fAVixYgV9+vRh2bJlJCQkkJmZydmzZ1m5ciX//Oc/mTZtWpEKmDlzJuHh4fz2228kJSWRmZnJqVOn+PHHH+nduzdLly4tUr8iIiKSP4cCQGxsLK+99hpZWVm0atWKuXPnsmXLFubNm0dQUBDZ2dl8/PHHrF+/vlAb//XXX/nwww8BePDBB/nxxx/ZsmULM2fOJDAwkIyMDMaMGcOePXsK/YuJiIhI/hwKAJMnTyYjI4M6deowY8YMWrZsSZUqVWjRogXTp0+nVatWWCwWPvroI7Kzsx3asMVi4ZNPPsFisXD//fczZcoUGjZsSJUqVWjXrh0//PADtWvXJjMzk4kTJ97ULykiIiK52Q0Ahw4dYtOmTQAMHTqUcuXK5Xre09OTkSNH2tru2LHDoQ1v3LiRgwcPAvDSSy/h5pa7lAoVKvDiiy8CsHnzZk6dOuVQvyIiImKf3QAQEREBgLu7Ox07dsyzTdOmTalatSoAa9eudWjD1n6rVKlC06ZN82wTHByMu7s7FouFdevWOdSviIiI2Gc3AOzduxeAWrVqUbFixTzbGIZBYGAgkHOngCP+/PNPABo0aIBhGHm2qVChArVq1QLgjz/+cKhfERERsc9uADhx4gQAd955Z4HtatSoAcDx48cd2rC135o1axZrvyIiImKf3QCQlJQEkO+3fyvr85cuXXJowyXVr4iIiNhndyKg9PR0ALy9vQtsZx0caG3vrH4L4uaW96WG4lDdz6fE+i4JJbkvSoqr7WNwvf2sfVw6XG0/ax+XjuLez/b6sxsA3N3dAfK9Tl9U7u7uZGdn2+3XYrEU2zb9/HyLra+/mv6fbiXWd0moWrWCs0soNFfbx+B6+1n7uHS42n7WPi4dpb2f7V4C8PHJSVFpaWkFtnP0G31h+83IyChUvyIiImKf3QBgvQZ/+fLlAttZr9H7+fk5tOGS6ldERETssxsAateuDWB3Ip7Tp08D10btO6tfERERsc9uAKhTpw4A8fHxpKam5tnGYrHY5gto0KCBQxu29mt9XV6Sk5M5duxYofoVERER++wGgODgYAAyMzNts/f9VUxMDOfPnwegQ4cODm3Y2u/Zs2fzneRn/fr1XL16FYD27ds71K+IiIjYZzcABAQE0Lx5cwCmTJlCSkpKruevX6ynbt26tG3b1qENt2nTxnZa/6OPPiIrKyvX88nJyUydOhWATp06cffddzvUr4iIiNjn0GqAY8aMwTAMDh48SFhYGFu3biUpKYmYmBgGDx5MdHQ0hmEwbNiwXLf1xcbGEhISQkhICHPmzMnVp7u7O6NGjQLg999/Z8iQIQHJMCgAACAASURBVOzatYukpCS2bNnCU089xZEjR/Dy8rItCiQiIiLFw+48AJCz2M+4ceMYO3YscXFxhIeH39Bm9OjRdOnSJddjV65c4ciRI8C1mf+u1717d/bt28eXX37J5s2b2bx5c+7iPDyYMGECjRo1cvgXEhEREfscCgAAffv2pVGjRsyYMYOoqCiSkpLw9fWlefPmhIeHO3zq/69eeeUVgoKC+O6779i1axcXL17Ez8+P1q1bM3jwYBo2bFikfkVERCR/hqU4p9oTERERl+DQGAAREREpWxQARERETEgBQERExIQUAERERExIAcDJTp486ewSRIrF7t27SU5OdnYZIuIgBQAnGzZsGJ06dWLdunXOLkXkprz11lu0b9+eZcuWObsUkSJ78cUXGT58OHv27HF2KSXO4XkApGQcOnSIlJQUrXboROfOneOnn37izJkzBAQE8Oijj3Lbbbc5uyyXEx8fT3p6OoGBgc4uxRT2799P7dq18fT0tD22ceNG5syZY3sv9+/fv8hztJjV9u3bSUpK4vHHH3d2KSVOAcDJrIsd+fv7O7mSsu3ChQtMnjyZVatW8d///pdatWoBEBcXx6BBg7h06ZKt7WeffcbUqVNp06aNs8p1SW5uOScUK1So4ORKyrbdu3fz+uuvc+jQIZYsWcJ9990HwNKlSxk1ahQWiwWLxcL+/fv59ddfGTFiBM8884yTq3Yd1stY9evXd3IlJU+XAJysXbt2ACxZssTJlZRdGRkZDBgwgPnz55OUlGRbYhrgjTfe4OLFi1gsFtuB69KlS/z73/8mMTHRWSW7pAcffBCLxcKsWbOcXUqZlZiYyKBBgzhw4ADZ2dm293JmZiYffPAB2dnZeHp60qVLFwICArBYLEyaNKnAZdclt7p16wI5q9yWde5jx44d6+wizOzvf/87mzZt4pdffuHMmTN4enri6+uLt7d3roWVpOjmz5/P4sWLAXjsscd46KGHqFixIrGxsXz55ZcYhsGQIUOYOXMmPXv2ZOPGjSQkJODh4aHTp4XQrl074uLiWLp0KTt37iQtLY3s7GwyMjJIS0sjJSUl3z86a+CYb775hsjISHx9fXnnnXfo3LkzHh4ebNq0iQULFmAYBhMmTOCVV16hX79+7Nq1i//93//FYrHQuXNnZ5fvEgIDA1m2bBm//fYbNWrUICAgAA+PsnmyXFMBO9mgQYNIS0tjx44dNxzw3d3dC3ztH3/8UZKllRkDBw7k999/Z+DAgbYVKAE++eQTvv76a7y9vdmyZQs+Pj4ArFy5kldeeYXAwEB++uknZ5Xtch566CHbt9LChFfDMEwx4Ko49O7dmz179vD222/Tr18/2+NvvfUWCxYswM/Pj8jISNv+j46OJiwsjICAAFavXu2ssl3KV199xaFDh1i6dCmGYeDu7k5AQAB+fn6UK1cu39cZhsH06dNLsdKbVzZjjQu5/n/Wv2axrKysfF+nswOO279/PwBPPPFErsc3btyIYRi0bt3advAHaNKkCQAnTpwovSLLgPj4eNvP+l5RMqyn/Nu3b5/r8U2bNmEYBu3atcv12XDPPfcAcPbs2dIr0sV9+umnuT6Ts7KyOHz4cL7tDcPAYrG45GeyAoCTvfDCC84uocyzDvCrVq2a7bHExET+/PNP4No4DCtryr9y5UopVVg2vP/++84uocxLTU0FwNfX1/bY4cOHOXnypC0AXC89PR1QICuM1q1bO7uEUqMA4GQKACXP19eXS5cucf78edvtfREREbbUfv/99+dqf+TIEQAqV65c6rW6sscee8zZJZR5VapUISEhgZMnT+Ln5wfknMmy+uuZgbi4OACqV69eekW6uO+//97ZJZQa3QUgZZ71dp41a9bYHrNe269Ro4Zt1K/V9OnTMQyDhg0bll6RIg5o3rw5AN999x0AKSkpzJ8/3/Z+/dvf/mZre+HCBaZMmYJhGLRo0cIp9cqtTWcAbjGxsbFs376d06dPk5yczHvvvQfAqlWraNSoETVr1nRyha4nNDSUrVu3MmXKFI4dO8a5c+fYunUrhmHQs2dPWzvrJCrWsQHXPydyK+jduzerVq1iyZIlxMTEkJqaSkJCAoZh5BoUOHHiRJYsWcKZM2dwc3Ojf//+TqzatWVnZ7N3717bZ7L1c+HYsWO2+URcle4CuEVs27aN8ePHc+DAgVyPW+/fDQkJ4fjx4zz99NMMGzbMNumK2JeVlcXTTz9NdHR0rsE9d911F4sXL6Z8+fIA3H///Zw/fx6LxULXrl2ZOnWqM8u+pQ0YMKDY+jIMg9mzZxdbf2XdpEmT+Oabb3I99uCDD/LZZ5/Z/t61a1eOHz+OYRi8+uqrPPvss6VdpstLSkri888/Z/HixbaxF3DtM7lnz55kZGQwZswYOnbs6Kwyb4oCwC1gxYoVjBw50jbq38vLi4yMDAzDsL3ZmjRpYnusV69eGnBVSJmZmcyYMYM1a9aQkZFB69at+fe//02VKlVsbcLCwjhw4ADh4eE8++yzZfbe3+IQGBiY790rhXX9+1wcExMTw+rVq23v5W7duuUahT5s2DDS09MZNGgQrVq1cmKlrunAgQMMHjyYs2fP5np/X/9ebdWqFcnJyRiGccNtma5CAcDJTpw4QWhoKOnp6bRs2ZJXX32V++67j6CgoFxvtj179vD+++/bvsV+8803dOjQwcnVly2nT5/G39/f7vwLkjsAVK1alTp16txUfzNnziyOssq83bt3U7t2bU2cVIJSU1N5+OGHOXnyJLfddhsDBw6kcePGDB48ONdn8qxZs5g2bRrnzp3Dw8ODn376yTYts6vQVxwnmzFjBunp6TRv3pxZs2bh6emZ63STVYMGDZgxYwaDBg0iOjqa//73vwoAxez22293dgkuw9/fn4SEBCDnlsrbbruN0NBQQkNDqV27tpOrK7veeustjhw5wnvvvUePHj2cXU6Z9P3333Py5Elq1KjB/Pnz+dvf/pbnZ/LAgQN56KGHCAsL48SJE3z33XeMHz/eCRUXnS4kO5l1Ao+XXnop16peefH09ORf//oXALt27SqN8sqc/fv3k5mZmeuxjRs3MmTIEHr27MmLL77Ili1bnFSd67AOmOzfvz9Vq1bl0KFDTJ06ldDQUB5//HGmTZumiZRKgFZcLHmrV6/GMAxeeeWVXHdV5KVGjRq88sorWCwWtm7dWkoVFh8FACc7ffo04PjKU/Xq1QNyBqiI43bv3s0jjzxCr169cs1Yt3TpUp5//nkiIiLYt28fv/76K4MGDXK5KT1Lm2EYtGrVirfeeouNGzcyc+ZMnnzySSpXrsyePXuYOHEiXbt2pW/fvnz33Xeaia6YaMXFkmf9fHB0HRDrxEFnzpwpsZpKigKAk3l5eQGQlpbmUPuUlBQg90xgUjCtoFay3NzcaNu2LePHj2fTpk1MmzaNXr16UalSJXbt2sX7779Pp06dCAsLY/78+Zw/f97ZJbssrbhY8qyDse2dkbWyhjJXvDNLYwCc7K677iIuLo6IiAiefPJJu+1Xrlxpe5045vvvv+fy5cv4+vry5ptv2pJ9ZGQkiYmJGIbBBx98QGhoKJmZmTz33HNERkYyb948l7um52zu7u60b9+e9u3bk5WVxebNm1m2bBnr1q0jOjqabdu28e677xIUFERoaCjdunWjYsWKzi7bZbz55pskJCQwa9Ys9u/fT7du3ahfvz5+fn62LxP5sXc6W3LcfvvtxMfHExsbS3BwsN321lP/NWrUKOnSip0CgJN17dqVP/74g8mTJ3P//fdzxx135Nv2+uVrtbSn4yIiIjAMg5EjR9KrVy/b42vXrgVypvzt3r07kJP6hw4dSmRkpEte07uVeHh4EBwcTHBwMBkZGWzcuJEVK1awfv16Nm/eTGRkJOPGjaN9+/Z88cUXzi7XJTz22GNkZ2djsViIjIwkMjLSoddpxUXHtW/fnqNHj/LZZ5/Rrl27As8EXLhwgcmTJ+e5DoMrcL1zFmVMWFgY1atXJzExkccff5ypU6fm+p86Pj6eTZs2MX78eJ566ilSU1OpUqUKYWFhTqzatWgFNefz8vKia9euTJw4kfXr19OnTx8AMjIy+O2335xcneuIj4+3vZ8tFkuh/ohjnn76aby8vIiLi2PAgAFER0fbFlWySktLY+XKlfTp04fjx4/j7u5erJNjlRadAXAyX19fvv76a55++mkuXLjA559/Dlxb7jckJMTW1mKx4Ovry2effaZBQIWgFdScLy0tjd9++43Vq1ezYcMGrly5Ytu/Gs/iOE0AVvJq1qzJu+++y6hRo4iJibEd2K2fyR06dCApKYmrV6/a3sOvv/46AQEBTqu5qBQAbgH169dnyZIlfPTRR6xcudI2COV61tP+o0eP1vX/QtIKas6RmprK+vXrWblyJREREaSlpdk+MMuXL0/nzp3p3r27y06j6gxacbF0PProo/j5+fH2229z8uTJXM9Z57+AnEmw3njjDUJDQ0u7xGKhmQBvMZcuXWLnzp0cO3aM5ORkvL29ueOOO2jRokWu9ezFcS+//DKrV6/m0Ucf5cMPPyQlJYXevXsTHx9Pw4YNWbRoka3thQsXCAsL4+DBg7b24riUlBTWrVvHypUr2bx5M+np6baDvo+Pj+2gHxwcbHfQmoizZWdns3HjRqKiojh+/Hiuz+SWLVvywAMPuPT7WAHABZ07d44TJ07QtGlTZ5fiEqwT/RiGQUBAQK4V1N555x2eeOIJ4MYV1ObNm6d97IDk5GR+/fVXVq1axebNm8nMzMx10O/UqZPtoF+uXDknVysiVgoAThYYGIibmxvbt2/Hx8fHbvvz58/Trl07qlevnus0thRMK6gVr0uXLtkO+pGRkWRlZeU66AcHBxMSEkKnTp3w9vZ2crWuSSsuSknTGIBbQGEymHUiIM0EWDivvvoqXbp0uWEFtes1btyYunXragU1B7Rr146rV68COe9fHx8fOnbsSEhICJ07d9ZBvxhERUUV64qLUnj79u3j2LFjpKamkp2dbbf99bcZuwIFgFKSnZ3N9OnTb7idxOqrr76yO/NUZmam7Vv/9cvYimOaNWtGs2bN8n3+k08+KcVqXFtWVpbtoFK3bl06d+6Mj48P8fHxRZql7vnnny/mCsuWatWq3fSKi+K4qKgo/vOf/9huuXSEdal2V6IAUErc3NxIT09n6tSpN6Rxi8Vyw+np/Fi/CbjqqFMpew4cOMCBAweK/HrDMBQA8qAVF53j0KFDPPfcc7nuWimrNAagFGVkZDBgwIBcE8xY70W3N42kYRh4eHjg5+dH27ZtGTp0qEuPPi0pO3bsAMDb25sGDRrkeqwoWrRoUSx1lTXFvRrdn3/+Waz9lQUWi4Xt27ezfPlyVq9ezblz52xfHurXr09oaCjdu3enZs2aTq60bBkzZgyLFy/Gw8ODgQMH0q1bN6pVq4a7u7vd17radMsKAE4WGBiIYRjs2LHDoUGAUjDr/gwICGDVqlW5HissTZ8qt4rs7Gx+//13VqxYwZo1a0hKSrK9p5s0aUKPHj0ICQnR3BXFoFOnTpw5c4aXXnqJoUOHOrucEqVLAE7Wq1cv27d7KR55ZVrlXHFl1hUX27Zty9tvv82WLVtsiyzt2rWL2NhYPvjgA1q2bEmPHj3o1q2bxgkVkXW1yh49eji5kpKnMwBSpkRFRQE5lwCaNGmS67GiaNOmTbHUJVIS/rriYnJyMoZh4O7urhUXiyg4OJizZ8+yfv16lzulX1gKALeQ/fv3U7t27Vx3A2zcuJE5c+Zw5swZAgIC6N+/v205WxERq7+uuJiSkoJhGHh6emrFxUIYNWoUS5Ys4f3333e5Uf2FpQBwC9i9ezevv/46hw4dYsmSJdx3330ALF26lFGjRtlW87Je8xsxYgTPPPOMM0sWkVvY5cuXmTBhAgsXLrR9duzdu9fZZbmEw4cP07NnT/z8/FiwYAG33367s0sqMQoATpaYmEhISAiXL18G4Msvv6Rz585kZmbSqVMnEhMT8fLyon379hw6dIj4+Hjc3d1ZtGgR9evXd3L1InKrKGjFxQoVKrBt2zYnV3jrye8OoRUrVvD9999TsWJFnnjiCZo1a0blypXtztXiancNaeSZk33//fdcvnwZX19f3nzzTdvp/cjISBITEzEMgw8++IDQ0FAyMzN57rnniIyMZN68eYwfP97J1d96NH2qmIlWXLw5/fv3z/cOIcMwuHz5ssMTW7niXUMKAE4WERGBYRiMHDky1/WmtWvXAlC5cmW6d+8OgKenJ0OHDiUyMpKtW7c6pd5bnaZPlbJOKy4WL3ufE2X5JLkCgJNZp5r865r0mzZtwjAM2rVrl+tAdM899wDkmkxI8qbpU6Ws0IqLJeO7775zdglOpQDgZKmpqQD4+vraHjt8+LBthsB27drlam9dS6Asp9KboelTpazQioslz+y3+SoAOFmVKlVISEjg5MmT+Pn5AeRa5vevZwbi4uIANONXPjZu3Jhr+tRDhw4xdepUpk6dqulTxaVoxUXnsK7X8txzzzk0QVtycjITJkwgOTmZiRMnlkKFxUd3ATjZyy+/zOrVq3n00Uf58MMPSUlJoXfv3sTHx9OwYUMWLVpka3vhwgXCwsI4ePCgrb3kT9Oniiu7fgrrOnXq2FZcLCotuOSYwk7PfvHiRYKCgvD19WX79u2lUGHxUQBwso0bNzJkyBDb/PWpqakkJCRgGAbvvPMOTzzxBAATJ05kyZIlnDlzBjc3N+bNm0fTpk2dXL3ruHr1aq7pUy9evIhhGBiGoelT5ZZU1DUs8uKKI9Sdxbrfd+7cafcsS1ZWFgsXLmTcuHGUL1/+phYecwYFgFvApEmTblgO+MEHH+Szzz6z/b1r164cP34cwzB49dVXefbZZ0u7zDJD06eKK9CKiyUrOzubJ598stiCUfPmzZk7d26x9FVaFABuETExMaxevZqMjAxat25Nt27dcqX/YcOGkZ6ezqBBg2jVqpUTKy1bNH2qiHnFxcXx5JNPkp2dfVP9VKhQgW+++cblJgJSABD5P5o+VcR8IiIiOHfunO3vY8aMwTAMxo4dW+A8CtZVXCtXrkzjxo257bbbSqPcYqUAIKam6VNF5HqFHQToyhQAbiHx8fGcP3+eq1ev3nCff3Z2NpmZmaSkpBAfH8+aNWtYuHChkyp1bYWZPlUzqYmYy4kTJwCoWbMmGRkZXL58GYvFQqVKlcrc54HmAbgFbNiwgfHjx3Py5Elnl1JmafpUEXHE8ePHWbp0Kb///jsnTpywfU4YhsEdd9xBmzZteOSRR8rEsuw6A+Bkhw4dolevXrlm+bLHx8eH1q1b33DngOSm6VNFxFEHDx7kzTffJCYmBsh/tlXr4OymTZsyfvx46tatW2o1FjcFACd76623WLBgAZ6enoSHh9O6dWt2797N1KlTCQ0NpV+/fiQlJbFlyxYWLVrE1atXeeaZZxgxYoSzS78lafpUESmsiIgIXn75ZdsYIMMwaNSoEbVr16Zq1ap4eHhw8eJF9u/fT1xcHJmZmUDOZ8qnn35KcHCwk3+DolEAcLLu3btz9OhRBg0axMiRIwE4c+YMwcHBBAYG8tNPP9naRkdH8/TTT5Odnc3ChQtp2LChs8q+ZTVq1EjTp4qIww4cOECfPn24cuUKnp6ePPPMM/Tv3z/fGUIvXbrEvHnz+Oqrr7hy5Qo+Pj7Mmzev2OdtKA0KAE7WsmVLUlNTWbx4ca430P33309SUhLbtm2jfPnytsc//PBDZs6cSe/evXnvvfecUfItTdOnikhh9O3bl127duHn58fs2bMdPqV/6NAhBg4cSEJCAk2aNGHBggUlXGnx0yBAJ7Ou7nf77bfnevyee+5h27Zt7N+/n2bNmtkef+SRR5g5c6bLTTnpDAcOHODAgQNFfr1hGAoAImVYZGQku3btwt3dnc8//7xQ1/PvvfdevvjiC/r168fu3bvZsmWLyw0MdHN2AWZXuXJlIGcSmuvdddddQM7AlOvVqFEDgNOnT5dCda7JYrEUy5+bnR1MRG5tv/zyCwAPPPBAkWbxa9y4MY888ggWi4Xly5cXd3klTmcAnOy+++4jMTGRHTt2UKtWLdvjAQEBWCwW4uLibAsCASQlJQHYBqFIbprvXEQctWvXLgzDoHfv3kXuo0ePHvz000/s3LmzGCsrHToD4GTt27fHYrEwadIkdu/ebXvcOsBv9erVuc4OWK8z+fv7l26hIiJljHXuldq1axe5jzp16gDXJhByJQoATtavXz8qV67M2bNn6dOnD++++y4AQUFBVKtWjfPnz/P4448zYcIEnn/+eWbPno1hGPz97393cuUiIq7NesdQpUqVityHdZCxK56VVQBwsgoVKvDVV19RtWrVXBNPeHh4MGbMGCwWC8ePH2fGjBls2LDBdmvbc88958SqRURcX5UqVQA4e/ZskftISEgAcMnFgBQAbgHNmjVj1apVvPXWW9x///22x3v06MEnn3xCrVq1bAPTGjduzOzZs7n77rudV7CISBlg/RzdunVrkfvYsmULcO1SgCvRPAAuIikpCQ8PDypWrOjsUkREyoTp06czYcIEAgICWLFiBe7u7oV6fVZWFqGhoRw7doyRI0cyaNCgEqq0ZOgMgIvw8/PTwV9EpBj17NkTHx8fjh07xscff1zo17///vv87//+L97e3vTq1asEKixZCgAiImJK1apV45lnnsFisTBr1izef/99hwbzpaen85///IcffvgBwzAYOnSobTyBK9ElgFI0YMCAYuvLMAxmz55dbP2JiJiRxWJh6NChrF+/HsMwqFq1Kr1796Z169bUqVOHSpUqUa5cOS5evMjhw4fZvHkzCxYsIDExEYvFQteuXZk6daqzf40iUQAoRdfPU3+zu90wDPbu3VscZYmImFpGRgbvvPMOCxcuBK4t+Zsf6+d3nz59ePPNN/H09CzxGkuCAkApuj4AVK1a9aZHjc6cObM4yhIREXKWBf7mm2+Ijo4usF2bNm3497//TVBQUClVVjIUAEpRhw4dbPeMGobBPffcQ2hoKKGhoTc1E5WIiBSfhIQEoqOjOXz4MBcuXCA9PZ1KlSpx77330rZtW9uaLK5OAaAUWSwWtm/fzvLly1m9ejXnzp2znRGoX78+oaGhdO/enZo1azq5UhERKesUAJwkOzub33//nRUrVrBmzRqSkpJsYaBJkyb06NGDkJAQqlev7uRKRUSkLFIAuAVcvXqVLVu2sGzZMtatW8fFixcxDAPDMGjZsiU9evSgW7duLnmbiYiI3JoUAG4xWVlZbN682RYGkpOTMQwDd3d3goKCCA0NpVu3bpoUSEREbooCwC0sIyODjRs3smLFCtavX09KSgqGYeDp6Un79u354osvnF2iiIi4KAUAF3H58mUmTJjAwoULsVgsmgdARERuioezC5D8paWl8dtvv7F69Wo2bNjAlStXbBNQ+Pr6Ork6ERFxZQoAt5jU1FTWr1/PypUriYiIIC0tzXbQL1++PJ07d6Z79+507NjRyZWKiIgrUwC4BaSkpLBu3TpWrlzJ5s2bSU9Ptx30fXx8bAf94OBgvLy8nFytiIiUBQoATpKcnMyvv/7KqlWr2Lx5M5mZmbkO+p06dbId9MuVK+fkakVEpKxRAChFly5dsh30IyMjycrKynXQDw4OJiQkhE6dOuHt7e3kakVEpCzTXQClqFGjRly9ehXImRbYx8eHjh07EhISQufOnXXQFxGRUqMAUIquXw2wTp06dO7cGR8fnyL39/zzzxdXaSIiYjIKAKXo+gBwswzDYM+ePcXSl4iImI/GAJSy4spbym0iInIzdAZARETEhNycXYCIiIiUPgUAERERE1IAECnjfv/9d+rVq0e9evUYPXp0sfffpUsX6tWrR0hISL5t/vzzzzwfDwsLo169ejRu3LjY6/rxxx9tv/eyZcuKvX8RV6cAICIlJiEhgREjRvCvf/3L2aWIyF8oAIhIiRkxYgRLly51dhkikgfdBigiN2XdunX5PpednV3ga7///vviLkdEHKQzACIiIiakACAiImJCugQgYmKjR49m8eLFBAYG8vPPP3P06FFmz57Npk2bOHPmDD4+PtStW5devXrx2GOP4eZ243eGLl26cOLECWrXrs3KlStz9Wt14sQJ6tWrB8Bjjz3GBx98AOTcBRAVFYWXlxe7d+/Os8Y///yTH3/8kW3btnHy5EmSk5Px9vamWrVqtGzZkr59+9KkSZPi3jUiZZ4CgIgAsGrVKkaNGsWVK1dsj6WnpxMVFUVUVBS//PILX3/9NV5eXqVSz9WrV/mf//kffvjhhxumvs7MzOTy5cscOXKERYsWMXz4cIYMGVIqdYmUFQoAIsKpU6cYMWIE7u7uDB48mPbt2+Pl5cX27dv5+uuvSU5OJjIyklmzZjl0oH3ppZcIDw/njTfeIC4uDn9/f7799lsAbrvtNodq+vzzz5kzZw4AtWvXJiwsjHvuuYdy5cpx4sQJli5dyoYNGwD45JNP6NKlC/fdd18R94CI+SgAiAgXL16kfPnyzJ07l/r169seb9myJW3atKFfv35YLBYWL17sUAC44447uOOOO/D19QXAy8srV7/2JCcnM23aNADuvPNO5s+fT+XKlW3Pt2jRgkceeYQPP/yQGTNmkJ2dzerVqxUARApBgwBFBIB//OMfeR6kmzVrZrt+f/jwYTIyMkq8lgMHDnDnnXfi4+NDeHh4roP/9R599FHbz2fOnCnxukTKEp0BEBEA2rdvn+9zAQEBtul8U1JSSnwcQPPmzVm+fDlQ8FwC1apVs/1cGsFEpCxRABARIOdUe37Kly9v+/nq1aulUY6N9c6DpKQkjh07xrFjxzh48CB79uxh+/bttnZa2VykcBQARAQAHx+ffJ8zDMP2c2keaHft2sV3331HZGQk58+fv+H5r8b7fQAAAf5JREFUvG5LFBHHKACIyC3p888/Z8qUKbkeq1atGvfccw/16tWjadOmNGjQgNDQUCdVKOLaFABE5JazYcMG28Hf39+fl19+meDgYKpXr56r3fHjx51RnkiZoAAgIrecuXPn2n7+5JNPaN26dZ7tTp06VVoliZQ5uoAmIiXm+rEDhREfH2/7uWHDhvm2W7Jkie3nrKysIm1LxKwUAESkxFhvF0xJSSnU6/z8/Gw/b9y4Mc82CxcuZOHChba/6zZAkcLRJQARKTH+/v4AXLhwga+//pp27drh4+Njd8a+7t27s2PHDgBef/11Dh48SMuWLfHy8iI+Pp4lS5awZcuWXK9JTk4umV9CpIxSABCREtOtWzd+/PFHACZNmsSkSZNo3bq1bY7//PTv35/Nmzezfv16UlJS+Oyzz25o4+bmxqBBg4iKiiI2NpYDBw6UyO8gUlbpEoCIlJjOnTvz4YcfUr9+fXx8fChfvjzp6el2X+fh4cGXX37J+PHjadWqFRUrVsTd3Z0KFSpQt25d/vGPf7B48WJGjhzJ/fffD8DZs2dzTQwkIgUzLJo+S0RExHR0BkBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAERERExIAUBERMSEFABERERMSAFARETEhBQARERETEgBQERExIQUAEREREzo/wNTH72CvD082wAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 576x396 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_train.groupby('Initial')['Survived'].mean().plot.bar()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 결과를 살펴보면, 여성과 관계있는 **Miss, Mrs**가 **생존률이 높은 것**을 알 수 있다."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Statistics를 사용하여 Null 값 처리**\n",
"- 학습 데이터 셋에서 얻은 statistics를 기반으로, 테스트 데이터 셋의 Null 값을 채워줘야 한다.\n",
"- Pandas 데이터 프레임을 다룰 때에는 boolean array를 이용해서 indexing하는 방법이 정말 편리하다!!\n",
" - ***```loc + boolean + column```*** 을 사용해서 값을 치환하는 방법은 자주 쓰이므로, 반드시 잘 숙지하도록 하자!!"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.119221Z",
"start_time": "2020-06-09T06:37:38.099274Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" <th>FamilySize</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Initial</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Master</th>\n",
" <td>414.975000</td>\n",
" <td>0.575000</td>\n",
" <td>2.625000</td>\n",
" <td>4.574167</td>\n",
" <td>2.300000</td>\n",
" <td>1.375000</td>\n",
" <td>3.340710</td>\n",
" <td>4.675000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Miss</th>\n",
" <td>411.741935</td>\n",
" <td>0.704301</td>\n",
" <td>2.284946</td>\n",
" <td>21.860000</td>\n",
" <td>0.698925</td>\n",
" <td>0.537634</td>\n",
" <td>3.123713</td>\n",
" <td>2.236559</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mr</th>\n",
" <td>455.880907</td>\n",
" <td>0.162571</td>\n",
" <td>2.381853</td>\n",
" <td>32.739609</td>\n",
" <td>0.293006</td>\n",
" <td>0.151229</td>\n",
" <td>2.651507</td>\n",
" <td>1.444234</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mrs</th>\n",
" <td>456.393701</td>\n",
" <td>0.795276</td>\n",
" <td>1.984252</td>\n",
" <td>35.981818</td>\n",
" <td>0.692913</td>\n",
" <td>0.818898</td>\n",
" <td>3.443751</td>\n",
" <td>2.511811</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Other</th>\n",
" <td>564.444444</td>\n",
" <td>0.111111</td>\n",
" <td>1.666667</td>\n",
" <td>45.888889</td>\n",
" <td>0.111111</td>\n",
" <td>0.111111</td>\n",
" <td>2.641605</td>\n",
" <td>1.222222</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Age SibSp Parch \\\n",
"Initial \n",
"Master 414.975000 0.575000 2.625000 4.574167 2.300000 1.375000 \n",
"Miss 411.741935 0.704301 2.284946 21.860000 0.698925 0.537634 \n",
"Mr 455.880907 0.162571 2.381853 32.739609 0.293006 0.151229 \n",
"Mrs 456.393701 0.795276 1.984252 35.981818 0.692913 0.818898 \n",
"Other 564.444444 0.111111 1.666667 45.888889 0.111111 0.111111 \n",
"\n",
" Fare FamilySize \n",
"Initial \n",
"Master 3.340710 4.675000 \n",
"Miss 3.123713 2.236559 \n",
"Mr 2.651507 1.444234 \n",
"Mrs 3.443751 2.511811 \n",
"Other 2.641605 1.222222 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Age 변수에 존재하는 Null 값들을 Age 변수의 평균값으로 대체하기 위해, 평균값들을 확인\n",
"df_train.groupby('Initial').mean()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.179071Z",
"start_time": "2020-06-09T06:37:38.122213Z"
}
},
"outputs": [],
"source": [
"# Null 값들을 Age 변수의 평균값으로 대체\n",
"df_train.loc[(df_train.Age.isnull()) & (df_train.Initial == 'Mr'), 'Age'] = 33\n",
"df_train.loc[(df_train.Age.isnull()) & (df_train.Initial == 'Mrs'), 'Age'] = 36\n",
"df_train.loc[(df_train.Age.isnull()) & (df_train.Initial == 'Master'), 'Age'] = 5\n",
"df_train.loc[(df_train.Age.isnull()) & (df_train.Initial == 'Miss'), 'Age'] = 22\n",
"df_train.loc[(df_train.Age.isnull()) & (df_train.Initial == 'Other'), 'Age'] = 46\n",
"\n",
"df_test.loc[(df_test.Age.isnull()) & (df_test.Initial == 'Mr'), 'Age'] = 33\n",
"df_test.loc[(df_test.Age.isnull()) & (df_test.Initial == 'Mrs'), 'Age'] = 36\n",
"df_test.loc[(df_test.Age.isnull()) & (df_test.Initial == 'Master'), 'Age'] = 5\n",
"df_test.loc[(df_test.Age.isnull()) & (df_test.Initial == 'Miss'), 'Age'] = 22\n",
"df_test.loc[(df_test.Age.isnull()) & (df_test.Initial == 'Other'), 'Age'] = 46"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**2. Embarked 변수의 Null 값 처리**\n",
"- Embarked 변수에는 Null 값이 2개 존재한다.\n",
"- 탑승 항구 \"S\"에서 가장 많은 탑승객이 있었으므로, 간단하게 **Null 값들을 \"S\"로 대체**해주겠다."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.190032Z",
"start_time": "2020-06-09T06:37:38.182055Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Embarked has 2 Null values\n"
]
}
],
"source": [
"print('Embarked has', sum(df_train['Embarked'].isnull()), 'Null values')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.201035Z",
"start_time": "2020-06-09T06:37:38.194020Z"
}
},
"outputs": [],
"source": [
"df_train['Embarked'].fillna('S', inplace = True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 연속형 변수인 Age를 범주형 변수로 변환\n",
"- 이처럼 연속형 변수를 범주형 변수로 변환하는 경우, **정보 손실**이 **발생**할 수 있다.\n",
"- 그러나 본 튜토리얼에서는 다양한 방법들을 소개하는 것이 목적이므로 진행하도록 하겠다.\n",
"- 범주형 변수로 변환해주는 방법은 여러 가지가 존재한다.\n",
" - 첫 번째 방법: 데이터 프레임의 indexing 방법인 ***```loc```*** 를 사용하여 직접 변환하는 방법\n",
" - 두 번째 방법: ***```apply```*** 를 사용해서 함수를 넣어주는 방법\n",
" - 참고로 두 번째 방법(***```apply```***)을 사용하는 것이 훨씬 더 간단하다!! "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.284777Z",
"start_time": "2020-06-09T06:37:38.204991Z"
}
},
"outputs": [],
"source": [
"# 1. loc 를 사용한 방법\n",
"df_train['Age_cat'] = 0\n",
"df_train.loc[df_train['Age'] < 10, 'Age_cat'] = 0\n",
"df_train.loc[(10 <= df_train['Age']) & (df_train['Age'] < 20), 'Age_cat'] = 1\n",
"df_train.loc[(20 <= df_train['Age']) & (df_train['Age'] < 30), 'Age_cat'] = 2\n",
"df_train.loc[(30 <= df_train['Age']) & (df_train['Age'] < 40), 'Age_cat'] = 3\n",
"df_train.loc[(40 <= df_train['Age']) & (df_train['Age'] < 50), 'Age_cat'] = 4\n",
"df_train.loc[(50 <= df_train['Age']) & (df_train['Age'] < 60), 'Age_cat'] = 5\n",
"df_train.loc[(60 <= df_train['Age']) & (df_train['Age'] < 70), 'Age_cat'] = 6\n",
"df_train.loc[df_train['Age'] >= 70, 'Age_cat'] = 7\n",
"\n",
"df_test['Age_cat'] = 0\n",
"df_test.loc[df_test['Age'] < 10, 'Age_cat'] = 0\n",
"df_test.loc[(10 <= df_test['Age']) & (df_test['Age'] < 20), 'Age_cat'] = 1\n",
"df_test.loc[(20 <= df_test['Age']) & (df_test['Age'] < 30), 'Age_cat'] = 2\n",
"df_test.loc[(30 <= df_test['Age']) & (df_test['Age'] < 40), 'Age_cat'] = 3\n",
"df_test.loc[(40 <= df_test['Age']) & (df_test['Age'] < 50), 'Age_cat'] = 4\n",
"df_test.loc[(50 <= df_test['Age']) & (df_test['Age'] < 60), 'Age_cat'] = 5\n",
"df_test.loc[(60 <= df_test['Age']) & (df_test['Age'] < 70), 'Age_cat'] = 6\n",
"df_test.loc[df_test['Age'] >= 70, 'Age_cat'] = 7"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.302786Z",
"start_time": "2020-06-09T06:37:38.287769Z"
}
},
"outputs": [],
"source": [
"# 2. apply 를 사용한 방법 --> 훨씬 더 간단!!\n",
"def category_age(x):\n",
" if x < 10:\n",
" return 0\n",
" elif x < 20:\n",
" return 1\n",
" elif x < 30:\n",
" return 2\n",
" elif x < 40:\n",
" return 3\n",
" elif x < 50:\n",
" return 4\n",
" elif x < 60:\n",
" return 5\n",
" elif x < 70:\n",
" return 6\n",
" else:\n",
" return 7\n",
" \n",
"df_train['Age_cat2'] = df_train['Age'].apply(category_age)\n",
"df_test['Age_cat2'] = df_test['Age'].apply(category_age)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.313701Z",
"start_time": "2020-06-09T06:37:38.304723Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1번 방법, 2번 방법 둘 다 같은 결과를 내면 True 를 반환해줘야 함 -> True\n"
]
}
],
"source": [
"# 첫 번째 방법(loc)과 두 번째 방법(apply)의 결과가 동일한지 확인\n",
"print('1번 방법, 2번 방법 둘 다 같은 결과를 내면 True 를 반환해줘야 함 -> ',\n",
" (df_train['Age_cat'] == df_train['Age_cat2']).all())"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.329657Z",
"start_time": "2020-06-09T06:37:38.316694Z"
}
},
"outputs": [],
"source": [
"# 이제 중복되는 Age_cat2 컬럼과 원래 컬럼인 Age를 제거해주겠다\n",
"df_train.drop(['Age', 'Age_cat2'], axis = 1, inplace = True)\n",
"df_test.drop(['Age', 'Age_cat2'], axis = 1, inplace = True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 문자열 변수인 Initial, Embarked, Sex를 수치형 변수로 변환\n",
"- ```map()``` 메소드를 사용하면 간단하게 변환해 줄 수 있다."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Initial 변수를 수치화**"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.344618Z",
"start_time": "2020-06-09T06:37:38.331652Z"
}
},
"outputs": [],
"source": [
"df_train['Initial'] = df_train['Initial'].map({'Master': 0, 'Miss': 1, 'Mr': 2, 'Mrs': 3, 'Other': 4})\n",
"df_test['Initial'] = df_test['Initial'].map({'Master': 0, 'Miss': 1, 'Mr': 2, 'Mrs': 3, 'Other': 4})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Embarked 변수를 수치화**"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.355588Z",
"start_time": "2020-06-09T06:37:38.347609Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array(['S', 'C', 'Q'], dtype=object)"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['Embarked'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.368555Z",
"start_time": "2020-06-09T06:37:38.358580Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"S 646\n",
"C 168\n",
"Q 77\n",
"Name: Embarked, dtype: int64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['Embarked'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.381518Z",
"start_time": "2020-06-09T06:37:38.371547Z"
}
},
"outputs": [],
"source": [
"df_train['Embarked'] = df_train['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})\n",
"df_test['Embarked'] = df_test['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.393486Z",
"start_time": "2020-06-09T06:37:38.384511Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# any()를 사용하여, True가 단 하나라도 있으면 True를 반환\n",
"# 즉, Null 값이 한 개라도 존재하면 True를 반환\n",
"df_train['Embarked'].isnull().any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Sex 변수를 수치화**"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:38.405455Z",
"start_time": "2020-06-09T06:37:38.395481Z"
}
},
"outputs": [],
"source": [
"df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})\n",
"df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**각 feature 간의 상관관계를 시각화**\n",
"- 피어슨 상관계수가 **1 에 가까우면 양의 상관관계**, **-1 에 가까우면 음의 상관관계**이다.\n",
"- **피어슨 상관계수가 0** 이라는 것은 상관관계가 없다는 의미가 아니라, **선형적인 상관관계가 없다는 의미**이다.\n",
" - 즉, **비선형적 상관관계는 존재할 수 있다!**"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:39.011715Z",
"start_time": "2020-06-09T06:37:38.408447Z"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1008x864 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"heatmap_data = df_train[['Survived', 'Pclass', 'Sex', \n",
" 'Fare', 'Embarked', 'FamilySize',\n",
" 'Initial', 'Age_cat']]\n",
"\n",
"colormap = plt.cm.RdBu\n",
"plt.figure(figsize = (14, 12))\n",
"plt.title('Pearson Correlation of Features', y = 1.05, size = 15)\n",
"sns.heatmap(heatmap_data.astype(float).corr(), linewidths = 0.1, vmax = 1.0, square = True, \n",
" cmap = colormap, linecolor = 'white', annot = True, annot_kws = {'size': 16})\n",
"\n",
"del heatmap_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 앞선 EDA 과정에서 살펴봤듯이, **Sex와 Pclass 변수가 Survived 변수와 어느 정도 상관관계가 존재함**을 알 수 있다.\n",
"- 하지만 **서로 강한 상관관계를 갖는 feature들은 존재하지 않는 것을 확인**할 수 있다.\n",
" - 즉, **다중 공선성을 보이는 변수들이 존재하지 않는다**는 의미이다.\n",
" - 다시 말해, 모델 생성에 있어서 **불필요한 feature가 없다**는 말이다."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### One-Hot Encoding(원-핫 인코딩)\n",
"- 수치화를 시켜준 범주형 데이터를 그대로 모델에 넣어줘도 되지만, 모델의 성능을 높여주기 위해서 **one-hot encoding** 작업을 수행해주겠다.\n",
" - 원-핫 인코딩이라는게 쉽게 말하면 **가변수(더미변수)를 만들어주겠다**는 말이랑 동일하다.\n",
" - **Pandas**의 ```get_dummies()```를 사용하면 쉽게 수행해 줄 수 있다.\n",
" - ```prefix``` 옵션을 사용하면, **가변수에 공통으로 접두사를 추가**할 수 있다.\n",
" - ```drop_first = True``` 옵션을 설정하면, **가변수의 첫 번째 변수를 자동으로 삭제**해준다.\n",
" - 즉, **가변수 함정(dummy_trap)을 피할 수 있게 해준다 --> 총 (k-1)개의 가변수 생성!!**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Initial 변수에 대해서 One-Hot Encoding 처리**"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:39.036646Z",
"start_time": "2020-06-09T06:37:39.015703Z"
}
},
"outputs": [],
"source": [
"df_train = pd.get_dummies(df_train, columns = ['Initial'], prefix = 'Initial')\n",
"df_test = pd.get_dummies(df_test, columns = ['Initial'], prefix = 'Initial')"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:39.066568Z",
"start_time": "2020-06-09T06:37:39.040636Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>FamilySize</th>\n",
" <th>Age_cat</th>\n",
" <th>Initial_0</th>\n",
" <th>Initial_1</th>\n",
" <th>Initial_2</th>\n",
" <th>Initial_3</th>\n",
" <th>Initial_4</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>1.981001</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>4.266662</td>\n",
" <td>C85</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>2.070022</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>3.972177</td>\n",
" <td>C123</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>2.085672</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex SibSp Parch \\\n",
"0 Braund, Mr. Owen Harris 1 1 0 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 1 0 \n",
"2 Heikkinen, Miss. Laina 0 0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 0 \n",
"4 Allen, Mr. William Henry 1 0 0 \n",
"\n",
" Ticket Fare Cabin Embarked FamilySize Age_cat Initial_0 \\\n",
"0 A/5 21171 1.981001 NaN 2 2 2 0 \n",
"1 PC 17599 4.266662 C85 0 2 3 0 \n",
"2 STON/O2. 3101282 2.070022 NaN 2 1 2 0 \n",
"3 113803 3.972177 C123 2 2 3 0 \n",
"4 373450 2.085672 NaN 2 1 3 0 \n",
"\n",
" Initial_1 Initial_2 Initial_3 Initial_4 \n",
"0 0 1 0 0 \n",
"1 0 0 1 0 \n",
"2 1 0 0 0 \n",
"3 0 0 1 0 \n",
"4 0 1 0 0 "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Embarked 변수에 대해서 One-Hot Encoding 처리**"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:39.093495Z",
"start_time": "2020-06-09T06:37:39.069559Z"
}
},
"outputs": [],
"source": [
"df_train = pd.get_dummies(df_train, columns = ['Embarked'], prefix = 'Embarked')\n",
"df_test = pd.get_dummies(df_test, columns = ['Embarked'], prefix = 'Embarked')"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:39.123415Z",
"start_time": "2020-06-09T06:37:39.096486Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>FamilySize</th>\n",
" <th>Age_cat</th>\n",
" <th>Initial_0</th>\n",
" <th>Initial_1</th>\n",
" <th>Initial_2</th>\n",
" <th>Initial_3</th>\n",
" <th>Initial_4</th>\n",
" <th>Embarked_0</th>\n",
" <th>Embarked_1</th>\n",
" <th>Embarked_2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>1.981001</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>4.266662</td>\n",
" <td>C85</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>2.070022</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>3.972177</td>\n",
" <td>C123</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>2.085672</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex SibSp Parch \\\n",
"0 Braund, Mr. Owen Harris 1 1 0 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 1 0 \n",
"2 Heikkinen, Miss. Laina 0 0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 0 \n",
"4 Allen, Mr. William Henry 1 0 0 \n",
"\n",
" Ticket Fare Cabin FamilySize Age_cat Initial_0 \\\n",
"0 A/5 21171 1.981001 NaN 2 2 0 \n",
"1 PC 17599 4.266662 C85 2 3 0 \n",
"2 STON/O2. 3101282 2.070022 NaN 1 2 0 \n",
"3 113803 3.972177 C123 2 3 0 \n",
"4 373450 2.085672 NaN 1 3 0 \n",
"\n",
" Initial_1 Initial_2 Initial_3 Initial_4 Embarked_0 Embarked_1 \\\n",
"0 0 1 0 0 0 0 \n",
"1 0 0 1 0 1 0 \n",
"2 1 0 0 0 0 0 \n",
"3 0 0 1 0 0 0 \n",
"4 0 1 0 0 0 0 \n",
"\n",
" Embarked_2 \n",
"0 1 \n",
"1 0 \n",
"2 1 \n",
"3 1 \n",
"4 1 "
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 이제 **pandas**의 ```get_dummies()```를 사용하여 손쉽게 **One-Hot Encoding**을 수행해주었다.\n",
"- 추가적으로 **```Labelencoder``` + ```OneHotencoder```**를 이용해도 One-Hot Encoding이 가능하다.\n",
"- 그러나 본 튜토리얼에서는 그냥 **pandas**의 ```get_dummies()```를 사용하겠다."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 불필요한 컬럼 제거"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:39.139372Z",
"start_time": "2020-06-09T06:37:39.127404Z"
}
},
"outputs": [],
"source": [
"df_train.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis = 1, inplace = True)\n",
"df_test.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis = 1, inplace = True)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:39.164305Z",
"start_time": "2020-06-09T06:37:39.141366Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th>Fare</th>\n",
" <th>FamilySize</th>\n",
" <th>Age_cat</th>\n",
" <th>Initial_0</th>\n",
" <th>Initial_1</th>\n",
" <th>Initial_2</th>\n",
" <th>Initial_3</th>\n",
" <th>Initial_4</th>\n",
" <th>Embarked_0</th>\n",
" <th>Embarked_1</th>\n",
" <th>Embarked_2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1.981001</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>4.266662</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>2.070022</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3.972177</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>2.085672</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass Sex Fare FamilySize Age_cat Initial_0 Initial_1 \\\n",
"0 0 3 1 1.981001 2 2 0 0 \n",
"1 1 1 0 4.266662 2 3 0 0 \n",
"2 1 3 0 2.070022 1 2 0 1 \n",
"3 1 1 0 3.972177 2 3 0 0 \n",
"4 0 3 1 2.085672 1 3 0 0 \n",
"\n",
" Initial_2 Initial_3 Initial_4 Embarked_0 Embarked_1 Embarked_2 \n",
"0 1 0 0 0 0 1 \n",
"1 0 1 0 1 0 0 \n",
"2 0 0 0 0 0 1 \n",
"3 0 1 0 0 0 1 \n",
"4 1 0 0 0 0 1 "
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train.head()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:39.189238Z",
"start_time": "2020-06-09T06:37:39.166299Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th>Fare</th>\n",
" <th>FamilySize</th>\n",
" <th>Age_cat</th>\n",
" <th>Initial_0</th>\n",
" <th>Initial_1</th>\n",
" <th>Initial_2</th>\n",
" <th>Initial_3</th>\n",
" <th>Initial_4</th>\n",
" <th>Embarked_0</th>\n",
" <th>Embarked_1</th>\n",
" <th>Embarked_2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>2.057860</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>1.945910</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>2.270836</td>\n",
" <td>1</td>\n",
" <td>6</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>2.159003</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>2.508582</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pclass Sex Fare FamilySize Age_cat Initial_0 Initial_1 \\\n",
"0 3 1 2.057860 1 3 0 0 \n",
"1 3 0 1.945910 2 4 0 0 \n",
"2 2 1 2.270836 1 6 0 0 \n",
"3 3 1 2.159003 1 2 0 0 \n",
"4 3 0 2.508582 3 2 0 0 \n",
"\n",
" Initial_2 Initial_3 Initial_4 Embarked_0 Embarked_1 Embarked_2 \n",
"0 1 0 0 0 1 0 \n",
"1 0 1 0 0 0 1 \n",
"2 1 0 0 0 1 0 \n",
"3 1 0 0 0 0 1 \n",
"4 0 1 0 0 0 1 "
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_test.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 머신러닝 모델 생성 및 예측\n",
"- 본 튜토리얼은 **0 과 1 로 이루어진 target 변수에 대한 예측**을 수행하는 모델을 만드는 것이다.\n",
" - 즉, **이진 분류 문제**라고 할 수 있다.\n",
"- 우선 **학습 데이터 셋**에서 **Survived를 제외한 input**을 가지고, **모델 최적화**를 수행해 줄 것이다.\n",
"- 그 후에 모델이 학습하지 않았던 **테스트 데이터 셋을 input**으로 주어서, **테스트 데이터 셋의 각 탑승객의 생존 여부를 예측**해보겠다.\n",
"- 본 튜토리얼에서는 **랜덤 포레스트** 모델을 사용할 것이다."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:39.460753Z",
"start_time": "2020-06-09T06:37:39.193229Z"
}
},
"outputs": [],
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn import metrics\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 데이터 셋 분할\n",
"1. **모델 학습**에 사용할 ***train dataset***\n",
"2. **모델 평가**에 사용할 ***valid dataset***\n",
" - 좋은 모델을 만들기 위해서는 ***valid dataset*** 을 따로 만들어서 모델을 평가해주어야 한다.\n",
" - 마치 축구대표팀이 **팀훈련(train)**을 하고 바로 **월드컵(test)**에 나가는 것이 아니라, **팀훈련(train)**을 한 다음 **평가전(valid)**을 거쳐 **팀의 훈련 정도(학습 정도)를 확인**하고 **월드컵(test)**에 나가는 것과 비슷하다.\n",
"3. **모델 예측**에 사용할 ***test dataset***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 먼저 **학습 데이터 셋(train dataset)**과 **target label(Survived)**을 **분리**하겠다."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:39.469733Z",
"start_time": "2020-06-09T06:37:39.462752Z"
}
},
"outputs": [],
"source": [
"X_train = df_train.drop('Survived', axis = 1).values\n",
"target_label = df_train['Survived'].values\n",
"X_test = df_test.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- ```train_test_split()```을 사용하여 데이터 셋을 쉽게 분할할 수 있다."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:39.478709Z",
"start_time": "2020-06-09T06:37:39.471729Z"
}
},
"outputs": [],
"source": [
"X_tr, X_vld, y_tr, y_vld = train_test_split(X_train, target_label, test_size = 0.3, random_state = 2020)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 모델 학습 및 예측\n",
"- 먼저 모델 객체를 만들고, ```fit()``` 메소드로 학습시키겠다.\n",
"- 그 다음, **valid dataset**을 input으로 넣어줘서 예측 값을 구해보도록 하겠다."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:39.677180Z",
"start_time": "2020-06-09T06:37:39.481702Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"총 268명 중, 81.34% 정확도로 생존을 맞춤\n"
]
}
],
"source": [
"model = RandomForestClassifier()\n",
"model.fit(X_tr, y_tr)\n",
"prediction = model.predict(X_vld)\n",
"\n",
"print('총 {}명 중, {:.2f}% 정확도로 생존을 맞춤'.format(y_vld.shape[0], 100 * metrics.accuracy_score(prediction, y_vld)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 파라미터 튜닝을 하지도 않았는데 **약 82 %의 정확도**가 나왔다!!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature 중요도 확인\n",
"- 생성된 예측 모델이 어떤 feature의 영향을 많이 받았는지 확인해보는 작업을 수행하겠다."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:39.695166Z",
"start_time": "2020-06-09T06:37:39.679173Z"
},
"scrolled": true
},
"outputs": [],
"source": [
"from pandas import Series\n",
"\n",
"feature_importance = model.feature_importances_\n",
"Series_feat_imp = Series(feature_importance, index = df_test.columns)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:39.704109Z",
"start_time": "2020-06-09T06:37:39.697125Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Pclass', 'Sex', 'Fare', 'FamilySize', 'Age_cat', 'Initial_0',\n",
" 'Initial_1', 'Initial_2', 'Initial_3', 'Initial_4', 'Embarked_0',\n",
" 'Embarked_1', 'Embarked_2'],\n",
" dtype='object')"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_test.columns"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:40.007331Z",
"start_time": "2020-06-09T06:37:39.706103Z"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 576x576 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize = (8, 8))\n",
"Series_feat_imp.sort_values(ascending = True).plot.barh()\n",
"plt.xlabel('Feature importance')\n",
"plt.ylabel('Feature')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 결과를 보면, 가장 중요도가 높은 feature는 **Fare**이다.\n",
"- 그 다음으로 **Sex, Initial_2, Age_cat**이 뒤를 이었다.\n",
"- 만약 좀 더 정확도가 높은 모델을 얻고 싶다면, **feature selection**을 수행해주면 된다."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 테스트 데이터 셋에 대해서 예측을 수행\n",
"- 이제 모델이 학습하지 않았던 테스트 데이터 셋을 input으로 넣어주고, 각 탑승객의 생존 여부를 예측해보겠다.\n",
"- Kaggle에서 준 파일인 **gender_submission.csv** 파일을 읽어와서 제출할 준비를 하겠다."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:40.021259Z",
"start_time": "2020-06-09T06:37:40.009291Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>892</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>893</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>894</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>895</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>896</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived\n",
"0 892 0\n",
"1 893 1\n",
"2 894 0\n",
"3 895 0\n",
"4 896 1"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"submission = pd.read_csv('../titanic/gender_submission.csv')\n",
"submission.head()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:40.046193Z",
"start_time": "2020-06-09T06:37:40.024255Z"
}
},
"outputs": [],
"source": [
"prediction = model.predict(X_test)\n",
"submission['Survived'] = prediction"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-09T06:37:40.061153Z",
"start_time": "2020-06-09T06:37:40.050199Z"
}
},
"outputs": [],
"source": [
"submission.to_csv('./my_first_submission.csv', index = False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**여기까지 해서 타이타닉 튜토리얼을 마무리하도록 하겠다!**\n",
"\n",
"**비록 본 튜토리얼에서는 기본적인 데이터 전처리 과정과 단순한 모델을 사용해서 예측 모델을 만들었지만, 앞으로 공부를 이어나가면서 더욱 참신한 아이디어로 머신러닝 모델을 만들어 볼 예정이다.**"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment