Skip to content

Instantly share code, notes, and snippets.

@fhiyo
Created January 20, 2018 22:34
Show Gist options
  • Save fhiyo/3bfe1b894ffbdf8afcd78794da800fa3 to your computer and use it in GitHub Desktop.
Save fhiyo/3bfe1b894ffbdf8afcd78794da800fa3 to your computer and use it in GitHub Desktop.
KaggleのTitanicチュートリアルをrandom forestで試したもの (前処理少しやった)
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Random forestを用いたtitanicのチュートリアル問題\n",
"\n",
"18/01/21\n",
"データの前処理を少し行っている.Ageの欠損値を敬称によって補完した."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [Titanic Top 4% with ensemble modeling | Kaggle](https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling)\n",
"- [タイタニック号データ分析:Feature Engineering](http://rindalog.blogspot.jp/2016/10/feature-engineering.html)\n",
"- [Exploring Survival on the Titanic | Kaggle](https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic/code)\n",
"- [Kaggleのtitanic問題で上位10%に入るまでのデータ解析と所感 - mirandora.commirandora.com](http://www.mirandora.com/?p=1804)←cross valicationで精度を評価してた\n",
"\n",
"を参考にしながら書いている."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import math\n",
"import re\n",
"\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline\n",
"\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"sns.set(style='white', context='notebook', palette='deep')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Load training and test data\n",
"train_df = pd.read_csv(\"../input/train.csv\", header=0)\n",
"test_df = pd.read_csv(\"../input/test.csv\", header=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"データの前処理を行うが,先にどのようなアルゴリズムを使うかを決めないと前処理の方向性も決まらない? \n",
"例えば,Decision Treeを使うならCategorical Dataをそのままにしてもよいが,他のLabeled Dataを陽に扱えない手法を用いるなら \n",
"ダミー変数化するなどの処理が必要になる."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Convert \"Sex\" to be a dummy variable (female = 0, Male = 1)\n",
"train_df[\"Sex\"] = train_df[\"Sex\"].map({\"female\": 0, \"male\": 1}).astype(int)\n",
"test_df[\"Sex\"] = test_df[\"Sex\"].map({\"female\": 0, \"male\": 1}).astype(int)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId 0\n",
"Survived 0\n",
"Pclass 0\n",
"Name 0\n",
"Sex 0\n",
"Age 177\n",
"SibSp 0\n",
"Parch 0\n",
"Ticket 0\n",
"Fare 0\n",
"Cabin 687\n",
"Embarked 2\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 欠損値が含まれているデータの数がいくつあるかを項目ごとに調べる\n",
"train_df.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# 外れ値を検出して除外している参考記事もあったが,今回は無視"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Name, Ticket, Cabinが非構造データ.どう処理をする?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic/notebook\n",
"にあるように,乗客の名前から分かることは結構あるようだ. \n",
"例えば,敬称がMrsだったら結婚している女性のものなので女性で年齢は未婚の人よりも高い傾向あることが予想できる.\n",
"というわけで,名前から敬称(Titleというらしい?)を抜き出して属性として使えるようにしよう."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Palsson, Master. Gosta Leonard\n",
"Rice, Master. Eugene\n",
"Uruchurtu, Don. Manuel E\n",
"Panula, Master. Juha Niilo\n",
"Goodwin, Master. William Frederick\n",
"Skoog, Master. Harald\n",
"Moubarek, Master. Gerios\n",
"Caldwell, Master. Alden Gates\n",
"Nicola-Yarred, Master. Elias\n",
"Byles, Rev. Thomas Roussel Davids\n",
"Bateman, Rev. Robert James\n",
"Sage, Master. Thomas Henry\n",
"Panula, Master. Eino Viljami\n",
"Goldsmith, Master. Frank John William \"Frankie\"\n",
"Rice, Master. Arthur\n",
"Lefebre, Master. Henry Forbes\n",
"Asplund, Master. Clarence Gustaf Hugo\n",
"Becker, Master. Richard F\n",
"Navratil, Master. Michel M\n",
"Minahan, Dr. William Edward\n",
"Carter, Rev. Ernest Courtenay\n",
"Asplund, Master. Edvin Rojj Felix\n",
"Rice, Master. Eric\n",
"Allison, Master. Hudson Trevor\n",
"Moraweck, Dr. Ernest\n",
"Navratil, Master. Edmond Roger\n",
"Coutts, Master. William Loch \"William\"\n",
"Aubart, Mme. Leontine Pauline\n",
"Goodwin, Master. Sidney Leonard\n",
"Pain, Dr. Alfred\n",
"Richards, Master. William Rowe\n",
"Dodge, Master. Washington\n",
"Peuchen, Major. Arthur Godfrey\n",
"Goodwin, Master. Harold Victor\n",
"Coutts, Master. Eden Leslie \"Neville\"\n",
"Butt, Major. Archibald Willingham\n",
"Davies, Master. John Morgan Jr\n",
"Duff Gordon, Sir. Cosmo Edmund (\"Mr Morgan\")\n",
"Kirkland, Rev. Charles Leonard\n",
"Stahelin-Maeglin, Dr. Max\n",
"Sagesser, Mlle. Emma\n",
"Simonius-Blumer, Col. Oberst Alfons\n",
"Frauenthal, Dr. Henry William\n",
"Weir, Col. John\n",
"Moubarek, Master. Halim Gonios (\"William George\")\n",
"Crosby, Capt. Edward Gifford\n",
"Moor, Master. Meier\n",
"Hamalainen, Master. Viljo\n",
"Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)\n",
"Brewe, Dr. Arthur Jackson\n",
"Rice, Master. George Hugh\n",
"Dean, Master. Bertram Vere\n",
"Leader, Dr. Alice (Farnham)\n",
"Carter, Master. William Thornton II\n",
"Thomas, Master. Assad Alexander\n",
"Skoog, Master. Karl Thorsten\n",
"Reuchlin, Jonkheer. John George\n",
"Panula, Master. Urho Abraham\n",
"Mallet, Master. Andre\n",
"Richards, Master. George Sibley\n",
"Harper, Rev. John\n",
"Andersson, Master. Sigvard Harald Elias\n",
"Johnson, Master. Harold Theodor\n",
"Montvila, Rev. Juozas\n"
]
}
],
"source": [
"# Mrとかの有名な敬称じゃない名前を抽出してみる\n",
"for name in train_df['Name']:\n",
" print(name) if not re.search('(miss)|(mrs)|(ms)|(mr)\\.', name, re.IGNORECASE) else '' "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Uruchurtu, Don. Manuel E\n",
"Byles, Rev. Thomas Roussel Davids\n",
"Bateman, Rev. Robert James\n",
"Carter, Rev. Ernest Courtenay\n",
"Aubart, Mme. Leontine Pauline\n",
"Peuchen, Major. Arthur Godfrey\n",
"Butt, Major. Archibald Willingham\n",
"Kirkland, Rev. Charles Leonard\n",
"Sagesser, Mlle. Emma\n",
"Simonius-Blumer, Col. Oberst Alfons\n",
"Weir, Col. John\n",
"Crosby, Capt. Edward Gifford\n",
"Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)\n",
"Reuchlin, Jonkheer. John George\n",
"Harper, Rev. John\n",
"Montvila, Rev. Juozas\n"
]
}
],
"source": [
"# MasterとかDr.が多い?これらを入れて再度出力\n",
"for name in train_df['Name']:\n",
" print(name) if not re.search('(miss)|(mrs)|(ms)|(mr)|(master)|(dr)\\.', name, re.IGNORECASE) else '' "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"とりあえず名前を並べてみて,多く使われている敬称は取り除けたけど,その分布とかはこの方法だとわからない.\n",
"ので敬称の頻度分布がわかるようにしてみる."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Mr\n",
"1 Mrs\n",
"2 Mr\n",
"3 Mr\n",
"4 Mrs\n",
"Name: Title, dtype: object"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Nameから敬称を抽出する.データを見る限り,\"名字 敬称 名前\"の順になっていたのでそのルールに沿って敬称を取得する\n",
"train_df_title = [i.split(\",\")[1].split(\".\")[0].strip() for i in train_df[\"Name\"]]\n",
"train_df[\"Title\"] = pd.Series(train_df_title)\n",
"# train_df[\"Title\"].head()\n",
"\n",
"test_df_title = [i.split(\",\")[1].split(\".\")[0].strip() for i in test_df[\"Name\"]]\n",
"test_df[\"Title\"] = pd.Series(test_df_title)\n",
"test_df[\"Title\"].head()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYoAAAEtCAYAAAAWZydGAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAIABJREFUeJzt3Xm8XdP5x/HPvYmIVBJTzGOLR6uqqCFIRcsvYmiUIqVFSjVtqqiWVqNC66eG0ipqilkHU6taQ9QYSlVRlD7Kj6qSlmpiqiHJ/f3xrCM7x73rnmnnnlzf9+uVV87Z9+x11jln7/Wsaa/d0dXVhYiISE86+zoDIiLS3hQoREQkS4FCRESyFChERCRLgUJERLIG9nUG6mVmiwIbA88Bc/o4OyIiC4sBwArAH9z9jXp2XOgCBREkpvd1JkREFlKjgDvq2WFhDBTPAVx66aUsv/zyfZ0XEZGFwowZM9hrr70glaH1WBgDxRyA5ZdfnpVXXrmv8yIisrCpu8teg9kiIpKlQCEiIlkKFCIikqVAISIiWQoUIiKSpUAhIiJZChQiIpK1MF5H8bbnf3xJU/uP+OJnWpQTEZH+Sy0KERHJUqAQEZEsBQoREckqdYzCzO4DXkpPnwTOAn4IzAamufvRZtYJnAGsD7wB7O/uj5eZLxERqV1pgcLMBgMd7j66sO0BYFfg/4DfmNkGwBrAYHcfaWabAd8HxpWVLxERqU+ZLYr1gSFmNi29zxRgUXd/AsDMbgC2IW6kcT2Au99tZh8pMU8iIlKnMscoXgNOAsYAE4Hz07aKl4HhwDBgVmH7HDNbqKftioj0J2UWyI8Bj7t7F/CYmc0Clir8fSgwExiSHld0uvvsEvMlIiJ1KLNF8TlivAEzW5EICK+a2fvMrINoaUwH7gS2T6/bDHioxDyJiEidymxRTAUuMLM7gC4icMwFLiVu8j3N3X9vZn8AtjWz3wEdwIQS8yQiInUqLVC4+5vAnt38abOq180lxjBERKQN6YI7ERHJUqAQEZEsBQoREclSoBARkSwFChERyVKgEBGRLAUKERHJUqAQEZEsBQoREclSoBARkSwFChERyVKgEBGRLAUKERHJUqAQEZEsBQoREclSoBARkSwFChERyVKgEBGRLAUKERHJUqAQEZEsBQoREclSoBARkSwFChERyVKgEBGRLAUKERHJUqAQEZEsBQoREclSoBARkSwFChERyVKgEBGRLAUKERHJUqAQEZGsgWUmbmbLAn8EtgVmAxcAXcDDwCR3n2tmRwE7pL8f7O73lJknERGpT2ktCjNbBDgL+G/adDIw2d1HAR3AODPbENgK2BQYD5xeVn5ERKQxZXY9nQScCTybnm8E3JYeXwdsA2wJTHP3Lnd/GhhoZiNKzJOIiNSplEBhZvsCz7v7DYXNHe7elR6/DAwHhgGzCq+pbBcRkTZR1hjF54AuM9sG+DBwEbBs4e9DgZnAS+lx9XYREWkTpbQo3P2j7r6Vu48GHgD2Bq4zs9HpJWOB6cCdwBgz6zSzVYFOd3+hjDyJiEhjSp31VOVQ4BwzGwQ8Clzh7nPMbDpwFxG0Ji3A/IiISA1KDxSpVVGxVTd/nwJMKTsfIiLSGF1wJyIiWQoUIiKSpUAhIiJZChQiIpKlQCEiIlkKFCIikqVAISIiWQoUIiKSpUAhIiJZChQiIpKlQCEiIlkKFCIikqVAISIiWQoUIiKSpUAhIiJZChQiIpKlQCEiIlkKFCIikqVAISIiWQoUIiKSpUAhIiJZChQiIpKlQCEiIlkKFCIikqVAISIiWQoUIiKSpUAhIiJZChQiIpKlQCEiIlkKFCIikqVAISIiWQoUIiKSNbCshM1sAHAOYEAXMBF4HbggPX8YmOTuc83sKGAHYDZwsLvfU1a+RESkPmW2KHYCcPctgMnAscDJwGR3HwV0AOPMbENgK2BTYDxweol5EhGROpUWKNz9l8AB6elqwExgI+C2tO06YBtgS2Cau3e5+9PAQDMbUVa+RESkPjUFCjP7UTfbLuxtP3efnV73I+BSoMPdu9KfXwaGA8OAWYXdKttFRKQNZMcozOxc4L3AR8xs3cKfFqHGwtzd9zGzw4HfA4sV/jSUaGW8lB5XbxcRkTbQ22D2d4HVgR8CRxe2zwYeze1oZp8FVnb344DXgLnAvWY22t1vBcYCtwCPAyeY2UnAykCnu79Q/0cREZEyZAOFuz8FPAWsb2bDiFZER/rz4sCLmd2vAs43s9uJFsjBRHA5x8wGpcdXuPscM5sO3EV0hU1q+NOIiEjL1TQ91sy+CXwT+HdhcxfRLdUtd38V2L2bP23VzWunAFNqyYuIiCxYtV5HsT/wPnd/vszMiIhI+6l1euzT5LuZRESkn6q1RfFX4A4zu4W4uhoAdz+mlFyJiEjbqDVQ/CP9g3mD2SIi8i5QU6Bw96N7f5WIiPRHtc56mkvMcip61t1XaX2WRESkndTaonh70NvMFgF2BkaWlSkREWkfdS8K6O5vufvlwMdKyI+IiLSZWrue9i487QDWBd4sJUciItJWap31tHXhcRfwArBH67MjIiLtptYxiglpbMLSPg+7++xScyYiIm2h1vtRbERcdHchcD7wtJltWmbGRESkPdTa9XQqsIe7/x7AzDYjbka0SVkZExGR9lDrrKfFK0ECwN3vBgaXkyUREWkntQaKF81sXOWJme3M/EuOi4hIP1Vr19MBwK/NbCoxPbYL2Ly0XImISNuotUUxlrid6WrEVNnngdEl5UlERNpIrYHiAGALd3/V3R8ENgIOLC9bIiLSLmoNFIsw/5XYb/LORQJFRKQfqnWM4pfAzWZ2WXq+C3B1OVkSEZF2UlOLwt0PJ66lMOC9wKnufmSZGRMRkfZQa4sCd78CuKLEvIiISBuqe5lxERF5d1GgEBGRLAUKERHJUqAQEZEsBQoREclSoBARkSwFChERyVKgEBGRLAUKERHJUqAQEZGsmpfwqIeZLQKcB6wOLAp8F3gEuIBYdfZhYJK7zzWzo4AdgNnAwe5+Txl5EhGRxpTVovgM8G93HwVsB5wGnAxMTts6gHFmtiGwFbApMB44vaT8iIhIg8oKFJcDldVlO4jWwkbAbWnbdcA2wJbANHfvcvengYFmNqKkPImISANKCRTu/oq7v2xmQ4kVZycDHe5eudnRy8BwYBgwq7BrZbuIiLSJ0gazzWwV4BbgYnf/CTC38OehwEzgpfS4eruIiLSJUgKFmS0HTAMOd/fz0ub7zWx0ejwWmA7cCYwxs04zWxXodPcXysiTiIg0ppRZT8ARwJLAkWZWGas4CDjVzAYBjwJXuPscM5sO3EUErUkl5UdERBpUSqBw94OIwFBtq25eOwWYUkY+RESkebrgTkREshQoREQkS4FCRESyyhrMfte7/8ydmtp/g4nXtCgnIiLNUYtCRESyFChERCRLgUJERLIUKEREJEuBQkREshQoREQkS4FCRESyFChERCRLgUJERLIUKEREJEuBQkREshQoREQkS4FCRESyFChERCRLgUJERLIUKEREJEuBQkREshQoREQkS4FCRESyFChERCRLgUJERLIG9nUG2sVzZ3yrqf1X+NKxLcqJiEh7UYtCRESyFChERCRLgUJERLIUKEREJEuBQkREshQoREQkq9TpsWa2KXC8u482szWBC4Au4GFgkrvPNbOjgB2A2cDB7n5PmXkSEZH6lNaiMLPDgHOBwWnTycBkdx8FdADjzGxDYCtgU2A8cHpZ+RERkcaU2fX0BLBL4flGwG3p8XXANsCWwDR373L3p4GBZjaixDyJiEidSgsU7n4l8FZhU4e7d6XHLwPDgWHArMJrKttFRKRNLMjB7LmFx0OBmcBL6XH1dhERaRMLMlDcb2aj0+OxwHTgTmCMmXWa2apAp7u/sADzJCIivViQiwIeCpxjZoOAR4Er3H2OmU0H7iKC1qQFmB8REalBqYHC3Z8CNkuPHyNmOFW/Zgowpcx8iIhI43TBnYiIZClQiIhIlgKFiIhkKVCIiEiWAoWIiGQpUIiISJYChYiIZClQiIhIlgKFiIhkKVCIiEiWAoWIiGQpUIiISJYChYiIZClQiIhIlgKFiIhkKVCIiEiWAoWIiGQpUIiISJYChYiIZClQiIhI1sC+zoDU5tqp2ze1//b7XduinIjIu41aFCIikqVAISIiWQoUIiKSpUAhIiJZChQiIpKlQCEiIlkKFCIikqVAISIiWQoUIiKSpUAhIiJZChQiIpLVFms9mVkncAawPvAGsL+7P963uRIREWiTQAHsDAx295FmthnwfWBcH+dJavTdn49pav/Je9ww3/OxV3+6qfSuG/fTpvYXkfm1S9fTlsD1AO5+N/CRvs2OiIhUtEuLYhgwq/B8jpkNdPfZ3bx2AMCMGTMYNGtmU2/6xjPPvP34Xy+92lRacwppAfxr1ltNpfdMVXovtji9y6/cu+G0dtv1ovmev/Kf1ubtrRffaGl6E244s+G0zh8zcb7n+11/WcNpAUzdbvf5nu9/3fVNpXfu2O3me/6F6+9qOK2zths53/MvX/9kw2kBnLbdGvM9/+kN/24qvU+PWXq+53dd/WJT6Y0ct9R8z5+66IWG01p972Xme/781L82nBbAiP3Wmu/5vy/8XVPpLb3P5syYMaPydEC9+3d0dXU1lYFWMLOTgbvd/bL0/Bl3X7mH124JTF+Q+RMR6UdGufsd9ezQLi2KO4GdgMvSGMVDmdf+ARgFPAfMWQB5ExHpDwYAKxBlaF3apUVRmfX0IaADmODuf+nbXImICLRJoBARkfbVLrOeRESkTSlQiIhIlgKFiIhkKVD0c2bW0d1jkXfD8WBmQ0pMu8++vwX93v02UJjZwPR/W58MacZXaWm7e5eZLWVmi7l7n81cqPwOld+liXQ60/+LtCJfVWk3fayUfbyZWd0XS/WkcjyY2fKN5sPMBpd5DDfDzL4J7GRmQ1uc7kow3/e3QMuYwnk93MxG1Pr9F19Xb57b8gduVuWqbjNbEfhmq2oVrT4hzGyAu89NP/bI3veoTyVt4MfAR+vNW4vz0pVO2L3NrLPR7zJ9phWAw9Lv2xJVJ99QM1suba/5hDKzRVIai7cyb1V5nJO+v/3M7D0tSHNP4GPpcU2f1cw6Uj6WB34GrNKqwrK746KJtF8AxgBbm9mwpjI2Ly+LAQeb2SFmtm/avMACRfru56bj6zpinbw1a9ivM+23nJmtWm+lsV9Nj00H1AXAg8AlwPnAVHe/sgVpDyicHDsCF7n7my1IdyUiz38ELnf3PzabZkq3AxhErKH1uLt/vo593z6oiBV9X3D3+xrMR6e7z02PPwScB+zr7g/XkUYHsAbwKtCV0viVu59dTL9R6eTrSiffJcAtxPpjX3f3B2tMo/KdrQ6cDVwDTHN3L75HM/mspENcczTX3Se1IL19gZ3cfdca33tn4FfAYsCJwBB336fZfKT0K+fYcsBaKe1pDaTz9vI/ZrYLscDolcBt7j4ru3M+3cpx8n7gbuBP7v7R9Lemj8M68rE4cBVwOvA8sBWwDPBtd+9xLSIzWwY4E7jS3etaObO/tSiGEyf4nsCpwHbAS9B8DTkdwCsBlwLLA/9jZoOaaa2k7pPjgZ8DFwIHmtkRZrZuE2lWPmeHu7+R0t/azDZPf8/+5oUCbxngCuDzwNfMbK9G8lNJy8wGp0L3+8CeZrZoLTXFlN9fAkem/39M1BLXLKTf9HGcaopTgVOA3wBLAZ8ys6V62a/DzC4EDk211mOB04iCZAczOyjls+EgUfX5tgBWBv5W6X6royVwUOHxYWa2pLtfADyYWha9pbUcscrzHGBZ4tx61cy2qOfz9CSdYysClwEfBE40s6/Vk0YKNrPNbFkz2w64j1j5YQfgo2Y2vJG8pXS70vn1LPBlYHjh9y01SFSVX28R3/0HgKOB3wNLAO9oxdq8Lt8BwA+Jq7N/We/796tA4e4zga8CtxO1wj8Bl5rZ1ukgrLuJaPPGOgYCnwVuBi4GDiAK4bpOEpvXxz7A3d8iliEZThSgFxFXp3e7zlUtaafPuQpwrpmdDbwMHAGcZ2ajejugU8G7JHAocLG770Z0L2xtZp+rJy+Fp18EppvZZ4ChxAKQc3srPFMa5wAPuvsEYH/gWuAGYDUzO7qS51rz1V0e3b3L3f8L3AvMBk5IeX4aWK2XZCqVk08D/0scD4sSAechYH8zs0byl/JY6Z7ssFje5v+IQPYeYBczW7yOIPRJM/tVOg8+Akwxs8uA/wIjoPuAlrq6riSC4N5mdgEwnmgJPweMNrOtGvx8HWa2aeHcPCKlezkwk/idN6k1vUKwuYZo/ZwLvAY8AOwLbFFvOVA4rzqJ1uKXUpofBb5gZqeZ2Qn1pFnn+1daWiua2X7Eb3U8cbweD7xOBI1Z3ezXZWaLpuD+HWAIUVGra6xwoQ8U6UBbtrBpBlHbfI5oZv0FmJYKybpqdTb/WMc3iBrVUOLH+SbRDbJeHelVTvoVgOPMbA3gGCLCH0N0rSwJeD35rEhpL00czJcSJ8fFxNouU4AfWgw+vuNEqSrYNwA2BYakmvaNwDRgI6uhr7fQKlnKzDYgukoOJ2oznwUOA75ew0e6Clja3Y9Mn+/PRCF5E1FjX8HMjqohnVweR5jZ7qnGNYL47k4CBgOTgP/k0ilUTm4FniSCyxvAgcAzxHH4fCN5TOlXCqgrie/wWGBxogIwEhhXQytxUEprNNGVeL677w4cTHyPo4DDUw28et8O4CfAX9x9P3cfS9RixwAbA78gzonN0rFSr3WJllzlve8nuhmvAvYjFgDdIPcZUxmwrZmNMbO1iXP1QnefSBSOnyUqjucBD9RbDlQCNXAy8b3fTfwWuxPjO3OJc6QU6RhYgeh1WA5YHXgE+CdRsTwJ+Jy7/6uyj80bR1qVWEPvEmBrogKzBzDR6uhlWegDBXGCXm9mk9OX83uicNyQCBI3p3//rDXBQnfCQRb9pVOBe9z9ECJIXEI089Yj+mtrkn64ZYnA8GdgW2ARIjjsSASfQ939qVrTTPktDoQvDTwBvAhsT5w04939Z8Bod3+9+kQpFJrLmNlYotZ6EVFL2Tq97GrgcHd/qYbPWRlsuxHYhTjZB7n7hcBexEl2SS+faRHgt8BLla6N9PvOICoDI4kunrrXELd5A4LLEN1+X0v5OYo4nkYSBcye3f0WPVRO1gIeS3nek+ia+B7xe9a9HraZFe/JMgV4xN0/CfyUaOXcQXQ53F9DK/FNi27S84nv6/1mdllqSZ1MtI4PJbrbqo0FnnP3b6V8nUd8TxsTLahRRAtgamqV1fr5Oszsl8Q42gFEoNqG6FIZQxw766Z83d7TZ0wB5CpgArA30S30IhFMcffbgL8C67v7b9z92XryWHj6HWAVYIq73wLsA2ycjsevuvuNjfRY1JGHrwKPEsfrkenfx4lK2Cfd/dHifoXJI6cSgeQi4rx7k6ikbUsE+Jos1IEiRcRfAa8QtY/jzewwYhbC48DviAJuT3d/rI6kq8c6xjDvu3oLMKIQPsR7uWVrOiHOsugz7SQCwu+IWvGniJbEeu4+OeWz5kHegi3M7F4zOzE9X5qofXyPCBofM7MleirkCwX7tcTA37lEF8wDRHfP1u7+hru/kvmcnWZ2oJltbmarAYekzzYlvWQ7M1vJ3Z8Gzkv/9yh1y00lCoyDzOzjhQD3JlF7/7O711wBSPkcmE6ixYma1ZnuvgnRBXgk8ZseRTcnX8GtvLNycgmwEVEpeYloVUx099xKyD3lcQNilc+KzpQmaXB3ALCmu//U3R/pJa1KQbMHsIS7H+fumwKLmdntKc3niGNmp3S8Fgu8vxPHz9lmdjGwNjCZKMAfAbYB/unuNd/MIaV/AlHZuiZ9tslERelZ4pybSVQy9sn8DhCVwj+5+57E9/0K0SpZ2swOMrPxxLl8T635S3kcUFWhqizLvWn6fwSwRiqM50Jz41DdvH+lvKnU+h8gWrfHEJW/F4CuVPH7Z2G/4m83LO0zjGjhTiBaQM8Au6XWcE0W+llPZjaYGKjanqgJ30fUdFYAjnH3MxpMdxwwmugG+gKwElEzv7nSv+fd31ipmEYHMfYwmIjiaxMH1WTiBDyWqLGNBr6YCsd68licUeTAHHf/gJntQfTPPglsQhR+D1XtW5lJ9BZxUn4LeNrdz7Dobz4COAh4H1Fr7bEmlg7qK9Jn6yBqPaunz/teohXxifTys+s5oVJ3xm7Eb/xdYhD1MODgXgqQ7tIqdiWeR9SgH/U0ayfVcP9DNOO7zWOqnBxMBNRViL70F4C/EWMTlxItzX+4e0NdThYD/6+b2ReJ/ue/EGMCDxIDs5cAX/a4G2RPaQxI/dKV5x8iaqWnufu9qZV2BVEJeoiojd/VXYXKzHYiujgedfer0rbxRCvqRHd/vc7PtyRRmD9E1GrXIWrHSxPH3Q/c/frqz9BNOoOAk9z9K5XnqfX0ASIwdhDjfSd4HatRW0xzfiv91mcRFcdDiS7t/YmuxLWAk939t/V89hrfv9LCX4GoaP2O6Hm4GNicOO72Bia5u3ez3xJEt/hw4tz7HNGaeIzoWt01VQ5qttAHCgCL+eTjiMLou8QPuTMxPbGm23SlgnNEpZ/PzDYlCvRziVlOexFdEtuk5mwtaS5JTHudSfRVjyWawA+Z2SRiQOyLwGfqbPFUHxSdwKrAV4AV3H2sma1PtApere4+SQX7L4gmuhGtrk8RfddnpNecBkx395/3lg9iwHmGu38rBamNiS6YQ4CniIKt0kTOtsB6eI/FiNrl14ka1i7uXvMtxGz+adMXA8cR3X/T0rafpVYEZrZib90TJVZOqgv3zxEF0r+Iboe9ic9/sbv/OpNO5djoJGaJ3ZPSWJ8ItC8SA7FHuvuddeRvDaLw3YgYv5nkMW5UNzP7JDGt04negJWJ2u5yRKE8njh2eyygLMbjbiNaEnfYvOmrGxLdpHtUgkeDebyA+G2HEQFiJFFxOhy4yd1/0Ei6Nb73CGI87hSiG31NYrp/J9FtdHqxNVn47CsRZda9xPH4GDGAvTzRZXlYI63cfhEo4O3C5FNEgX6ku9d1cw4zu42o3VwFHJu+9N2JftKbib76kcCB9RTqhZbJo8TBtirR/fQiUZu7qYEgUZz3fzlR67nK3V9JB/dmRC1k/+q+3ULB/qy7H5m6OdYnmqW3Mi+ofYMo2LOBNtXCB7j7Tun5ZKIgOYU4qdYjWhpT620BVL1P5fe9p1iLqnHfJZgXsB8nWii7u/sV6cS6n6hpH1NHmk1XTnpIt5MYR3gQuItolX2ImPn100pro4Z0Oohj+Unimoe/EzX4V4lW5t3ufmvltb218lJwnEB83jlEQVxzkKihIrZ3ytd6wN+8xvEOM/sKMQPsMnd/Im3bOeV1T89cV9BNWt8kAtbfiVbcx4jW66HEBI8xRI1+HeIYugC4vvoca5SZVVqiL6bz8uPAj4gKzUXEOXtdsSehav/BRFf8ycQ59y1iEkRlHPW/9XbVVvSbQAFgcU3DJ4ia8D/q2C/XnTCImB76QaJrJnvj3xpOiM8S/ZxbuHtd/aZV7zOMOABOIU7+9YgD4Uwzmwj8trvaeyrYcfedC9uWJU7UIenzrkB0KWSb6xYDzl8gplqeTHz32xLdJc8QwfUOolCpq6nbw/s1fNFaVVfiAURX4l7u/luLCQuLVwqaOtJsqnJSSGebSheGmf2aqMWuTRTyPyWOvQ2JvviLe0mrUonYEVjX3Y9PlYf3Ag8Dl7h7QzdgTl09A4mKwct17ltLRWwLopVST0VsZeIYHEIE/A7iWD7Iexm/qUrnBKK1dTXw99Q9tzsRHC4hxmNuJ46Ze83sU8Dvemt91vH+Q4geh/WIiTf3EWOGM4iAsQHRdfgpL1w0aIWLC9Pzk4mu36OJcaC1Uz5runC0J/0qUEDjhUmruhNqPCE2J/qYG+puSo+XIloSvyVqN1OJQvoQ7+Hq00LBvglwlrvfWShY9gB2dvdPW8y7fqPGPC0G7ErMHHrF3bcs/O39RJdUdoppGeroShzj7jc38T4NVU4K+3+VmEU0maiYvM/dTzGzM4iW4QNE4TWH+H2/3d3va2brFAN7qpGOJmaunUAUNBNosOuhGa2siPWQ/tLEb7kz0bK7qJ7Wq5ntCoxz970L2zpSnj5B9PVvAnzH3W+qN3915GNHopC/1t13S+XGGcS4zUSii/odwS91U21ItDxOS6/dkhin+BHRem64lQv9MFA0o9nuhDJPiMIA2wiia+d+oiCBqHUMJKbu7uyZGSiFWvBORLC4KW3flZjqeAjUN4PDzBYFPkPUvk5stGbdSmV1JfbwXs20dNYiWg2PEzOl/koM6p5D9Nd/jxifWZFYsmFs9bFjcdX9FOA4j6mble3vJYLE6USlZ3KZBV1OWeM6Lcrbp4EV3f376RyeS5yvXyMmN6xGLP9xbXp9S5ZjSWlVLqbrIFpF2xIt9H+5+6lmNooYl3i6WAaZ2RHEbLELgV8Tk1LeQ1RCzyYuolyfmCRTc8uqJwoUVVow1tHyE6JQ61+NODA6iIt+fkzUjj9KXPzzmd66i1J6LZtJVJXmLkTX2rHuPr2RdFqh7BpsK6XunK8TlZLBRNfDokS3w0Si8L8hdQ/ihYuqCmkMIY7ZscSsslsKA9qXEIOxp3kD6ya1UlnjOi3I1xZEUDjc3R8rfHffJgLFjz1my7UsQKT3rbzPysQEizeJLqPFiKu/lyW6C4+v6l4aRFR0xhOTUU51959ZXCezHLF21xwzG95T70K9FCi60YLuhJacEDZvts6fiVrhScQ6OP8gasq3puePAXhc/FNr2k3NJOohzaa+t1Zq8xrs/oBXgmkafJ1IXD+wAVEAvJjyWVMLoCr4n+3uN1ks+XEG0e9/V+s/Sf1aNa7TSul8PZyooV/v7g9bXPB4JvCVRsd0anzvpYgK3x1EgJhAtMxfJyp/N7q/c/KGxZpVHyYCy20+b9beZcBS7r5NKwObAkUPmv2SW3FC2LzZOrOIC8/eIJqZk4i+9tOJKXTH1DpLpIc81j2TKJNmS2tdzWjHGqzF1cfTiK6mG4nJCM8QAe0lYsr0zsDPvc6VhAvBYjQGuzx4AAAHDUlEQVRRuRhHtEiua1X+W6GdKhQVFuujjQc+SQxajybO25YvzVHoIRhMDFCP9HmzBg8kZittWZyMkiqN23nMelqCGNM6lZiksD8xi+209NoVWjF5pEiBokStOCHSbJ2tiEJuPWJw89H0bypx5epTTeSxbQr2MrRbDTZ1G4wH/oeYKn0F8fvOJK40P9nMhnkNS6X0kP5ixHjRd4DPu/s1rcl5a7XjcZcK47WJ7sk5rejbz7zXMKKQN2K6/D3AGanLaCLRknii8PqNiXGonxDHzHHu/tWU51HEkjG3u/tpZXy3ChQlq/dH62G2zreIwc11iZrOmsSFbAd5gxc8vZu0Ww025Wd7ovb6c+K6lf8lBqx3bLa1kyYXLOvuf2/HAvndKp3blXXGJhO/93lEhWFDoufgpO5+L4slZzYnrrJ+jhjwPodohb5CDF4/7g1eJ9EbBYo2k5mt8wFifahdiH72r7v73/supwuXdiswU7DYlQgYJxAX2C3fDoFMymNmQ9z9NYtlgI4lprBeRrQsDPhhcTq5vfNq/W2J5dJ3J1qO44kVbSd7L+unNUOBoo30MltnUaLZ+UHiCs13zH6RhUuhm2g74ir6BX69iSwYFlfcdxIrKV/l7iemYHFheskpxASHl4v7+LwFO88glkp/kJgKP4lYcudqq7rorgwKFG2mnWfrSOulYLGUWhL9U6GwryxIuQ6xztjp7v6j1FuwJ7EQ5TuWo7dYL+67xFpdSxKz4W4nZkiNA77hdawC2ygFijbUjrN1RKQ+VS2CHxDXZDxAdCs/TNzPfh1iVeonqvbtIHoRfk1McjgozcyaQFxvcz1wr7u/tiA+y0J9P4r+ymMhs18QU19PAlZ197MUJEQWHilIjCAW9LuZGIv6CjELbz2iK2mPqtlNAwr7v04sybGjmX0kjUleTIxJPLKgggSoRdHW2m22joj0Ll0MN9Tdn0nTWvfzuC0rFrcm/Z7HjZaq9yteqX0U0aI4g1hq/ghggrvfvSDGJKqpRdHGUo3h5woSIu3P4i6PvyAW67wmdSEDrG7zbp07ElgyjUVW9usws60LLZBziCV6LieuDn+S6II+LU197vFmTmVRi0JEpElpTOEy4i6A3zaz5SrXNJjZKcRsxQeJ6x8memFNNTPbiFiq5yhi9YXd3H3f9Lft0+s/YXE749IHrrszsC/eVESkn9kEeMHdv52eP2+xrP8mRKtgSPp3WvVYo7v/0cwOJrqX7gdmmtka6XWLApXrLlqywF8jFChERJr3BtGltBix5Pccj9sCrA9s7e675XZOizjOBo4hLrh9La3pZMTChAt0TKKaxihERJr3GtFiWMXdZxdWAXiNuB4qK60ccBuxiu170n5XEut19fkyPRqjEBFpATP7OrF678HAv4kFBg8mbo7V6wKDNu/mZEcCs9z91FIzXAcFChGROhXXDitOV033GlmHuKHYAOL2qbUEiU5ixd8hxJIuO1ZfhNeXFChEROpg825fOjhdFFf990XTw4Hp4tla012buGbCi/eiaAcKFCIiNSrcdGhl4tqGF4hprzdWbhbUbisVt4IGs0VEapSCxFDiPiIXE4FiAnEL234ZJECBQkSkV8U1mICliFvNPkzcSOxcYE0zG9ofgwToOgoRkV6lMYnhxDUO/yDuTvcXYGNgCWLxzsuIBfv6HY1RiIj0wMz2Bf4AzACuIW4iNpO4X/2RRNDYkLjx1MN9lM3SKVCIiHQjLeR3BHFf6g8BVxPL/3+PWJjvUGIa7H/Luld1u9AYhYhIFTP7GDAMOB6YCywCPJEW5ZsMLA0c5+5P9fcgARqjEBHpzjPEnej+DJxOlJUfN7OX3P0BM/sSsWDfu4IChYjIOz1OjD9sTLQojie6mvZINxi6ry8zt6Cp60lEpIq7zwU+DewDfB/4AHE70/8AT/VdzvqGBrNFRDLMbAciWDxILPk9o4+ztMApUIiI9MLM3k/Mbnqqr/PSFxQoREQkS2MUIiKSpUAhIiJZChQiIpKlQCEiIlkKFCIikqUrs0VqYGanA1sAg4A1gcp9kM8Cutz9TDM7H5ji7n8zs6eA0e/W6ZTSvyhQiNTA3ScBmNnqwK3u/uFuXrY1cPSCzJfIgqBAIdIEM5uSHr5O3MzmWjMbVfj7AOBE4k5oA4AL3P2UBZxNkaZojEKkBdz9e8CzwPbu/u/Cnz6f/r4hsAkwrhhIRBYGalGIlGsb4MPp/gYAiwPrAdP7Lksi9VGgECnXAOAwd78KwMyWAV7t2yyJ1EddTyKtM5t3Vr5uBj5vZouY2eLAHcCmCzxnIk1Qi0KkdX5NDGaPKWw7E1gLuJ84385391v7IG8iDdPqsSIikqWuJxERyVKgEBGRLAUKERHJUqAQEZEsBQoREclSoBARkSwFChERyVKgEBGRrP8HjZTi0OpCohwAAAAASUVORK5CYII=\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x10c927f28>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 敬称の頻度分布を可視化\n",
"g = sns.countplot(x=\"Title\",data=train_df)\n",
"g = plt.setp(g.get_xticklabels(), rotation=45) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"上の可視化の結果を見る限り,Mr, Mrs, Miss, Masterの4つ以外はほとんど現れていないことがわかる.なのでまとめて'rare'のカラムにまとめてしまうことにする. \n",
"※ Mme, MlleはMissと同じ意味の敬称らしいので,同じカテゴリに分類しておく(本当か怪しいが,Mme, Mlleはデータ数が少ないので間違っていても大きく影響は受けないだろう).Msは婚姻状態によらない女性の敬称. \n",
"MrsとMissは女性が既婚ならMrs, 未婚ならMissの敬称になる.昔の慣習と言った感じだが,年齢とは相関がありそうなので使わせてもらう."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# 敬称をカテゴリデータに変換する.本当はonehot-vectorの形にしたほうがよさそう\n",
"train_df[\"Title\"] = train_df[\"Title\"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')\n",
"train_df[\"Title\"] = train_df[\"Title\"].map({\"Master\":0, \"Miss\":1, \"Mme\":1, \"Mlle\":1, \"Ms\" : 2 , \"Mrs\":2, \"Mr\":3, \"Rare\":4})\n",
"train_df[\"Title\"] = train_df[\"Title\"].astype(int)\n",
"\n",
"test_df[\"Title\"] = test_df[\"Title\"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')\n",
"test_df[\"Title\"] = test_df[\"Title\"].map({\"Master\":0, \"Miss\":1, \"Mme\":1, \"Mlle\":1, \"Ms\" : 2 , \"Mrs\":2, \"Mr\":3, \"Rare\":4})\n",
"test_df[\"Title\"] = test_df[\"Title\"].astype(int)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEFCAYAAADuT+DpAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAEytJREFUeJzt3XuQHWWZx/HvJOGikrCLiwaXm5etx9K1UAImIiEDBhFQg2y5IrhILGRZg4K6BV5iEVR2V0UsQFxZwMRLQLkorggYa4GYRG4qrqaCDwICoqQkYCDAckky+0f3mJNhMpwZps/J5P1+qlL06e5zztNNT//67ct7evr6+pAklWtctwuQJHWXQSBJhTMIJKlwBoEkFW5CtwsYrojYBtgbuB9Y1+VyJGmsGA/sBNySmU+2ThhzQUAVAku6XYQkjVHTgaWtI8ZiENwPsHDhQiZPntztWiRpTFi5ciVHHXUU1PvQVmMxCNYBTJ48mZ133rnbtUjSWPOMU+peLJakwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXAGgSQVbiw+UCZphP7tk5d2u4RGfOL0d3a7hDHNFoEkFc4gkKTCGQSSVLhGrxFExC+AR+qXvwPOA84C1gKLMvO0iBgHfAXYA3gSODYz72iyLknSBo0FQURsC/RkZm/LuF8C/wDcBfwwIl4HvBTYNjPfEBHTgC8Cs5qqS5K0sSZbBHsAz4+IRfX3zAO2ycw7ASLiR8BMql/MuQYgM2+MiL0arEmSNECT1wgeB84ADgKOB+bX4/qtAbYHJgEPt4xfFxHe1ipJHdLkDvd24I7M7ANuj4iHgR1apk8EVgPPr4f7jcvMtQ3WJUlq0WSL4H1U5/uJiJdQ7fAfi4iXR0QPVUthCbAMOKSebxrw6wZrkiQN0GSL4EJgQUQsBfqogmE9sBAYT3XX0E0RcQtwYET8FOgBZjdYkyRpgMaCIDOfAo4cZNK0AfOtp7qGIEnqAh8ok6TCGQSSVDiDQJIKZxBIUuEMAkkqnEEgSYUzCCSpcAaBJBXOIJCkwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXAGgSQVziCQpMIZBJJUOINAkgpnEEhS4QwCSSqcQSBJhTMIJKlwBoEkFc4gkKTCGQSSVDiDQJIKZxBIUuEMAkkqnEEgSYUzCCSpcAaBJBVuQpMfHhEvAn4OHAisBRYAfcByYE5mro+IU4FD6+knZebNTdYkSdpYYy2CiNgKOA/4v3rUmcDczJwO9ACzImJPYAYwFTgCOLepeiRJg2vy1NAZwFeBP9avpwCL6+GrgZnAvsCizOzLzHuBCRGxY4M1SZIGaCQIIuIY4IHM/FHL6J7M7KuH1wDbA5OAh1vm6R8vSeqQpq4RvA/oi4iZwGuBbwAvapk+EVgNPFIPDxwvSeqQRloEmblfZs7IzF7gl8DRwNUR0VvPcjCwBFgGHBQR4yJiV2BcZq5qoiZJ0uAavWtogI8C50fE1sBtwGWZuS4ilgA3UIXSnA7WI0miA0FQtwr6zRhk+jxgXtN1SJIG5wNlklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXAGgSQVziCQpMIZBJJUOINAkgpnEEhS4QwCSSqcQSBJhTMIJKlwBoEkFc4gkKTCGQSSVDiDQJIKZxBIUuEMAkkqnEEgSYUzCCSpcAaBJBXOIJCkwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXAGgSQVziCQpMJNaOqDI2I8cD4QQB9wPPAEsKB+vRyYk5nrI+JU4FBgLXBSZt7cVF2SpI012SJ4G0BmvhGYC5wOnAnMzczpQA8wKyL2BGYAU4EjgHMbrEmSNEBjQZCZVwDH1S93A1YDU4DF9birgZnAvsCizOzLzHuBCRGxY1N1SZI21lYQRMQ5g4z7+rO9LzPX1vOdAywEejKzr568BtgemAQ83PK2/vGSpA4Y8hpBRFwAvAzYKyJe3TJpK9rcWWfmeyPiFOAm4HktkyZStRIeqYcHjpckdcCzXSz+LLA7cBZwWsv4tcBtQ70xIv4J2Dkz/x14HFgP/CwiejPzeuBg4DrgDuDzEXEGsDMwLjNXDX9RJEkjMWQQZObdwN3AHhExiaoV0FNP3g54aIi3fxeYHxE/oWpBnEQVHudHxNb18GWZuS4ilgA3UJ2qmjPipZEkDVtbt49GxMeBjwMPtozuozptNKjMfAz4x0EmzRhk3nnAvHZqkSSNrnafIzgWeHlmPtBkMZKkzmv39tF7Gfo0kCRpjGq3RfBbYGlEXEf1dDAAmfnpRqqSJHVMu0Hwh/ofbLhYLEnaArQVBJl52rPPJUkai9q9a2g91V1Crf6YmbuMfkmSpE5qt0Xwl4vKEbEVcBjwhqaKkiR1zrA7ncvMpzPzUuCABuqRJHVYu6eGjm552QO8GniqkYokSR3V7l1D+7cM9wGrgHeNfjmSpE5r9xrB7PraQNTvWZ6ZaxutTJLUEe3+HsEUqofKvg7MB+6NiKlNFiZJ6ox2Tw2dDbwrM28CiIhpVD828/qmCpMkdUa7dw1t1x8CAJl5I7BtMyVJkjqp3SB4KCJm9b+IiMPYuEtqSdIY1e6poeOAKyPiQqrbR/uAfRqrSpLUMe22CA6m+rnJ3ahuJX0A6G2oJklSB7UbBMcBb8zMxzLzV8AU4IPNlSVJ6pR2g2ArNn6S+Cme2QmdJGkMavcawRXAtRFxSf36cOD7zZQkSeqktloEmXkK1bMEQfWD9Wdn5qeaLEyS1BnttgjIzMuAyxqsRZLUBcPuhlqStGUxCCSpcAaBJBXOIJCkwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXBtdzExHBGxFfA1YHdgG+CzwApgAVWvpcuBOZm5PiJOBQ4F1gInZebNTdQkSRpcUy2C9wAPZuZ04C3Al4Ezgbn1uB5gVkTsCcwApgJHAOc2VI8kaROaCoJLgf7eSXuojvanAIvrcVcDM4F9gUWZ2ZeZ9wITImLHhmqSJA2ikSDIzEczc01ETKTqsXQu0JOZ/T9mswbYHpgEPNzy1v7xkqQOaexicUTsAlwHfDMzLwLWt0yeCKwGHqmHB46XJHVII0EQES8GFgGnZObX6tG3RkRvPXwwsARYBhwUEeMiYldgXGauaqImSdLgGrlrCPgE8NfApyKi/1rBicDZEbE1cBtwWWaui4glwA1UoTSnoXokSZvQSBBk5olUO/6BZgwy7zxgXhN1SJKenQ+USVLhmjo1pC47Zv5gDbKxb8Hss7pdgrTFsUUgSYUzCCSpcAaBJBXOIJCkwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXAGgSQVziCQpMIZBJJUOINAkgpnEEhS4QwCSSqcQSBJhTMIJKlwBoEkFc4gkKTCGQSSVDiDQJIKN6HbBUhNu+ro2d0uoRGHfGN+t0vQFsIWgSQVziCQpMIZBJJUOINAkgpnEEhS4QwCSSqcQSBJhWv0OYKImAp8LjN7I+IVwAKgD1gOzMnM9RFxKnAosBY4KTNvbrImSdLGGmsRRMTJwAXAtvWoM4G5mTkd6AFmRcSewAxgKnAEcG5T9UiSBtfkqaE7gcNbXk8BFtfDVwMzgX2BRZnZl5n3AhMiYscGa5IkDdBYEGTm5cDTLaN6MrOvHl4DbA9MAh5umad/vCSpQzp5sXh9y/BEYDXwSD08cLwkqUM6GQS3RkRvPXwwsARYBhwUEeMiYldgXGau6mBNklS8TvY++lHg/IjYGrgNuCwz10XEEuAGqlCa08F6JEk0HASZeTcwrR6+neoOoYHzzAPmNVmHJGnTfKBMkgpnEEhS4QwCSSqcQSBJhTMIJKlwBoEkFc4gkKTCGQSSVDiDQJIKZxBIUuEMAkkqnEEgSYUzCCSpcAaBJBXOIJCkwhkEklQ4g0CSCmcQSFLhDAJJKlwnf7y+I448eWG3Sxh1F33+qG6XIGkLZotAkgpnEEhS4QwCSSqcQSBJhTMIJKlwBoEkFW6Lu31UktrxkyvndbuERuz31nnDfo8tAkkqnEEgSYUzCCSpcAaBJBXOIJCkwm0Wdw1FxDjgK8AewJPAsZl5R3erkqQybC4tgsOAbTPzDcDHgC92uR5JKsZm0SIA9gWuAcjMGyNiryHmHQ+wcuXKQSc++fjqUS+u2+67775hv+eJ1Y83UEn3jWRdPPTkEw1U0n0jWRePPvbnBirpvpGsi1UPPdpAJd23qXXRss8cP3BaT19fX4MltSciLgAuz8yr69f3Ai/LzLWDzLsvsKTDJUrSlmJ6Zi5tHbG5tAgeASa2vB43WAjUbgGmA/cD65ouTJK2EOOBnaj2oRvZXIJgGfA24JKImAb8elMzZuaTwNJNTZckbdKdg43cXILge8CBEfFToAeY3eV6JKkYm8U1AklS92wut49KkrrEIJCkwhkEklS4zeVicddERC9wHfDuzPx2y/hfAb/IzGOG8VknZOaXR73IZ//eXoZYBmBSZh4+zM/ck+qJ713r/764vmOrf9rPgf0z8/rnUHdb3wHsDrwS+Crw7cycNtLvbKOmXkawPUTEWcBZVHdlnJeZx7dMOxt4e2bu3lTd3Taaf0djSb3clwArgD5gEnAXcFRmPtXF0obFFkHlN8AR/S8i4jXAC0bwOXNHraLh2+QyDDcEam8FrqyH7wcObpl2FNXG/lx14jtGYiTbw8sy8y7gQWC/iJhQv3c8sHdThW5mRuvvaKy5NjN7M3P/zJwCPA28vdtFDUfxLYLa/wIREdtn5sPAe4CFwK4RcQJwONUGvQp4B9UR6nxgLVWYHgkcDewQEV8BTqQ6ev27evrczLw+IpYDtwNPZeYRjK6hlmFlZk6OiA8A7wXWA7dk5oci4nDgFKqN94/AEZm5HtgL+Ez92RcD7wauqDsI3JP6oZSIOIbqGZDnUT2schYwC/h74F8z8/sR8U7gI1QPAC7NzI/Vn9vWdwwmImYAp9efeSfwz5n59MhW3TMMtS7nA6+ol/eszPxmRLwKuK1+71rgeuBA4GrgzcCPqbYPIuJ64E/ADsAc4Gu0bEeZ+ftRWoZuGGq93UMVFCsy88PdLLJJEbE11d/Bn+seE3apX/93Zs6NiAXAC+t/hwInUz0gOx44MzMv7Ubdtgg2uBw4PCJ6gNcDP6VaPy8EZmbmVKrg3Jvqj/xmYCZwKrB9Zp4OPJSZHwCOBVZl5n5UO8Vz6+/YDvhMAyEw1DK0mg2cUHfud1t91Ppu4AuZuS/V0fmkiHgx8KfM7L+3+GbglRHxAuAAqlMArSZm5iHA54B/oQrO44DZEbEDcBrwpvo7/jYiDhzBd/xFvXznA4dn5gzgD8Ax7a+mtgy2LicC+9XL9xY2PNne2rIBuIgNR8ZHUu0MW12cmTOptp+NtqNRXoZu2NQ2uAtV0G2JIXBARFwfESuoTsV+j+rg5MbMPIhqPRzfMv+1mbkPMA14af13sT/wyYj4qw7XDhgErfr/ePdjQ19G64GngIsj4kJgZ2Ar4EJgNVVHeSdQHdG1eg1wSH30dzkwISL+pp6WHV6GVrOBORGxGNiN6uG9j1BtyIuBfaiW+VDgqgHv/T5VqB0JfGvAtFvr/64Gbqt37n8GtqU6et4RuKpeH68CXj6C72i1I9VR1iX1Z765Xp7RNNi6XAOcBPwX8B1gm3r8PlRPx/dbBrwuIvqP/O4Z8Nn928CzbUdj0aa2wVWZ+WB3SmrctZnZS3Vk/xTwO+AhYO+IWAh8iQ3bCmz4//8aYEq9DV9DtW/ZvTMlb8wgqNXnd18AfIgNO6FJwGGZ+S7gg1Trq4dqZ7UkM98EXEp1aoV6GlRN4IvrjePgep6H6mnrO7wMrd4PHF8fRb+Oagd2HDCvHtdDderrQGDRgPdeRHV6Y6f6e1oN9VTi74DfAwfW6+Mc4MYRfEerVcB9wKz6M08Hrh1i/mHbxLrcCZiSme+gCrLPR8SOwCOZua7lvX1UIfefwBWDfHz/NrCp7WjMGmIbbGy731zUQfce4ALgw8DqzDyKqlv959etJNiwLn4DXFdvwwdQXXQetAuIphkEG/sOsEtm3l6/Xgs8FhHLqM7z3g+8BPgZ8OmIuJaqyXdOPf+KiPgWcB7VaY7FVE3je+rz7t1Yhla/BpbUdf8JuInq1MSVEfE/wGSqI5OtM3OjPnoz8zdUR+I/GE4xmfkAcCawOCJuogrGu57Ld9Tr8kTgh3W3JB8Alg+nrjYNXJcrgcn1d/4YOIMq0K4Z5L0LqS4YDnXOd1Pb0Vg31Da4RcvMFcDZVNfI3hIRP6E6IPgt1b6j1Q+ARyNiCdUdcn2ZuaaT9faziwlJKpwtAkkqnEEgSYUzCCSpcAaBJBXOIJCkwtnFhNSGiDgXeCOwNdVDcivqSedR3fb31br7iXmZeU9E3A30ZubdXShXGhaDQGpDZs4BiIjdgesz87WDzLY/VXca0phiEEjPQUTMqwefoHpg6KqImN4yfTzwBaCXqmOxBZn5pQ6XKQ3JawTSKMjM/6DqvfWQAX3qvL+evidV52OzWoNC2hzYIpCaNRN4bUQcUL/ejqqzscE6BZS6wiCQmjUeODkzvwtQ90L7WHdLkjbmqSFp9KzlmQdX1wLvj4itImI7YCkwteOVSUOwRSCNniupLhYf1DKu/5fqbqX6e5v/XH7nWWqCvY9KUuE8NSRJhTMIJKlwBoEkFc4gkKTCGQSSVDiDQJIKZxBIUuH+Hy54WV5UCS/LAAAAAElFTkSuQmCC\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x10c9636d8>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"g = sns.countplot(train_df[\"Title\"])\n",
"g = g.set_xticklabels([\"Master\",\"Miss/Mme/Mlle\", \"Ms/Mrs\",\"Mr\",\"Rare\"])"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAARgAAAEYCAYAAACHjumMAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAFfZJREFUeJzt3X+UVXW5x/H3zDAIgqiIKzU19JpP3hI1TdHSgLFUyiyzm0rX9BpqirrMlWl6EzS1pdmPycofld7U/JW6AkrLRMRIiWtledHHLLWsrPBXKgMDzNw/vnvoMJ45s+fHM3vO5vNay8Xss8/Z59ly5sN377P392no7OxERCRCY9EFiEh5KWBEJIwCRkTCKGBEJIwCRkTCKGBEJMyIqA2bWSPwDWA3YBXwCXd/smL9Z4CjgH8Cl7r7/KhaRKQYkSOYDwKj3H1f4Gzg8q4VZrYrcDQwGXgvcIGZbRxYi4gUIDJg3gXcDeDuDwF7VazbBVjo7ivdfSXwO2BSTxsysxFmNtHMwkZcIjL4In9hxwEvVyyvNbMR7r4G+C1wjpltAowE9gOurrGtbYGn7r333rBiRWRAGqo9GDmC+SewSeV7ZeGCuz8GXEEa4VwBLAGWB9YiIgWIDJjFwHQAM5tMGrWQLW8JbOLu7wROArYDHg2sRUQKEHmIdCfwHjP7OWn4dJyZfQp4EpgH7GJmS4F24NPuvjawFhEpQFjAuHsHaXRS6fGKn0+Mem8RGR50oZ2IhFHAiEgYBYyIhFHAiEgYBYyIhFHAyJBrbW2lpaWF1tbWokuRYAoYGVJtbW3MnTsXgHnz5tHW1lZwRRJJASNDqr29na5OFh0dHbS3txdckURSwIhIGAWMiIRRwIhIGAWMiIRRwIhIGAWMiIRRwIhIGAWMiIRRwIhIGAWMiIQpsrPjmaTmax3Axe5+Z1QtIlKMojo7bgacDuxL6uz4lcA6RKQgRXV2fA14BhiT/dcRWIeIFCQyYKp2dqxY/hOwDPgloIlBREoosi9Sj50dgUOArYEdsuUfm9lid/9FYD0yQD865rgBb6Nt7frtr3568qmMbmrq9/amf/fagZYkgQrp7Ai8CLQBq9x9JfASsFlgLSJSgEI6O7r7XDM7EHjIzDqAnwH3BNYiIgUorLOju58PnB/1/iJSPF1oJyJhFDAiEkYBU8fU/kOGOwVMnVL7D6kHCpg6pfYfUg8UMCISRgEjImEUMCISRgEjImEUMCISRgEjImEUMCISRgEjImE2+IDR5fYicTbogNHl9iKxNuiA0eX2Q6+poWHdzw3dlqV8NuiAkaE3srGR3ceMBWC3MWMZ2aiPYJlFTpkpUlXLZuNp2Wx80WXIECiks6OZ7c76zdYmAx9097uj6hGRoRc5glnX2THrKnA5cBiAu/8amAJgZh8B/qxwESmfyIBZr7Ojme3V/QlmNgaYAxwQWMewc+y1pw94G2tXrVlvedZNn6Vpo/7/dV533FcHWpLI6xTZ2RHgeOA2d18eWIeIFKSozo5dZgBHBNYgIgUqqrMjZrYpsJG7/ymwBhEpUGGdHYGdgacD319EClZkZ8elpG+aRKSkdBmliIRRwIhIGAWMiIRRwIhIGAWMiIRRwIhIGAWMiIRRwIhIGAWMiIRRwIhIGAWMiIRRwIhIGAVMnWporGj30dBtWWSYUMDUqcbmJsbunGbmH/vm8TQ2NxVckcjrqW1JHdt8723YfO9tii5DpEcawYhImF5HMGb2KPA/wPXu/lx8SSJSFnlGMO8DRgH3mdkPzewIM2sOrktESqDXEYy7PwNcCFxoZh8CWoErzewG4EJ3f77a62p1dszWHwKcT5qv92HgFHfvHOD+iMgw0usIxszGmtmxZnYvcAnwTWBv4AngxzVeuq6zI3A2qbNj1zY3AS4D3u/u+5Am/57Q350QkeEpz7dITwHzgTnuvqjrQTP7JvCeGq+r1dlxP1Ibk8vNbEfgW+7+j74WLyLDW56AOT5rM7KOmR3u7ncAH6rxuqqdHbPmaxOAqcDuwKvAA2b2oLs/0bfyRWQ46zFgzOyjwEbABWa2WcWqZuAc4I5etl2rs+PzwNKub6XMbBEpbBQwIiVSawQzjnQoswlptNFlDXBujm0vBg4Fbq3S2fGXwNvMbALwEjAZuKYPdYtIHegxYNz9GuAaM2tx93v7se2anR3N7Bz+dZL4Vnd/tB/vISLDWK1DpKvd/QTgPDN73YjF3afV2nCOzo43Azf3rVwRqSe1DpGuyv6cPQR1iEgJ1QqYMWZ2ADBsL347+qwbB/T6jjUr11s+cc73aRwxakDb/N6lMwb0epEyqRUwc2qs6wRqHiKJiNQ6yTu1p3UiInn0epLXzO6jymFSbyd5RUR0kldEwvR4s6O7P5z9eT/wIjAJ2AV4LntMRKSmPHdTnwZ8H5gI7AzMM7OPB9clIiWQ52bHmcCe7v4KgJldCCwizXInItKjPDPavQas7ra8sofnioisU+tbpM9lPz4PLDazm0k3Oh4B/G4IahOROlfrEKmrk9cvsj83zv78SVw5IlImtS60q3olr5k1ADuEVSQipZGnbcks4GJgTMXDTwE7RRUlIuWQ5yTvmaTOALcA/wYcDyyJLEpEBk9raystLS20trYO+XvnCZi/u/tTwG+AXd39OsBCqxKRQdHW1sbcuWlK7Xnz5tHW1jak75/ra2ozm0oKmEPNbCtg89iyRGQwtLe309mZbiXs6Oigvb19SN8/T8CcCnyA1IJkC9KsdF+LLEpEyiFPZ8f/A84ws3HADHfPdZFdjs6OXyX1Tnole+gwd3/5dRsSkbqV51ukXUm3BbwJ6DSzx4GPu/vve3npus6OWVeBy4HDKtbvCRzk7sv7V7qIDHd5DpGuBM519y3cfQIpKL6T43XrdXYE1nV2zEY3bwauNrPFZvZffa5cRIa9PAEz2t3v6lpw9ztJPZN6U7WzY/bzGNJ5nI8BBwMnm9mkfCWLSL2odS/S9tmPj5jZ2cC3SfcizQAeyLHtWp0dVwBfdfcV2XstIJ2r+U3fyheR4azWOZj7SVNlNgBTgBMr1nUCp/Wy7VqdHXcGbjGzPUijqHeh6R9ESqfWvUgDvd+ot86O1wMPkaaC+G72bZWIlEieb5G2BK4AWrLnLwA+6e5/q/W6HJ0dLwMu62vBIlI/8pzkvQpYCuxImjbzIdL5GBGRmvJMmbmjux9esXypmf1nVEEiUh55RjCdZrZd10L27dLqGs8XEQHyjWD+G3jQzJaQTtbuA5wQWpWIlEKegPkjsAewN2nEc5K7/z20KhEphTwBc4u77wL8MLoYESmXPAGzLOswsARYN1uNuy8Kq0pESiFPwIwHpmb/dekEpoVUJCKlkWc+mKkAZjYeWKs5W0QkrzxX8u4GfBd4I9BoZo8Bx+SYD0ZENnB5roP5Dmk+mAnuPh74InBdaFUiUgp5AqbB3ed3LWTzwYyNK0lEyiLPSd5FZnYecA1pPpgjgce65otx9z8G1icidSxPwHTNo3t8t8e75ovZcVArEpHSyPMtUnn7UDc0VS50WxaRgcpzDqa0GpuaGb3lLgCM3vItNDY1F1yRSLnkOUQqtXHb78u47fctugyRUtqgRzAiEqtWV4FrSSdxq3L3mr2MeuvsWPGcHwI/cPcr+1C3iNSBWodICwe47d46OwJ8Hth8gO8jIsNUra4C69qIZPchjSFNONUE5Plmab3Ojma2V+VKMzsC6Oh6joiUT6/nYMzsYuApwIGfAU8Cl+TYdo+dHc3sbcDRwOf6WrCI1I88J3mPArYDbiFN2XAg8I8cr6vV2fEY0s2TC4BjgU+Z2cE5axaROpEnYP7q7v8EHgV2c/f7gDfkeN1iYDpA986O7n6Wu+/j7lNIN05+yd11qCRSMnmug3k5a1PyMHCqmf2FfCdma3Z27HfFIlI38gTM8cBR7n69mR1KasR2Xm8v6q2zY8XzZueoQUTqUJ6A+Q/gBgB3PzO2HBEpkzwB80bgITNzUtDc4e4rYssSkTLo9SSvu386u6P6ImAy8Gszuz68MhGpe7nuRTKzBqAZGEm6OG5VZFEiUg55Jv3+Gumy/18BNwKnufvK6MJEpP7lOQfzBPB2d89zcZ2IyDq17qY+wd2vJjVe+6SZrbfe3S8Irk1E6lytEUxDDz+LiORS627qq7IfXwZucve/DU1JItJl0fzZA3r9irbV6y0/+JNL2Xj0wKaGPeD9s3M/V9fBiEgYXQcjImF0HYyIhMl7HcxhwK9Jh0i6DkZEcslzDuZvwJ66DkZE+irPIdIMhYuI9EeeEcwyM/scsARo63rQ3ReFVSUipZAnYMaT5uKdWvFYJzAtpCIRKY1eA8bdp/b2nGp6a7xmZqeQJvzuBL7o7rf2531EZPjK8y3SfVTp8OjuvY1gemy8ZmYTgE8CewCjSIdht7l7j50kRaT+5DlEml3xczMpJF7M8boeG6+5+3Iz293d15jZRGClwkWkfPIcIt3f7aGfmtkSem+aVrXxWldvpCxcZgFzgNY+1CwidSLPIdL2FYsNwFuBLXJsu1bjNQDc/Qozuxq4y8ymZj2XRKQk8hwiVY5gOoHlwKk5XrcYOBS4tXvjNUuTy1wCfBhYTToJ3JGzZhGpE3kOkfI0uq+mZuM1M3sEeJAUWndVORQTkTqX5xBpb9IJ2yuA+aRvfk5y99trva63xmvuPod0/kVESirPrQKtpLaxR5Cu5N0TODuyKBEphzwB05gdvrwP+L67/5F8525EZAOXJ2BWmNmZpFsD5pvZ6cArsWWJSBnkupsaGAN82N1fBLYBjg6tSkRKIc+3SH8GLqhY/kxoRSJSGrmmzBQR6Q8FjIiEUcCISBgFjIiEUcCISBgFjIiEUcCISBgFjIiEUcCISBgFjIiEUcCISBgFjIiEUcCISJiwiaNydHY8AzgyW/xRNoWmiJRI5AhmXWdH0hSbl3etMLMdSfPM7AdMBt5rZpMCaxGRAkQGzHqdHYG9Ktb9CTjY3ddmHR2bgZWBtYhIASLn1u2xs6O7rwaWm1kDcBnwK3d/IrAWESlA5AimZmdHMxsF3Jg95+TAOkSkIJEBsxiYDlCls2MD8APgEXc/0d3XBtYhMihaW1tpaWmhtVWt1POKPETqsbMj0AS8G9jIzA7Jnn+Ouz8YWI9Iv7W1tTF37lwA5s2bx8yZMxk9enTBVQ1/YQHTW2dHYFTUe4sMtvb2djo7OwHo6Oigvb1dAZODLrQTkTAKGBEJo4ARKbERTf/6FW9oWH95KChgREps5Mgm3rHb1gDsNWlrRo5sGtL3VxN7kZKbPm0npk/bqZD31ghGRMJoBCMbhIvPvW1Ar1+9ev1b5b5y0Vyamwd2pcVnL/rIgF5fDzSCEZEwChgRCaOAEZEwChgRCaOAEZEwChgRCaOAEZEwChiRHBoaKy+xb+i2LD1RwIjkMKKpmW3f8FYAtn3DvzOiqbngiuqDruQVycl22B/bYf+iy6grGsGISJjCOjtmz9mSNDn4JHdXXySRkimksyOAmR0E/ATYKrAGESlQUZ0dATqAA4EXAmsQkQJFBkzVzo5dC+5+j7s/H/j+IlKwwjo7ikj5FdLZUUQ2DIV0dnT3uYHvKyLDRJGdHbueNzGqBhEpli60E5EwChgRCaOAEZEwChgRCaOAEZEwChgRCaOAEZEwChgRCaOAEZEwChgRCaOAEZEwChgRCaOAEZEwChgRCaOAEZEwChgRCaOAEZEwChgRCVNYZ0czmwmcCKwBPu/u86NqEZFiFNLZ0cy2Ak4D3gkcBFxiZhsF1iIiBYjsKrBeZ0czq+zsuDew2N1XAavM7ElgErC0h201ATz33HPrPbhqxUuDXfOAPfvss70+Z+VLK4agkr7JU/cLq4Zf+/A8dQO8+tqLwZX0XZ7al7/w6hBU0jfV6m5paZkIPNu991lkwFTt7JgV0H3dK8CmNba1NcCMGTMGvcjB1nJPa9El9EvLlS1Fl9AvF7bUZ90At991ee9PGo4umlft0aeAHYCnKx+MDJhanR27r9sEqDUcWQrsD/wVWDuYRYrIoHnd0CYyYBYDhwK3Vuns+AvgIjMbBWwE7AI82tOGskOpnwXWKiIBGjo7O0M2XPEt0iSyzo6kVrJPuvvc7FukE0gnmi9299tDChGRwoQFjIiILrQTkTAKGBEJo4ARkTCR3yINOTObAtwHHOXuN1c8/hvgl+5+bB+2Ncvdrxj0IvO//xRq7Aswzt0PL6i8Xg3m30VR6mEfshpvBZYBnaRrzP4AzHD39gJLA8o5gnkcOLJrwcx2Bcb0YzvnDVpF/dfjvgzncKkwWH8XRaqHfVjg7lPcfaq77wmsBj5QdFFQshFM5hHAzGxTd38Z+BhwI7C9mc0CDid9QJYDHwImAteSbrpsBI4GjgHGm9k3gNOBK4E3Z+vPc/eFZvYo8ATQ7u5HEqPWvjzn7luZ2cnAx4EOYKm7n2ZmhwOfIX3Q/gIc6e4dQTX2t/5nSL+8y4AHhkm91eTaB3c/o8giu5jZSNKV7y+a2beA7bLlue5+npldB2yR/fc+4CzSRaxNwJfc/bbBrKeMIxiA24HDzayBdN/Tz0n7ugVwoLvvQwrXdwDvIV34dyBwPrCpu18EvODuJwOfAJa7+wHAYcDXs/cYC1wYGC619qXSccCs7KbSx8xsBHAUcJm7vwuYTxo2F6Wn+rcDjs5+MYdTvdXk2YciTTOzhWa2jHT4fCfwe+Ahdz+IVPNJFc9f4O77AZOBHbL/71OBc81ss8EsrKwB8z3SsPYA0r+OkP6FbwduMrNvA9sCzcC3Sbcp3A3MIo1kKu0KTDezhaQP2ggzm5Ct88B96FJtXyodB5xiZvcDbyJd1Pgp0ofufmA/0r4Xpaf6l7v789nPw6neavLsQ5EWuPsU0kiknXRf0AvAO8zsRuDLpCvmu3R9bncF9sw+23eTfh8mDmZhpQwYd/8D6TDoNOCG7OFxwAfd/aPAqaR9byCNSh5w9xbgNtJQnWwdpCHwTdlf4CHZc17I1oX/IvSwL5VmAie5+7uBPUi/oCcAs7PHGkiHgoWoUX/l/7thU281OfehcFnYfQz4FnAG8JK7zyBNlbJxNgKDf9X9OHBf9tmeRjpZ/PvBrKmUAZO5BdjO3Z/IltcAr5nZYuAe0o2T2wD/C1xgZgtIw8ivZc9fZmY3AFcBb8n+df058EwB5we670ul3wIPZPX/HVhCOuSbb2b3AluRDjuKVKt+GH71VtPbPgwL7r4MaAXeBhxsZouAbwK/I33eK80DXjWzB4CHgU53f2Uw69GtAiISpswjGBEpmAJGRMIoYEQkjAJGRMIoYEQkTBlvFZACmdnXSe1oRgI7kW4FgPR1f6e7X2lm15Kue3nGzJ4Gprj70wWUK8EUMDKo3P0UADObCCx0992rPG0qMGco65JiKGBkSJjZ7OzHlaQLvn5kZvtXrG8CLgOmkG68u87dvzzEZcog0zkYGVLu/gXSHdPTu93HMzNb/3bSzXmHVQaQ1CeNYGS4OBDY3cymZctjSTfjVbvBU+qEAkaGiybgLHe/AyC7Y/21YkuSgdIhkhRhDa//x20BMNPMms1sLKnR3j5DXpkMKo1gpAjzSSd5D6p4rGvWwF+RPpfXuvvCAmqTQaS7qUUkjA6RRCSMAkZEwihgRCSMAkZEwihgRCSMAkZEwihgRCTM/wOhcWDfLV8lsQAAAABJRU5ErkJggg==\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x10ca8fc50>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"g = sns.factorplot(x=\"Title\",y=\"Survived\",data=train_df,kind=\"bar\")\n",
"g = g.set_xticklabels([\"Master\",\"Miss\",\"Mrs\",\"Mr\",\"Rare\"])\n",
"g = g.set_ylabels(\"survival probability\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mrの敬称がついている(男性)は生存率が低い.Masterは男児につける敬称らしい. \n",
"グラフを見る限り,女性,子供が優先的に救助されたという話と一致した結果になっている.\n",
"敬称がRareになっている人はいろんな属性の人が混ざっているのであろうから,全体の真ん中くらいの生存率なのも自然な気がする."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# Nameの属性を削除\n",
"train_df.drop(labels = [\"Name\"], axis = 1, inplace = True)\n",
"test_df.drop(labels = [\"Name\"], axis = 1, inplace = True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"次に欠損値の扱いについてだが,Cabinは欠損している数が多いので今回はデータから除外する. \n",
"Ageはその値が生存に関わっているようなので (前回でAgeとSurviveの間には相関があることがわかっている), \n",
"Ageの欠損値を適当な値で埋める. \n",
"今回,名前から敬称を抽出したので,それを使って年齢の補完を行う.\n",
"knnとかで補完したほうがよさそうだが,今回は単に中央値で補完する."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3.5\n",
"21.0\n",
"35.0\n",
"30.0\n",
"48.5\n"
]
}
],
"source": [
"print(train_df[train_df.Title==0].Age.dropna().median())\n",
"print(train_df[train_df.Title==1].Age.dropna().median())\n",
"print(train_df[train_df.Title==2].Age.dropna().median())\n",
"print(train_df[train_df.Title==3].Age.dropna().median())\n",
"print(train_df[train_df.Title==4].Age.dropna().median())"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# fillnaで各行のTitleなどの属性によって補完値を変えるやり方がわからなかったので,forで回す\n",
"# 補完する値をmedianにしたら精度が前回よりも下がったが,理由は不明.\n",
"for i, train in train_df.iterrows():\n",
" if math.isnan(train.Age):\n",
" if math.isnan(train.Title):\n",
" train_df.at[i, 'Age'] = train_df[train_df.Title==train.Title].Age.dropna().mean()\n",
" else:\n",
" train_df.at[i, 'Age'] = train_df.Age.dropna().mean()\n",
"\n",
"for i, test in test_df.iterrows():\n",
" if math.isnan(test.Age):\n",
" if math.isnan(test.Title):\n",
" test_df.at[i, 'Age'] = test_df[test_df.Title==test.Title].Age.dropna().mean()\n",
" else:\n",
" test_df.at[i, 'Age'] = test_df.Age.dropna().mean()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId 0\n",
"Survived 0\n",
"Pclass 0\n",
"Sex 0\n",
"Age 0\n",
"SibSp 0\n",
"Parch 0\n",
"Ticket 0\n",
"Fare 0\n",
"Cabin 687\n",
"Embarked 2\n",
"Title 0\n",
"dtype: int64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# XXX: 今回はとにかく属性を削除しまくる\n",
"train_df = train_df.drop([\"Ticket\", \"SibSp\", \"Parch\", \"Fare\", \"Cabin\", \"Embarked\"], axis=1)\n",
"test_df = test_df.drop([\"Ticket\", \"SibSp\", \"Parch\", \"Fare\", \"Cabin\", \"Embarked\"], axis=1)\n",
"\n",
"# Prepare data\n",
"X_train = train_df.drop(['PassengerId', 'Survived'], axis=1).values\n",
"y_train = train_df.Survived.values\n",
"X_test = test_df.drop('PassengerId', axis=1).values\n",
"\n",
"model = RandomForestClassifier(n_estimators=100)\n",
"\n",
"# Predict with \"Random Forest\"\n",
"y_pred = model.fit(X_train, y_train).predict(X_test).astype(int)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"メモ: 欠損値についての考え方が書いてあった.\n",
"[Python pandas 欠損値/外れ値/離散化の処理 - StatsFragments](http://sinhrks.hatenablog.com/entry/2016/02/01/080859#%E6%AC%A0%E6%90%8D%E5%80%A4)\n",
"#### 欠損発生のパターンと概要\n",
"- MCAR\tランダムに欠損している ( 欠損は \"IQ\" や \"JobPerformance\" の値に関係しない )\n",
"- MAR\t他の変数の値と関係して欠損している ( \"IQ\" が低いと \"JobPerformance\" の欠損が多い )\n",
"- MNAR\t欠損が発生しているデータ自身と関係して欠損している ( \"JobPerformance\" の真の値が低いと \"JobPerformance\" の欠損が多い )"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy: 0.802\n"
]
}
],
"source": [
"import numpy as np\n",
"from sklearn.model_selection import KFold\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"# cross valicationで性能を予測\n",
"SPLIT_NUM = 4\n",
"sum_ = 0\n",
"kf = KFold(n_splits=SPLIT_NUM, shuffle=True)\n",
"for train, cv in kf.split(X_train):\n",
" y_cv_pred = model.fit(X_train[train], y_train[train]).predict(X_train[cv]).astype(int)\n",
" sum_ += accuracy_score(y_train[cv], y_cv_pred, normalize=True)\n",
"\n",
"print('accuracy: {:.3f}'.format(sum_/SPLIT_NUM))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# Save prediction to csv file\n",
"submission = pd.DataFrame({\n",
" \"PassengerId\": test_df[\"PassengerId\"],\n",
" \"Survived\": y_pred\n",
" })\n",
"submission.to_csv('../output/submission.csv', index=False)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# score: 0.71"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment