Skip to content

Instantly share code, notes, and snippets.

@fhiyo
Created January 20, 2018 22:34
Show Gist options
  • Save fhiyo/3bfe1b894ffbdf8afcd78794da800fa3 to your computer and use it in GitHub Desktop.
Save fhiyo/3bfe1b894ffbdf8afcd78794da800fa3 to your computer and use it in GitHub Desktop.
KaggleのTitanicチュートリアルをrandom forestで試したもの (前処理少しやった)
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Random forestを用いたtitanicのチュートリアル問題\n",
"\n",
"18/01/21\n",
"データの前処理を少し行っている.Ageの欠損値を敬称によって補完した."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [Titanic Top 4% with ensemble modeling | Kaggle](https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling)\n",
"- [タイタニック号データ分析:Feature Engineering](http://rindalog.blogspot.jp/2016/10/feature-engineering.html)\n",
"- [Exploring Survival on the Titanic | Kaggle](https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic/code)\n",
"- [Kaggleのtitanic問題で上位10%に入るまでのデータ解析と所感 - mirandora.commirandora.com](http://www.mirandora.com/?p=1804)←cross valicationで精度を評価してた\n",
"\n",
"を参考にしながら書いている."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import math\n",
"import re\n",
"\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline\n",
"\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"sns.set(style='white', context='notebook', palette='deep')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Load training and test data\n",
"train_df = pd.read_csv(\"../input/train.csv\", header=0)\n",
"test_df = pd.read_csv(\"../input/test.csv\", header=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"データの前処理を行うが,先にどのようなアルゴリズムを使うかを決めないと前処理の方向性も決まらない? \n",
"例えば,Decision Treeを使うならCategorical Dataをそのままにしてもよいが,他のLabeled Dataを陽に扱えない手法を用いるなら \n",
"ダミー変数化するなどの処理が必要になる."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Convert \"Sex\" to be a dummy variable (female = 0, Male = 1)\n",
"train_df[\"Sex\"] = train_df[\"Sex\"].map({\"female\": 0, \"male\": 1}).astype(int)\n",
"test_df[\"Sex\"] = test_df[\"Sex\"].map({\"female\": 0, \"male\": 1}).astype(int)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId 0\n",
"Survived 0\n",
"Pclass 0\n",
"Name 0\n",
"Sex 0\n",
"Age 177\n",
"SibSp 0\n",
"Parch 0\n",
"Ticket 0\n",
"Fare 0\n",
"Cabin 687\n",
"Embarked 2\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 欠損値が含まれているデータの数がいくつあるかを項目ごとに調べる\n",
"train_df.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# 外れ値を検出して除外している参考記事もあったが,今回は無視"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Name, Ticket, Cabinが非構造データ.どう処理をする?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic/notebook\n",
"にあるように,乗客の名前から分かることは結構あるようだ. \n",
"例えば,敬称がMrsだったら結婚している女性のものなので女性で年齢は未婚の人よりも高い傾向あることが予想できる.\n",
"というわけで,名前から敬称(Titleというらしい?)を抜き出して属性として使えるようにしよう."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Palsson, Master. Gosta Leonard\n",
"Rice, Master. Eugene\n",
"Uruchurtu, Don. Manuel E\n",
"Panula, Master. Juha Niilo\n",
"Goodwin, Master. William Frederick\n",
"Skoog, Master. Harald\n",
"Moubarek, Master. Gerios\n",
"Caldwell, Master. Alden Gates\n",
"Nicola-Yarred, Master. Elias\n",
"Byles, Rev. Thomas Roussel Davids\n",
"Bateman, Rev. Robert James\n",
"Sage, Master. Thomas Henry\n",
"Panula, Master. Eino Viljami\n",
"Goldsmith, Master. Frank John William \"Frankie\"\n",
"Rice, Master. Arthur\n",
"Lefebre, Master. Henry Forbes\n",
"Asplund, Master. Clarence Gustaf Hugo\n",
"Becker, Master. Richard F\n",
"Navratil, Master. Michel M\n",
"Minahan, Dr. William Edward\n",
"Carter, Rev. Ernest Courtenay\n",
"Asplund, Master. Edvin Rojj Felix\n",
"Rice, Master. Eric\n",
"Allison, Master. Hudson Trevor\n",
"Moraweck, Dr. Ernest\n",
"Navratil, Master. Edmond Roger\n",
"Coutts, Master. William Loch \"William\"\n",
"Aubart, Mme. Leontine Pauline\n",
"Goodwin, Master. Sidney Leonard\n",
"Pain, Dr. Alfred\n",
"Richards, Master. William Rowe\n",
"Dodge, Master. Washington\n",
"Peuchen, Major. Arthur Godfrey\n",
"Goodwin, Master. Harold Victor\n",
"Coutts, Master. Eden Leslie \"Neville\"\n",
"Butt, Major. Archibald Willingham\n",
"Davies, Master. John Morgan Jr\n",
"Duff Gordon, Sir. Cosmo Edmund (\"Mr Morgan\")\n",
"Kirkland, Rev. Charles Leonard\n",
"Stahelin-Maeglin, Dr. Max\n",
"Sagesser, Mlle. Emma\n",
"Simonius-Blumer, Col. Oberst Alfons\n",
"Frauenthal, Dr. Henry William\n",
"Weir, Col. John\n",
"Moubarek, Master. Halim Gonios (\"William George\")\n",
"Crosby, Capt. Edward Gifford\n",
"Moor, Master. Meier\n",
"Hamalainen, Master. Viljo\n",
"Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)\n",
"Brewe, Dr. Arthur Jackson\n",
"Rice, Master. George Hugh\n",
"Dean, Master. Bertram Vere\n",
"Leader, Dr. Alice (Farnham)\n",
"Carter, Master. William Thornton II\n",
"Thomas, Master. Assad Alexander\n",
"Skoog, Master. Karl Thorsten\n",
"Reuchlin, Jonkheer. John George\n",
"Panula, Master. Urho Abraham\n",
"Mallet, Master. Andre\n",
"Richards, Master. George Sibley\n",
"Harper, Rev. John\n",
"Andersson, Master. Sigvard Harald Elias\n",
"Johnson, Master. Harold Theodor\n",
"Montvila, Rev. Juozas\n"
]
}
],
"source": [
"# Mrとかの有名な敬称じゃない名前を抽出してみる\n",
"for name in train_df['Name']:\n",
" print(name) if not re.search('(miss)|(mrs)|(ms)|(mr)\\.', name, re.IGNORECASE) else '' "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Uruchurtu, Don. Manuel E\n",
"Byles, Rev. Thomas Roussel Davids\n",
"Bateman, Rev. Robert James\n",
"Carter, Rev. Ernest Courtenay\n",
"Aubart, Mme. Leontine Pauline\n",
"Peuchen, Major. Arthur Godfrey\n",
"Butt, Major. Archibald Willingham\n",
"Kirkland, Rev. Charles Leonard\n",
"Sagesser, Mlle. Emma\n",
"Simonius-Blumer, Col. Oberst Alfons\n",
"Weir, Col. John\n",
"Crosby, Capt. Edward Gifford\n",
"Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)\n",
"Reuchlin, Jonkheer. John George\n",
"Harper, Rev. John\n",
"Montvila, Rev. Juozas\n"
]
}
],
"source": [
"# MasterとかDr.が多い?これらを入れて再度出力\n",
"for name in train_df['Name']:\n",
" print(name) if not re.search('(miss)|(mrs)|(ms)|(mr)|(master)|(dr)\\.', name, re.IGNORECASE) else '' "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"とりあえず名前を並べてみて,多く使われている敬称は取り除けたけど,その分布とかはこの方法だとわからない.\n",
"ので敬称の頻度分布がわかるようにしてみる."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Mr\n",
"1 Mrs\n",
"2 Mr\n",
"3 Mr\n",
"4 Mrs\n",
"Name: Title, dtype: object"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Nameから敬称を抽出する.データを見る限り,\"名字 敬称 名前\"の順になっていたのでそのルールに沿って敬称を取得する\n",
"train_df_title = [i.split(\",\")[1].split(\".\")[0].strip() for i in train_df[\"Name\"]]\n",
"train_df[\"Title\"] = pd.Series(train_df_title)\n",
"# train_df[\"Title\"].head()\n",
"\n",
"test_df_title = [i.split(\",\")[1].split(\".\")[0].strip() for i in test_df[\"Name\"]]\n",
"test_df[\"Title\"] = pd.Series(test_df_title)\n",
"test_df[\"Title\"].head()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x10c927f28>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 敬称の頻度分布を可視化\n",
"g = sns.countplot(x=\"Title\",data=train_df)\n",
"g = plt.setp(g.get_xticklabels(), rotation=45) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"上の可視化の結果を見る限り,Mr, Mrs, Miss, Masterの4つ以外はほとんど現れていないことがわかる.なのでまとめて'rare'のカラムにまとめてしまうことにする. \n",
"※ Mme, MlleはMissと同じ意味の敬称らしいので,同じカテゴリに分類しておく(本当か怪しいが,Mme, Mlleはデータ数が少ないので間違っていても大きく影響は受けないだろう).Msは婚姻状態によらない女性の敬称. \n",
"MrsとMissは女性が既婚ならMrs, 未婚ならMissの敬称になる.昔の慣習と言った感じだが,年齢とは相関がありそうなので使わせてもらう."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# 敬称をカテゴリデータに変換する.本当はonehot-vectorの形にしたほうがよさそう\n",
"train_df[\"Title\"] = train_df[\"Title\"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')\n",
"train_df[\"Title\"] = train_df[\"Title\"].map({\"Master\":0, \"Miss\":1, \"Mme\":1, \"Mlle\":1, \"Ms\" : 2 , \"Mrs\":2, \"Mr\":3, \"Rare\":4})\n",
"train_df[\"Title\"] = train_df[\"Title\"].astype(int)\n",
"\n",
"test_df[\"Title\"] = test_df[\"Title\"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')\n",
"test_df[\"Title\"] = test_df[\"Title\"].map({\"Master\":0, \"Miss\":1, \"Mme\":1, \"Mlle\":1, \"Ms\" : 2 , \"Mrs\":2, \"Mr\":3, \"Rare\":4})\n",
"test_df[\"Title\"] = test_df[\"Title\"].astype(int)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEFCAYAAADuT+DpAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAEytJREFUeJzt3XuQHWWZx/HvJOGikrCLiwaXm5etx9K1UAImIiEDBhFQg2y5IrhILGRZg4K6BV5iEVR2V0UsQFxZwMRLQLkorggYa4GYRG4qrqaCDwICoqQkYCDAckky+0f3mJNhMpwZps/J5P1+qlL06e5zztNNT//67ct7evr6+pAklWtctwuQJHWXQSBJhTMIJKlwBoEkFW5CtwsYrojYBtgbuB9Y1+VyJGmsGA/sBNySmU+2ThhzQUAVAku6XYQkjVHTgaWtI8ZiENwPsHDhQiZPntztWiRpTFi5ciVHHXUU1PvQVmMxCNYBTJ48mZ133rnbtUjSWPOMU+peLJakwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXAGgSQVbiw+UCZphP7tk5d2u4RGfOL0d3a7hDHNFoEkFc4gkKTCGQSSVLhGrxFExC+AR+qXvwPOA84C1gKLMvO0iBgHfAXYA3gSODYz72iyLknSBo0FQURsC/RkZm/LuF8C/wDcBfwwIl4HvBTYNjPfEBHTgC8Cs5qqS5K0sSZbBHsAz4+IRfX3zAO2ycw7ASLiR8BMql/MuQYgM2+MiL0arEmSNECT1wgeB84ADgKOB+bX4/qtAbYHJgEPt4xfFxHe1ipJHdLkDvd24I7M7ANuj4iHgR1apk8EVgPPr4f7jcvMtQ3WJUlq0WSL4H1U5/uJiJdQ7fAfi4iXR0QPVUthCbAMOKSebxrw6wZrkiQN0GSL4EJgQUQsBfqogmE9sBAYT3XX0E0RcQtwYET8FOgBZjdYkyRpgMaCIDOfAo4cZNK0AfOtp7qGIEnqAh8ok6TCGQSSVDiDQJIKZxBIUuEMAkkqnEEgSYUzCCSpcAaBJBXOIJCkwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXAGgSQVziCQpMIZBJJUOINAkgpnEEhS4QwCSSqcQSBJhTMIJKlwBoEkFc4gkKTCGQSSVDiDQJIKZxBIUuEMAkkqnEEgSYUzCCSpcAaBJBVuQpMfHhEvAn4OHAisBRYAfcByYE5mro+IU4FD6+knZebNTdYkSdpYYy2CiNgKOA/4v3rUmcDczJwO9ACzImJPYAYwFTgCOLepeiRJg2vy1NAZwFeBP9avpwCL6+GrgZnAvsCizOzLzHuBCRGxY4M1SZIGaCQIIuIY4IHM/FHL6J7M7KuH1wDbA5OAh1vm6R8vSeqQpq4RvA/oi4iZwGuBbwAvapk+EVgNPFIPDxwvSeqQRloEmblfZs7IzF7gl8DRwNUR0VvPcjCwBFgGHBQR4yJiV2BcZq5qoiZJ0uAavWtogI8C50fE1sBtwGWZuS4ilgA3UIXSnA7WI0miA0FQtwr6zRhk+jxgXtN1SJIG5wNlklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXAGgSQVziCQpMIZBJJUOINAkgpnEEhS4QwCSSqcQSBJhTMIJKlwBoEkFc4gkKTCGQSSVDiDQJIKZxBIUuEMAkkqnEEgSYUzCCSpcAaBJBXOIJCkwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXAGgSQVziCQpMJNaOqDI2I8cD4QQB9wPPAEsKB+vRyYk5nrI+JU4FBgLXBSZt7cVF2SpI012SJ4G0BmvhGYC5wOnAnMzczpQA8wKyL2BGYAU4EjgHMbrEmSNEBjQZCZVwDH1S93A1YDU4DF9birgZnAvsCizOzLzHuBCRGxY1N1SZI21lYQRMQ5g4z7+rO9LzPX1vOdAywEejKzr568BtgemAQ83PK2/vGSpA4Y8hpBRFwAvAzYKyJe3TJpK9rcWWfmeyPiFOAm4HktkyZStRIeqYcHjpckdcCzXSz+LLA7cBZwWsv4tcBtQ70xIv4J2Dkz/x14HFgP/CwiejPzeuBg4DrgDuDzEXEGsDMwLjNXDX9RJEkjMWQQZObdwN3AHhExiaoV0FNP3g54aIi3fxeYHxE/oWpBnEQVHudHxNb18GWZuS4ilgA3UJ2qmjPipZEkDVtbt49GxMeBjwMPtozuozptNKjMfAz4x0EmzRhk3nnAvHZqkSSNrnafIzgWeHlmPtBkMZKkzmv39tF7Gfo0kCRpjGq3RfBbYGlEXEf1dDAAmfnpRqqSJHVMu0Hwh/ofbLhYLEnaArQVBJl52rPPJUkai9q9a2g91V1Crf6YmbuMfkmSpE5qt0Xwl4vKEbEVcBjwhqaKkiR1zrA7ncvMpzPzUuCABuqRJHVYu6eGjm552QO8GniqkYokSR3V7l1D+7cM9wGrgHeNfjmSpE5r9xrB7PraQNTvWZ6ZaxutTJLUEe3+HsEUqofKvg7MB+6NiKlNFiZJ6ox2Tw2dDbwrM28CiIhpVD828/qmCpMkdUa7dw1t1x8CAJl5I7BtMyVJkjqp3SB4KCJm9b+IiMPYuEtqSdIY1e6poeOAKyPiQqrbR/uAfRqrSpLUMe22CA6m+rnJ3ahuJX0A6G2oJklSB7UbBMcBb8zMxzLzV8AU4IPNlSVJ6pR2g2ArNn6S+Cme2QmdJGkMavcawRXAtRFxSf36cOD7zZQkSeqktloEmXkK1bMEQfWD9Wdn5qeaLEyS1BnttgjIzMuAyxqsRZLUBcPuhlqStGUxCCSpcAaBJBXOIJCkwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXBtdzExHBGxFfA1YHdgG+CzwApgAVWvpcuBOZm5PiJOBQ4F1gInZebNTdQkSRpcUy2C9wAPZuZ04C3Al4Ezgbn1uB5gVkTsCcwApgJHAOc2VI8kaROaCoJLgf7eSXuojvanAIvrcVcDM4F9gUWZ2ZeZ9wITImLHhmqSJA2ikSDIzEczc01ETKTqsXQu0JOZ/T9mswbYHpgEPNzy1v7xkqQOaexicUTsAlwHfDMzLwLWt0yeCKwGHqmHB46XJHVII0EQES8GFgGnZObX6tG3RkRvPXwwsARYBhwUEeMiYldgXGauaqImSdLgGrlrCPgE8NfApyKi/1rBicDZEbE1cBtwWWaui4glwA1UoTSnoXokSZvQSBBk5olUO/6BZgwy7zxgXhN1SJKenQ+USVLhmjo1pC47Zv5gDbKxb8Hss7pdgrTFsUUgSYUzCCSpcAaBJBXOIJCkwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXAGgSQVziCQpMIZBJJUOINAkgpnEEhS4QwCSSqcQSBJhTMIJKlwBoEkFc4gkKTCGQSSVDiDQJIKN6HbBUhNu+ro2d0uoRGHfGN+t0vQFsIWgSQVziCQpMIZBJJUOINAkgpnEEhS4QwCSSqcQSBJhWv0OYKImAp8LjN7I+IVwAKgD1gOzMnM9RFxKnAosBY4KTNvbrImSdLGGmsRRMTJwAXAtvWoM4G5mTkd6AFmRcSewAxgKnAEcG5T9UiSBtfkqaE7gcNbXk8BFtfDVwMzgX2BRZnZl5n3AhMiYscGa5IkDdBYEGTm5cDTLaN6MrOvHl4DbA9MAh5umad/vCSpQzp5sXh9y/BEYDXwSD08cLwkqUM6GQS3RkRvPXwwsARYBhwUEeMiYldgXGau6mBNklS8TvY++lHg/IjYGrgNuCwz10XEEuAGqlCa08F6JEk0HASZeTcwrR6+neoOoYHzzAPmNVmHJGnTfKBMkgpnEEhS4QwCSSqcQSBJhTMIJKlwBoEkFc4gkKTCGQSSVDiDQJIKZxBIUuEMAkkqnEEgSYUzCCSpcAaBJBXOIJCkwhkEklQ4g0CSCmcQSFLhDAJJKlwnf7y+I448eWG3Sxh1F33+qG6XIGkLZotAkgpnEEhS4QwCSSqcQSBJhTMIJKlwBoEkFW6Lu31UktrxkyvndbuERuz31nnDfo8tAkkqnEEgSYUzCCSpcAaBJBXOIJCkwm0Wdw1FxDjgK8AewJPAsZl5R3erkqQybC4tgsOAbTPzDcDHgC92uR5JKsZm0SIA9gWuAcjMGyNiryHmHQ+wcuXKQSc++fjqUS+u2+67775hv+eJ1Y83UEn3jWRdPPTkEw1U0n0jWRePPvbnBirpvpGsi1UPPdpAJd23qXXRss8cP3BaT19fX4MltSciLgAuz8yr69f3Ai/LzLWDzLsvsKTDJUrSlmJ6Zi5tHbG5tAgeASa2vB43WAjUbgGmA/cD65ouTJK2EOOBnaj2oRvZXIJgGfA24JKImAb8elMzZuaTwNJNTZckbdKdg43cXILge8CBEfFToAeY3eV6JKkYm8U1AklS92wut49KkrrEIJCkwhkEklS4zeVicddERC9wHfDuzPx2y/hfAb/IzGOG8VknZOaXR73IZ//eXoZYBmBSZh4+zM/ck+qJ713r/764vmOrf9rPgf0z8/rnUHdb3wHsDrwS+Crw7cycNtLvbKOmXkawPUTEWcBZVHdlnJeZx7dMOxt4e2bu3lTd3Taaf0djSb3clwArgD5gEnAXcFRmPtXF0obFFkHlN8AR/S8i4jXAC0bwOXNHraLh2+QyDDcEam8FrqyH7wcObpl2FNXG/lx14jtGYiTbw8sy8y7gQWC/iJhQv3c8sHdThW5mRuvvaKy5NjN7M3P/zJwCPA28vdtFDUfxLYLa/wIREdtn5sPAe4CFwK4RcQJwONUGvQp4B9UR6nxgLVWYHgkcDewQEV8BTqQ6ev27evrczLw+IpYDtwNPZeYRjK6hlmFlZk6OiA8A7wXWA7dk5oci4nDgFKqN94/AEZm5HtgL+Ez92RcD7wauqDsI3JP6oZSIOIbqGZDnUT2schYwC/h74F8z8/sR8U7gI1QPAC7NzI/Vn9vWdwwmImYAp9efeSfwz5n59MhW3TMMtS7nA6+ol/eszPxmRLwKuK1+71rgeuBA4GrgzcCPqbYPIuJ64E/ADsAc4Gu0bEeZ+ftRWoZuGGq93UMVFCsy88PdLLJJEbE11d/Bn+seE3apX/93Zs6NiAXAC+t/hwInUz0gOx44MzMv7Ubdtgg2uBw4PCJ6gNcDP6VaPy8EZmbmVKrg3Jvqj/xmYCZwKrB9Zp4OPJSZHwCOBVZl5n5UO8Vz6+/YDvhMAyEw1DK0mg2cUHfud1t91Ppu4AuZuS/V0fmkiHgx8KfM7L+3+GbglRHxAuAAqlMArSZm5iHA54B/oQrO44DZEbEDcBrwpvo7/jYiDhzBd/xFvXznA4dn5gzgD8Ax7a+mtgy2LicC+9XL9xY2PNne2rIBuIgNR8ZHUu0MW12cmTOptp+NtqNRXoZu2NQ2uAtV0G2JIXBARFwfESuoTsV+j+rg5MbMPIhqPRzfMv+1mbkPMA14af13sT/wyYj4qw7XDhgErfr/ePdjQ19G64GngIsj4kJgZ2Ar4EJgNVVHeSdQHdG1eg1wSH30dzkwISL+pp6WHV6GVrOBORGxGNiN6uG9j1BtyIuBfaiW+VDgqgHv/T5VqB0JfGvAtFvr/64Gbqt37n8GtqU6et4RuKpeH68CXj6C72i1I9VR1iX1Z765Xp7RNNi6XAOcBPwX8B1gm3r8PlRPx/dbBrwuIvqP/O4Z8Nn928CzbUdj0aa2wVWZ+WB3SmrctZnZS3Vk/xTwO+AhYO+IWAh8iQ3bCmz4//8aYEq9DV9DtW/ZvTMlb8wgqNXnd18AfIgNO6FJwGGZ+S7gg1Trq4dqZ7UkM98EXEp1aoV6GlRN4IvrjePgep6H6mnrO7wMrd4PHF8fRb+Oagd2HDCvHtdDderrQGDRgPdeRHV6Y6f6e1oN9VTi74DfAwfW6+Mc4MYRfEerVcB9wKz6M08Hrh1i/mHbxLrcCZiSme+gCrLPR8SOwCOZua7lvX1UIfefwBWDfHz/NrCp7WjMGmIbbGy731zUQfce4ALgw8DqzDyKqlv959etJNiwLn4DXFdvwwdQXXQetAuIphkEG/sOsEtm3l6/Xgs8FhHLqM7z3g+8BPgZ8OmIuJaqyXdOPf+KiPgWcB7VaY7FVE3je+rz7t1Yhla/BpbUdf8JuInq1MSVEfE/wGSqI5OtM3OjPnoz8zdUR+I/GE4xmfkAcCawOCJuogrGu57Ld9Tr8kTgh3W3JB8Alg+nrjYNXJcrgcn1d/4YOIMq0K4Z5L0LqS4YDnXOd1Pb0Vg31Da4RcvMFcDZVNfI3hIRP6E6IPgt1b6j1Q+ARyNiCdUdcn2ZuaaT9faziwlJKpwtAkkqnEEgSYUzCCSpcAaBJBXOIJCkwtnFhNSGiDgXeCOwNdVDcivqSedR3fb31br7iXmZeU9E3A30ZubdXShXGhaDQGpDZs4BiIjdgesz87WDzLY/VXca0phiEEjPQUTMqwefoHpg6KqImN4yfTzwBaCXqmOxBZn5pQ6XKQ3JawTSKMjM/6DqvfWQAX3qvL+evidV52OzWoNC2hzYIpCaNRN4bUQcUL/ejqqzscE6BZS6wiCQmjUeODkzvwtQ90L7WHdLkjbmqSFp9KzlmQdX1wLvj4itImI7YCkwteOVSUOwRSCNniupLhYf1DKu/5fqbqX6e5v/XH7nWWqCvY9KUuE8NSRJhTMIJKlwBoEkFc4gkKTCGQSSVDiDQJIKZxBIUuH+Hy54WV5UCS/LAAAAAElFTkSuQmCC\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x10c9636d8>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"g = sns.countplot(train_df[\"Title\"])\n",
"g = g.set_xticklabels([\"Master\",\"Miss/Mme/Mlle\", \"Ms/Mrs\",\"Mr\",\"Rare\"])"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAARgAAAEYCAYAAACHjumMAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAFfZJREFUeJzt3X+UVXW5x/H3zDAIgqiIKzU19JpP3hI1TdHSgLFUyiyzm0rX9BpqirrMlWl6EzS1pdmPycofld7U/JW6AkrLRMRIiWtledHHLLWsrPBXKgMDzNw/vnvoMJ45s+fHM3vO5vNay8Xss8/Z59ly5sN377P392no7OxERCRCY9EFiEh5KWBEJIwCRkTCKGBEJIwCRkTCKGBEJMyIqA2bWSPwDWA3YBXwCXd/smL9Z4CjgH8Cl7r7/KhaRKQYkSOYDwKj3H1f4Gzg8q4VZrYrcDQwGXgvcIGZbRxYi4gUIDJg3gXcDeDuDwF7VazbBVjo7ivdfSXwO2BSTxsysxFmNtHMwkZcIjL4In9hxwEvVyyvNbMR7r4G+C1wjpltAowE9gOurrGtbYGn7r333rBiRWRAGqo9GDmC+SewSeV7ZeGCuz8GXEEa4VwBLAGWB9YiIgWIDJjFwHQAM5tMGrWQLW8JbOLu7wROArYDHg2sRUQKEHmIdCfwHjP7OWn4dJyZfQp4EpgH7GJmS4F24NPuvjawFhEpQFjAuHsHaXRS6fGKn0+Mem8RGR50oZ2IhFHAiEgYBYyIhFHAiEgYBYyIhFHAyJBrbW2lpaWF1tbWokuRYAoYGVJtbW3MnTsXgHnz5tHW1lZwRRJJASNDqr29na5OFh0dHbS3txdckURSwIhIGAWMiIRRwIhIGAWMiIRRwIhIGAWMiIRRwIhIGAWMiIRRwIhIGAWMiIQpsrPjmaTmax3Axe5+Z1QtIlKMojo7bgacDuxL6uz4lcA6RKQgRXV2fA14BhiT/dcRWIeIFCQyYKp2dqxY/hOwDPgloIlBREoosi9Sj50dgUOArYEdsuUfm9lid/9FYD0yQD865rgBb6Nt7frtr3568qmMbmrq9/amf/fagZYkgQrp7Ai8CLQBq9x9JfASsFlgLSJSgEI6O7r7XDM7EHjIzDqAnwH3BNYiIgUorLOju58PnB/1/iJSPF1oJyJhFDAiEkYBU8fU/kOGOwVMnVL7D6kHCpg6pfYfUg8UMCISRgEjImEUMCISRgEjImEUMCISRgEjImEUMCISRgEjImE2+IDR5fYicTbogNHl9iKxNuiA0eX2Q6+poWHdzw3dlqV8NuiAkaE3srGR3ceMBWC3MWMZ2aiPYJlFTpkpUlXLZuNp2Wx80WXIECiks6OZ7c76zdYmAx9097uj6hGRoRc5glnX2THrKnA5cBiAu/8amAJgZh8B/qxwESmfyIBZr7Ojme3V/QlmNgaYAxwQWMewc+y1pw94G2tXrVlvedZNn6Vpo/7/dV533FcHWpLI6xTZ2RHgeOA2d18eWIeIFKSozo5dZgBHBNYgIgUqqrMjZrYpsJG7/ymwBhEpUGGdHYGdgacD319EClZkZ8elpG+aRKSkdBmliIRRwIhIGAWMiIRRwIhIGAWMiIRRwIhIGAWMiIRRwIhIGAWMiIRRwIhIGAWMiIRRwIhIGAVMnWporGj30dBtWWSYUMDUqcbmJsbunGbmH/vm8TQ2NxVckcjrqW1JHdt8723YfO9tii5DpEcawYhImF5HMGb2KPA/wPXu/lx8SSJSFnlGMO8DRgH3mdkPzewIM2sOrktESqDXEYy7PwNcCFxoZh8CWoErzewG4EJ3f77a62p1dszWHwKcT5qv92HgFHfvHOD+iMgw0usIxszGmtmxZnYvcAnwTWBv4AngxzVeuq6zI3A2qbNj1zY3AS4D3u/u+5Am/57Q350QkeEpz7dITwHzgTnuvqjrQTP7JvCeGq+r1dlxP1Ibk8vNbEfgW+7+j74WLyLDW56AOT5rM7KOmR3u7ncAH6rxuqqdHbPmaxOAqcDuwKvAA2b2oLs/0bfyRWQ46zFgzOyjwEbABWa2WcWqZuAc4I5etl2rs+PzwNKub6XMbBEpbBQwIiVSawQzjnQoswlptNFlDXBujm0vBg4Fbq3S2fGXwNvMbALwEjAZuKYPdYtIHegxYNz9GuAaM2tx93v7se2anR3N7Bz+dZL4Vnd/tB/vISLDWK1DpKvd/QTgPDN73YjF3afV2nCOzo43Azf3rVwRqSe1DpGuyv6cPQR1iEgJ1QqYMWZ2ADBsL347+qwbB/T6jjUr11s+cc73aRwxakDb/N6lMwb0epEyqRUwc2qs6wRqHiKJiNQ6yTu1p3UiInn0epLXzO6jymFSbyd5RUR0kldEwvR4s6O7P5z9eT/wIjAJ2AV4LntMRKSmPHdTnwZ8H5gI7AzMM7OPB9clIiWQ52bHmcCe7v4KgJldCCwizXInItKjPDPavQas7ra8sofnioisU+tbpM9lPz4PLDazm0k3Oh4B/G4IahOROlfrEKmrk9cvsj83zv78SVw5IlImtS60q3olr5k1ADuEVSQipZGnbcks4GJgTMXDTwE7RRUlIuWQ5yTvmaTOALcA/wYcDyyJLEpEBk9raystLS20trYO+XvnCZi/u/tTwG+AXd39OsBCqxKRQdHW1sbcuWlK7Xnz5tHW1jak75/ra2ozm0oKmEPNbCtg89iyRGQwtLe309mZbiXs6Oigvb19SN8/T8CcCnyA1IJkC9KsdF+LLEpEyiFPZ8f/A84ws3HADHfPdZFdjs6OXyX1Tnole+gwd3/5dRsSkbqV51ukXUm3BbwJ6DSzx4GPu/vve3npus6OWVeBy4HDKtbvCRzk7sv7V7qIDHd5DpGuBM519y3cfQIpKL6T43XrdXYE1nV2zEY3bwauNrPFZvZffa5cRIa9PAEz2t3v6lpw9ztJPZN6U7WzY/bzGNJ5nI8BBwMnm9mkfCWLSL2odS/S9tmPj5jZ2cC3SfcizQAeyLHtWp0dVwBfdfcV2XstIJ2r+U3fyheR4azWOZj7SVNlNgBTgBMr1nUCp/Wy7VqdHXcGbjGzPUijqHeh6R9ESqfWvUgDvd+ot86O1wMPkaaC+G72bZWIlEieb5G2BK4AWrLnLwA+6e5/q/W6HJ0dLwMu62vBIlI/8pzkvQpYCuxImjbzIdL5GBGRmvJMmbmjux9esXypmf1nVEEiUh55RjCdZrZd10L27dLqGs8XEQHyjWD+G3jQzJaQTtbuA5wQWpWIlEKegPkjsAewN2nEc5K7/z20KhEphTwBc4u77wL8MLoYESmXPAGzLOswsARYN1uNuy8Kq0pESiFPwIwHpmb/dekEpoVUJCKlkWc+mKkAZjYeWKs5W0QkrzxX8u4GfBd4I9BoZo8Bx+SYD0ZENnB5roP5Dmk+mAnuPh74InBdaFUiUgp5AqbB3ed3LWTzwYyNK0lEyiLPSd5FZnYecA1pPpgjgce65otx9z8G1icidSxPwHTNo3t8t8e75ovZcVArEpHSyPMtUnn7UDc0VS50WxaRgcpzDqa0GpuaGb3lLgCM3vItNDY1F1yRSLnkOUQqtXHb78u47fctugyRUtqgRzAiEqtWV4FrSSdxq3L3mr2MeuvsWPGcHwI/cPcr+1C3iNSBWodICwe47d46OwJ8Hth8gO8jIsNUra4C69qIZPchjSFNONUE5Plmab3Ojma2V+VKMzsC6Oh6joiUT6/nYMzsYuApwIGfAU8Cl+TYdo+dHc3sbcDRwOf6WrCI1I88J3mPArYDbiFN2XAg8I8cr6vV2fEY0s2TC4BjgU+Z2cE5axaROpEnYP7q7v8EHgV2c/f7gDfkeN1iYDpA986O7n6Wu+/j7lNIN05+yd11qCRSMnmug3k5a1PyMHCqmf2FfCdma3Z27HfFIlI38gTM8cBR7n69mR1KasR2Xm8v6q2zY8XzZueoQUTqUJ6A+Q/gBgB3PzO2HBEpkzwB80bgITNzUtDc4e4rYssSkTLo9SSvu386u6P6ImAy8Gszuz68MhGpe7nuRTKzBqAZGEm6OG5VZFEiUg55Jv3+Gumy/18BNwKnufvK6MJEpP7lOQfzBPB2d89zcZ2IyDq17qY+wd2vJjVe+6SZrbfe3S8Irk1E6lytEUxDDz+LiORS627qq7IfXwZucve/DU1JItJl0fzZA3r9irbV6y0/+JNL2Xj0wKaGPeD9s3M/V9fBiEgYXQcjImF0HYyIhMl7HcxhwK9Jh0i6DkZEcslzDuZvwJ66DkZE+irPIdIMhYuI9EeeEcwyM/scsARo63rQ3ReFVSUipZAnYMaT5uKdWvFYJzAtpCIRKY1eA8bdp/b2nGp6a7xmZqeQJvzuBL7o7rf2531EZPjK8y3SfVTp8OjuvY1gemy8ZmYTgE8CewCjSIdht7l7j50kRaT+5DlEml3xczMpJF7M8boeG6+5+3Iz293d15jZRGClwkWkfPIcIt3f7aGfmtkSem+aVrXxWldvpCxcZgFzgNY+1CwidSLPIdL2FYsNwFuBLXJsu1bjNQDc/Qozuxq4y8ymZj2XRKQk8hwiVY5gOoHlwKk5XrcYOBS4tXvjNUuTy1wCfBhYTToJ3JGzZhGpE3kOkfI0uq+mZuM1M3sEeJAUWndVORQTkTqX5xBpb9IJ2yuA+aRvfk5y99trva63xmvuPod0/kVESirPrQKtpLaxR5Cu5N0TODuyKBEphzwB05gdvrwP+L67/5F8525EZAOXJ2BWmNmZpFsD5pvZ6cArsWWJSBnkupsaGAN82N1fBLYBjg6tSkRKIc+3SH8GLqhY/kxoRSJSGrmmzBQR6Q8FjIiEUcCISBgFjIiEUcCISBgFjIiEUcCISBgFjIiEUcCISBgFjIiEUcCISBgFjIiEUcCISJiwiaNydHY8AzgyW/xRNoWmiJRI5AhmXWdH0hSbl3etMLMdSfPM7AdMBt5rZpMCaxGRAkQGzHqdHYG9Ktb9CTjY3ddmHR2bgZWBtYhIASLn1u2xs6O7rwaWm1kDcBnwK3d/IrAWESlA5AimZmdHMxsF3Jg95+TAOkSkIJEBsxiYDlCls2MD8APgEXc/0d3XBtYhMihaW1tpaWmhtVWt1POKPETqsbMj0AS8G9jIzA7Jnn+Ouz8YWI9Iv7W1tTF37lwA5s2bx8yZMxk9enTBVQ1/YQHTW2dHYFTUe4sMtvb2djo7OwHo6Oigvb1dAZODLrQTkTAKGBEJo4ARKbERTf/6FW9oWH95KChgREps5Mgm3rHb1gDsNWlrRo5sGtL3VxN7kZKbPm0npk/bqZD31ghGRMJoBCMbhIvPvW1Ar1+9ev1b5b5y0Vyamwd2pcVnL/rIgF5fDzSCEZEwChgRCaOAEZEwChgRCaOAEZEwChgRCaOAEZEwChiRHBoaKy+xb+i2LD1RwIjkMKKpmW3f8FYAtn3DvzOiqbngiuqDruQVycl22B/bYf+iy6grGsGISJjCOjtmz9mSNDn4JHdXXySRkimksyOAmR0E/ATYKrAGESlQUZ0dATqAA4EXAmsQkQJFBkzVzo5dC+5+j7s/H/j+IlKwwjo7ikj5FdLZUUQ2DIV0dnT3uYHvKyLDRJGdHbueNzGqBhEpli60E5EwChgRCaOAEZEwChgRCaOAEZEwChgRCaOAEZEwChgRCaOAEZEwChgRCaOAEZEwChgRCaOAEZEwChgRCaOAEZEwChgRCaOAEZEwChgRCVNYZ0czmwmcCKwBPu/u86NqEZFiFNLZ0cy2Ak4D3gkcBFxiZhsF1iIiBYjsKrBeZ0czq+zsuDew2N1XAavM7ElgErC0h201ATz33HPrPbhqxUuDXfOAPfvss70+Z+VLK4agkr7JU/cLq4Zf+/A8dQO8+tqLwZX0XZ7al7/w6hBU0jfV6m5paZkIPNu991lkwFTt7JgV0H3dK8CmNba1NcCMGTMGvcjB1nJPa9El9EvLlS1Fl9AvF7bUZ90At991ee9PGo4umlft0aeAHYCnKx+MDJhanR27r9sEqDUcWQrsD/wVWDuYRYrIoHnd0CYyYBYDhwK3Vuns+AvgIjMbBWwE7AI82tOGskOpnwXWKiIBGjo7O0M2XPEt0iSyzo6kVrJPuvvc7FukE0gnmi9299tDChGRwoQFjIiILrQTkTAKGBEJo4ARkTCR3yINOTObAtwHHOXuN1c8/hvgl+5+bB+2Ncvdrxj0IvO//xRq7Aswzt0PL6i8Xg3m30VR6mEfshpvBZYBnaRrzP4AzHD39gJLA8o5gnkcOLJrwcx2Bcb0YzvnDVpF/dfjvgzncKkwWH8XRaqHfVjg7lPcfaq77wmsBj5QdFFQshFM5hHAzGxTd38Z+BhwI7C9mc0CDid9QJYDHwImAteSbrpsBI4GjgHGm9k3gNOBK4E3Z+vPc/eFZvYo8ATQ7u5HEqPWvjzn7luZ2cnAx4EOYKm7n2ZmhwOfIX3Q/gIc6e4dQTX2t/5nSL+8y4AHhkm91eTaB3c/o8giu5jZSNKV7y+a2beA7bLlue5+npldB2yR/fc+4CzSRaxNwJfc/bbBrKeMIxiA24HDzayBdN/Tz0n7ugVwoLvvQwrXdwDvIV34dyBwPrCpu18EvODuJwOfAJa7+wHAYcDXs/cYC1wYGC619qXSccCs7KbSx8xsBHAUcJm7vwuYTxo2F6Wn+rcDjs5+MYdTvdXk2YciTTOzhWa2jHT4fCfwe+Ahdz+IVPNJFc9f4O77AZOBHbL/71OBc81ss8EsrKwB8z3SsPYA0r+OkP6FbwduMrNvA9sCzcC3Sbcp3A3MIo1kKu0KTDezhaQP2ggzm5Ct88B96FJtXyodB5xiZvcDbyJd1Pgp0ofufmA/0r4Xpaf6l7v789nPw6neavLsQ5EWuPsU0kiknXRf0AvAO8zsRuDLpCvmu3R9bncF9sw+23eTfh8mDmZhpQwYd/8D6TDoNOCG7OFxwAfd/aPAqaR9byCNSh5w9xbgNtJQnWwdpCHwTdlf4CHZc17I1oX/IvSwL5VmAie5+7uBPUi/oCcAs7PHGkiHgoWoUX/l/7thU281OfehcFnYfQz4FnAG8JK7zyBNlbJxNgKDf9X9OHBf9tmeRjpZ/PvBrKmUAZO5BdjO3Z/IltcAr5nZYuAe0o2T2wD/C1xgZgtIw8ivZc9fZmY3AFcBb8n+df058EwB5we670ul3wIPZPX/HVhCOuSbb2b3AluRDjuKVKt+GH71VtPbPgwL7r4MaAXeBhxsZouAbwK/I33eK80DXjWzB4CHgU53f2Uw69GtAiISpswjGBEpmAJGRMIoYEQkjAJGRMIoYEQkTBlvFZACmdnXSe1oRgI7kW4FgPR1f6e7X2lm15Kue3nGzJ4Gprj70wWUK8EUMDKo3P0UADObCCx0992rPG0qMGco65JiKGBkSJjZ7OzHlaQLvn5kZvtXrG8CLgOmkG68u87dvzzEZcog0zkYGVLu/gXSHdPTu93HMzNb/3bSzXmHVQaQ1CeNYGS4OBDY3cymZctjSTfjVbvBU+qEAkaGiybgLHe/AyC7Y/21YkuSgdIhkhRhDa//x20BMNPMms1sLKnR3j5DXpkMKo1gpAjzSSd5D6p4rGvWwF+RPpfXuvvCAmqTQaS7qUUkjA6RRCSMAkZEwihgRCSMAkZEwihgRCSMAkZEwihgRCTM/wOhcWDfLV8lsQAAAABJRU5ErkJggg==\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x10ca8fc50>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"g = sns.factorplot(x=\"Title\",y=\"Survived\",data=train_df,kind=\"bar\")\n",
"g = g.set_xticklabels([\"Master\",\"Miss\",\"Mrs\",\"Mr\",\"Rare\"])\n",
"g = g.set_ylabels(\"survival probability\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mrの敬称がついている(男性)は生存率が低い.Masterは男児につける敬称らしい. \n",
"グラフを見る限り,女性,子供が優先的に救助されたという話と一致した結果になっている.\n",
"敬称がRareになっている人はいろんな属性の人が混ざっているのであろうから,全体の真ん中くらいの生存率なのも自然な気がする."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# Nameの属性を削除\n",
"train_df.drop(labels = [\"Name\"], axis = 1, inplace = True)\n",
"test_df.drop(labels = [\"Name\"], axis = 1, inplace = True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"次に欠損値の扱いについてだが,Cabinは欠損している数が多いので今回はデータから除外する. \n",
"Ageはその値が生存に関わっているようなので (前回でAgeとSurviveの間には相関があることがわかっている), \n",
"Ageの欠損値を適当な値で埋める. \n",
"今回,名前から敬称を抽出したので,それを使って年齢の補完を行う.\n",
"knnとかで補完したほうがよさそうだが,今回は単に中央値で補完する."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3.5\n",
"21.0\n",
"35.0\n",
"30.0\n",
"48.5\n"
]
}
],
"source": [
"print(train_df[train_df.Title==0].Age.dropna().median())\n",
"print(train_df[train_df.Title==1].Age.dropna().median())\n",
"print(train_df[train_df.Title==2].Age.dropna().median())\n",
"print(train_df[train_df.Title==3].Age.dropna().median())\n",
"print(train_df[train_df.Title==4].Age.dropna().median())"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# fillnaで各行のTitleなどの属性によって補完値を変えるやり方がわからなかったので,forで回す\n",
"# 補完する値をmedianにしたら精度が前回よりも下がったが,理由は不明.\n",
"for i, train in train_df.iterrows():\n",
" if math.isnan(train.Age):\n",
" if math.isnan(train.Title):\n",
" train_df.at[i, 'Age'] = train_df[train_df.Title==train.Title].Age.dropna().mean()\n",
" else:\n",
" train_df.at[i, 'Age'] = train_df.Age.dropna().mean()\n",
"\n",
"for i, test in test_df.iterrows():\n",
" if math.isnan(test.Age):\n",
" if math.isnan(test.Title):\n",
" test_df.at[i, 'Age'] = test_df[test_df.Title==test.Title].Age.dropna().mean()\n",
" else:\n",
" test_df.at[i, 'Age'] = test_df.Age.dropna().mean()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId 0\n",
"Survived 0\n",
"Pclass 0\n",
"Sex 0\n",
"Age 0\n",
"SibSp 0\n",
"Parch 0\n",
"Ticket 0\n",
"Fare 0\n",
"Cabin 687\n",
"Embarked 2\n",
"Title 0\n",
"dtype: int64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# XXX: 今回はとにかく属性を削除しまくる\n",
"train_df = train_df.drop([\"Ticket\", \"SibSp\", \"Parch\", \"Fare\", \"Cabin\", \"Embarked\"], axis=1)\n",
"test_df = test_df.drop([\"Ticket\", \"SibSp\", \"Parch\", \"Fare\", \"Cabin\", \"Embarked\"], axis=1)\n",
"\n",
"# Prepare data\n",
"X_train = train_df.drop(['PassengerId', 'Survived'], axis=1).values\n",
"y_train = train_df.Survived.values\n",
"X_test = test_df.drop('PassengerId', axis=1).values\n",
"\n",
"model = RandomForestClassifier(n_estimators=100)\n",
"\n",
"# Predict with \"Random Forest\"\n",
"y_pred = model.fit(X_train, y_train).predict(X_test).astype(int)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"メモ: 欠損値についての考え方が書いてあった.\n",
"[Python pandas 欠損値/外れ値/離散化の処理 - StatsFragments](http://sinhrks.hatenablog.com/entry/2016/02/01/080859#%E6%AC%A0%E6%90%8D%E5%80%A4)\n",
"#### 欠損発生のパターンと概要\n",
"- MCAR\tランダムに欠損している ( 欠損は \"IQ\" や \"JobPerformance\" の値に関係しない )\n",
"- MAR\t他の変数の値と関係して欠損している ( \"IQ\" が低いと \"JobPerformance\" の欠損が多い )\n",
"- MNAR\t欠損が発生しているデータ自身と関係して欠損している ( \"JobPerformance\" の真の値が低いと \"JobPerformance\" の欠損が多い )"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy: 0.802\n"
]
}
],
"source": [
"import numpy as np\n",
"from sklearn.model_selection import KFold\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"# cross valicationで性能を予測\n",
"SPLIT_NUM = 4\n",
"sum_ = 0\n",
"kf = KFold(n_splits=SPLIT_NUM, shuffle=True)\n",
"for train, cv in kf.split(X_train):\n",
" y_cv_pred = model.fit(X_train[train], y_train[train]).predict(X_train[cv]).astype(int)\n",
" sum_ += accuracy_score(y_train[cv], y_cv_pred, normalize=True)\n",
"\n",
"print('accuracy: {:.3f}'.format(sum_/SPLIT_NUM))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# Save prediction to csv file\n",
"submission = pd.DataFrame({\n",
" \"PassengerId\": test_df[\"PassengerId\"],\n",
" \"Survived\": y_pred\n",
" })\n",
"submission.to_csv('../output/submission.csv', index=False)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# score: 0.71"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment