Skip to content

Instantly share code, notes, and snippets.

@lasershow
Last active March 19, 2018 04:50
Show Gist options
  • Save lasershow/73774b6677b881a82750914d3040cabe to your computer and use it in GitHub Desktop.
Save lasershow/73774b6677b881a82750914d3040cabe to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Porto Seguro チュートリアル *end-to-end* アンサンブル"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python 3には便利な分析ライブラリが多く含まれています。Kaggle/Pythonドッカーイメージで構築されています。 https://github.com/kaggle/docker-python\n",
"\n",
"それでは、パッケージをいくつか読み込んでみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np # 線形代数\n",
"import pandas as pd # データ処理: CSV ファイル I/O(例 pd.read_csv)\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.metrics import roc_auc_score"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier\n",
"from sklearn.naive_bayes import BernoulliNB\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import StratifiedKFold"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"feature_stat.csv\n",
"sample_submission.7z\n",
"sample_submission.csv\n",
"test.7z\n",
"test.csv\n",
"train.7z\n",
"train.csv\n",
"traintest.csv\n",
"traintest_h20000.csv\n",
"\n"
]
}
],
"source": [
"import xgboost as xgb\n",
"import lightgbm as lgb\n",
"import time\n",
"\n",
"pd.set_option('display.max_columns', 500)\n",
"pd.set_option('display.max_colwidth', 500)\n",
"pd.set_option('display.max_rows', 1000)\n",
"\n",
"# 入力データファイルは \".../input/\" ディレクトリにあります。\n",
"# runをクリック、もしくはShift+Enterを実行することで入力ディレクトリに\n",
"# 含まれているファイルを確認することができます。\n",
"\n",
"from subprocess import check_output\n",
"print(check_output([\"ls\", \"../input\"]).decode(\"utf8\"))\n",
"\n",
"# カレントディレクトリに書いたものは全て出力として保存されます。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 読まれる方へ\n",
"こちらのコードはKaggleを通してスキルを身に付けたい学生向けに書かれています。LB(リーダーボード)を過学習させようと競う人も多くいる中で、今回こちらでは基本的なCV(分割交差検証)を使った方法に重点をおきます。\n",
"もちろん生徒さんの中にはチームで競う方もいらっしゃるので、カーネルを使って積極的に情報共有をしていただければと思います。是非以下にコメントも残してください。では楽しんでください!\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Porto Seguro - End-to-end アンサンブル\n",
"\n",
"このチャレンジでは運転者が保険金請求を行うかどうかを予測する予測モデルを構築します。前もってデータを探求した後、いくつか便利なカテゴリカル・フィーチャー・エンコーディング(特徴量作成)を行い、モデル構築するためのシンプルなパイプラインを作成しました。\n",
"\n",
"こちらのカーネルではモデリングに深入りします。まずはいくつかのモデルにout-out-fold学習・テスト予測を行うテクニックをみていきます。後にこれらをまとめてアンサンブルモデルに使います。\n",
"\n",
"今回行うアンサンブルは一般的にアンサンブル学習と呼ばれ、以下のブログで上手にまとめられています。\n",
"\n",
"* [Kaggle Ensemble Guide](https://mlwave.com/kaggle-ensembling-guide/) ([Triskelion](https://www.kaggle.com/triskelion))\n",
"* [Stacking Made Easy: An Introduction to StackNet](http://blog.kaggle.com/2017/06/15/stacking-made-easy-an-introduction-to-stacknet-by-competitions-grandmaster-marios-michailidis-kazanova/) (グランドマスター [Marios Michailidis (KazAnova)](https://www.kaggle.com/kazanova))\n",
"\n",
"\n",
"では、さっそく取り掛かりましょう!"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"train=pd.read_csv('../input/train.csv')\n",
"test=pd.read_csv('../input/test.csv')\n",
"sample_submission=pd.read_csv('../input/sample_submission.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. カテゴリカル・フィーチャー・エンコーディングとフィーチャ削減\n",
"こちらはデータ前処理のカーネルと同じですので、詳細は[こちら](https://www.kaggle.com/yifanxie/porto-seguro-tutorial-simple-e2e-pipeline)を確認してください。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 フリークエンシー・エンコーディング"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# 学習用データとテストデータから'cols'を取り出しフリークエンシー(出現頻度)エンコーディングを行います。\n",
"def freq_encoding(cols, train_df, test_df):\n",
" # 新しいデータセットを以下のデータセットに保存します。\n",
" result_train_df=pd.DataFrame()\n",
" result_test_df=pd.DataFrame()\n",
" \n",
" # 各フィーチャーカラム(縦列)をループしていきます。\n",
" for col in cols:\n",
" \n",
" # 学習用セットの中でフィーチャーの出現頻度をデータフレームとして読み込んでいきます。\n",
" col_freq=col+'_freq'\n",
" freq=train_df[col].value_counts()\n",
" freq=pd.DataFrame(freq)\n",
" freq.reset_index(inplace=True)\n",
" freq.columns=[[col,col_freq]]\n",
"\n",
" # 'freq'データフレームを学習用データと融合します。\n",
" temp_train_df=pd.merge(train_df[[col]], freq, how='left', on=col)\n",
" temp_train_df.drop([col], axis=1, inplace=True)\n",
"\n",
" # 'freq'データフレームをテストデータと融合します。\n",
" temp_test_df=pd.merge(test_df[[col]], freq, how='left', on=col)\n",
" temp_test_df.drop([col], axis=1, inplace=True)\n",
"\n",
" # 学習用セットで現れなかったレベルがテストセットに現れなかった場合、頻度を0と設定します。\n",
" temp_test_df.fillna(0, inplace=True)\n",
" temp_test_df[col_freq]=temp_test_df[col_freq].astype(np.int32)\n",
"\n",
" if result_train_df.shape[0]==0:\n",
" result_train_df=temp_train_df\n",
" result_test_df=temp_test_df\n",
" else:\n",
" result_train_df=pd.concat([result_train_df, temp_train_df],axis=1)\n",
" result_test_df=pd.concat([result_test_df, temp_test_df],axis=1)\n",
" \n",
" return result_train_df, result_test_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"フリークエンシー・エンコーディングの関数を実行してみましょう。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 バイナリ変換"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# カテゴリカル変数をバイナリに変換する関数を作成します。\n",
"# 学習用セットとテストセット、エンコードされるフィーチャーをとり、\n",
"# 入力フィーチャをバイナリー表示に変換されたデータセットが2つ返ってきます。\n",
"# この関数は符号化されるフィーチャが、nをフィーチャのレベル数としたとき\n",
"# 既に0からn-1の範囲で数値型に変換されていることを仮定しています。\n",
"\n",
"def binary_encoding(train_df, test_df, feat):\n",
" # 数値型変換で使用された最大値を計算。\n",
" train_feat_max = train_df[feat].max()\n",
" test_feat_max = test_df[feat].max()\n",
" if train_feat_max > test_feat_max:\n",
" feat_max = train_feat_max\n",
" else:\n",
" feat_max = test_feat_max\n",
" \n",
" # 欠損値にはfeat_max+1を使います。\n",
" train_df.loc[train_df[feat] == -1, feat] = feat_max + 1\n",
" test_df.loc[test_df[feat] == -1, feat] = feat_max + 1\n",
" \n",
" # 有り得るすべてのフィーチャの集合体を作成します。\n",
" union_val = np.union1d(train_df[feat].unique(), test_df[feat].unique())\n",
"\n",
" # フィーチャから小数点表示で最大値を抜き出します。\n",
" max_dec = union_val.max()\n",
" \n",
" # max_devをバイナリ表示するのに必要な桁数を計算します。\n",
" max_bin_len = len(\"{0:b}\".format(max_dec))\n",
" index = np.arange(len(union_val))\n",
" columns = list([feat])\n",
" \n",
" # フィーチャ全てのレベルを取得するのにバイナリ変換フィーチャ用のデータフレームを作成。\n",
" bin_df = pd.DataFrame(index=index, columns=columns)\n",
" bin_df[feat] = union_val\n",
" \n",
" # フィーチャの各レベルのバイナリ表示を取得。\n",
" feat_bin = bin_df[feat].apply(lambda x: \"{0:b}\".format(x).zfill(max_bin_len))\n",
" \n",
" # バイナリ表示を異なる桁数に分割する。\n",
" splitted = feat_bin.apply(lambda x: pd.Series(list(x)).astype(np.uint8))\n",
" splitted.columns = [feat + '_bin_' + str(x) for x in splitted.columns]\n",
" bin_df = bin_df.join(splitted)\n",
" \n",
" # バイナリ変換フィーチャ用のデータフレームを学習用セットとテストセットで結合させ、完成です! \n",
" train_df = pd.merge(train_df, bin_df, how='left', on=[feat])\n",
" test_df = pd.merge(test_df, bin_df, how='left', on=[feat])\n",
" return train_df, test_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"バイナリ変換関数を実行してみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"cat_cols=['ps_ind_02_cat','ps_car_04_cat', 'ps_car_09_cat',\n",
" 'ps_ind_05_cat', 'ps_car_01_cat']\n",
"\n",
"train, test=binary_encoding(train, test, 'ps_ind_02_cat')\n",
"train, test=binary_encoding(train, test, 'ps_car_04_cat')\n",
"train, test=binary_encoding(train, test, 'ps_car_09_cat')\n",
"train, test=binary_encoding(train, test, 'ps_ind_05_cat')\n",
"train, test=binary_encoding(train, test, 'ps_car_01_cat')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"任意でもとのカテゴリフィーチャを削除しても構いません。ここでは試しにやってみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"col_to_drop = train.columns[train.columns.str.startswith('ps_calc_')]\n",
"train.drop(col_to_drop, axis=1, inplace=True) \n",
"test.drop(col_to_drop, axis=1, inplace=True) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.3 フィチャー削減\n",
"calを含むフィーチャを全て取り除いてみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"col_to_drop = train.columns[train.columns.str.startswith('ps_calc_')]\n",
"train.drop(col_to_drop, axis=1, inplace=True) \n",
"test.drop(col_to_drop, axis=1, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"上記のデータ処理のあとデータセットを簡単に見てみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>target</th>\n",
" <th>ps_ind_01</th>\n",
" <th>ps_ind_02_cat</th>\n",
" <th>ps_ind_03</th>\n",
" <th>ps_ind_04_cat</th>\n",
" <th>ps_ind_05_cat</th>\n",
" <th>ps_ind_06_bin</th>\n",
" <th>ps_ind_07_bin</th>\n",
" <th>ps_ind_08_bin</th>\n",
" <th>ps_ind_09_bin</th>\n",
" <th>ps_ind_10_bin</th>\n",
" <th>ps_ind_11_bin</th>\n",
" <th>ps_ind_12_bin</th>\n",
" <th>ps_ind_13_bin</th>\n",
" <th>ps_ind_14</th>\n",
" <th>ps_ind_15</th>\n",
" <th>ps_ind_16_bin</th>\n",
" <th>ps_ind_17_bin</th>\n",
" <th>ps_ind_18_bin</th>\n",
" <th>ps_reg_01</th>\n",
" <th>ps_reg_02</th>\n",
" <th>ps_reg_03</th>\n",
" <th>ps_car_01_cat</th>\n",
" <th>ps_car_02_cat</th>\n",
" <th>ps_car_03_cat</th>\n",
" <th>ps_car_04_cat</th>\n",
" <th>ps_car_05_cat</th>\n",
" <th>ps_car_06_cat</th>\n",
" <th>ps_car_07_cat</th>\n",
" <th>ps_car_08_cat</th>\n",
" <th>ps_car_09_cat</th>\n",
" <th>ps_car_10_cat</th>\n",
" <th>ps_car_11_cat</th>\n",
" <th>ps_car_11</th>\n",
" <th>ps_car_12</th>\n",
" <th>ps_car_13</th>\n",
" <th>ps_car_14</th>\n",
" <th>ps_car_15</th>\n",
" <th>ps_ind_02_cat_bin_0</th>\n",
" <th>ps_ind_02_cat_bin_1</th>\n",
" <th>ps_ind_02_cat_bin_2</th>\n",
" <th>ps_car_04_cat_bin_0</th>\n",
" <th>ps_car_04_cat_bin_1</th>\n",
" <th>ps_car_04_cat_bin_2</th>\n",
" <th>ps_car_04_cat_bin_3</th>\n",
" <th>ps_car_09_cat_bin_0</th>\n",
" <th>ps_car_09_cat_bin_1</th>\n",
" <th>ps_car_09_cat_bin_2</th>\n",
" <th>ps_ind_05_cat_bin_0</th>\n",
" <th>ps_ind_05_cat_bin_1</th>\n",
" <th>ps_ind_05_cat_bin_2</th>\n",
" <th>ps_car_01_cat_bin_0</th>\n",
" <th>ps_car_01_cat_bin_1</th>\n",
" <th>ps_car_01_cat_bin_2</th>\n",
" <th>ps_car_01_cat_bin_3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>7</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>11</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0.7</td>\n",
" <td>0.2</td>\n",
" <td>0.718070</td>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>12</td>\n",
" <td>2</td>\n",
" <td>0.400000</td>\n",
" <td>0.883679</td>\n",
" <td>0.370810</td>\n",
" <td>3.605551</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>9</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>7</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0.8</td>\n",
" <td>0.4</td>\n",
" <td>0.766078</td>\n",
" <td>11</td>\n",
" <td>1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>11</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>19</td>\n",
" <td>3</td>\n",
" <td>0.316228</td>\n",
" <td>0.618817</td>\n",
" <td>0.388716</td>\n",
" <td>2.449490</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>13</td>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>9</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>12</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-1.000000</td>\n",
" <td>7</td>\n",
" <td>1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>14</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>60</td>\n",
" <td>1</td>\n",
" <td>0.316228</td>\n",
" <td>0.641586</td>\n",
" <td>0.347275</td>\n",
" <td>3.316625</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>16</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>8</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.9</td>\n",
" <td>0.2</td>\n",
" <td>0.580948</td>\n",
" <td>7</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>11</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>104</td>\n",
" <td>1</td>\n",
" <td>0.374166</td>\n",
" <td>0.542949</td>\n",
" <td>0.294958</td>\n",
" <td>2.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>17</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>9</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.7</td>\n",
" <td>0.6</td>\n",
" <td>0.840759</td>\n",
" <td>11</td>\n",
" <td>1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>14</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>82</td>\n",
" <td>3</td>\n",
" <td>0.316070</td>\n",
" <td>0.565832</td>\n",
" <td>0.365103</td>\n",
" <td>2.000000</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id target ps_ind_01 ps_ind_02_cat ps_ind_03 ps_ind_04_cat \\\n",
"0 7 0 2 2 5 1 \n",
"1 9 0 1 1 7 0 \n",
"2 13 0 5 4 9 1 \n",
"3 16 0 0 1 2 0 \n",
"4 17 0 0 2 0 1 \n",
"\n",
" ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin \\\n",
"0 0 0 1 0 0 \n",
"1 0 0 0 1 0 \n",
"2 0 0 0 1 0 \n",
"3 0 1 0 0 0 \n",
"4 0 1 0 0 0 \n",
"\n",
" ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14 \\\n",
"0 0 0 0 0 0 \n",
"1 0 0 0 0 0 \n",
"2 0 0 0 0 0 \n",
"3 0 0 0 0 0 \n",
"4 0 0 0 0 0 \n",
"\n",
" ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01 \\\n",
"0 11 0 1 0 0.7 \n",
"1 3 0 0 1 0.8 \n",
"2 12 1 0 0 0.0 \n",
"3 8 1 0 0 0.9 \n",
"4 9 1 0 0 0.7 \n",
"\n",
" ps_reg_02 ps_reg_03 ps_car_01_cat ps_car_02_cat ps_car_03_cat \\\n",
"0 0.2 0.718070 10 1 -1 \n",
"1 0.4 0.766078 11 1 -1 \n",
"2 0.0 -1.000000 7 1 -1 \n",
"3 0.2 0.580948 7 1 0 \n",
"4 0.6 0.840759 11 1 -1 \n",
"\n",
" ps_car_04_cat ps_car_05_cat ps_car_06_cat ps_car_07_cat ps_car_08_cat \\\n",
"0 0 1 4 1 0 \n",
"1 0 -1 11 1 1 \n",
"2 0 -1 14 1 1 \n",
"3 0 1 11 1 1 \n",
"4 0 -1 14 1 1 \n",
"\n",
" ps_car_09_cat ps_car_10_cat ps_car_11_cat ps_car_11 ps_car_12 \\\n",
"0 0 1 12 2 0.400000 \n",
"1 2 1 19 3 0.316228 \n",
"2 2 1 60 1 0.316228 \n",
"3 3 1 104 1 0.374166 \n",
"4 2 1 82 3 0.316070 \n",
"\n",
" ps_car_13 ps_car_14 ps_car_15 ps_ind_02_cat_bin_0 ps_ind_02_cat_bin_1 \\\n",
"0 0.883679 0.370810 3.605551 0 1 \n",
"1 0.618817 0.388716 2.449490 0 0 \n",
"2 0.641586 0.347275 3.316625 1 0 \n",
"3 0.542949 0.294958 2.000000 0 0 \n",
"4 0.565832 0.365103 2.000000 0 1 \n",
"\n",
" ps_ind_02_cat_bin_2 ps_car_04_cat_bin_0 ps_car_04_cat_bin_1 \\\n",
"0 0 0 0 \n",
"1 1 0 0 \n",
"2 0 0 0 \n",
"3 1 0 0 \n",
"4 0 0 0 \n",
"\n",
" ps_car_04_cat_bin_2 ps_car_04_cat_bin_3 ps_car_09_cat_bin_0 \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 0 \n",
"\n",
" ps_car_09_cat_bin_1 ps_car_09_cat_bin_2 ps_ind_05_cat_bin_0 \\\n",
"0 0 0 0 \n",
"1 1 0 0 \n",
"2 1 0 0 \n",
"3 1 1 0 \n",
"4 1 0 0 \n",
"\n",
" ps_ind_05_cat_bin_1 ps_ind_05_cat_bin_2 ps_car_01_cat_bin_0 \\\n",
"0 0 0 1 \n",
"1 0 0 1 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 1 \n",
"\n",
" ps_car_01_cat_bin_1 ps_car_01_cat_bin_2 ps_car_01_cat_bin_3 \n",
"0 0 1 0 \n",
"1 0 1 1 \n",
"2 1 1 1 \n",
"3 1 1 1 \n",
"4 0 1 1 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. K-fold CV(k-分割交差検証)と Out-of-Fold (OOF) 予測\n",
"注:デモンストレーションのため、パラメータを落として実行時間を速めました。実際やる時にはパラメータの組み合わせを考えてやってみてください。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 OOFに便利な関数の作成\n",
"まずAUCスコアからジニ係数を求める関数を書きます。"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"def auc_to_gini_norm(auc_score):\n",
" return 2*auc_score-1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1.1 Sklearn K-fold と OOF 関数\n",
"学習データとテストデータのOOF予測を求めるk-fold関数を書きます。"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"def cross_validate_sklearn(clf, x_train, y_train , x_test, kf,scale=False, verbose=True):\n",
" start_time=time.time()\n",
" \n",
" # out-of-fold学習とテスト予測のサイズを初期化。\n",
" train_pred = np.zeros((x_train.shape[0]))\n",
" test_pred = np.zeros((x_test.shape[0]))\n",
"\n",
" # k-foldオブジェクトを使い必要な分割(fold)の作成。\n",
" for i, (train_index, test_index) in enumerate(kf.split(x_train, y_train)):\n",
" # 学習分割(training fold)と検証分割(validation fold)を作ります。\n",
" x_train_kf, x_val_kf = x_train.loc[train_index, :], x_train.loc[test_index, :]\n",
" y_train_kf, y_val_kf = y_train[train_index], y_train[test_index]\n",
"\n",
" # 必要であればスケーリングを行います。(線形アルゴリズム用)\n",
" if scale:\n",
" scaler = StandardScaler().fit(x_train_kf.values)\n",
" x_train_kf_values = scaler.transform(x_train_kf.values)\n",
" x_val_kf_values = scaler.transform(x_val_kf.values)\n",
" x_test_values = scaler.transform(x_test.values)\n",
" else:\n",
" x_train_kf_values = x_train_kf.values\n",
" x_val_kf_values = x_val_kf.values\n",
" x_test_values = x_test.values\n",
" \n",
" # input classifier(入力分類器)を通して予測を行います。\n",
" clf.fit(x_train_kf_values, y_train_kf.values)\n",
" val_pred=clf.predict_proba(x_val_kf_values)[:,1]\n",
" train_pred[test_index] += val_pred\n",
"\n",
" y_test_preds = clf.predict_proba(x_test_values)[:,1]\n",
" test_pred += y_test_preds\n",
"\n",
" fold_auc = roc_auc_score(y_val_kf.values, val_pred)\n",
" fold_gini_norm = auc_to_gini_norm(fold_auc)\n",
"\n",
" if verbose:\n",
" print('fold cv {} AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(i, fold_auc, fold_gini_norm))\n",
"\n",
" test_pred /= kf.n_splits\n",
"\n",
" cv_auc = roc_auc_score(y_train, train_pred)\n",
" cv_gini_norm = auc_to_gini_norm(cv_auc)\n",
" cv_score = [cv_auc, cv_gini_norm]\n",
" if verbose:\n",
" print('cv AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(cv_auc, cv_gini_norm))\n",
" end_time = time.time()\n",
" print(\"it takes %.3f seconds to perform cross validation\" % (end_time - start_time))\n",
" return cv_score, train_pred,test_pred"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1.2 Xgboost K-分割とOOF関数\n",
"ここではネイティブインタフェースのXGBとLGBを使います。もちろんsklearnのAPIを使うのも手ですが、ネイティブインタフェースの方がいくつか利点があります。私が知っている限り、例えばXGBでヒストグラムを書くhist機能はネイティブインタフェースでしか扱われていません。\n",
"\n",
"この2つのOOF関数のために確率をランクに変換させる関数も用意します。予測確率の代わりに正規化ランクを使う理由は後ほど明らかになります。"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"def probability_to_rank(prediction, scaler=1):\n",
" pred_df=pd.DataFrame(columns=['probability'])\n",
" pred_df['probability']=prediction\n",
" pred_df['rank']=pred_df['probability'].rank()/len(prediction)*scaler\n",
" return pred_df['rank'].values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"次はOOF予測を計算するためにXGB用のk-fold関数を作成します。こちらはsklearnと似ていますが、分類を容易にするためにXGBインタフェースを使う必要がある上、後に確率をランクに変換させる機能を追加するためにも独自のものを作成して使います。"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"def cross_validate_xgb(params, x_train, y_train, x_test, kf, cat_cols=[], verbose=True, \n",
" verbose_eval=50, num_boost_round=4000, use_rank=True):\n",
" start_time=time.time()\n",
"\n",
" train_pred = np.zeros((x_train.shape[0]))\n",
" test_pred = np.zeros((x_test.shape[0]))\n",
"\n",
" # k-foldオブジェクトを使って学習フォルドと検証フォルドそれぞれにインデックスを付けます。\n",
" for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train)): #1, 2 ,3 ,4, 5フォルド\n",
" # 例:1,2,3,4が学習、5が検証\n",
" x_train_kf, x_val_kf = x_train.loc[train_index, :], x_train.loc[val_index, :]\n",
" y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]\n",
" x_test_kf=x_test.copy()\n",
"\n",
" d_train_kf = xgb.DMatrix(x_train_kf, label=y_train_kf)\n",
" d_val_kf = xgb.DMatrix(x_val_kf, label=y_val_kf)\n",
" d_test = xgb.DMatrix(x_test_kf)\n",
"\n",
" bst = xgb.train(params, d_train_kf, num_boost_round=num_boost_round,\n",
" evals=[(d_train_kf, 'train'), (d_val_kf, 'val')], verbose_eval=verbose_eval,\n",
" early_stopping_rounds=50)\n",
"\n",
" val_pred = bst.predict(d_val_kf, ntree_limit=bst.best_ntree_limit)\n",
" if use_rank:\n",
" train_pred[val_index] += probability_to_rank(val_pred)\n",
" test_pred+=probability_to_rank(bst.predict(d_test))\n",
" else:\n",
" train_pred[val_index] += val_pred\n",
" test_pred+=bst.predict(d_test)\n",
"\n",
" fold_auc = roc_auc_score(y_val_kf.values, val_pred)\n",
" fold_gini_norm = auc_to_gini_norm(fold_auc)\n",
"\n",
" if verbose:\n",
" print('fold cv {} AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(i, fold_auc, \n",
" fold_gini_norm))\n",
"\n",
" test_pred /= kf.n_splits\n",
"\n",
" cv_auc = roc_auc_score(y_train, train_pred)\n",
" cv_gini_norm = auc_to_gini_norm(cv_auc)\n",
" cv_score = [cv_auc, cv_gini_norm]\n",
" if verbose:\n",
" print('cv AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(cv_auc, cv_gini_norm))\n",
" end_time = time.time()\n",
" print(\"it takes %.3f seconds to perform cross validation\" % (end_time - start_time))\n",
"\n",
" return cv_score, train_pred,test_pred"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1.3 LigthGBM K-fold と OOF 関数\n",
"LGBにも同様の関数を作成します。LightGBMインタフェースを呼ぶコード以外、XGBとほとんど同じです。"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"def cross_validate_lgb(params, x_train, y_train, x_test, kf, cat_cols=[],\n",
" verbose=True, verbose_eval=50, use_cat=True, use_rank=True):\n",
" start_time = time.time()\n",
" train_pred = np.zeros((x_train.shape[0]))\n",
" test_pred = np.zeros((x_test.shape[0]))\n",
"\n",
" if len(cat_cols)==0: use_cat=False\n",
"\n",
" # k-foldオブジェクトを使って学習フォルドと検証フォルドそれぞれにインデックスを付けます。\n",
" for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train)): #1, 2 ,3 ,4, 5フォルド\n",
" # 例:1,2,3,4が学習、5が検証\n",
" x_train_kf, x_val_kf = x_train.loc[train_index, :], x_train.loc[val_index, :]\n",
" y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]\n",
"\n",
" if use_cat:\n",
" lgb_train = lgb.Dataset(x_train_kf, y_train_kf, categorical_feature=cat_cols)\n",
" lgb_val = lgb.Dataset(x_val_kf, y_val_kf, reference=lgb_train, categorical_feature=cat_cols)\n",
" else:\n",
" lgb_train = lgb.Dataset(x_train_kf, y_train_kf)\n",
" lgb_val = lgb.Dataset(x_val_kf, y_val_kf, reference=lgb_train)\n",
"\n",
" gbm = lgb.train(params,\n",
" lgb_train,\n",
" num_boost_round=4000,\n",
" valid_sets=lgb_val,\n",
" early_stopping_rounds=30,\n",
" verbose_eval=verbose_eval)\n",
"\n",
" val_pred = gbm.predict(x_val_kf)\n",
"\n",
" if use_rank:\n",
" train_pred[val_index] += probability_to_rank(val_pred)\n",
" test_pred += probability_to_rank(gbm.predict(x_test))\n",
" # test_pred += gbm.predict(x_test) 加算代入\n",
" else:\n",
" train_pred[val_index] += val_pred\n",
" test_pred += gbm.predict(x_test)\n",
"\n",
" # test_pred += gbm.predict(x_test) 加算代入\n",
" fold_auc = roc_auc_score(y_val_kf.values, val_pred)\n",
" fold_gini_norm = auc_to_gini_norm(fold_auc)\n",
" if verbose:\n",
" print('fold cv {} AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(i, fold_auc, fold_gini_norm))\n",
"\n",
" test_pred /= kf.n_splits\n",
"\n",
" cv_auc = roc_auc_score(y_train, train_pred)\n",
" cv_gini_norm = auc_to_gini_norm(cv_auc)\n",
" cv_score = [cv_auc, cv_gini_norm]\n",
" if verbose:\n",
" print('cv AUC score is {:.6f}, Gini_Norm score is {:.6f}'.format(cv_auc, cv_gini_norm))\n",
" end_time = time.time()\n",
" print(\"it takes %.3f seconds to perform cross validation\" % (end_time - start_time))\n",
" return cv_score, train_pred,test_pred"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. レベル1OOF 予測の作成\n",
"機械学習アルゴリズムのための学習とテストデータの準備をし、StratifiedKFoldオブジェクトを作成すればレベル1OOF出力まであともう少しです。"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"drop_cols=['id','target']\n",
"y_train=train['target']\n",
"x_train=train.drop(drop_cols, axis=1)\n",
"x_test=test.drop(['id'], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"スタッキングには全てのレベルと全てのモデルに同じフォルドの分布を使うようにしましょう。技術的な理由はこのチャレンジのフォーラムで他の競技者によって詳しく書かれています。"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"kf=StratifiedKFold(n_splits=5, shuffle=True, random_state=2017)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"やっと準備ができたところでレベル1モデル出力に進みましょう!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.1 ランダムフォレスト\n",
"ランダムフォレストを使ってみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fold cv 0 AUC score is 0.627738, Gini_Norm score is 0.255476\n",
"fold cv 1 AUC score is 0.628162, Gini_Norm score is 0.256324\n",
"fold cv 2 AUC score is 0.628498, Gini_Norm score is 0.256995\n",
"fold cv 3 AUC score is 0.624774, Gini_Norm score is 0.249548\n",
"fold cv 4 AUC score is 0.635280, Gini_Norm score is 0.270560\n",
"cv AUC score is 0.628841, Gini_Norm score is 0.257682\n",
"it takes 119.565 seconds to perform cross validation\n"
]
}
],
"source": [
"rf=RandomForestClassifier(n_estimators=200, n_jobs=6, min_samples_split=5, max_depth=7,\n",
" criterion='gini', random_state=0)\n",
"\n",
"outcomes =cross_validate_sklearn(rf, x_train, y_train ,x_test, kf, scale=False, verbose=True)\n",
"\n",
"rf_cv=outcomes[0]\n",
"rf_train_pred=outcomes[1]\n",
"rf_test_pred=outcomes[2]\n",
"\n",
"rf_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=rf_train_pred)\n",
"rf_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=rf_test_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2 決定木を増やしてみましょう\n",
"木をもっと足しちゃいましょう!"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fold cv 0 AUC score is 0.623916, Gini_Norm score is 0.247832\n",
"fold cv 1 AUC score is 0.622989, Gini_Norm score is 0.245978\n",
"fold cv 2 AUC score is 0.624352, Gini_Norm score is 0.248703\n",
"fold cv 3 AUC score is 0.619732, Gini_Norm score is 0.239464\n",
"fold cv 4 AUC score is 0.630961, Gini_Norm score is 0.261922\n",
"cv AUC score is 0.624357, Gini_Norm score is 0.248714\n",
"it takes 47.035 seconds to perform cross validation\n"
]
}
],
"source": [
"et=RandomForestClassifier(n_estimators=100, n_jobs=6, min_samples_split=5, max_depth=5,\n",
" criterion='gini', random_state=0)\n",
"\n",
"outcomes =cross_validate_sklearn(et, x_train, y_train ,x_test, kf, scale=False, verbose=True)\n",
"\n",
"et_cv=outcomes[0]\n",
"et_train_pred=outcomes[1]\n",
"et_test_pred=outcomes[2]\n",
"\n",
"et_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=et_train_pred)\n",
"et_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=et_test_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.3 ロジスティック回帰\n",
"お馴染みのロジスティック回帰を使ってみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fold cv 0 AUC score is 0.627116, Gini_Norm score is 0.254233\n",
"fold cv 1 AUC score is 0.624198, Gini_Norm score is 0.248397\n",
"fold cv 2 AUC score is 0.626947, Gini_Norm score is 0.253894\n",
"fold cv 3 AUC score is 0.626684, Gini_Norm score is 0.253367\n",
"fold cv 4 AUC score is 0.633531, Gini_Norm score is 0.267061\n",
"cv AUC score is 0.627682, Gini_Norm score is 0.255364\n",
"it takes 36.037 seconds to perform cross validation\n"
]
}
],
"source": [
"logit=LogisticRegression(random_state=0, C=0.5)\n",
"\n",
"outcomes = cross_validate_sklearn(logit, x_train, y_train ,x_test, kf, scale=True, verbose=True)\n",
"\n",
"logit_cv=outcomes[0]\n",
"logit_train_pred=outcomes[1]\n",
"logit_test_pred=outcomes[2]\n",
"\n",
"logit_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=logit_train_pred)\n",
"logit_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=logit_test_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.4 ベルヌーイ分布\n",
"あまり馴染みのないナイーブベイズはXGBやLGBに匹敵するような出力を生成しないアルゴリズムではありますが、スタッキングの全体のパフォーマンスを向上させルために役立ちます。"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fold cv 0 AUC score is 0.618216, Gini_Norm score is 0.236431\n",
"fold cv 1 AUC score is 0.617390, Gini_Norm score is 0.234780\n",
"fold cv 2 AUC score is 0.619533, Gini_Norm score is 0.239066\n",
"fold cv 3 AUC score is 0.618452, Gini_Norm score is 0.236905\n",
"fold cv 4 AUC score is 0.628443, Gini_Norm score is 0.256886\n",
"cv AUC score is 0.620379, Gini_Norm score is 0.240758\n",
"it takes 29.043 seconds to perform cross validation\n"
]
}
],
"source": [
"nb=BernoulliNB()\n",
"\n",
"outcomes =cross_validate_sklearn(nb, x_train, y_train ,x_test, kf, scale=True, verbose=True)\n",
"\n",
"nb_cv=outcomes[0]\n",
"nb_train_pred=outcomes[1]\n",
"nb_test_pred=outcomes[2]\n",
"\n",
"nb_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=nb_train_pred)\n",
"nb_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=nb_test_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.5 XGB\n",
"最強のGBMバズーカを使ってみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fold cv 0 AUC score is 0.642407, Gini_Norm score is 0.284814\n",
"fold cv 1 AUC score is 0.640669, Gini_Norm score is 0.281339\n",
"fold cv 2 AUC score is 0.642415, Gini_Norm score is 0.284829\n",
"fold cv 3 AUC score is 0.640377, Gini_Norm score is 0.280754\n",
"fold cv 4 AUC score is 0.645840, Gini_Norm score is 0.291680\n",
"cv AUC score is 0.642265, Gini_Norm score is 0.284530\n",
"it takes 121.876 seconds to perform cross validation\n"
]
}
],
"source": [
"xgb_params = {\n",
" \"booster\" : \"gbtree\", \n",
" \"objective\" : \"binary:logistic\",\n",
" \"tree_method\": \"hist\",\n",
" \"eval_metric\": \"auc\",\n",
" \"eta\": 0.1,\n",
" \"max_depth\": 5,\n",
" \"min_child_weight\": 10,\n",
" \"gamma\": 0.70,\n",
" \"subsample\": 0.76,\n",
" \"colsample_bytree\": 0.95,\n",
" \"nthread\": 6,\n",
" \"seed\": 0,\n",
" 'silent': 1\n",
"}\n",
"\n",
"outcomes=cross_validate_xgb(xgb_params, x_train, y_train, x_test, kf, use_rank=False, verbose_eval=False)\n",
"\n",
"xgb_cv=outcomes[0]\n",
"xgb_train_pred=outcomes[1]\n",
"xgb_test_pred=outcomes[2]\n",
"\n",
"xgb_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=xgb_train_pred)\n",
"xgb_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=xgb_test_pred)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.6 LightGBM"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/yifan/.conda/envs/dragons/lib/python3.6/site-packages/lightgbm/basic.py:1030: UserWarning: Using categorical_feature in Dataset.\n",
" warnings.warn('Using categorical_feature in Dataset.')\n",
"/home/yifan/.conda/envs/dragons/lib/python3.6/site-packages/lightgbm/basic.py:671: UserWarning: categorical_feature in param dict is overrided.\n",
" warnings.warn('categorical_feature in param dict is overrided.')\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"fold cv 0 AUC score is 0.640417, Gini_Norm score is 0.280834\n",
"fold cv 1 AUC score is 0.640079, Gini_Norm score is 0.280157\n",
"fold cv 2 AUC score is 0.641063, Gini_Norm score is 0.282127\n",
"fold cv 3 AUC score is 0.639120, Gini_Norm score is 0.278241\n",
"fold cv 4 AUC score is 0.646300, Gini_Norm score is 0.292600\n",
"cv AUC score is 0.640832, Gini_Norm score is 0.281665\n",
"it takes 65.087 seconds to perform cross validation\n"
]
}
],
"source": [
"lgb_params = {\n",
" 'task': 'train',\n",
" 'boosting_type': 'dart',\n",
" 'objective': 'binary',\n",
" 'metric': {'auc'},\n",
" 'num_leaves': 22,\n",
" 'min_sum_hessian_in_leaf': 20,\n",
" 'max_depth': 5,\n",
" 'learning_rate': 0.1, # 0.618580\n",
" 'num_threads': 6,\n",
" 'feature_fraction': 0.6894,\n",
" 'bagging_fraction': 0.4218,\n",
" 'max_drop': 5,\n",
" 'drop_rate': 0.0123,\n",
" 'min_data_in_leaf': 10,\n",
" 'bagging_freq': 1,\n",
" 'lambda_l1': 1,\n",
" 'lambda_l2': 0.01,\n",
" 'verbose': 1\n",
"}\n",
"\n",
"\n",
"cat_cols=['ps_ind_02_cat','ps_car_04_cat', 'ps_car_09_cat','ps_ind_05_cat', 'ps_car_01_cat']\n",
"outcomes=cross_validate_lgb(lgb_params,x_train, y_train ,x_test,kf, cat_cols, use_cat=True, \n",
" verbose_eval=False, use_rank=False)\n",
"\n",
"lgb_cv=outcomes[0]\n",
"lgb_train_pred=outcomes[1]\n",
"lgb_test_pred=outcomes[2]\n",
"\n",
"lgb_train_pred_df=pd.DataFrame(columns=['prediction_probability'], data=lgb_train_pred)\n",
"lgb_test_pred_df=pd.DataFrame(columns=['prediction_probability'], data=lgb_test_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"レベル1が準備できたとことで、次に進みましょう!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. レベル2アンサンブル\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.1 レベル1出力データフレームの作成\n",
"レベル1のOOF予測結果をまとめてレベル2スタッキングの入力データを作りましょう。"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"columns=['rf','et','logit','nb','xgb','lgb']\n",
"train_pred_df_list=[rf_train_pred_df, et_train_pred_df, logit_train_pred_df, nb_train_pred_df,\n",
" xgb_train_pred_df, lgb_train_pred_df]\n",
"\n",
"test_pred_df_list=[rf_test_pred_df, et_test_pred_df, logit_test_pred_df, nb_test_pred_df,\n",
" xgb_test_pred_df, lgb_test_pred_df]\n",
"\n",
"lv1_train_df=pd.DataFrame(columns=columns)\n",
"lv1_test_df=pd.DataFrame(columns=columns)\n",
"\n",
"for i in range(0,len(columns)):\n",
" lv1_train_df[columns[i]]=train_pred_df_list[i]['prediction_probability']\n",
" lv1_test_df[columns[i]]=test_pred_df_list[i]['prediction_probability']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.2 レベル2 XGB\n",
"レベル2にもブースティングを使ってみましょう!\n",
"\n",
"…やり方は先ほどと全く同じなのでしょうか?"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fold cv 0 AUC score is 0.642236, Gini_Norm score is 0.284472\n",
"fold cv 1 AUC score is 0.642141, Gini_Norm score is 0.284282\n",
"fold cv 2 AUC score is 0.643130, Gini_Norm score is 0.286260\n",
"fold cv 3 AUC score is 0.640631, Gini_Norm score is 0.281262\n",
"fold cv 4 AUC score is 0.647816, Gini_Norm score is 0.295632\n",
"cv AUC score is 0.581720, Gini_Norm score is 0.163440\n",
"it takes 33.358 seconds to perform cross validation\n"
]
}
],
"source": [
"xgb_lv2_outcomes=cross_validate_xgb(xgb_params, lv1_train_df, y_train, lv1_test_df, kf, \n",
" verbose=True, verbose_eval=False, use_rank=False)\n",
"\n",
"xgb_lv2_cv=xgb_lv2_outcomes[0]\n",
"xgb_lv2_train_pred=xgb_lv2_outcomes[1]\n",
"xgb_lv2_test_pred=xgb_lv2_outcomes[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"何かおかしいようです。CVスコアはなかなか良いですが、全体の学習CVスコアが落ちてしまいました。AUCとジニスコアはランキングにもとづいた指標なので、XGBとLGBをレベル2スタッキングで用いると、各フォルドで計算された予測スコアを合わせたときにランキングがおかしくなってしまうのです。実はこれを防ぐために先ほど確率をランクに変換する関数を書きました。\n",
"\n",
"*use_rank*を使ってみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fold cv 0 AUC score is 0.642236, Gini_Norm score is 0.284472\n",
"fold cv 1 AUC score is 0.642141, Gini_Norm score is 0.284282\n",
"fold cv 2 AUC score is 0.643130, Gini_Norm score is 0.286260\n",
"fold cv 3 AUC score is 0.640631, Gini_Norm score is 0.281262\n",
"fold cv 4 AUC score is 0.647816, Gini_Norm score is 0.295632\n",
"cv AUC score is 0.643200, Gini_Norm score is 0.286399\n",
"it takes 31.880 seconds to perform cross validation\n"
]
}
],
"source": [
"xgb_lv2_outcomes=cross_validate_xgb(xgb_params, lv1_train_df, y_train, lv1_test_df, kf, \n",
" verbose=True, verbose_eval=False, use_rank=True)\n",
"\n",
"xgb_lv2_cv=xgb_lv2_outcomes[0]\n",
"xgb_lv2_train_pred=xgb_lv2_outcomes[1]\n",
"xgb_lv2_test_pred=xgb_lv2_outcomes[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"学習予測のOOFスコアが上がり学習予測結果の良くなりました。レベル1で最もよかったのはXGBで得た0.282で今回は0.284が得られ、どのレベル1OOF学習スコアよりも良くなっていることがわかります。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.3 レベル2 LightGBM\n",
"LightGBMも同様です。*use_rank*機能を使いましょう。"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fold cv 0 AUC score is 0.640239, Gini_Norm score is 0.280479\n",
"fold cv 1 AUC score is 0.639798, Gini_Norm score is 0.279596\n",
"fold cv 2 AUC score is 0.640237, Gini_Norm score is 0.280474\n",
"fold cv 3 AUC score is 0.638335, Gini_Norm score is 0.276670\n",
"fold cv 4 AUC score is 0.646149, Gini_Norm score is 0.292299\n",
"cv AUC score is 0.640982, Gini_Norm score is 0.281964\n",
"it takes 9.026 seconds to perform cross validation\n"
]
}
],
"source": [
"lgb_lv2_outcomes=cross_validate_lgb(lgb_params,lv1_train_df, y_train ,lv1_test_df,kf, [], use_cat=False, \n",
" verbose_eval=False, use_rank=True)\n",
"\n",
"lgb_lv2_cv=xgb_lv2_outcomes[0]\n",
"lgb_lv2_train_pred=lgb_lv2_outcomes[1]\n",
"lgb_lv2_test_pred=lgb_lv2_outcomes[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.3 レベル2 ランダムフォレスト\n",
"レベル2にもランダムフォレストを使ってみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fold cv 0 AUC score is 0.642115, Gini_Norm score is 0.284230\n",
"fold cv 1 AUC score is 0.640611, Gini_Norm score is 0.281222\n",
"fold cv 2 AUC score is 0.642669, Gini_Norm score is 0.285338\n",
"fold cv 3 AUC score is 0.640341, Gini_Norm score is 0.280683\n",
"fold cv 4 AUC score is 0.646928, Gini_Norm score is 0.293856\n",
"cv AUC score is 0.642027, Gini_Norm score is 0.284055\n",
"it takes 179.781 seconds to perform cross validation\n"
]
}
],
"source": [
"rf_lv2=RandomForestClassifier(n_estimators=200, n_jobs=6, min_samples_split=5, max_depth=7,\n",
" criterion='gini', random_state=0)\n",
"rf_lv2_outcomes = cross_validate_sklearn(rf_lv2, lv1_train_df, y_train ,lv1_test_df, kf, \n",
" scale=True, verbose=True)\n",
"rf_lv2_cv=rf_lv2_outcomes[0]\n",
"rf_lv2_train_pred=rf_lv2_outcomes[1]\n",
"rf_lv2_test_pred=rf_lv2_outcomes[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.4 レベル2 ロジスティック回帰"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fold cv 0 AUC score is 0.641212, Gini_Norm score is 0.282425\n",
"fold cv 1 AUC score is 0.639367, Gini_Norm score is 0.278733\n",
"fold cv 2 AUC score is 0.641523, Gini_Norm score is 0.283046\n",
"fold cv 3 AUC score is 0.639232, Gini_Norm score is 0.278464\n",
"fold cv 4 AUC score is 0.646012, Gini_Norm score is 0.292024\n",
"cv AUC score is 0.641076, Gini_Norm score is 0.282152\n",
"it takes 6.127 seconds to perform cross validation\n"
]
}
],
"source": [
"logit_lv2=LogisticRegression(random_state=0, C=0.5)\n",
"logit_lv2_outcomes = cross_validate_sklearn(logit_lv2, lv1_train_df, y_train ,lv1_test_df, kf, \n",
" scale=True, verbose=True)\n",
"logit_lv2_cv=logit_lv2_outcomes[0]\n",
"logit_lv2_train_pred=logit_lv2_outcomes[1]\n",
"logit_lv2_test_pred=logit_lv2_outcomes[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"レベル1OOF出力で求めたメタフィーチャのおかげで、ランダムフォレストやロジスティック回帰といったモデルがレベル2で優れた結果を出すようになったことが確認できます。\n",
"\n",
"ここでやめては面白くないのでレベル3もやってみましょう!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. レベル3アンサンブル\n",
"レベル3でもレベル2に似た手順を追います。まずレベル2のOOF出力をまとめ、お好きな学習アルゴリズムに渡します。\n",
"\n",
"### 5.1 レベル2出力データフレームの作成"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"lv2_columns=['rf_lf2', 'logit_lv2', 'xgb_lv2','lgb_lv2']\n",
"train_lv2_pred_list=[rf_lv2_train_pred, logit_lv2_train_pred, xgb_lv2_train_pred, lgb_lv2_train_pred]\n",
"\n",
"test_lv2_pred_list=[rf_lv2_test_pred, logit_lv2_test_pred, xgb_lv2_test_pred, lgb_lv2_test_pred]\n",
"\n",
"lv2_train=pd.DataFrame(columns=lv2_columns)\n",
"lv2_test=pd.DataFrame(columns=lv2_columns)\n",
"\n",
"for i in range(0,len(lv2_columns)):\n",
" lv2_train[lv2_columns[i]]=train_lv2_pred_list[i]\n",
" lv2_test[lv2_columns[i]]=test_lv2_pred_list[i]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.2 レベル3 XGB\n",
"ここではXGBのみ使ってみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fold cv 0 AUC score is 0.642765, Gini_Norm score is 0.285529\n",
"fold cv 1 AUC score is 0.642432, Gini_Norm score is 0.284865\n",
"fold cv 2 AUC score is 0.642701, Gini_Norm score is 0.285402\n",
"fold cv 3 AUC score is 0.640837, Gini_Norm score is 0.281673\n",
"fold cv 4 AUC score is 0.647882, Gini_Norm score is 0.295764\n",
"cv AUC score is 0.643337, Gini_Norm score is 0.286674\n",
"it takes 44.188 seconds to perform cross validation\n"
]
}
],
"source": [
"xgb_lv3_params = {\n",
" \"booster\" : \"gbtree\", \n",
" \"objective\" : \"binary:logistic\",\n",
" \"tree_method\": \"hist\",\n",
" \"eval_metric\": \"auc\",\n",
" \"eta\": 0.1,\n",
" \"max_depth\": 2,\n",
" \"min_child_weight\": 10,\n",
" \"gamma\": 0.70,\n",
" \"subsample\": 0.76,\n",
" \"colsample_bytree\": 0.95,\n",
" \"nthread\": 6,\n",
" \"seed\": 0,\n",
" 'silent': 1\n",
"}\n",
"\n",
"\n",
"\n",
"xgb_lv3_outcomes=cross_validate_xgb(xgb_lv3_params, lv2_train, y_train, lv2_test, kf, \n",
" verbose=True, verbose_eval=False, use_rank=True)\n",
"\n",
"xgb_lv3_cv=xgb_lv3_outcomes[0]\n",
"xgb_lv3_train_pred=xgb_lv3_outcomes[1]\n",
"xgb_lv3_test_pred=xgb_lv3_outcomes[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"レベル2で求めたXBGの結果よりも少しだけ良いですが大差はなく、レベルを増やすことで収穫逓減が起こり始めてしまいました。線形的なものと組み合わせてみましょう。\n",
"\n",
"### 5.3 レベル3 ロジスティック回帰\n",
"線形といえばロジスティック回帰ですね!"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fold cv 0 AUC score is 0.642561, Gini_Norm score is 0.285122\n",
"fold cv 1 AUC score is 0.642091, Gini_Norm score is 0.284182\n",
"fold cv 2 AUC score is 0.643098, Gini_Norm score is 0.286195\n",
"fold cv 3 AUC score is 0.640703, Gini_Norm score is 0.281406\n",
"fold cv 4 AUC score is 0.647777, Gini_Norm score is 0.295554\n",
"cv AUC score is 0.643160, Gini_Norm score is 0.286320\n",
"it takes 4.816 seconds to perform cross validation\n"
]
}
],
"source": [
"logit_lv3=LogisticRegression(random_state=0, C=0.5)\n",
"logit_lv3_outcomes = cross_validate_sklearn(logit_lv3, lv2_train, y_train ,lv2_test, kf, \n",
" scale=True, verbose=True)\n",
"logit_lv3_cv=logit_lv3_outcomes[0]\n",
"logit_lv3_train_pred=logit_lv3_outcomes[1]\n",
"logit_lv3_test_pred=logit_lv3_outcomes[2]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"このレベルまでくるとXGBとロジスティック回帰の差はあまり見られません。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.4 レベル3出力データの平均と回答の提出\n",
"2つのウェイトの平均をとり、もう少し絞れないか見てみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.286698636514\n"
]
}
],
"source": [
"weight_avg=logit_lv3_train_pred*0.5+ xgb_lv3_train_pred*0.5\n",
"print(auc_to_gini_norm(roc_auc_score(y_train, weight_avg)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"学習スコアは0.28443と求まりました。回答の提出をするためテストデータの方も同じウェイトで掛けましょう。"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"submission=sample_submission.copy()\n",
"submission['target']=logit_lv3_test_pred*0.5+ xgb_lv3_test_pred*0.5\n",
"filename='stacking_demonstration.csv.gz'\n",
"submission.to_csv(filename,compression='gzip', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6 終えてみて\n",
"レベルを3つ使ったスタッキングのやり方をご理解いただけたでしょうか。学習データからより多くの情報を取り出すことで、より良いテスト予測が期待できます。しかし実際このチャレンジのデータにはノイズが多く、レベル2以上のスタッキングがスコアを向上させてくれる保証は正直ないような気もします。\n",
"\n",
"以下の手順でスタッキングを進めるのも案です。\n",
"\n",
"レベル2まで進めウェイトを平均する\n",
"いくつか異なった乱数シードに対しても同じような手順でスタッキングを行う\n",
"全ての平均をとる\n",
"\n",
"CV(交差検証)もこのチャレンジの鍵になるかと思います。もしかしたら締め切り前日になって誰かが0.291のスクリプトを流出させて私たちをこう興奮させてくれるかもしれません。締め切りわずかの時間を楽しんでくださいね!被保険者が請求する必要がないことを願って。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment