Skip to content

Instantly share code, notes, and snippets.

@riodeja5
Created November 18, 2018 13:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save riodeja5/5b075a3f1f3f2ae70864905160bbbdc4 to your computer and use it in GitHub Desktop.
Save riodeja5/5b075a3f1f3f2ae70864905160bbbdc4 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import mglearn\n",
"from IPython.display import display\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3.3.1 さまざまな前処理"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1080x576 with 5 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"mglearn.plots.plot_scaling()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"疑問:RebustScalerは中央値と四方位数を用いるが、なぜ外れ値を無視できるのか?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3.3.2 データ変換の適用"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import load_breast_cancer\n",
"from sklearn.model_selection import train_test_split\n",
"cancer = load_breast_cancer()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=1)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(426, 30)\n",
"(143, 30)\n"
]
}
],
"source": [
"print(X_train.shape)\n",
"print(X_test.shape)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# 前処理 MinMaxScalerをインポートし、インスタンスを生成\n",
"from sklearn.preprocessing import MinMaxScaler\n",
"scaler = MinMaxScaler()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MinMaxScaler(copy=True, feature_range=(0, 1))"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# スケール変換器を適合させる\n",
"# ※スケール変換器のfitメソッドには、データXのみを与え、yは与えない\n",
"scaler.fit(X_train)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# データを変換\n",
"X_train_scaled = scaler.transform(X_train)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"transformed shape: (426, 30)\n",
"per-feature minimum before scaling:\n",
" [6.981e+00 9.710e+00 4.379e+01 1.435e+02 5.263e-02 1.938e-02 0.000e+00\n",
" 0.000e+00 1.060e-01 5.024e-02 1.153e-01 3.602e-01 7.570e-01 6.802e+00\n",
" 1.713e-03 2.252e-03 0.000e+00 0.000e+00 9.539e-03 8.948e-04 7.930e+00\n",
" 1.202e+01 5.041e+01 1.852e+02 7.117e-02 2.729e-02 0.000e+00 0.000e+00\n",
" 1.566e-01 5.521e-02]\n",
"per-feature maximum before scaling:\n",
" [2.811e+01 3.928e+01 1.885e+02 2.501e+03 1.634e-01 2.867e-01 4.268e-01\n",
" 2.012e-01 3.040e-01 9.575e-02 2.873e+00 4.885e+00 2.198e+01 5.422e+02\n",
" 3.113e-02 1.354e-01 3.960e-01 5.279e-02 6.146e-02 2.984e-02 3.604e+01\n",
" 4.954e+01 2.512e+02 4.254e+03 2.226e-01 9.379e-01 1.170e+00 2.910e-01\n",
" 5.774e-01 1.486e-01]\n",
"per-feature minimum after scaling:\n",
" [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
" 0. 0. 0. 0. 0. 0.]\n",
"per-feature maximum after scaling:\n",
" [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n",
" 1. 1. 1. 1. 1. 1.]\n"
]
}
],
"source": [
"# スケール変換の前後のデータ特性をプリント\n",
"print(\"transformed shape: {}\".format(X_train_scaled.shape))\n",
"print(\"per-feature minimum before scaling:\\n {}\".format(X_train.min(axis=0)))\n",
"print(\"per-feature maximum before scaling:\\n {}\".format(X_train.max(axis=0)))\n",
"print(\"per-feature minimum after scaling:\\n {}\".format(X_train_scaled.min(axis=0)))\n",
"print(\"per-feature maximum after scaling:\\n {}\".format(X_train_scaled.max(axis=0)))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# テストデータを変換\n",
"X_test_scaled = scaler.transform(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"per-feature minimum after scaling:\n",
" [ 0.0336031 0.0226581 0.03144219 0.01141039 0.14128374 0.04406704\n",
" 0. 0. 0.1540404 -0.00615249 -0.00137796 0.00594501\n",
" 0.00430665 0.00079567 0.03919502 0.0112206 0. 0.\n",
" -0.03191387 0.00664013 0.02660975 0.05810235 0.02031974 0.00943767\n",
" 0.1094235 0.02637792 0. 0. -0.00023764 -0.00182032]\n",
"per-feature maximum after scaling:\n",
" [0.9578778 0.81501522 0.95577362 0.89353128 0.81132075 1.21958701\n",
" 0.87956888 0.9333996 0.93232323 1.0371347 0.42669616 0.49765736\n",
" 0.44117231 0.28371044 0.48703131 0.73863671 0.76717172 0.62928585\n",
" 1.33685792 0.39057253 0.89612238 0.79317697 0.84859804 0.74488793\n",
" 0.9154725 1.13188961 1.07008547 0.92371134 1.20532319 1.63068851]\n"
]
}
],
"source": [
"# スケール返還の前後のデータ特性をプリント\n",
"print(\"per-feature minimum after scaling:\\n {}\".format(X_test_scaled.min(axis=0)))\n",
"print(\"per-feature maximum after scaling:\\n {}\".format(X_test_scaled.max(axis=0)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"※訓練データと同じ変換をテストデータに施すため、テストデータは0~1の範囲からはみ出してスケーリングされる場合もある!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3.3.3 訓練データとテストデータを同じように変換する"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import make_blobs"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# 合成データを作成\n",
"X, _ = make_blobs(n_samples=50, centers=5, random_state=4, cluster_std=2)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# 訓練セットとデータセットに分割\n",
"X_train, X_test, = train_test_split(X, random_state=5, test_size=.1)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 936x288 with 3 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 訓練セットとテストセットをプロット\n",
"fig, axes = plt.subplots(1, 3, figsize=(13, 4))\n",
"axes[0].scatter(X_train[:, 0], X_train[:, 1],\n",
" c=mglearn.cm2(0), label=\"Training set\", s=60)\n",
"axes[0].scatter(X_test[:, 0], X_test[:, 1], marker='^',\n",
" c=mglearn.cm2(1), label=\"Test set\", s=60)\n",
"axes[0].legend(loc='upper left')\n",
"axes[0].set_title(\"Original Data\")\n",
"\n",
"# MinMaxScalerでデータをスケール変換\n",
"scaler = MinMaxScaler()\n",
"scaler.fit(X_train)\n",
"X_train_scaled = scaler.transform(X_train)\n",
"X_test_scaled = scaler.transform(X_test)\n",
"\n",
"# スケール変換されたデータの特性を可視化\n",
"axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],\n",
" c=mglearn.cm2(0), label=\"Training set\", s=60)\n",
"axes[1].scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], marker='^',\n",
" c=mglearn.cm2(1), label=\"Test set\", s=60)\n",
"axes[1].set_title(\"Scaled Data\")\n",
"\n",
"# テストセットを訓練セットとは別にスケール変換\n",
"# 最大値と最小値が0,1になる。ここでは説明のためにわざとやっている\n",
"# *実際にやってはいけない!*\n",
"test_scaler = MinMaxScaler()\n",
"test_scaler.fit(X_test)\n",
"X_test_scaled_badly = test_scaler.transform(X_test)\n",
"\n",
"# 間違ってスケール変換されたデータを可視化\n",
"axes[2].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],\n",
" c=mglearn.cm2(0), label=\"Training set\", s=60)\n",
"axes[2].scatter(X_test_scaled_badly[:, 0], X_test_scaled_badly[:, 1], marker='^',\n",
" c=mglearn.cm2(1), label=\"Test set\", s=60)\n",
"axes[2].set_title(\"Scaled Data\")\n",
"\n",
"for ax in axes:\n",
" ax.set_xlabel(\"Feature 0\")\n",
" ax.set_xlabel(\"Feature 1\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3.3.4 教師あり学習における前処理の効果"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.svm import SVC"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Test set accuracy: 0.63\n"
]
}
],
"source": [
"svm = SVC(C=100)\n",
"svm.fit(X_train, y_train)\n",
"print(\"Test set accuracy: {:.2f}\".format(svm.score(X_test, y_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"前処理をしなかった場合の精度はこんなもの。。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"今度はSVCに掛ける前に、MinMaxScalerを使ってスケール変換してみる"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scaled test set accuracy: 0.97\n"
]
}
],
"source": [
"# 0-1 スケール変換で前処理\n",
"scaler = MinMaxScaler()\n",
"scaler.fit(X_train)\n",
"X_train_scaled = scaler.transform(X_train)\n",
"X_test_scaled = scaler.transform(X_test)\n",
"\n",
"# 変換された訓練データで学習\n",
"svm.fit(X_train_scaled, y_train)\n",
"\n",
"# 変換されたテストセットでスコア計算\n",
"print(\"Scaled test set accuracy: {:.2f}\".format(svm.score(X_test_scaled, y_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"精度めっちゃ上がった!"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scaled test set accuracy: 0.96\n"
]
}
],
"source": [
"# 平均を0に分散を1に前処理\n",
"from sklearn.preprocessing import StandardScaler\n",
"scaler = StandardScaler()\n",
"scaler.fit(X_train)\n",
"X_train_scaled = scaler.transform(X_train)\n",
"X_test_scaled = scaler.transform(X_test)\n",
"\n",
"# 変換された訓練データで学習\n",
"svm.fit(X_train_scaled, y_train)\n",
"\n",
"# 変換されたテストセットでスコア計算\n",
"print(\"Scaled test set accuracy: {:.2f}\".format(svm.score(X_test_scaled, y_test)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment