Skip to content

Instantly share code, notes, and snippets.

@pon-x
Created November 9, 2020 01:47
Show Gist options
  • Save pon-x/5f9bd7e41a141cb08fa9f9cc6bf8b315 to your computer and use it in GitHub Desktop.
Save pon-x/5f9bd7e41a141cb08fa9f9cc6bf8b315 to your computer and use it in GitHub Desktop.
重回帰分析の例
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 重回帰分析\n",
"\n",
"## 重回帰分析を実行し、モデルを評価してみよう。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn import linear_model\n",
"from sklearn.datasets import load_boston"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 今回扱うデータはボストン市住宅価格のデータセットです。"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,\n",
" 4.9800e+00],\n",
" [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,\n",
" 9.1400e+00],\n",
" [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,\n",
" 4.0300e+00],\n",
" ...,\n",
" [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,\n",
" 5.6400e+00],\n",
" [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,\n",
" 6.4800e+00],\n",
" [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,\n",
" 7.8800e+00]]),\n",
" 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,\n",
" 18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,\n",
" 15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,\n",
" 13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,\n",
" 21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,\n",
" 35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,\n",
" 19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,\n",
" 20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,\n",
" 23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,\n",
" 33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,\n",
" 21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,\n",
" 20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,\n",
" 23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,\n",
" 15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21.5, 19.6, 15.3, 19.4,\n",
" 17. , 15.6, 13.1, 41.3, 24.3, 23.3, 27. , 50. , 50. , 50. , 22.7,\n",
" 25. , 50. , 23.8, 23.8, 22.3, 17.4, 19.1, 23.1, 23.6, 22.6, 29.4,\n",
" 23.2, 24.6, 29.9, 37.2, 39.8, 36.2, 37.9, 32.5, 26.4, 29.6, 50. ,\n",
" 32. , 29.8, 34.9, 37. , 30.5, 36.4, 31.1, 29.1, 50. , 33.3, 30.3,\n",
" 34.6, 34.9, 32.9, 24.1, 42.3, 48.5, 50. , 22.6, 24.4, 22.5, 24.4,\n",
" 20. , 21.7, 19.3, 22.4, 28.1, 23.7, 25. , 23.3, 28.7, 21.5, 23. ,\n",
" 26.7, 21.7, 27.5, 30.1, 44.8, 50. , 37.6, 31.6, 46.7, 31.5, 24.3,\n",
" 31.7, 41.7, 48.3, 29. , 24. , 25.1, 31.5, 23.7, 23.3, 22. , 20.1,\n",
" 22.2, 23.7, 17.6, 18.5, 24.3, 20.5, 24.5, 26.2, 24.4, 24.8, 29.6,\n",
" 42.8, 21.9, 20.9, 44. , 50. , 36. , 30.1, 33.8, 43.1, 48.8, 31. ,\n",
" 36.5, 22.8, 30.7, 50. , 43.5, 20.7, 21.1, 25.2, 24.4, 35.2, 32.4,\n",
" 32. , 33.2, 33.1, 29.1, 35.1, 45.4, 35.4, 46. , 50. , 32.2, 22. ,\n",
" 20.1, 23.2, 22.3, 24.8, 28.5, 37.3, 27.9, 23.9, 21.7, 28.6, 27.1,\n",
" 20.3, 22.5, 29. , 24.8, 22. , 26.4, 33.1, 36.1, 28.4, 33.4, 28.2,\n",
" 22.8, 20.3, 16.1, 22.1, 19.4, 21.6, 23.8, 16.2, 17.8, 19.8, 23.1,\n",
" 21. , 23.8, 23.1, 20.4, 18.5, 25. , 24.6, 23. , 22.2, 19.3, 22.6,\n",
" 19.8, 17.1, 19.4, 22.2, 20.7, 21.1, 19.5, 18.5, 20.6, 19. , 18.7,\n",
" 32.7, 16.5, 23.9, 31.2, 17.5, 17.2, 23.1, 24.5, 26.6, 22.9, 24.1,\n",
" 18.6, 30.1, 18.2, 20.6, 17.8, 21.7, 22.7, 22.6, 25. , 19.9, 20.8,\n",
" 16.8, 21.9, 27.5, 21.9, 23.1, 50. , 50. , 50. , 50. , 50. , 13.8,\n",
" 13.8, 15. , 13.9, 13.3, 13.1, 10.2, 10.4, 10.9, 11.3, 12.3, 8.8,\n",
" 7.2, 10.5, 7.4, 10.2, 11.5, 15.1, 23.2, 9.7, 13.8, 12.7, 13.1,\n",
" 12.5, 8.5, 5. , 6.3, 5.6, 7.2, 12.1, 8.3, 8.5, 5. , 11.9,\n",
" 27.9, 17.2, 27.5, 15. , 17.2, 17.9, 16.3, 7. , 7.2, 7.5, 10.4,\n",
" 8.8, 8.4, 16.7, 14.2, 20.8, 13.4, 11.7, 8.3, 10.2, 10.9, 11. ,\n",
" 9.5, 14.5, 14.1, 16.1, 14.3, 11.7, 13.4, 9.6, 8.7, 8.4, 12.8,\n",
" 10.5, 17.1, 18.4, 15.4, 10.8, 11.8, 14.9, 12.6, 14.1, 13. , 13.4,\n",
" 15.2, 16.1, 17.8, 14.9, 14.1, 12.7, 13.5, 14.9, 20. , 16.4, 17.7,\n",
" 19.5, 20.2, 21.4, 19.9, 19. , 19.1, 19.1, 20.1, 19.9, 19.6, 23.2,\n",
" 29.8, 13.8, 13.3, 16.7, 12. , 14.6, 21.4, 23. , 23.7, 25. , 21.8,\n",
" 20.6, 21.2, 19.1, 20.6, 15.2, 7. , 8.1, 13.6, 20.1, 21.8, 24.5,\n",
" 23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9, 22. , 11.9]),\n",
" 'feature_names': array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',\n",
" 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7'),\n",
" 'DESCR': \".. _boston_dataset:\\n\\nBoston house prices dataset\\n---------------------------\\n\\n**Data Set Characteristics:** \\n\\n :Number of Instances: 506 \\n\\n :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\\n\\n :Attribute Information (in order):\\n - CRIM per capita crime rate by town\\n - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\\n - INDUS proportion of non-retail business acres per town\\n - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\\n - NOX nitric oxides concentration (parts per 10 million)\\n - RM average number of rooms per dwelling\\n - AGE proportion of owner-occupied units built prior to 1940\\n - DIS weighted distances to five Boston employment centres\\n - RAD index of accessibility to radial highways\\n - TAX full-value property-tax rate per $10,000\\n - PTRATIO pupil-teacher ratio by town\\n - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\\n - LSTAT % lower status of the population\\n - MEDV Median value of owner-occupied homes in $1000's\\n\\n :Missing Attribute Values: None\\n\\n :Creator: Harrison, D. and Rubinfeld, D.L.\\n\\nThis is a copy of UCI ML housing dataset.\\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\\n\\n\\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\\n\\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\\nprices and the demand for clean air', J. Environ. Economics & Management,\\nvol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\\n...', Wiley, 1980. N.B. Various transformations are used in the table on\\npages 244-261 of the latter.\\n\\nThe Boston house-price data has been used in many machine learning papers that address regression\\nproblems. \\n \\n.. topic:: References\\n\\n - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\\n - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\\n\",\n",
" 'filename': 'C:\\\\Users\\\\hide8\\\\Anaconda3\\\\lib\\\\site-packages\\\\sklearn\\\\datasets\\\\data\\\\boston_house_prices.csv'}"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"boston = load_boston()\n",
"boston"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>CRIM</th>\n",
" <th>ZN</th>\n",
" <th>INDUS</th>\n",
" <th>CHAS</th>\n",
" <th>NOX</th>\n",
" <th>RM</th>\n",
" <th>AGE</th>\n",
" <th>DIS</th>\n",
" <th>RAD</th>\n",
" <th>TAX</th>\n",
" <th>PTRATIO</th>\n",
" <th>B</th>\n",
" <th>LSTAT</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.00632</td>\n",
" <td>18.0</td>\n",
" <td>2.31</td>\n",
" <td>0.0</td>\n",
" <td>0.538</td>\n",
" <td>6.575</td>\n",
" <td>65.2</td>\n",
" <td>4.0900</td>\n",
" <td>1.0</td>\n",
" <td>296.0</td>\n",
" <td>15.3</td>\n",
" <td>396.90</td>\n",
" <td>4.98</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>0.02731</td>\n",
" <td>0.0</td>\n",
" <td>7.07</td>\n",
" <td>0.0</td>\n",
" <td>0.469</td>\n",
" <td>6.421</td>\n",
" <td>78.9</td>\n",
" <td>4.9671</td>\n",
" <td>2.0</td>\n",
" <td>242.0</td>\n",
" <td>17.8</td>\n",
" <td>396.90</td>\n",
" <td>9.14</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>0.02729</td>\n",
" <td>0.0</td>\n",
" <td>7.07</td>\n",
" <td>0.0</td>\n",
" <td>0.469</td>\n",
" <td>7.185</td>\n",
" <td>61.1</td>\n",
" <td>4.9671</td>\n",
" <td>2.0</td>\n",
" <td>242.0</td>\n",
" <td>17.8</td>\n",
" <td>392.83</td>\n",
" <td>4.03</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>0.03237</td>\n",
" <td>0.0</td>\n",
" <td>2.18</td>\n",
" <td>0.0</td>\n",
" <td>0.458</td>\n",
" <td>6.998</td>\n",
" <td>45.8</td>\n",
" <td>6.0622</td>\n",
" <td>3.0</td>\n",
" <td>222.0</td>\n",
" <td>18.7</td>\n",
" <td>394.63</td>\n",
" <td>2.94</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>0.06905</td>\n",
" <td>0.0</td>\n",
" <td>2.18</td>\n",
" <td>0.0</td>\n",
" <td>0.458</td>\n",
" <td>7.147</td>\n",
" <td>54.2</td>\n",
" <td>6.0622</td>\n",
" <td>3.0</td>\n",
" <td>222.0</td>\n",
" <td>18.7</td>\n",
" <td>396.90</td>\n",
" <td>5.33</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \\\n",
"0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 \n",
"1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 \n",
"2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 \n",
"3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 \n",
"4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 \n",
"\n",
" PTRATIO B LSTAT \n",
"0 15.3 396.90 4.98 \n",
"1 17.8 396.90 9.14 \n",
"2 17.8 392.83 4.03 \n",
"3 18.7 394.63 2.94 \n",
"4 18.7 396.90 5.33 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DataFrameに入れて見やすいくしておきましょう。\n",
"\n",
"df = pd.DataFrame(boston.data, columns = boston.feature_names)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| カラム | 内容 |\n",
"| :--- | :--- |\n",
"| CRIM | 町ごとの一人当たりの犯罪率 |\n",
"| ZN | 宅地の比率が25,000平方フィートを超える敷地に区画されている。 |\n",
"| INDUS | 町当たりの非小売業エーカーの割合 |\n",
"| CHAS | チャーリーズ川ダミー変数(川の境界にある場合は1、それ以外の場合は0) |\n",
"| NOX | 一酸化窒素濃度(1000万分の1) |\n",
"| RM | 1住戸あたりの平均部屋数 |\n",
"| AGE | 1940年以前に建設された所有占有ユニットの年齢比率 |\n",
"| DIS | 5つのボストンの雇用センターまでの加重距離 |\n",
"| RAD | ラジアルハイウェイへのアクセス可能性の指標 |\n",
"| TAX | 10,000ドルあたりの税全額固定資産税率 |\n",
"| PTRATIO | 生徒教師の比率 |\n",
"| B | 町における黒人の割合 |\n",
"| LSTAT | 人口当たり地位が低い率 |\n",
"| MEDV | 1000ドルでの所有者居住住宅の中央値 |\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 目的変数は別にあるので追加でインポートしておきましょう。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>CRIM</th>\n",
" <th>ZN</th>\n",
" <th>INDUS</th>\n",
" <th>CHAS</th>\n",
" <th>NOX</th>\n",
" <th>RM</th>\n",
" <th>AGE</th>\n",
" <th>DIS</th>\n",
" <th>RAD</th>\n",
" <th>TAX</th>\n",
" <th>PTRATIO</th>\n",
" <th>B</th>\n",
" <th>LSTAT</th>\n",
" <th>PRICE</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.00632</td>\n",
" <td>18.0</td>\n",
" <td>2.31</td>\n",
" <td>0.0</td>\n",
" <td>0.538</td>\n",
" <td>6.575</td>\n",
" <td>65.2</td>\n",
" <td>4.0900</td>\n",
" <td>1.0</td>\n",
" <td>296.0</td>\n",
" <td>15.3</td>\n",
" <td>396.90</td>\n",
" <td>4.98</td>\n",
" <td>24.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>0.02731</td>\n",
" <td>0.0</td>\n",
" <td>7.07</td>\n",
" <td>0.0</td>\n",
" <td>0.469</td>\n",
" <td>6.421</td>\n",
" <td>78.9</td>\n",
" <td>4.9671</td>\n",
" <td>2.0</td>\n",
" <td>242.0</td>\n",
" <td>17.8</td>\n",
" <td>396.90</td>\n",
" <td>9.14</td>\n",
" <td>21.6</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>0.02729</td>\n",
" <td>0.0</td>\n",
" <td>7.07</td>\n",
" <td>0.0</td>\n",
" <td>0.469</td>\n",
" <td>7.185</td>\n",
" <td>61.1</td>\n",
" <td>4.9671</td>\n",
" <td>2.0</td>\n",
" <td>242.0</td>\n",
" <td>17.8</td>\n",
" <td>392.83</td>\n",
" <td>4.03</td>\n",
" <td>34.7</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>0.03237</td>\n",
" <td>0.0</td>\n",
" <td>2.18</td>\n",
" <td>0.0</td>\n",
" <td>0.458</td>\n",
" <td>6.998</td>\n",
" <td>45.8</td>\n",
" <td>6.0622</td>\n",
" <td>3.0</td>\n",
" <td>222.0</td>\n",
" <td>18.7</td>\n",
" <td>394.63</td>\n",
" <td>2.94</td>\n",
" <td>33.4</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>0.06905</td>\n",
" <td>0.0</td>\n",
" <td>2.18</td>\n",
" <td>0.0</td>\n",
" <td>0.458</td>\n",
" <td>7.147</td>\n",
" <td>54.2</td>\n",
" <td>6.0622</td>\n",
" <td>3.0</td>\n",
" <td>222.0</td>\n",
" <td>18.7</td>\n",
" <td>396.90</td>\n",
" <td>5.33</td>\n",
" <td>36.2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \\\n",
"0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 \n",
"1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 \n",
"2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 \n",
"3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 \n",
"4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 \n",
"\n",
" PTRATIO B LSTAT PRICE \n",
"0 15.3 396.90 4.98 24.0 \n",
"1 17.8 396.90 9.14 21.6 \n",
"2 17.8 392.83 4.03 34.7 \n",
"3 18.7 394.63 2.94 33.4 \n",
"4 18.7 396.90 5.33 36.2 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"PRICE\"] = boston.target\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 説明変数をXに、目的変数をYとして、ひとまずすべてのカラムを使用してみます。"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"X = df[boston.feature_names]\n",
"Y = df[\"PRICE\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 今回は訓練データとテストデータに分けて評価してみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import mean_squared_error\n",
"\n",
"X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4.910525893692257"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lm = linear_model.LinearRegression() \n",
"lm.fit(X_train, Y_train) # 訓練データで学習をする。\n",
"y_pred = lm.predict(X_test) # Xのテストデータで予測をする \n",
"np.sqrt(mean_squared_error(Y_test, y_pred)) # 実際のYの値と予測値の差を 2 乗を平均したものの正の平方根。小さいほど良い。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### どのカラムが目的変数にどれだけ影響を与えたか回帰係数を調べてみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>coef</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>CRIM</td>\n",
" <td>-0.094108</td>\n",
" </tr>\n",
" <tr>\n",
" <td>ZN</td>\n",
" <td>0.063418</td>\n",
" </tr>\n",
" <tr>\n",
" <td>INDUS</td>\n",
" <td>-0.025215</td>\n",
" </tr>\n",
" <tr>\n",
" <td>CHAS</td>\n",
" <td>2.807830</td>\n",
" </tr>\n",
" <tr>\n",
" <td>NOX</td>\n",
" <td>-21.966234</td>\n",
" </tr>\n",
" <tr>\n",
" <td>RM</td>\n",
" <td>2.510085</td>\n",
" </tr>\n",
" <tr>\n",
" <td>AGE</td>\n",
" <td>0.006912</td>\n",
" </tr>\n",
" <tr>\n",
" <td>DIS</td>\n",
" <td>-1.870137</td>\n",
" </tr>\n",
" <tr>\n",
" <td>RAD</td>\n",
" <td>0.363952</td>\n",
" </tr>\n",
" <tr>\n",
" <td>TAX</td>\n",
" <td>-0.013587</td>\n",
" </tr>\n",
" <tr>\n",
" <td>PTRATIO</td>\n",
" <td>-1.083637</td>\n",
" </tr>\n",
" <tr>\n",
" <td>B</td>\n",
" <td>0.009199</td>\n",
" </tr>\n",
" <tr>\n",
" <td>LSTAT</td>\n",
" <td>-0.576695</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" coef\n",
"CRIM -0.094108\n",
"ZN 0.063418\n",
"INDUS -0.025215\n",
"CHAS 2.807830\n",
"NOX -21.966234\n",
"RM 2.510085\n",
"AGE 0.006912\n",
"DIS -1.870137\n",
"RAD 0.363952\n",
"TAX -0.013587\n",
"PTRATIO -1.083637\n",
"B 0.009199\n",
"LSTAT -0.576695"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 0に近いほど影響がない。\n",
"\n",
"coef = pd.DataFrame(lm.coef_, columns=['coef'], index=boston.feature_names)\n",
"coef"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x2c819c31c48>"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# グラフにしてみます。\n",
"# NOXが強く影響していることがわかります。\n",
"\n",
"coef.plot(kind='bar')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 回帰係数の絶対値が0.07未満のカラムを削除して評価してみましょう。"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>CRIM</th>\n",
" <th>CHAS</th>\n",
" <th>NOX</th>\n",
" <th>RM</th>\n",
" <th>DIS</th>\n",
" <th>RAD</th>\n",
" <th>PTRATIO</th>\n",
" <th>LSTAT</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0.00632</td>\n",
" <td>0.0</td>\n",
" <td>0.538</td>\n",
" <td>6.575</td>\n",
" <td>4.0900</td>\n",
" <td>1.0</td>\n",
" <td>15.3</td>\n",
" <td>4.98</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>0.02731</td>\n",
" <td>0.0</td>\n",
" <td>0.469</td>\n",
" <td>6.421</td>\n",
" <td>4.9671</td>\n",
" <td>2.0</td>\n",
" <td>17.8</td>\n",
" <td>9.14</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>0.02729</td>\n",
" <td>0.0</td>\n",
" <td>0.469</td>\n",
" <td>7.185</td>\n",
" <td>4.9671</td>\n",
" <td>2.0</td>\n",
" <td>17.8</td>\n",
" <td>4.03</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>0.03237</td>\n",
" <td>0.0</td>\n",
" <td>0.458</td>\n",
" <td>6.998</td>\n",
" <td>6.0622</td>\n",
" <td>3.0</td>\n",
" <td>18.7</td>\n",
" <td>2.94</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>0.06905</td>\n",
" <td>0.0</td>\n",
" <td>0.458</td>\n",
" <td>7.147</td>\n",
" <td>6.0622</td>\n",
" <td>3.0</td>\n",
" <td>18.7</td>\n",
" <td>5.33</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" CRIM CHAS NOX RM DIS RAD PTRATIO LSTAT\n",
"0 0.00632 0.0 0.538 6.575 4.0900 1.0 15.3 4.98\n",
"1 0.02731 0.0 0.469 6.421 4.9671 2.0 17.8 9.14\n",
"2 0.02729 0.0 0.469 7.185 4.9671 2.0 17.8 4.03\n",
"3 0.03237 0.0 0.458 6.998 6.0622 3.0 18.7 2.94\n",
"4 0.06905 0.0 0.458 7.147 6.0622 3.0 18.7 5.33"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 対象のカラムだけに絞ります。\n",
"\n",
"feature_cols = coef[abs(coef[\"coef\"]) >= 0.07].index\n",
"\n",
"X2 = df[feature_cols]\n",
"X2.head()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, Y_train, Y_test = train_test_split(X2, Y, test_size=0.3, random_state=1234)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4.875197690173779"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lm = linear_model.LinearRegression() #\n",
"lm.fit(X_train, Y_train) # 訓練データで学習をする。\n",
"y_pred = lm.predict(X_test) # Xのテストデータで予測をする \n",
"np.sqrt(mean_squared_error(Y_test, y_pred)) # 実際のYの値と予測値の差を 2 乗を平均したもの。小さいほど良い。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 問題1\n",
"\n",
"以下は、重回帰分析を行った結果のサマリーを表示したものです。"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"simpletable\">\n",
"<caption>OLS Regression Results</caption>\n",
"<tr>\n",
" <th>Dep. Variable:</th> <td>PRICE</td> <th> R-squared: </th> <td> 0.741</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Model:</th> <td>OLS</td> <th> Adj. R-squared: </th> <td> 0.734</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Method:</th> <td>Least Squares</td> <th> F-statistic: </th> <td> 108.1</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Date:</th> <td>Sat, 10 Oct 2020</td> <th> Prob (F-statistic):</th> <td>6.72e-135</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Time:</th> <td>10:10:23</td> <th> Log-Likelihood: </th> <td> -1498.8</td> \n",
"</tr>\n",
"<tr>\n",
" <th>No. Observations:</th> <td> 506</td> <th> AIC: </th> <td> 3026.</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Df Residuals:</th> <td> 492</td> <th> BIC: </th> <td> 3085.</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Df Model:</th> <td> 13</td> <th> </th> <td> </td> \n",
"</tr>\n",
"<tr>\n",
" <th>Covariance Type:</th> <td>nonrobust</td> <th> </th> <td> </td> \n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <td></td> <th>coef</th> <th>std err</th> <th>t</th> <th>P>|t|</th> <th>[0.025</th> <th>0.975]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>const</th> <td> 36.4595</td> <td> 5.103</td> <td> 7.144</td> <td> 0.000</td> <td> 26.432</td> <td> 46.487</td>\n",
"</tr>\n",
"<tr>\n",
" <th>CRIM</th> <td> -0.1080</td> <td> 0.033</td> <td> -3.287</td> <td> 0.001</td> <td> -0.173</td> <td> -0.043</td>\n",
"</tr>\n",
"<tr>\n",
" <th>ZN</th> <td> 0.0464</td> <td> 0.014</td> <td> 3.382</td> <td> 0.001</td> <td> 0.019</td> <td> 0.073</td>\n",
"</tr>\n",
"<tr>\n",
" <th>INDUS</th> <td> 0.0206</td> <td> 0.061</td> <td> 0.334</td> <td> 0.738</td> <td> -0.100</td> <td> 0.141</td>\n",
"</tr>\n",
"<tr>\n",
" <th>CHAS</th> <td> 2.6867</td> <td> 0.862</td> <td> 3.118</td> <td> 0.002</td> <td> 0.994</td> <td> 4.380</td>\n",
"</tr>\n",
"<tr>\n",
" <th>NOX</th> <td> -17.7666</td> <td> 3.820</td> <td> -4.651</td> <td> 0.000</td> <td> -25.272</td> <td> -10.262</td>\n",
"</tr>\n",
"<tr>\n",
" <th>RM</th> <td> 3.8099</td> <td> 0.418</td> <td> 9.116</td> <td> 0.000</td> <td> 2.989</td> <td> 4.631</td>\n",
"</tr>\n",
"<tr>\n",
" <th>AGE</th> <td> 0.0007</td> <td> 0.013</td> <td> 0.052</td> <td> 0.958</td> <td> -0.025</td> <td> 0.027</td>\n",
"</tr>\n",
"<tr>\n",
" <th>DIS</th> <td> -1.4756</td> <td> 0.199</td> <td> -7.398</td> <td> 0.000</td> <td> -1.867</td> <td> -1.084</td>\n",
"</tr>\n",
"<tr>\n",
" <th>RAD</th> <td> 0.3060</td> <td> 0.066</td> <td> 4.613</td> <td> 0.000</td> <td> 0.176</td> <td> 0.436</td>\n",
"</tr>\n",
"<tr>\n",
" <th>TAX</th> <td> -0.0123</td> <td> 0.004</td> <td> -3.280</td> <td> 0.001</td> <td> -0.020</td> <td> -0.005</td>\n",
"</tr>\n",
"<tr>\n",
" <th>PTRATIO</th> <td> -0.9527</td> <td> 0.131</td> <td> -7.283</td> <td> 0.000</td> <td> -1.210</td> <td> -0.696</td>\n",
"</tr>\n",
"<tr>\n",
" <th>B</th> <td> 0.0093</td> <td> 0.003</td> <td> 3.467</td> <td> 0.001</td> <td> 0.004</td> <td> 0.015</td>\n",
"</tr>\n",
"<tr>\n",
" <th>LSTAT</th> <td> -0.5248</td> <td> 0.051</td> <td> -10.347</td> <td> 0.000</td> <td> -0.624</td> <td> -0.425</td>\n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <th>Omnibus:</th> <td>178.041</td> <th> Durbin-Watson: </th> <td> 1.078</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Prob(Omnibus):</th> <td> 0.000</td> <th> Jarque-Bera (JB): </th> <td> 783.126</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Skew:</th> <td> 1.521</td> <th> Prob(JB): </th> <td>8.84e-171</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Kurtosis:</th> <td> 8.281</td> <th> Cond. No. </th> <td>1.51e+04</td> \n",
"</tr>\n",
"</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 1.51e+04. This might indicate that there are<br/>strong multicollinearity or other numerical problems."
],
"text/plain": [
"<class 'statsmodels.iolib.summary.Summary'>\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: PRICE R-squared: 0.741\n",
"Model: OLS Adj. R-squared: 0.734\n",
"Method: Least Squares F-statistic: 108.1\n",
"Date: Sat, 10 Oct 2020 Prob (F-statistic): 6.72e-135\n",
"Time: 10:10:23 Log-Likelihood: -1498.8\n",
"No. Observations: 506 AIC: 3026.\n",
"Df Residuals: 492 BIC: 3085.\n",
"Df Model: 13 \n",
"Covariance Type: nonrobust \n",
"==============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"const 36.4595 5.103 7.144 0.000 26.432 46.487\n",
"CRIM -0.1080 0.033 -3.287 0.001 -0.173 -0.043\n",
"ZN 0.0464 0.014 3.382 0.001 0.019 0.073\n",
"INDUS 0.0206 0.061 0.334 0.738 -0.100 0.141\n",
"CHAS 2.6867 0.862 3.118 0.002 0.994 4.380\n",
"NOX -17.7666 3.820 -4.651 0.000 -25.272 -10.262\n",
"RM 3.8099 0.418 9.116 0.000 2.989 4.631\n",
"AGE 0.0007 0.013 0.052 0.958 -0.025 0.027\n",
"DIS -1.4756 0.199 -7.398 0.000 -1.867 -1.084\n",
"RAD 0.3060 0.066 4.613 0.000 0.176 0.436\n",
"TAX -0.0123 0.004 -3.280 0.001 -0.020 -0.005\n",
"PTRATIO -0.9527 0.131 -7.283 0.000 -1.210 -0.696\n",
"B 0.0093 0.003 3.467 0.001 0.004 0.015\n",
"LSTAT -0.5248 0.051 -10.347 0.000 -0.624 -0.425\n",
"==============================================================================\n",
"Omnibus: 178.041 Durbin-Watson: 1.078\n",
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 783.126\n",
"Skew: 1.521 Prob(JB): 8.84e-171\n",
"Kurtosis: 8.281 Cond. No. 1.51e+04\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"[2] The condition number is large, 1.51e+04. This might indicate that there are\n",
"strong multicollinearity or other numerical problems.\n",
"\"\"\""
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import statsmodels.api as sm\n",
" \n",
"x = df[boston.feature_names]\n",
"Y = df[\"PRICE\"]\n",
" \n",
"# 定数項のカラムを作っておきます。\n",
"X = sm.add_constant(x)\n",
" \n",
"# 最小二乗法回帰をセットします。\n",
"lm = sm.OLS(Y, X)\n",
" \n",
"#回帰分析の実行\n",
"result = lm.fit()\n",
" \n",
"#結果の詳細を表示\n",
"result.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 上記の結果からP値が5%よりおおきい説明変数を取り上げ、どのような解釈をすべきか考察してください。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment