Skip to content

Instantly share code, notes, and snippets.

@ereyester
Created July 31, 2020 08:57
Show Gist options
  • Save ereyester/5c6e5a9b8aa55ba826c7c96a4daf7814 to your computer and use it in GitHub Desktop.
Save ereyester/5c6e5a9b8aa55ba826c7c96a4daf7814 to your computer and use it in GitHub Desktop.
jouhou2_3_13_python.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "jouhou2_3_13_python.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true,
"authorship_tag": "ABX9TyMuo8hGdVdicLphI29h319U",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/ereyester/5c6e5a9b8aa55ba826c7c96a4daf7814/jouhou2_3_13_python.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "h5sdaFZ62KMs",
"colab_type": "text"
},
"source": [
"# 高等学校情報科「情報Ⅱ」教員用研修教材\n",
"## 第3章前半 13重回帰分析とモデルの決定\n",
"### python版"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8F_OYPOD2V3r",
"colab_type": "text"
},
"source": [
"stasmodelには、様々な線形回帰モデルがあります.\n",
"基本的なもの(最小二乗法、OLS)からより複雑なもの(反復再重み付け最小二乗法、IRLS)まであります.\n",
"stasmodelの線形モデルには、2つの主なインターフェースがあります.\n",
"配列ベースのものと、formula式ベースのものです."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SLHIqDfcDKpT",
"colab_type": "text"
},
"source": [
"https://blog.amedama.jp/entry/2016/12/23/193452\n",
"https://qiita.com/0NE_shoT_/items/08376b08783cd554b02e\n",
"\n",
"http://pepper.is.sci.toho-u.ac.jp/pepper/index.php?%A5%CE%A1%BC%A5%C8%2FPython%2F%C5%FD%B7%D7%2F%B2%F3%B5%A2%CA%AC%C0%CF"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YFki6HUFTHEy",
"colab_type": "text"
},
"source": [
"教材のコードは、lm()の引数weights=NULLであるため、最小二乗法(OLS)を使用しているので、ここでもOLSを使う."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iGUX3Kdzc2dk",
"colab_type": "text"
},
"source": [
"- coef:係数の推定値\n",
"- R-squared: 寄与率(決定係数)\n",
"- Adj. R-squared: 自由度修正済みR2\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "syuulWeP975p",
"colab_type": "text"
},
"source": [
"formula式ベースのコード(教材のRのコードに近い)"
]
},
{
"cell_type": "code",
"metadata": {
"id": "q6OJEX7BLwpC",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 527
},
"outputId": "9a05dc34-a071-4c40-e486-eaa46e916bfc"
},
"source": [
"import pandas as pd\n",
"import statsmodels.formula.api as smf\n",
"high_male = pd.read_csv('/content/high_male_data.csv')\n",
"\n",
"model = smf.ols('X50m走 ~ 立ち幅跳び + ハンドボール投げ + 握力得点 + 上体起こし得点', data = high_male)\n",
"results = model.fit()\n",
"print(results.summary())\n"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: X50m走 R-squared: 0.525\n",
"Model: OLS Adj. R-squared: 0.510\n",
"Method: Least Squares F-statistic: 36.18\n",
"Date: Mon, 20 Jul 2020 Prob (F-statistic): 2.39e-20\n",
"Time: 04:16:26 Log-Likelihood: -41.639\n",
"No. Observations: 136 AIC: 93.28\n",
"Df Residuals: 131 BIC: 107.8\n",
"Df Model: 4 \n",
"Covariance Type: nonrobust \n",
"==============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"Intercept 10.8194 0.325 33.331 0.000 10.177 11.462\n",
"立ち幅跳び -0.0120 0.002 -7.648 0.000 -0.015 -0.009\n",
"ハンドボール投げ -0.0144 0.006 -2.367 0.019 -0.026 -0.002\n",
"握力得点 -0.0402 0.024 -1.677 0.096 -0.088 0.007\n",
"上体起こし得点 -0.0255 0.020 -1.264 0.208 -0.065 0.014\n",
"==============================================================================\n",
"Omnibus: 3.064 Durbin-Watson: 1.813\n",
"Prob(Omnibus): 0.216 Jarque-Bera (JB): 2.955\n",
"Skew: 0.359 Prob(JB): 0.228\n",
"Kurtosis: 2.927 Cond. No. 2.61e+03\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"[2] The condition number is large, 2.61e+03. This might indicate that there are\n",
"strong multicollinearity or other numerical problems.\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZKjcnj-k-7s7",
"colab_type": "text"
},
"source": [
"配列ベースのコード"
]
},
{
"cell_type": "code",
"metadata": {
"id": "hl15gVVm2Tve",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 527
},
"outputId": "39d4abd2-db27-4c2b-9a06-83b1a0edcddc"
},
"source": [
"import pandas as pd\n",
"import statsmodels.api as sm\n",
"\n",
"high_male = pd.read_csv('/content/high_male_data.csv')\n",
"x = high_male[['立ち幅跳び', 'ハンドボール投げ', '握力得点','上体起こし得点']]\n",
"y = high_male['X50m走']\n",
"x = sm.add_constant(x)\n",
"model = sm.OLS(high_male['X50m走'], x)\n",
"results = model.fit()\n",
"print(results.summary())\n"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: X50m走 R-squared: 0.525\n",
"Model: OLS Adj. R-squared: 0.510\n",
"Method: Least Squares F-statistic: 36.18\n",
"Date: Mon, 20 Jul 2020 Prob (F-statistic): 2.39e-20\n",
"Time: 04:15:11 Log-Likelihood: -41.639\n",
"No. Observations: 136 AIC: 93.28\n",
"Df Residuals: 131 BIC: 107.8\n",
"Df Model: 4 \n",
"Covariance Type: nonrobust \n",
"==============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"const 10.8194 0.325 33.331 0.000 10.177 11.462\n",
"立ち幅跳び -0.0120 0.002 -7.648 0.000 -0.015 -0.009\n",
"ハンドボール投げ -0.0144 0.006 -2.367 0.019 -0.026 -0.002\n",
"握力得点 -0.0402 0.024 -1.677 0.096 -0.088 0.007\n",
"上体起こし得点 -0.0255 0.020 -1.264 0.208 -0.065 0.014\n",
"==============================================================================\n",
"Omnibus: 3.064 Durbin-Watson: 1.813\n",
"Prob(Omnibus): 0.216 Jarque-Bera (JB): 2.955\n",
"Skew: 0.359 Prob(JB): 0.228\n",
"Kurtosis: 2.927 Cond. No. 2.61e+03\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"[2] The condition number is large, 2.61e+03. This might indicate that there are\n",
"strong multicollinearity or other numerical problems.\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZtNTo17VZ-fi",
"colab_type": "text"
},
"source": [
"そのほか、LinearRegressionを使う方法などもある"
]
},
{
"cell_type": "code",
"metadata": {
"id": "IV5IdJEpGCL9",
"colab_type": "code",
"colab": {}
},
"source": [
"import pandas as pd\n",
"from sklearn.linear_model import LinearRegression\n",
"lr = LinearRegression()\n",
"\n",
"high_male = pd.read_csv('/content/high_male_data.csv')\n",
"# 回帰モデルの呼び出し\n",
"clf = LinearRegression()\n",
"\n",
"# 説明変数にx1とx2のデータを使用\n",
"X = high_male.loc[:, ['立ち幅跳び', 'ハンドボール投げ', '握力得点','上体起こし得点']].values\n",
"\n",
"# 目的変数にx3のデータを使用\n",
"Y = high_male['X50m走'].values\n",
"\n",
"# 予測モデルを作成(重回帰)\n",
"results = clf.fit(X, Y)\n",
"\n",
"\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "AyrQFX-l9Mod",
"colab_type": "text"
},
"source": [
"参考:\n",
"https://tanuhack.com/statsmodels-multiple-lra/\n",
"\n",
"https://future-chem.com/esol-reg-aic/#stepwise_regression\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "6ZptolsJ-to7",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"outputId": "3a93db00-6ae0-4eef-abe4-b4860e840bcd"
},
"source": [
"import pandas as pd\n",
"import statsmodels.formula.api as smf\n",
"high_male = pd.read_csv('/content/high_male_data.csv')\n",
"descriptors = ['立ち幅跳び', 'ハンドボール投げ', '握力得点','上体起こし得点']\n",
"\n",
"model = smf.ols('X50m走 ~ ' + ' + '.join(descriptors), data = high_male)\n",
"results = model.fit()\n",
"\n",
"print(results.summary())\n",
"\n",
"best_aic = results.aic\n",
"best_model = results\n",
"while descriptors:\n",
" desc_selected = ''\n",
" flag = 0\n",
" for desk in descriptors:\n",
" used_desks = descriptors.copy()\n",
" used_desks.remove(desk)\n",
" formula = 'X50m走 ~ ' + ' + '.join(used_desks)\n",
" model = smf.ols(formula=formula, data=high_male)\n",
" results = model.fit()\n",
" if results.aic < best_aic:\n",
" best_aic = results.aic\n",
" best_model = model\n",
" desc_selected = desk\n",
" flag = 1\n",
" if flag:\n",
" descriptors.remove(desc_selected)\n",
" else:\n",
" break\n",
"\n",
"stepwise_model = best_model.fit()\n",
"print(stepwise_model.summary())"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: X50m走 R-squared: 0.525\n",
"Model: OLS Adj. R-squared: 0.510\n",
"Method: Least Squares F-statistic: 36.18\n",
"Date: Fri, 24 Jul 2020 Prob (F-statistic): 2.39e-20\n",
"Time: 10:51:28 Log-Likelihood: -41.639\n",
"No. Observations: 136 AIC: 93.28\n",
"Df Residuals: 131 BIC: 107.8\n",
"Df Model: 4 \n",
"Covariance Type: nonrobust \n",
"==============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"Intercept 10.8194 0.325 33.331 0.000 10.177 11.462\n",
"立ち幅跳び -0.0120 0.002 -7.648 0.000 -0.015 -0.009\n",
"ハンドボール投げ -0.0144 0.006 -2.367 0.019 -0.026 -0.002\n",
"握力得点 -0.0402 0.024 -1.677 0.096 -0.088 0.007\n",
"上体起こし得点 -0.0255 0.020 -1.264 0.208 -0.065 0.014\n",
"==============================================================================\n",
"Omnibus: 3.064 Durbin-Watson: 1.813\n",
"Prob(Omnibus): 0.216 Jarque-Bera (JB): 2.955\n",
"Skew: 0.359 Prob(JB): 0.228\n",
"Kurtosis: 2.927 Cond. No. 2.61e+03\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"[2] The condition number is large, 2.61e+03. This might indicate that there are\n",
"strong multicollinearity or other numerical problems.\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: X50m走 R-squared: 0.519\n",
"Model: OLS Adj. R-squared: 0.508\n",
"Method: Least Squares F-statistic: 47.50\n",
"Date: Fri, 24 Jul 2020 Prob (F-statistic): 6.92e-21\n",
"Time: 10:51:28 Log-Likelihood: -42.463\n",
"No. Observations: 136 AIC: 92.93\n",
"Df Residuals: 132 BIC: 104.6\n",
"Df Model: 3 \n",
"Covariance Type: nonrobust \n",
"==============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"Intercept 10.7121 0.314 34.114 0.000 10.091 11.333\n",
"立ち幅跳び -0.0121 0.002 -7.703 0.000 -0.015 -0.009\n",
"ハンドボール投げ -0.0169 0.006 -2.929 0.004 -0.028 -0.005\n",
"握力得点 -0.0439 0.024 -1.841 0.068 -0.091 0.003\n",
"==============================================================================\n",
"Omnibus: 2.668 Durbin-Watson: 1.820\n",
"Prob(Omnibus): 0.263 Jarque-Bera (JB): 2.632\n",
"Skew: 0.334 Prob(JB): 0.268\n",
"Kurtosis: 2.867 Cond. No. 2.51e+03\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"[2] The condition number is large, 2.51e+03. This might indicate that there are\n",
"strong multicollinearity or other numerical problems.\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "CI6OXg4npAWh",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"outputId": "0021272a-835e-418e-f7d4-d246e5bfe325"
},
"source": [
"import pandas as pd\n",
"import statsmodels.formula.api as smf\n",
"\n",
"high_male = pd.read_csv('/content/high_male_data.csv')\n",
"descriptors = ['立ち幅跳び', 'ハンドボール投げ', '握力得点','上体起こし得点']\n",
"\n",
"model = smf.ols('X50m走 ~ ' + ' + '.join(descriptors), data = high_male)\n",
"results = model.fit()\n",
"\n",
"print(results.summary())\n",
"\n",
"best_aic = results.aic\n",
"best_model = model\n",
"dict_models = {}\n",
"while descriptors:\n",
" desc_selected = ''\n",
" flag = 0\n",
" #dict_fitsに辞書keys:削除対象変数 values:[モデル値,AIC]\n",
" for rm_desk in descriptors:\n",
" used_desks = descriptors.copy()\n",
" used_desks.remove(desk)\n",
" formula = 'X50m走 ~ ' + ' + '.join(used_desks)\n",
" resultmodel = smf.ols(formula = formula, data = high_male)\n",
" dict_models[rm_desk] = [resultmodel, resultmodel.fit().aic]\n",
" #AICが最小になる\n",
" min_k, min_v = min(dict_models.items(), key=lambda x: x[1][1])\n",
" if min_v[1] < best_aic:\n",
" best_model = min_v[0]\n",
" best_aic = min_v[1]\n",
" descriptors.remove(min_k)\n",
" else:\n",
" #削減してもAICの改善が行われなかったら終了\n",
" break\n",
"\n",
"stepwise_model_fit = best_model.fit()\n",
"print(stepwise_model_fit.summary())"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: X50m走 R-squared: 0.525\n",
"Model: OLS Adj. R-squared: 0.510\n",
"Method: Least Squares F-statistic: 36.18\n",
"Date: Sat, 25 Jul 2020 Prob (F-statistic): 2.39e-20\n",
"Time: 02:25:36 Log-Likelihood: -41.639\n",
"No. Observations: 136 AIC: 93.28\n",
"Df Residuals: 131 BIC: 107.8\n",
"Df Model: 4 \n",
"Covariance Type: nonrobust \n",
"==============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"Intercept 10.8194 0.325 33.331 0.000 10.177 11.462\n",
"立ち幅跳び -0.0120 0.002 -7.648 0.000 -0.015 -0.009\n",
"ハンドボール投げ -0.0144 0.006 -2.367 0.019 -0.026 -0.002\n",
"握力得点 -0.0402 0.024 -1.677 0.096 -0.088 0.007\n",
"上体起こし得点 -0.0255 0.020 -1.264 0.208 -0.065 0.014\n",
"==============================================================================\n",
"Omnibus: 3.064 Durbin-Watson: 1.813\n",
"Prob(Omnibus): 0.216 Jarque-Bera (JB): 2.955\n",
"Skew: 0.359 Prob(JB): 0.228\n",
"Kurtosis: 2.927 Cond. No. 2.61e+03\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"[2] The condition number is large, 2.61e+03. This might indicate that there are\n",
"strong multicollinearity or other numerical problems.\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: X50m走 R-squared: 0.519\n",
"Model: OLS Adj. R-squared: 0.508\n",
"Method: Least Squares F-statistic: 47.50\n",
"Date: Sat, 25 Jul 2020 Prob (F-statistic): 6.92e-21\n",
"Time: 02:25:36 Log-Likelihood: -42.463\n",
"No. Observations: 136 AIC: 92.93\n",
"Df Residuals: 132 BIC: 104.6\n",
"Df Model: 3 \n",
"Covariance Type: nonrobust \n",
"==============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"Intercept 10.7121 0.314 34.114 0.000 10.091 11.333\n",
"立ち幅跳び -0.0121 0.002 -7.703 0.000 -0.015 -0.009\n",
"ハンドボール投げ -0.0169 0.006 -2.929 0.004 -0.028 -0.005\n",
"握力得点 -0.0439 0.024 -1.841 0.068 -0.091 0.003\n",
"==============================================================================\n",
"Omnibus: 2.668 Durbin-Watson: 1.820\n",
"Prob(Omnibus): 0.263 Jarque-Bera (JB): 2.632\n",
"Skew: 0.334 Prob(JB): 0.268\n",
"Kurtosis: 2.867 Cond. No. 2.51e+03\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"[2] The condition number is large, 2.51e+03. This might indicate that there are\n",
"strong multicollinearity or other numerical problems.\n"
],
"name": "stdout"
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment