Skip to content

Instantly share code, notes, and snippets.

@statkclee
Created November 30, 2015 10:47
Show Gist options
  • Save statkclee/9d54f17859763713973c to your computer and use it in GitHub Desktop.
Save statkclee/9d54f17859763713973c to your computer and use it in GitHub Desktop.
Think Stats 2, 11장 연습문제 해답
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"통계적 사고 (2판) 연습문제 ([thinkstats2.com](thinkstats2.com), [think-stat.xwmooc.org](http://think-stat.xwmooc.org))<br>\n",
"Allen Downey / 이광춘(xwMOOC)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 연습문제 11.1\n",
"\n",
"동료중 한명이 출산이 예정되어 있고, 출생일을 예측하는 사무실 내기게임에 참여한다고 가정하자. 임신 30주차에 내기를 건다고 가정하자. 가장 최선의 예측을 하는데 어떤 변수를 사용하면 될까? 출생전에 알려진 변수로 한정해야하고, 내기에 참여한 사람들에게 이용가능해야할 것 같다.\n",
"\n",
"## 연습문제 11.2\n",
"\n",
"트리버스-윌라드(Trivers-Willard) 가설은 많은 포유류에 있어, 성비는 “어머니 상태(maternal condition)”에 달려있다고 제시한다; 즉, 산모 연령, 크기, 건강, 사회적 지위 같은 요인. https://en.wikipedia.org/wiki/Trivers-Willard_hypothesis 참조한다.\n",
"일부 연구는 사람사이에 이런 효과를 보여주고 있지만, 결과는 혼재한다. 이번 장에서, 이런 요인과 연관된 일부 변수를 검정하지만, 성비에 통계적으로 유의적인 효과를 갖는 어떤 변수도 발견하지 못했다.\n",
"\n",
"연습으로, 데이터 마이닝 접근법을 사용해서 임신파일과 응답자파일에서 다른 변수를 검정하라. 실질적인 효과를 갖는 다른 요인을 발견할 수 있는가?\n",
"\n",
"## 연습문제 11.3\n",
"예측하고자 하는 수량(quantity)이 갯수(count)라면, 포아송 회귀를 사용할 수 있다. StatModels에 poisson 함수로 구현되어 있다. ols 나 logit 과 동일한 방식으로 작동한다. 연습으로, 이것을 사용해서 한 여성에 대해서 얼마나 많은 아기가 태어나는지 예측한다; NSFG 데이터셋에서, 이 변수는 numbabes로 불린다..\n",
"나이가 35세, 흑인, 연가구소득이 $75,000을 넘는 대학을 졸업한 여성을 가정해보자. 그녀가 얼마나 많은 자녀를 출산할 것으로 예상되는가?\n",
"\n",
"## 연습문제 11.4\n",
"\n",
"만약 예측하고자 하는 수량이 범주형이라면, 다항 로지스틱 회귀를 사용한다. StatModels에서 mnlogit 함수로 구현되어 있다. 연습으로, 이를 사용해서, 한 여성이 혼인상태, 동거상태, 과부상태, 이혼상태, 별거상태, 미혼인지 추측해본다; NSFG 데이터셋에서, 혼인상태는 변수명 rmarital로 부호화되어 있다.\n",
"연령이 25세, 백인, 연가구소득이 약 $45,000 달러인 고졸 여성을 만났다고 가정해보자. 그녀가 혼인, 동거, 등일 확률은 얼마나 될까?"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"import first\n",
"live, firsts, others = first.MakeFrames()\n",
"df = live[live.prglngth>30]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"아래에 저자가 발견한 임신기간에 통계적으로 유의적인 변수만 나와있다."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"simpletable\">\n",
"<caption>OLS Regression Results</caption>\n",
"<tr>\n",
" <th>Dep. Variable:</th> <td>prglngth</td> <th> R-squared: </th> <td> 0.011</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Model:</th> <td>OLS</td> <th> Adj. R-squared: </th> <td> 0.011</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Method:</th> <td>Least Squares</td> <th> F-statistic: </th> <td> 34.28</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Date:</th> <td>Mon, 30 Nov 2015</td> <th> Prob (F-statistic):</th> <td>5.09e-22</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Time:</th> <td>10:43:48</td> <th> Log-Likelihood: </th> <td> -18247.</td> \n",
"</tr>\n",
"<tr>\n",
" <th>No. Observations:</th> <td> 8884</td> <th> AIC: </th> <td>3.650e+04</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Df Residuals:</th> <td> 8880</td> <th> BIC: </th> <td>3.653e+04</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Df Model:</th> <td> 3</td> <th> </th> <td> </td> \n",
"</tr>\n",
"<tr>\n",
" <th>Covariance Type:</th> <td>nonrobust</td> <th> </th> <td> </td> \n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <td></td> <th>coef</th> <th>std err</th> <th>t</th> <th>P>|t|</th> <th>[95.0% Conf. Int.]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>Intercept</th> <td> 38.7617</td> <td> 0.039</td> <td> 1006.410</td> <td> 0.000</td> <td> 38.686 38.837</td>\n",
"</tr>\n",
"<tr>\n",
" <th>birthord == 1[T.True]</th> <td> 0.1015</td> <td> 0.040</td> <td> 2.528</td> <td> 0.011</td> <td> 0.023 0.180</td>\n",
"</tr>\n",
"<tr>\n",
" <th>race == 2[T.True]</th> <td> 0.1390</td> <td> 0.042</td> <td> 3.311</td> <td> 0.001</td> <td> 0.057 0.221</td>\n",
"</tr>\n",
"<tr>\n",
" <th>nbrnaliv > 1[T.True]</th> <td> -1.4944</td> <td> 0.164</td> <td> -9.086</td> <td> 0.000</td> <td> -1.817 -1.172</td>\n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <th>Omnibus:</th> <td>1587.470</td> <th> Durbin-Watson: </th> <td> 1.619</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Prob(Omnibus):</th> <td> 0.000</td> <th> Jarque-Bera (JB): </th> <td>6160.751</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Skew:</th> <td>-0.852</td> <th> Prob(JB): </th> <td> 0.00</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Kurtosis:</th> <td> 6.707</td> <th> Cond. No. </th> <td> 10.9</td>\n",
"</tr>\n",
"</table>"
],
"text/plain": [
"<class 'statsmodels.iolib.summary.Summary'>\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: prglngth R-squared: 0.011\n",
"Model: OLS Adj. R-squared: 0.011\n",
"Method: Least Squares F-statistic: 34.28\n",
"Date: Mon, 30 Nov 2015 Prob (F-statistic): 5.09e-22\n",
"Time: 10:43:48 Log-Likelihood: -18247.\n",
"No. Observations: 8884 AIC: 3.650e+04\n",
"Df Residuals: 8880 BIC: 3.653e+04\n",
"Df Model: 3 \n",
"Covariance Type: nonrobust \n",
"=========================================================================================\n",
" coef std err t P>|t| [95.0% Conf. Int.]\n",
"-----------------------------------------------------------------------------------------\n",
"Intercept 38.7617 0.039 1006.410 0.000 38.686 38.837\n",
"birthord == 1[T.True] 0.1015 0.040 2.528 0.011 0.023 0.180\n",
"race == 2[T.True] 0.1390 0.042 3.311 0.001 0.057 0.221\n",
"nbrnaliv > 1[T.True] -1.4944 0.164 -9.086 0.000 -1.817 -1.172\n",
"==============================================================================\n",
"Omnibus: 1587.470 Durbin-Watson: 1.619\n",
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 6160.751\n",
"Skew: -0.852 Prob(JB): 0.00\n",
"Kurtosis: 6.707 Cond. No. 10.9\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"\"\"\""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import statsmodels.formula.api as smf\n",
"model = smf.ols('prglngth ~ birthord==1 + race==2 + nbrnaliv>1', data=df)\n",
"results = model.fit()\n",
"results.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"성비를 예측하는 변수에 대해서 좀더 깊게 살펴보자."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import regression\n",
"join = regression.JoinFemResp(live)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 0.687464\n",
" Iterations 4\n"
]
}
],
"source": [
"def GoMining(df):\n",
" \"\"\"Searches for variables that predict birth weight.\n",
"\n",
" df: DataFrame of pregnancy records\n",
"\n",
" returns: list of (rsquared, variable name) pairs\n",
" \"\"\"\n",
" df['boy'] = (df.babysex==1).astype(int)\n",
" variables = []\n",
" for name in df.columns:\n",
" try:\n",
" if df[name].var() < 1e-7:\n",
" continue\n",
"\n",
" formula='boy ~ agepreg + ' + name\n",
" model = smf.logit(formula, data=df)\n",
" nobs = len(model.endog)\n",
" if nobs < len(df)/2:\n",
" continue\n",
"\n",
" results = model.fit()\n",
" except:\n",
" continue\n",
"\n",
" variables.append((results.prsquared, name))\n",
"\n",
" return variables\n",
"\n",
"variables = GoMining(join)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"가장 높은 유사-$R^2$(pseudo-$R^2$) 값을 이끌어 내는 변수가 30개 있다."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"totalwgt_lb 0.00804663447667\n"
]
}
],
"source": [
"regression.MiningReport(variables)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"임신기간 동안 알려지지 않은 변수와 다양한 사유로 수상한 냄새가 나는 변수를 제거하고 나서, 저자가 찾은 가장 최적의 모형이 다음에 나와있다:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 0.691983\n",
" Iterations 4\n"
]
},
{
"data": {
"text/html": [
"<table class=\"simpletable\">\n",
"<caption>Logit Regression Results</caption>\n",
"<tr>\n",
" <th>Dep. Variable:</th> <td>boy</td> <th> No. Observations: </th> <td> 9148</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Model:</th> <td>Logit</td> <th> Df Residuals: </th> <td> 9144</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Method:</th> <td>MLE</td> <th> Df Model: </th> <td> 3</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Date:</th> <td>Mon, 30 Nov 2015</td> <th> Pseudo R-squ.: </th> <td>0.001525</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Time:</th> <td>10:45:58</td> <th> Log-Likelihood: </th> <td> -6330.3</td> \n",
"</tr>\n",
"<tr>\n",
" <th>converged:</th> <td>True</td> <th> LL-Null: </th> <td> -6339.9</td> \n",
"</tr>\n",
"<tr>\n",
" <th> </th> <td> </td> <th> LLR p-value: </th> <td>0.0002326</td>\n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <td></td> <th>coef</th> <th>std err</th> <th>z</th> <th>P>|z|</th> <th>[95.0% Conf. Int.]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>Intercept</th> <td> -0.1863</td> <td> 0.116</td> <td> -1.606</td> <td> 0.108</td> <td> -0.414 0.041</td>\n",
"</tr>\n",
"<tr>\n",
" <th>fmarout5 == 5[T.True]</th> <td> 0.1493</td> <td> 0.048</td> <td> 3.087</td> <td> 0.002</td> <td> 0.054 0.244</td>\n",
"</tr>\n",
"<tr>\n",
" <th>infever == 1[T.True]</th> <td> 0.2096</td> <td> 0.063</td> <td> 3.304</td> <td> 0.001</td> <td> 0.085 0.334</td>\n",
"</tr>\n",
"<tr>\n",
" <th>agepreg</th> <td> 0.0053</td> <td> 0.004</td> <td> 1.252</td> <td> 0.211</td> <td> -0.003 0.014</td>\n",
"</tr>\n",
"</table>"
],
"text/plain": [
"<class 'statsmodels.iolib.summary.Summary'>\n",
"\"\"\"\n",
" Logit Regression Results \n",
"==============================================================================\n",
"Dep. Variable: boy No. Observations: 9148\n",
"Model: Logit Df Residuals: 9144\n",
"Method: MLE Df Model: 3\n",
"Date: Mon, 30 Nov 2015 Pseudo R-squ.: 0.001525\n",
"Time: 10:45:58 Log-Likelihood: -6330.3\n",
"converged: True LL-Null: -6339.9\n",
" LLR p-value: 0.0002326\n",
"=========================================================================================\n",
" coef std err z P>|z| [95.0% Conf. Int.]\n",
"-----------------------------------------------------------------------------------------\n",
"Intercept -0.1863 0.116 -1.606 0.108 -0.414 0.041\n",
"fmarout5 == 5[T.True] 0.1493 0.048 3.087 0.002 0.054 0.244\n",
"infever == 1[T.True] 0.2096 0.063 3.304 0.001 0.085 0.334\n",
"agepreg 0.0053 0.004 1.252 0.211 -0.003 0.014\n",
"=========================================================================================\n",
"\"\"\""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"formula='boy ~ agepreg + fmarout5==5 + infever==1'\n",
"model = smf.logit(formula, data=join)\n",
"results = model.fit()\n",
"results.summary() "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"응답자 자녀숫자를 예측하는 모형을 만들어보자. 연령에 대해 비선형모형을 사용했다. age3 항이 지나칠 수 있지만, 저자 생각에는 엉뚱한 선택은 아니라고 본다. 예측에 그다지 효과는 없다."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"join.numbabes.replace([97], np.nan, inplace=True)\n",
"join['age2'] = join.age_r**2\n",
"join['age3'] = join.age_r**3"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 1.677637\n",
" Iterations 7\n"
]
},
{
"data": {
"text/html": [
"<table class=\"simpletable\">\n",
"<caption>Poisson Regression Results</caption>\n",
"<tr>\n",
" <th>Dep. Variable:</th> <td>numbabes</td> <th> No. Observations: </th> <td> 9148</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Model:</th> <td>Poisson</td> <th> Df Residuals: </th> <td> 9140</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Method:</th> <td>MLE</td> <th> Df Model: </th> <td> 7</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Date:</th> <td>Mon, 30 Nov 2015</td> <th> Pseudo R-squ.: </th> <td>0.03665</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Time:</th> <td>10:46:04</td> <th> Log-Likelihood: </th> <td> -15347.</td> \n",
"</tr>\n",
"<tr>\n",
" <th>converged:</th> <td>True</td> <th> LL-Null: </th> <td> -15931.</td> \n",
"</tr>\n",
"<tr>\n",
" <th> </th> <td> </td> <th> LLR p-value: </th> <td>6.721e-248</td>\n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <td></td> <th>coef</th> <th>std err</th> <th>z</th> <th>P>|z|</th> <th>[95.0% Conf. Int.]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>Intercept</th> <td> -3.2731</td> <td> 0.719</td> <td> -4.551</td> <td> 0.000</td> <td> -4.683 -1.863</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(race)[T.2]</th> <td> -0.1403</td> <td> 0.014</td> <td> -9.683</td> <td> 0.000</td> <td> -0.169 -0.112</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(race)[T.3]</th> <td> -0.0959</td> <td> 0.024</td> <td> -3.957</td> <td> 0.000</td> <td> -0.143 -0.048</td>\n",
"</tr>\n",
"<tr>\n",
" <th>age_r</th> <td> 0.3744</td> <td> 0.069</td> <td> 5.433</td> <td> 0.000</td> <td> 0.239 0.510</td>\n",
"</tr>\n",
"<tr>\n",
" <th>age2</th> <td> -0.0090</td> <td> 0.002</td> <td> -4.158</td> <td> 0.000</td> <td> -0.013 -0.005</td>\n",
"</tr>\n",
"<tr>\n",
" <th>age3</th> <td> 7.12e-05</td> <td> 2.2e-05</td> <td> 3.237</td> <td> 0.001</td> <td> 2.81e-05 0.000</td>\n",
"</tr>\n",
"<tr>\n",
" <th>totincr</th> <td> -0.0185</td> <td> 0.002</td> <td> -9.858</td> <td> 0.000</td> <td> -0.022 -0.015</td>\n",
"</tr>\n",
"<tr>\n",
" <th>educat</th> <td> -0.0463</td> <td> 0.003</td> <td> -15.988</td> <td> 0.000</td> <td> -0.052 -0.041</td>\n",
"</tr>\n",
"</table>"
],
"text/plain": [
"<class 'statsmodels.iolib.summary.Summary'>\n",
"\"\"\"\n",
" Poisson Regression Results \n",
"==============================================================================\n",
"Dep. Variable: numbabes No. Observations: 9148\n",
"Model: Poisson Df Residuals: 9140\n",
"Method: MLE Df Model: 7\n",
"Date: Mon, 30 Nov 2015 Pseudo R-squ.: 0.03665\n",
"Time: 10:46:04 Log-Likelihood: -15347.\n",
"converged: True LL-Null: -15931.\n",
" LLR p-value: 6.721e-248\n",
"================================================================================\n",
" coef std err z P>|z| [95.0% Conf. Int.]\n",
"--------------------------------------------------------------------------------\n",
"Intercept -3.2731 0.719 -4.551 0.000 -4.683 -1.863\n",
"C(race)[T.2] -0.1403 0.014 -9.683 0.000 -0.169 -0.112\n",
"C(race)[T.3] -0.0959 0.024 -3.957 0.000 -0.143 -0.048\n",
"age_r 0.3744 0.069 5.433 0.000 0.239 0.510\n",
"age2 -0.0090 0.002 -4.158 0.000 -0.013 -0.005\n",
"age3 7.12e-05 2.2e-05 3.237 0.001 2.81e-05 0.000\n",
"totincr -0.0185 0.002 -9.858 0.000 -0.022 -0.015\n",
"educat -0.0463 0.003 -15.988 0.000 -0.052 -0.041\n",
"================================================================================\n",
"\"\"\""
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"formula='numbabes ~ age_r + age2 + age3 + C(race) + totincr + educat'\n",
"model = smf.poisson(formula, data=join)\n",
"results = model.fit()\n",
"results.summary() "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"이제, 35세, 흑인, 대학졸업하고 연간 가구소등이 $75,000을 넘는 여성에 대한 자녀수를 예측할 수 있다. "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 2.47393019])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas\n",
"columns = ['age_r', 'age2', 'age3', 'race', 'totincr', 'educat']\n",
"new = pandas.DataFrame([[35, 35**2, 35**3, 1, 14, 16]], columns=columns)\n",
"results.predict(new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"혼인상태를 예측하기 위하는데, 저자가 찾은 가장 최적의 모형이 다음에 나와 있다:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 1.092083\n",
" Iterations 8\n"
]
},
{
"data": {
"text/html": [
"<table class=\"simpletable\">\n",
"<caption>MNLogit Regression Results</caption>\n",
"<tr>\n",
" <th>Dep. Variable:</th> <td>rmarital</td> <th> No. Observations: </th> <td> 9148</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Model:</th> <td>MNLogit</td> <th> Df Residuals: </th> <td> 9113</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Method:</th> <td>MLE</td> <th> Df Model: </th> <td> 30</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Date:</th> <td>Mon, 30 Nov 2015</td> <th> Pseudo R-squ.: </th> <td>0.1661</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Time:</th> <td>10:46:25</td> <th> Log-Likelihood: </th> <td> -9990.4</td>\n",
"</tr>\n",
"<tr>\n",
" <th>converged:</th> <td>True</td> <th> LL-Null: </th> <td> -11981.</td>\n",
"</tr>\n",
"<tr>\n",
" <th> </th> <td> </td> <th> LLR p-value: </th> <td> 0.000</td> \n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <th>rmarital=2</th> <th>coef</th> <th>std err</th> <th>z</th> <th>P>|z|</th> <th>[95.0% Conf. Int.]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>Intercept</th> <td> 8.9153</td> <td> 0.792</td> <td> 11.251</td> <td> 0.000</td> <td> 7.362 10.468</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(race)[T.2]</th> <td> -0.9260</td> <td> 0.087</td> <td> -10.705</td> <td> 0.000</td> <td> -1.096 -0.756</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(race)[T.3]</th> <td> -0.6335</td> <td> 0.133</td> <td> -4.747</td> <td> 0.000</td> <td> -0.895 -0.372</td>\n",
"</tr>\n",
"<tr>\n",
" <th>age_r</th> <td> -0.3567</td> <td> 0.050</td> <td> -7.132</td> <td> 0.000</td> <td> -0.455 -0.259</td>\n",
"</tr>\n",
"<tr>\n",
" <th>age2</th> <td> 0.0047</td> <td> 0.001</td> <td> 6.054</td> <td> 0.000</td> <td> 0.003 0.006</td>\n",
"</tr>\n",
"<tr>\n",
" <th>totincr</th> <td> -0.1301</td> <td> 0.011</td> <td> -11.475</td> <td> 0.000</td> <td> -0.152 -0.108</td>\n",
"</tr>\n",
"<tr>\n",
" <th>educat</th> <td> -0.1940</td> <td> 0.018</td> <td> -10.534</td> <td> 0.000</td> <td> -0.230 -0.158</td>\n",
"</tr>\n",
"<tr>\n",
" <th>rmarital=3</th> <th>coef</th> <th>std err</th> <th>z</th> <th>P>|z|</th> <th>[95.0% Conf. Int.]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>Intercept</th> <td> 2.9927</td> <td> 2.970</td> <td> 1.007</td> <td> 0.314</td> <td> -2.829 8.815</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(race)[T.2]</th> <td> -0.3963</td> <td> 0.235</td> <td> -1.685</td> <td> 0.092</td> <td> -0.857 0.065</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(race)[T.3]</th> <td> 0.0650</td> <td> 0.336</td> <td> 0.194</td> <td> 0.846</td> <td> -0.593 0.723</td>\n",
"</tr>\n",
"<tr>\n",
" <th>age_r</th> <td> -0.3141</td> <td> 0.174</td> <td> -1.806</td> <td> 0.071</td> <td> -0.655 0.027</td>\n",
"</tr>\n",
"<tr>\n",
" <th>age2</th> <td> 0.0064</td> <td> 0.003</td> <td> 2.532</td> <td> 0.011</td> <td> 0.001 0.011</td>\n",
"</tr>\n",
"<tr>\n",
" <th>totincr</th> <td> -0.3217</td> <td> 0.032</td> <td> -10.135</td> <td> 0.000</td> <td> -0.384 -0.259</td>\n",
"</tr>\n",
"<tr>\n",
" <th>educat</th> <td> -0.1093</td> <td> 0.048</td> <td> -2.266</td> <td> 0.023</td> <td> -0.204 -0.015</td>\n",
"</tr>\n",
"<tr>\n",
" <th>rmarital=4</th> <th>coef</th> <th>std err</th> <th>z</th> <th>P>|z|</th> <th>[95.0% Conf. Int.]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>Intercept</th> <td> -3.6475</td> <td> 1.193</td> <td> -3.059</td> <td> 0.002</td> <td> -5.985 -1.310</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(race)[T.2]</th> <td> -0.3303</td> <td> 0.091</td> <td> -3.641</td> <td> 0.000</td> <td> -0.508 -0.153</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(race)[T.3]</th> <td> -0.8227</td> <td> 0.170</td> <td> -4.853</td> <td> 0.000</td> <td> -1.155 -0.490</td>\n",
"</tr>\n",
"<tr>\n",
" <th>age_r</th> <td> 0.1238</td> <td> 0.070</td> <td> 1.763</td> <td> 0.078</td> <td> -0.014 0.261</td>\n",
"</tr>\n",
"<tr>\n",
" <th>age2</th> <td> -0.0008</td> <td> 0.001</td> <td> -0.814</td> <td> 0.416</td> <td> -0.003 0.001</td>\n",
"</tr>\n",
"<tr>\n",
" <th>totincr</th> <td> -0.2288</td> <td> 0.011</td> <td> -20.058</td> <td> 0.000</td> <td> -0.251 -0.206</td>\n",
"</tr>\n",
"<tr>\n",
" <th>educat</th> <td> 0.0661</td> <td> 0.016</td> <td> 4.015</td> <td> 0.000</td> <td> 0.034 0.098</td>\n",
"</tr>\n",
"<tr>\n",
" <th>rmarital=5</th> <th>coef</th> <th>std err</th> <th>z</th> <th>P>|z|</th> <th>[95.0% Conf. Int.]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>Intercept</th> <td> -2.3978</td> <td> 1.269</td> <td> -1.889</td> <td> 0.059</td> <td> -4.886 0.090</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(race)[T.2]</th> <td> -1.0493</td> <td> 0.101</td> <td> -10.366</td> <td> 0.000</td> <td> -1.248 -0.851</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(race)[T.3]</th> <td> -0.6065</td> <td> 0.154</td> <td> -3.937</td> <td> 0.000</td> <td> -0.908 -0.305</td>\n",
"</tr>\n",
"<tr>\n",
" <th>age_r</th> <td> 0.2084</td> <td> 0.077</td> <td> 2.699</td> <td> 0.007</td> <td> 0.057 0.360</td>\n",
"</tr>\n",
"<tr>\n",
" <th>age2</th> <td> -0.0030</td> <td> 0.001</td> <td> -2.619</td> <td> 0.009</td> <td> -0.005 -0.001</td>\n",
"</tr>\n",
"<tr>\n",
" <th>totincr</th> <td> -0.2900</td> <td> 0.014</td> <td> -20.314</td> <td> 0.000</td> <td> -0.318 -0.262</td>\n",
"</tr>\n",
"<tr>\n",
" <th>educat</th> <td> -0.0176</td> <td> 0.021</td> <td> -0.835</td> <td> 0.404</td> <td> -0.059 0.024</td>\n",
"</tr>\n",
"<tr>\n",
" <th>rmarital=6</th> <th>coef</th> <th>std err</th> <th>z</th> <th>P>|z|</th> <th>[95.0% Conf. Int.]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>Intercept</th> <td> 8.1722</td> <td> 0.797</td> <td> 10.252</td> <td> 0.000</td> <td> 6.610 9.734</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(race)[T.2]</th> <td> -2.1628</td> <td> 0.079</td> <td> -27.532</td> <td> 0.000</td> <td> -2.317 -2.009</td>\n",
"</tr>\n",
"<tr>\n",
" <th>C(race)[T.3]</th> <td> -1.9701</td> <td> 0.136</td> <td> -14.472</td> <td> 0.000</td> <td> -2.237 -1.703</td>\n",
"</tr>\n",
"<tr>\n",
" <th>age_r</th> <td> -0.2188</td> <td> 0.050</td> <td> -4.337</td> <td> 0.000</td> <td> -0.318 -0.120</td>\n",
"</tr>\n",
"<tr>\n",
" <th>age2</th> <td> 0.0020</td> <td> 0.001</td> <td> 2.505</td> <td> 0.012</td> <td> 0.000 0.004</td>\n",
"</tr>\n",
"<tr>\n",
" <th>totincr</th> <td> -0.2888</td> <td> 0.011</td> <td> -25.418</td> <td> 0.000</td> <td> -0.311 -0.267</td>\n",
"</tr>\n",
"<tr>\n",
" <th>educat</th> <td> -0.0802</td> <td> 0.018</td> <td> -4.582</td> <td> 0.000</td> <td> -0.115 -0.046</td>\n",
"</tr>\n",
"</table>"
],
"text/plain": [
"<class 'statsmodels.iolib.summary.Summary'>\n",
"\"\"\"\n",
" MNLogit Regression Results \n",
"==============================================================================\n",
"Dep. Variable: rmarital No. Observations: 9148\n",
"Model: MNLogit Df Residuals: 9113\n",
"Method: MLE Df Model: 30\n",
"Date: Mon, 30 Nov 2015 Pseudo R-squ.: 0.1661\n",
"Time: 10:46:25 Log-Likelihood: -9990.4\n",
"converged: True LL-Null: -11981.\n",
" LLR p-value: 0.000\n",
"================================================================================\n",
" rmarital=2 coef std err z P>|z| [95.0% Conf. Int.]\n",
"--------------------------------------------------------------------------------\n",
"Intercept 8.9153 0.792 11.251 0.000 7.362 10.468\n",
"C(race)[T.2] -0.9260 0.087 -10.705 0.000 -1.096 -0.756\n",
"C(race)[T.3] -0.6335 0.133 -4.747 0.000 -0.895 -0.372\n",
"age_r -0.3567 0.050 -7.132 0.000 -0.455 -0.259\n",
"age2 0.0047 0.001 6.054 0.000 0.003 0.006\n",
"totincr -0.1301 0.011 -11.475 0.000 -0.152 -0.108\n",
"educat -0.1940 0.018 -10.534 0.000 -0.230 -0.158\n",
"--------------------------------------------------------------------------------\n",
" rmarital=3 coef std err z P>|z| [95.0% Conf. Int.]\n",
"--------------------------------------------------------------------------------\n",
"Intercept 2.9927 2.970 1.007 0.314 -2.829 8.815\n",
"C(race)[T.2] -0.3963 0.235 -1.685 0.092 -0.857 0.065\n",
"C(race)[T.3] 0.0650 0.336 0.194 0.846 -0.593 0.723\n",
"age_r -0.3141 0.174 -1.806 0.071 -0.655 0.027\n",
"age2 0.0064 0.003 2.532 0.011 0.001 0.011\n",
"totincr -0.3217 0.032 -10.135 0.000 -0.384 -0.259\n",
"educat -0.1093 0.048 -2.266 0.023 -0.204 -0.015\n",
"--------------------------------------------------------------------------------\n",
" rmarital=4 coef std err z P>|z| [95.0% Conf. Int.]\n",
"--------------------------------------------------------------------------------\n",
"Intercept -3.6475 1.193 -3.059 0.002 -5.985 -1.310\n",
"C(race)[T.2] -0.3303 0.091 -3.641 0.000 -0.508 -0.153\n",
"C(race)[T.3] -0.8227 0.170 -4.853 0.000 -1.155 -0.490\n",
"age_r 0.1238 0.070 1.763 0.078 -0.014 0.261\n",
"age2 -0.0008 0.001 -0.814 0.416 -0.003 0.001\n",
"totincr -0.2288 0.011 -20.058 0.000 -0.251 -0.206\n",
"educat 0.0661 0.016 4.015 0.000 0.034 0.098\n",
"--------------------------------------------------------------------------------\n",
" rmarital=5 coef std err z P>|z| [95.0% Conf. Int.]\n",
"--------------------------------------------------------------------------------\n",
"Intercept -2.3978 1.269 -1.889 0.059 -4.886 0.090\n",
"C(race)[T.2] -1.0493 0.101 -10.366 0.000 -1.248 -0.851\n",
"C(race)[T.3] -0.6065 0.154 -3.937 0.000 -0.908 -0.305\n",
"age_r 0.2084 0.077 2.699 0.007 0.057 0.360\n",
"age2 -0.0030 0.001 -2.619 0.009 -0.005 -0.001\n",
"totincr -0.2900 0.014 -20.314 0.000 -0.318 -0.262\n",
"educat -0.0176 0.021 -0.835 0.404 -0.059 0.024\n",
"--------------------------------------------------------------------------------\n",
" rmarital=6 coef std err z P>|z| [95.0% Conf. Int.]\n",
"--------------------------------------------------------------------------------\n",
"Intercept 8.1722 0.797 10.252 0.000 6.610 9.734\n",
"C(race)[T.2] -2.1628 0.079 -27.532 0.000 -2.317 -2.009\n",
"C(race)[T.3] -1.9701 0.136 -14.472 0.000 -2.237 -1.703\n",
"age_r -0.2188 0.050 -4.337 0.000 -0.318 -0.120\n",
"age2 0.0020 0.001 2.505 0.012 0.000 0.004\n",
"totincr -0.2888 0.011 -25.418 0.000 -0.311 -0.267\n",
"educat -0.0802 0.018 -4.582 0.000 -0.115 -0.046\n",
"================================================================================\n",
"\"\"\""
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"formula='rmarital ~ age_r + age2 + C(race) + totincr + educat'\n",
"model = smf.mnlogit(formula, data=join)\n",
"results = model.fit()\n",
"results.summary() "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"25세, 백인, 고등학교 졸업, 연간 가구소득이 $45,000인 여성에 대한 예측이 다음에 나와있다."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0.74568869, 0.12829132, 0.0016241 , 0.03280099, 0.02188668,\n",
" 0.06970823]])"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"columns = ['age_r', 'age2', 'race', 'totincr', 'educat']\n",
"new = pandas.DataFrame([[25, 25**2, 2, 11, 12]], columns=columns)\n",
"results.predict(new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"그래서, 이 분의 경우 현재 결혼가능성이 75%이고, \"결혼하지 않고 다른 성과 동거\" 가능성이 13% 등이다."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment