Created
August 20, 2015 23:09
-
-
Save clarkgrubb/8a1142b1e97310857a38 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "", | |
"signature": "sha256:ad9def6cd2562187051a60bccf190018c12b68225779c07fe4622ad0800536cb" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import numpy as np\n", | |
"import scipy.stats as stats\n", | |
"import pandas as pd\n", | |
"import statsmodels.formula.api as smf" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 17 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Create some data. We will create models of y and z in terms of x. \n", | |
"\n", | |
"y is generated from a single model, but z is generated from two different models." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"size = 50\n", | |
"\n", | |
"x1 = np.array(range(0, size))\n", | |
"y1 = 2 * x1 + 1 + stats.norm.rvs(0, 1, size)\n", | |
"z1 = 2 * x1 + 1 + stats.norm.rvs(0, 1, 50)\n", | |
"\n", | |
"x2 = np.array(range(2 * size, 3 * size))\n", | |
"y2 = 2 * x2 + 1 + stats.norm.rvs(0, 1, size)\n", | |
"z2 = 3 * x2 - 1 + stats.norm.rvs(0, 1, size)\n", | |
"\n", | |
"x = np.concatenate([x1, x2])\n", | |
"y = np.concatenate([y1, y2])\n", | |
"z = np.concatenate([z1, z2])\n", | |
"\n", | |
"classifier = [i > size for i in x]\n", | |
"\n", | |
"df = pd.DataFrame({'x': x, 'y': y, 'z': z, 'classifier': classifier})" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 12 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Perform Levene test on the data to answer whether the two populations have the same variance.\n", | |
"\n", | |
"The return value (statistics, p-value). A small p-value indicates that they probably don't have the same variance." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"stats.levene(y1, y2)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 14, | |
"text": [ | |
"(0.0013988754231943389, 0.97024092488680347)" | |
] | |
} | |
], | |
"prompt_number": 14 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"stats.levene(z1, z2)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 15, | |
"text": [ | |
"(11.156732721202243, 0.0011854998375876872)" | |
] | |
} | |
], | |
"prompt_number": 15 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Perform ANOVA on the data set.\n", | |
"\n", | |
"We are interested in whether the coefficients of `classifier` and `x:classifier` are signficant. If they are this is evidence they were generated from different models." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"fit1 = smf.ols(formula='y ~ x + classifier + x * classifier', data=df).fit()\n", | |
"fit1.summary()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"html": [ | |
"<table class=\"simpletable\">\n", | |
"<caption>OLS Regression Results</caption>\n", | |
"<tr>\n", | |
" <th>Dep. Variable:</th> <td>y</td> <th> R-squared: </th> <td> 1.000</td> \n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Model:</th> <td>OLS</td> <th> Adj. R-squared: </th> <td> 1.000</td> \n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Method:</th> <td>Least Squares</td> <th> F-statistic: </th> <td>3.873e+05</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Date:</th> <td>Thu, 20 Aug 2015</td> <th> Prob (F-statistic):</th> <td>8.24e-196</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Time:</th> <td>16:05:29</td> <th> Log-Likelihood: </th> <td> -136.39</td> \n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>No. Observations:</th> <td> 100</td> <th> AIC: </th> <td> 280.8</td> \n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Df Residuals:</th> <td> 96</td> <th> BIC: </th> <td> 291.2</td> \n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Df Model:</th> <td> 3</td> <th> </th> <td> </td> \n", | |
"</tr>\n", | |
"</table>\n", | |
"<table class=\"simpletable\">\n", | |
"<tr>\n", | |
" <td></td> <th>coef</th> <th>std err</th> <th>t</th> <th>P>|t|</th> <th>[95.0% Conf. Int.]</th> \n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Intercept</th> <td> 0.9109</td> <td> 0.269</td> <td> 3.384</td> <td> 0.001</td> <td> 0.377 1.445</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>classifier[T.True]</th> <td> -3.0054</td> <td> 1.217</td> <td> -2.470</td> <td> 0.015</td> <td> -5.420 -0.591</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>x</th> <td> 2.0061</td> <td> 0.009</td> <td> 211.922</td> <td> 0.000</td> <td> 1.987 2.025</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>x:classifier[T.True]</th> <td> 0.0188</td> <td> 0.013</td> <td> 1.407</td> <td> 0.163</td> <td> -0.008 0.045</td>\n", | |
"</tr>\n", | |
"</table>\n", | |
"<table class=\"simpletable\">\n", | |
"<tr>\n", | |
" <th>Omnibus:</th> <td> 0.486</td> <th> Durbin-Watson: </th> <td> 1.941</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Prob(Omnibus):</th> <td> 0.784</td> <th> Jarque-Bera (JB): </th> <td> 0.621</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Skew:</th> <td> 0.011</td> <th> Prob(JB): </th> <td> 0.733</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Kurtosis:</th> <td> 2.615</td> <th> Cond. No. </th> <td>1.59e+03</td>\n", | |
"</tr>\n", | |
"</table>" | |
], | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 19, | |
"text": [ | |
"<class 'statsmodels.iolib.summary.Summary'>\n", | |
"\"\"\"\n", | |
" OLS Regression Results \n", | |
"==============================================================================\n", | |
"Dep. Variable: y R-squared: 1.000\n", | |
"Model: OLS Adj. R-squared: 1.000\n", | |
"Method: Least Squares F-statistic: 3.873e+05\n", | |
"Date: Thu, 20 Aug 2015 Prob (F-statistic): 8.24e-196\n", | |
"Time: 16:05:29 Log-Likelihood: -136.39\n", | |
"No. Observations: 100 AIC: 280.8\n", | |
"Df Residuals: 96 BIC: 291.2\n", | |
"Df Model: 3 \n", | |
"========================================================================================\n", | |
" coef std err t P>|t| [95.0% Conf. Int.]\n", | |
"----------------------------------------------------------------------------------------\n", | |
"Intercept 0.9109 0.269 3.384 0.001 0.377 1.445\n", | |
"classifier[T.True] -3.0054 1.217 -2.470 0.015 -5.420 -0.591\n", | |
"x 2.0061 0.009 211.922 0.000 1.987 2.025\n", | |
"x:classifier[T.True] 0.0188 0.013 1.407 0.163 -0.008 0.045\n", | |
"==============================================================================\n", | |
"Omnibus: 0.486 Durbin-Watson: 1.941\n", | |
"Prob(Omnibus): 0.784 Jarque-Bera (JB): 0.621\n", | |
"Skew: 0.011 Prob(JB): 0.733\n", | |
"Kurtosis: 2.615 Cond. No. 1.59e+03\n", | |
"==============================================================================\n", | |
"\n", | |
"Warnings:\n", | |
"[1] The condition number is large, 1.59e+03. This might indicate that there are\n", | |
"strong multicollinearity or other numerical problems.\n", | |
"\"\"\"" | |
] | |
} | |
], | |
"prompt_number": 19 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"fit2 = smf.ols(formula='z ~ x + classifier + x * classifier', data=df).fit()\n", | |
"fit2.summary()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"html": [ | |
"<table class=\"simpletable\">\n", | |
"<caption>OLS Regression Results</caption>\n", | |
"<tr>\n", | |
" <th>Dep. Variable:</th> <td>z</td> <th> R-squared: </th> <td> 1.000</td> \n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Model:</th> <td>OLS</td> <th> Adj. R-squared: </th> <td> 1.000</td> \n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Method:</th> <td>Least Squares</td> <th> F-statistic: </th> <td>8.636e+05</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Date:</th> <td>Thu, 20 Aug 2015</td> <th> Prob (F-statistic):</th> <td>1.58e-212</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Time:</th> <td>16:06:06</td> <th> Log-Likelihood: </th> <td> -142.57</td> \n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>No. Observations:</th> <td> 100</td> <th> AIC: </th> <td> 293.1</td> \n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Df Residuals:</th> <td> 96</td> <th> BIC: </th> <td> 303.6</td> \n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Df Model:</th> <td> 3</td> <th> </th> <td> </td> \n", | |
"</tr>\n", | |
"</table>\n", | |
"<table class=\"simpletable\">\n", | |
"<tr>\n", | |
" <td></td> <th>coef</th> <th>std err</th> <th>t</th> <th>P>|t|</th> <th>[95.0% Conf. Int.]</th> \n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Intercept</th> <td> 0.7066</td> <td> 0.286</td> <td> 2.468</td> <td> 0.015</td> <td> 0.138 1.275</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>classifier[T.True]</th> <td> -2.0824</td> <td> 1.294</td> <td> -1.609</td> <td> 0.111</td> <td> -4.651 0.487</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>x</th> <td> 2.0095</td> <td> 0.010</td> <td> 199.543</td> <td> 0.000</td> <td> 1.989 2.029</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>x:classifier[T.True]</th> <td> 0.9929</td> <td> 0.014</td> <td> 69.718</td> <td> 0.000</td> <td> 0.965 1.021</td>\n", | |
"</tr>\n", | |
"</table>\n", | |
"<table class=\"simpletable\">\n", | |
"<tr>\n", | |
" <th>Omnibus:</th> <td> 3.332</td> <th> Durbin-Watson: </th> <td> 2.180</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Prob(Omnibus):</th> <td> 0.189</td> <th> Jarque-Bera (JB): </th> <td> 3.325</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Skew:</th> <td>-0.412</td> <th> Prob(JB): </th> <td> 0.190</td>\n", | |
"</tr>\n", | |
"<tr>\n", | |
" <th>Kurtosis:</th> <td> 2.653</td> <th> Cond. No. </th> <td>1.59e+03</td>\n", | |
"</tr>\n", | |
"</table>" | |
], | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 20, | |
"text": [ | |
"<class 'statsmodels.iolib.summary.Summary'>\n", | |
"\"\"\"\n", | |
" OLS Regression Results \n", | |
"==============================================================================\n", | |
"Dep. Variable: z R-squared: 1.000\n", | |
"Model: OLS Adj. R-squared: 1.000\n", | |
"Method: Least Squares F-statistic: 8.636e+05\n", | |
"Date: Thu, 20 Aug 2015 Prob (F-statistic): 1.58e-212\n", | |
"Time: 16:06:06 Log-Likelihood: -142.57\n", | |
"No. Observations: 100 AIC: 293.1\n", | |
"Df Residuals: 96 BIC: 303.6\n", | |
"Df Model: 3 \n", | |
"========================================================================================\n", | |
" coef std err t P>|t| [95.0% Conf. Int.]\n", | |
"----------------------------------------------------------------------------------------\n", | |
"Intercept 0.7066 0.286 2.468 0.015 0.138 1.275\n", | |
"classifier[T.True] -2.0824 1.294 -1.609 0.111 -4.651 0.487\n", | |
"x 2.0095 0.010 199.543 0.000 1.989 2.029\n", | |
"x:classifier[T.True] 0.9929 0.014 69.718 0.000 0.965 1.021\n", | |
"==============================================================================\n", | |
"Omnibus: 3.332 Durbin-Watson: 2.180\n", | |
"Prob(Omnibus): 0.189 Jarque-Bera (JB): 3.325\n", | |
"Skew: -0.412 Prob(JB): 0.190\n", | |
"Kurtosis: 2.653 Cond. No. 1.59e+03\n", | |
"==============================================================================\n", | |
"\n", | |
"Warnings:\n", | |
"[1] The condition number is large, 1.59e+03. This might indicate that there are\n", | |
"strong multicollinearity or other numerical problems.\n", | |
"\"\"\"" | |
] | |
} | |
], | |
"prompt_number": 20 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment