Skip to content

Instantly share code, notes, and snippets.

@jvrsgsty
Last active December 29, 2015 23:29
Show Gist options
  • Save jvrsgsty/7743359 to your computer and use it in GitHub Desktop.
Save jvrsgsty/7743359 to your computer and use it in GitHub Desktop.
Lista 9 de Estadística Aplicada II
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Lista 9 : Control 2"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Selecci\u00f3n de modelos"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Javier Sagastuy 122627"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"ITAM Oto\u00f1o 2013"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Comenzamos por especificar las opciones de graficaci\u00f3n de iPython para poder desplegar las gr\u00e1ficas de manera adecuada."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# -*- coding: utf-8 -*-\n",
"%pylab inline \n",
"%pylab --no-import-all"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Populating the interactive namespace from numpy and matplotlib\n",
"Using matplotlib backend: module://IPython.kernel.zmq.pylab.backend_inline\n",
"Populating the interactive namespace from numpy and matplotlib\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Se importan todas las librer\u00edas necesarias para trabajar."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd \n",
"import numpy as np \n",
"import math\n",
"import statsmodels.formula.api as sm\n",
"import matplotlib.pyplot as plt \n",
"import scipy.stats as stats\n",
"from statsmodels.stats.anova import anova_lm\n",
"from statsmodels.graphics import regressionplots as rp\n",
"from statsmodels.stats.outliers_influence import summary_table"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Se realiza la lectura de los datos."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"path = '/Users/jvrsgsty/Documents/ITAM/Semestre 7/Estadistica Aplicada 2/\\\n",
"ML-V2013/R/data/Hald.dat'\n",
"dat = pd.read_table(path, sep='\\s*')\n",
"n = len(dat)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Se ajusta el modelo tomando en cuenta todos los regresores."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"model = sm.ols(formula='y ~ x1 + x2 + x3 + x4', data = dat)\n",
"fitted = model.fit()\n",
"print fitted.summary()\n",
"print \" ANOVA\"\n",
"print \"==============================================================================\"\n",
"print anova_lm(fitted)\n",
"CMres = fitted.ssr/fitted.df_resid"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: y R-squared: 0.982\n",
"Model: OLS Adj. R-squared: 0.974\n",
"Method: Least Squares F-statistic: 111.5\n",
"Date: Tue, 03 Dec 2013 Prob (F-statistic): 4.76e-07\n",
"Time: 13:58:55 Log-Likelihood: -26.918\n",
"No. Observations: 13 AIC: 63.84\n",
"Df Residuals: 8 BIC: 66.66\n",
"Df Model: 4 \n",
"==============================================================================\n",
" coef std err t P>|t| [95.0% Conf. Int.]\n",
"------------------------------------------------------------------------------\n",
"Intercept 62.4054 70.071 0.891 0.399 -99.179 223.989\n",
"x1 1.5511 0.745 2.083 0.071 -0.166 3.269\n",
"x2 0.5102 0.724 0.705 0.501 -1.159 2.179\n",
"x3 0.1019 0.755 0.135 0.896 -1.638 1.842\n",
"x4 -0.1441 0.709 -0.203 0.844 -1.779 1.491\n",
"==============================================================================\n",
"Omnibus: 0.165 Durbin-Watson: 2.053\n",
"Prob(Omnibus): 0.921 Jarque-Bera (JB): 0.320\n",
"Skew: 0.201 Prob(JB): 0.852\n",
"Kurtosis: 2.345 Cond. No. 6.06e+03\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] The condition number is large, 6.06e+03. This might indicate that there are\n",
"strong multicollinearity or other numerical problems.\n",
" ANOVA\n",
"==============================================================================\n",
" df sum_sq mean_sq F PR(>F)\n",
"x1 1 1450.076328 1450.076328 242.367918 0.000000\n",
"x2 1 1207.782266 1207.782266 201.870528 0.000001\n",
"x3 1 9.793869 9.793869 1.636962 0.236600\n",
"x4 1 0.246975 0.246975 0.041280 0.844071\n",
"Residual 8 47.863639 5.982955 NaN NaN\n"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
"/Users/jvrsgsty/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/scipy/stats/stats.py:1276: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13\n",
" int(n))\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A continueaci\u00f3n se crea un dataFrame que ser\u00e1 llenado con los estad\u00edsticos que usaremos para comparar modelos. Del mismo modo se crea una lista con todos los modelos a comparar. Se itera sobre la lista y se ajsutan los respectivos modelos. Para cada uno se calculan los estad\u00edsticos de la tabla y se a\u00f1ade el rengl\u00f3n a la tabla. Al final se muestra la tabla resultante. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"r = pd.DataFrame(columns=['i','p','q', 'SCRes', 'R2', 'R2aj', 'AIC','s2','Cp','PRESS'])\n",
"form = ['y~1','y~x1','y~x2','y~x3','y~x4','y~x1+x2','y~x1+x3','y~x1+x4','y~x2+x3',\\\n",
"\t\t'y~x2+x4','y~x3+x4','y~x1+x2+x3','y~x1+x2+x4','y~x2+x3+x4','y~x1+x3+x4',\\\n",
"\t\t'y~x1+x2+x3+x4']\n",
"for i in range(len(form)):\n",
"\tif len(form[i]) == 3:\n",
"\t\tq = 1\n",
"\tif len(form[i]) == 4:\n",
"\t\tq = 2\n",
"\tif len(form[i]) == 7:\n",
"\t\tq = 3\n",
"\tif len(form[i]) == 10:\n",
"\t\tq = 4\n",
"\tif len(form[i]) == 13:\n",
"\t\tq = 5\n",
"\tmodel2 = sm.ols(formula=form[i], data = dat)\n",
"\tfitted2 = model2.fit()\n",
"\tinfluence = fitted2.get_influence()\n",
"\ts2 = fitted2.ssr/fitted2.df_resid\n",
"\tCp = fitted2.ssr/CMres-n+2*q\n",
"\th = influence.hat_matrix_diag\n",
"\tPRESS = sum((np.array(fitted2.resid).reshape(-1,1)/(np.ones((n,1))-h.reshape(-1,1)))**2)\n",
"\trow = pd.DataFrame([{'i':i,'p':q-1,'q':q,'SCRes':fitted2.ssr,'R2':fitted2.rsquared,\\\n",
"\t\t'R2aj':fitted2.rsquared_adj,'AIC':fitted2.aic,'s2':s2,'Cp':Cp,'PRESS':PRESS}],)\n",
"\tr=r.append(row,ignore_index=True)\n",
"print r"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" AIC Cp PRESS R2 R2aj SCRes i \\\n",
"0 108.336827 442.916687 3187.249722 0.000000 0.000000 2715.763077 0 \n",
"1 100.411872 202.548769 1699.611598 0.533948 0.491580 1265.686749 1 \n",
"2 96.070396 142.486407 1202.086751 0.666268 0.635929 906.336344 2 \n",
"3 105.959804 315.154284 2616.363852 0.285873 0.220952 1939.400469 3 \n",
"4 95.744045 138.730833 1194.218203 0.674542 0.644955 883.866917 4 \n",
"5 62.312393 2.678242 93.882546 0.978678 0.974414 57.904483 5 \n",
"6 102.009080 198.094653 2218.118312 0.548167 0.457800 1227.072060 6 \n",
"7 65.634106 5.495851 121.224393 0.972471 0.966965 74.762112 7 \n",
"8 87.929542 62.437716 701.743183 0.847025 0.816430 415.442727 8 \n",
"9 97.521728 138.225920 1461.814208 0.680060 0.616072 868.880131 9 \n",
"10 76.744986 22.373112 294.013868 0.935290 0.922348 175.738005 10 \n",
"11 61.903597 3.041280 90.000012 0.982285 0.976380 48.110614 11 \n",
"12 61.866285 3.018233 85.351121 0.982335 0.976447 47.972729 12 \n",
"13 67.468287 7.337474 146.852692 0.972820 0.963760 73.814551 13 \n",
"14 62.619952 3.496824 94.537062 0.981281 0.975041 50.836118 14 \n",
"15 63.836690 5.000000 110.346557 0.982376 0.973563 47.863639 15 \n",
"\n",
" p q s2 \n",
"0 0 1 226.313590 \n",
"1 1 2 115.062432 \n",
"2 1 2 82.394213 \n",
"3 1 2 176.309134 \n",
"4 1 2 80.351538 \n",
"5 2 3 5.790448 \n",
"6 2 3 122.707206 \n",
"7 2 3 7.476211 \n",
"8 2 3 41.544273 \n",
"9 2 3 86.888013 \n",
"10 2 3 17.573800 \n",
"11 3 4 5.345624 \n",
"12 3 4 5.330303 \n",
"13 3 4 8.201617 \n",
"14 3 4 5.648458 \n",
"15 4 5 5.982955 \n"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A continuaci\u00f3n se presentan los mejores modelos bajo distintos criterios dados por los estad\u00edsticos antes calculados. Se muestra el modelo ajustado y luego el resumen correspondiente a dicho modelo. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print \"\\nMejores modelos bajo el criterio: \"\n",
"print \"\\nMaxima R2:\"\n",
"k = r.R2.argmax()\n",
"f = form[k]\n",
"print f + \"\\n\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Mejores modelos bajo el criterio: \n",
"\n",
"Maxima R2:\n",
"y~x1+x2+x3+x4\n",
"\n"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print \"\\nMaxima R2aj:\"\n",
"k = r.R2aj.argmax()\n",
"f = form[k]\n",
"print f+ \"\\n\"\n",
"model3 = sm.ols(formula=f, data = dat)\n",
"fitted3 = model3.fit()\n",
"print fitted3.summary()\n",
"print \" ANOVA\"\n",
"print \"==============================================================================\"\n",
"print anova_lm(fitted3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Maxima R2aj:\n",
"y~x1+x2+x4\n",
"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: y R-squared: 0.982\n",
"Model: OLS Adj. R-squared: 0.976\n",
"Method: Least Squares F-statistic: 166.8\n",
"Date: Tue, 03 Dec 2013 Prob (F-statistic): 3.32e-08\n",
"Time: 13:58:58 Log-Likelihood: -26.933\n",
"No. Observations: 13 AIC: 61.87\n",
"Df Residuals: 9 BIC: 64.13\n",
"Df Model: 3 \n",
"==============================================================================\n",
" coef std err t P>|t| [95.0% Conf. Int.]\n",
"------------------------------------------------------------------------------\n",
"Intercept 71.6483 14.142 5.066 0.001 39.656 103.641\n",
"x1 1.4519 0.117 12.410 0.000 1.187 1.717\n",
"x2 0.4161 0.186 2.242 0.052 -0.004 0.836\n",
"x4 -0.2365 0.173 -1.365 0.205 -0.629 0.155\n",
"==============================================================================\n",
"Omnibus: 0.211 Durbin-Watson: 2.011\n",
"Prob(Omnibus): 0.900 Jarque-Bera (JB): 0.378\n",
"Skew: 0.202 Prob(JB): 0.828\n",
"Kurtosis: 2.270 Cond. No. 1.27e+03\n",
"==============================================================================\n",
"\n",
"Warnings:\n",
"[1] The condition number is large, 1.27e+03. This might indicate that there are\n",
"strong multicollinearity or other numerical problems.\n",
" ANOVA\n",
"==============================================================================\n",
" df sum_sq mean_sq F PR(>F)\n",
"x1 1 1450.076328 1450.076328 272.043870 4.933517e-08\n",
"x2 1 1207.782266 1207.782266 226.587908 1.094167e-07\n",
"x4 1 9.931754 9.931754 1.863262 2.053954e-01\n",
"Residual 9 47.972729 5.330303 NaN NaN\n"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Para el siguiente caso es importante resaltar que las l\u00e1minas especifican que el subconjunto de regresores que minimiza el $ECM$ maximiza la $R^2$. Como se puede notar, los resultados no coinciden con esto. P\u00e9nsandolo, esto se debe a que $ECM=\\dfrac{SC_{res}}{n-q}$ por lo que lo que se minimiza al minimizar el $ECM$ no es s\u00f3lo $SC_{res}$. Si s\u00f3lo se buscara minimizar $SC_{res}$, entonces el modelo que lo minimiza, maximiza tambi\u00e9n la $R^2$."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print \"\\nMinimo ECM:\"\n",
"k = r.s2.argmin()\n",
"f = form[k]\n",
"print f + \"\\n\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Minimo ECM:\n",
"y~x1+x2+x4\n",
"\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print \"\\nMinima Cp:\"\n",
"k = r.Cp.argmin()\n",
"f = form[k]\n",
"print f + \"\\n\"\n",
"model3 = sm.ols(formula=f, data = dat)\n",
"fitted3 = model3.fit()\n",
"print fitted3.summary()\n",
"print \" ANOVA\"\n",
"print \"==============================================================================\"\n",
"print anova_lm(fitted3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Minima Cp:\n",
"y~x1+x2\n",
"\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: y R-squared: 0.979\n",
"Model: OLS Adj. R-squared: 0.974\n",
"Method: Least Squares F-statistic: 229.5\n",
"Date: Tue, 03 Dec 2013 Prob (F-statistic): 4.41e-09\n",
"Time: 13:59:00 Log-Likelihood: -28.156\n",
"No. Observations: 13 AIC: 62.31\n",
"Df Residuals: 10 BIC: 64.01\n",
"Df Model: 2 \n",
"==============================================================================\n",
" coef std err t P>|t| [95.0% Conf. Int.]\n",
"------------------------------------------------------------------------------\n",
"Intercept 52.5773 2.286 22.998 0.000 47.483 57.671\n",
"x1 1.4683 0.121 12.105 0.000 1.198 1.739\n",
"x2 0.6623 0.046 14.442 0.000 0.560 0.764\n",
"==============================================================================\n",
"Omnibus: 1.509 Durbin-Watson: 1.922\n",
"Prob(Omnibus): 0.470 Jarque-Bera (JB): 1.104\n",
"Skew: 0.503 Prob(JB): 0.576\n",
"Kurtosis: 1.987 Cond. No. 175.\n",
"==============================================================================\n",
" ANOVA\n",
"==============================================================================\n",
" df sum_sq mean_sq F PR(>F)\n",
"x1 1 1450.076328 1450.076328 250.425571 2.088092e-08\n",
"x2 1 1207.782266 1207.782266 208.581823 5.028960e-08\n",
"Residual 10 57.904483 5.790448 NaN NaN\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print \"\\nMinima PRESS:\"\n",
"k = r.PRESS.argmin()\n",
"f = form[k]\n",
"print f + \"\\n\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Minima PRESS:\n",
"y~x1+x2+x4\n",
"\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lamentablemente, Python carece de una funci\u00f3n integrada como $R$ para contruir el mejor model con base en el criterio de Akaike. Dise\u00f1\u00e9 la siguiente funci\u00f3n que agrega un regresor con base en este criterio. An\u00e1logamente se podr\u00eda construir la funci\u00f3n que a partir del modelo va quitando regresores. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def aicStep(formula, inModel, names):\n",
"\tAIC = []\n",
"\tfor regressor in names:\n",
"\t\tmodel2 = sm.ols(formula=formula + ' + ' + regressor, data = dat)\n",
"\t\tfitted2 = model2.fit()\n",
"\t\tAIC.append(fitted2.aic)\n",
"\tAIC = np.array(AIC)\n",
"\tk = AIC.argmin()\n",
"\tformula += ' + ' + names[k]\n",
"\tinModel.append(names[k])\n",
"\tnames.remove(names[k])\n",
"\treturn (formula, inModel, names)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ahora inicializo la lista con los nombres de todos los regresores, la lista de los regresores actualmente en el modelo y la cadena de carateres con la f\u00f3rmula. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"names = fitted.model.data.xnames\n",
"names.remove('Intercept')\n",
"inModel = []\n",
"formula = 'y ~ 1'\n",
"print 'Start: ' + formula"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Start: y ~ 1\n"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Se hacen ahora tres llamadas sucesivas a la funci\u00f3n antes definida para ir agregando uno a uno los regresores:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"(formula, inModel, names) = aicStep(formula, inModel, names)\n",
"print 'Step 1: ' + formula\n",
"\n",
"(formula, inModel, names) = aicStep(formula, inModel, names)\n",
"print 'Step 2: ' + formula\n",
"\n",
"(formula, inModel, names) = aicStep(formula, inModel, names)\n",
"print 'Step 3: ' + formula"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Step 1: y ~ 1 + x4\n",
"Step 2: y ~ 1 + x4 + x1\n",
"Step 3: y ~ 1 + x4 + x1 + x2\n"
]
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compare y discuta los modelos ajustados. \u00bfCu\u00e1l recomendar\u00eda?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Como pudimos ver, de los criterios anteriormente utilizados resultan tres modelos candidatos. El primero y menos factible es el modelo completo que maximiza la $R^2$. Dado que lo que se busca es seleccionar los regresores m\u00e1s significativos, este modelo ni lo tomamos en consideraci\u00f3n.\n",
"\n",
"Todos los dem\u00e1s criterios ($R^2$ ajustada, $ECM$, $AIC$ y $PRESS$) terminaron con un modelo que incluye a todos los regresores salvo a $x_3$. El \u00fanico modelo diferente a estos es que elige el criterio de m\u00ednima $C_p$ de Mallows. \n",
"\n",
"Analizando ambos modelos ajustados, a mi me parece que el mejor modelo es el que s\u00f3lo toma en cuenta los primeros dos regresores (i.e. $y \\sim x_1 + x_2$). Mi conclusi\u00f3n de debe a que si bien el modelo que tiene tres regresores tiene una $R^2$ mayor, la prueba de significancia de la regresi\u00f3n tiene un valor-p ligeramente mayor. Del mismo modo, el valor-p del estad\u00edstico $t$ para la prueba de hip\u00f3tesis $H_0:\\beta_4=0$ en el mdoelo con tres variables es relativamente alto. Al eliminar $x_4$ y quedarnos con el modelo con dos variables vemos una mejora en el estad\u00edstico $F$, un ligero decremento en la $R^2$ (evidentemente) y un ligero incremento en la $SC_{res}$. Sin embargo, los valores-p de significancia de los regresores $x_1$ y $x_2$ son mucho menores en este caso.\n",
"\n",
"Por estas razones es que me atrevo a decir que los componentes $x_3$ y $x_4$ del cemento no contribuyen de manera significativa en qu\u00e9 tanto se calienta la mezcla de cemento. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment