Skip to content

Instantly share code, notes, and snippets.

@josef-pkt
Last active August 24, 2023 20:53
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save josef-pkt/6890383 to your computer and use it in GitHub Desktop.
Save josef-pkt/6890383 to your computer and use it in GitHub Desktop.
Poisson with endogeneity, GMM with IV versus MLE in python
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "example_gmm_poisson_2"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Poisson: Maximum Likelihood and General Method of Moments"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the following we are estimating a Poisson model with MLE and with GMM.\n",
"\n",
"This is mainly a test example for GMM estimation with statsmodels. I didn't add much explanation to the problem itself."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Setup and Data"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"\"\"\n",
"\n",
"Created on Tue Oct 08 09:27:31 2013\n",
"\n",
"Author: Josef Perktold\n",
"\"\"\"\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"from statsmodels.sandbox.regression import gmm"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 23
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"dta = pd.read_csv('docvisits.csv')\n",
"\n",
"print dta.iloc[:5]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" docvis age income female black hispanic married physlim private \\\n",
"0 0 3.9 30.000000 1 0 1 1 0 1 \n",
"1 1 4.7 10.000000 0 0 1 1 0 1 \n",
"2 15 2.7 27.000000 1 0 0 0 0 1 \n",
"3 0 3.0 11.250000 0 0 0 1 0 0 \n",
"4 2 5.4 76.330002 1 0 0 1 0 1 \n",
"\n",
" chronic \n",
"0 0 \n",
"1 0 \n",
"2 1 \n",
"3 0 \n",
"4 0 \n"
]
}
],
"prompt_number": 24
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"dta_clean = dta\n",
"dta_clean['const'] = 1\n",
"dta_clean['income'] *= 0.1 # rescale to have values into similar ranges\n",
"\n",
"endog_df = dta_clean[['docvis']]\n",
"exog_names = ['private', 'chronic', 'female', 'income', 'const']\n",
"exog_df = dta_clean[exog_names]\n",
"instrument_df = dta_clean[['private', 'chronic', 'female', 'age', 'black', \n",
" 'hispanic', 'const']]\n",
"\n",
"endog, exog, instrument = map(np.asarray, [endog_df, exog_df, instrument_df])\n",
"endog = np.squeeze(endog)\n",
"# TODO: no error checking in GMM, get_error breaks if endog is 2d"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 25
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following defines the score equation for the Log-likelihood of the Poisson model. We only need the $y = \\exp(x b)$ part. Our full moment condition is that the following expectation is zero::\n",
"\n",
"$ \\hspace{10pt} E \\[z \\hspace{3pt} (y - \\exp(x \\hspace{3pt} b)\\] = 0$\n",
"\n",
"**Aside:** \n",
"While working on this example in the notebook, I have problems with what seems to be a runaway `fmin_bfgs`. \n",
"I didn't have the problem working on the on the original problem in Spyder, but I haven't checked whether the python environment is different.\n",
"\n",
"I have been working on similar examples where the optimization is sensitive to the scaling of the variables. Since we take the exponential of data multiplied the parameters, this can easily result in numerical problems or overflow if the optimization algorithm tries to evaluate larger parameters. Additionally, the jacobian, that is the first derivative of the objective function can become near singular with a large condition number. This also makes it difficult for nonlinear optimization algorithm to converge."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def moment_exponential(params, exog, exp=True):\n",
" \n",
" # moment condition without instrument\n",
" if exp:\n",
" predicted = np.exp(np.dot(exog, params))\n",
" else:\n",
" predicted = np.dot(exog, params)\n",
" \n",
" #try to avoid runaway optimizers that cannot handle nans\n",
" if np.isnan(predicted).any():\n",
" raise ValueError('nans in predicted')\n",
" return predicted\n",
" "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 26
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Maximum Likelihood Estimation MLE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Statsmodels has had discrete models for a long time (well, on our time scale). \n",
"Poisson is a standard maximum likelihood estimator that assumes that observations are independent, \n",
"and that explanatory variables are exogenous (i.e. without measurement or endogeneity error)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from statsmodels.discrete.discrete_model import Poisson\n",
"res_poisson = Poisson(endog, exog).fit()\n",
"print(res_poisson.summary(yname='docvisits', xname=exog_names))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 4.193914\n",
" Iterations 12\n",
" Poisson Regression Results \n",
"==============================================================================\n",
"Dep. Variable: docvisits No. Observations: 4412\n",
"Model: Poisson Df Residuals: 4407\n",
"Method: MLE Df Model: 4\n",
"Date: Wed, 25 Dec 2013 Pseudo R-squ.: 0.1930\n",
"Time: 14:50:25 Log-Likelihood: -18504.\n",
"converged: True LL-Null: -22930.\n",
" LLR p-value: 0.000\n",
"==============================================================================\n",
" coef std err z P>|z| [95.0% Conf. Int.]\n",
"------------------------------------------------------------------------------\n",
"private 0.7987 0.028 28.813 0.000 0.744 0.853\n",
"chronic 1.0919 0.016 69.112 0.000 1.061 1.123\n",
"female 0.4925 0.016 30.770 0.000 0.461 0.524\n",
"income 0.0356 0.002 14.750 0.000 0.031 0.040\n",
"const -0.2297 0.029 -8.004 0.000 -0.286 -0.173\n",
"=============================================================================="
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n"
]
}
],
"prompt_number": 27
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can estimate the equivalent model with GMM using exog as instruments.\n",
"\n",
"**Aside** I'm using scipy's `fmin`, which is an optimizer that is more robust to numerical problems, \n",
"to get roughly the correct estimate. Then I use `fmin_bfgs` to get better local convergence.\n",
"\n",
"Running `bfgs` directly has problems in my current python setup."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"mod_ex = gmm.NonlinearIVGMM(endog, exog, exog, moment_exponential)\n",
"w0exinv = np.dot(exog.T, exog) / len(endog)\n",
"start_params = 0.1 + np.zeros(exog.shape[1])\n",
"res_ex0 = mod_ex.fit(start_params, maxiter=0, inv_weights=w0exinv, optim_method='nm')\n",
"res_ex = mod_ex.fit(res_ex0.params, maxiter=0, inv_weights=w0exinv, optim_method='bfgs')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 0.000000\n",
" Iterations: 310\n",
" Function evaluations: 502\n",
"Optimization terminated successfully.\n",
" Current function value: 0.000000\n",
" Iterations: 6\n",
" Function evaluations: 10\n",
" Gradient evaluations: 10\n"
]
}
],
"prompt_number": 28
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print(res_ex.summary(yname='docvisits', xname=exog_names))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" NonlinearIVGMM Results \n",
"==============================================================================\n",
"Dep. Variable: docvisits Hansen J: 2.193e-10\n",
"Model: NonlinearIVGMM Prob (Hansen J): nan\n",
"Method: GMM \n",
"Date: Wed, 25 Dec 2013 \n",
"Time: 14:50:25 \n",
"No. Observations: 4412 \n",
"==============================================================================\n",
" coef std err z P>|z| [95.0% Conf. Int.]\n",
"------------------------------------------------------------------------------\n",
"private 0.7987 0.109 7.328 0.000 0.585 1.012\n",
"chronic 1.0919 0.056 19.501 0.000 0.982 1.202\n",
"female 0.4925 0.059 8.415 0.000 0.378 0.607\n",
"income 0.0356 0.011 3.286 0.001 0.014 0.057\n",
"const -0.2297 0.111 -2.072 0.038 -0.447 -0.012\n",
"==============================================================================\n"
]
}
],
"prompt_number": 29
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print(res_poisson.params)\n",
"print(res_ex.params)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[ 0.79866538 1.09186511 0.49254807 0.03557013 -0.22972634]\n",
"[ 0.79866522 1.09186505 0.49254798 0.03557012 -0.22972605]\n"
]
}
],
"prompt_number": 30
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"difference in params between MLE and GMM oneway version"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print(res_ex.params - res_poisson.params)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[ -1.60226566e-07 -5.64773353e-08 -9.39717644e-08 -4.27707619e-09\n",
" 2.90754947e-07]\n"
]
}
],
"prompt_number": 31
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, the standard deviation of the estimates are not the same.\n",
"GMM reports heteroscedasticity robust standard errors."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print(res_poisson.bse)\n",
"print(res_ex.bse)\n",
"print(res_ex.bse - res_poisson.bse)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[ 0.02771898 0.01579852 0.01600728 0.0024115 0.02870217]\n",
"[ 0.10898908 0.05598878 0.05852984 0.01082366 0.11086067]\n",
"[ 0.08127009 0.04019027 0.04252255 0.00841216 0.0821585 ]\n"
]
}
],
"prompt_number": 32
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Homoscedasticity, Heteroscedasticity and Overdispersion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the Poisson model the variance is equal to the mean. The Maximum Likelihood Estimator in `discrete.Poisson` imposes this assumption. In contrast GMM with heteroscedasticity robust standard errors does not impose any assumption on the variance of the error term. In between these two extrems are overdispersed Poisson or exponential models that assume that the variance is the same as in the Poisson Model however can have a scale factor larger than one.\n",
"\n",
"However, the assumption on the variance do not affect the parameter estimates in this case, but will affect the estimated covariance of the parameter estimates. We can consistently estimate the mean or conditional expectation of the endogenous variable, however other properties of the conditional distribution of the endogenous variable depend on the assumption of correctly specified higher moments."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The scale, that is the variance scale factor, based on the MLE of the Poisson model is one. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"res_poisson.scale"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 52,
"text": [
"1.0"
]
}
],
"prompt_number": 52
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can estimate the scale and standard deviation based on the residuals of the Poisson MLE estimation, which is in this case 3.6 and indicates that there is considerable overdispersion "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fittedvalues = res_poisson.predict(linear=False) #attribute is currently linear prediction :(\n",
"resid_pearson = res_poisson.resid / np.sqrt(fittedvalues)\n",
"scale = (resid_pearson**2).mean()\n",
"scale, np.sqrt(scale), resid_pearson.std()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 53,
"text": [
"(12.947923548090147, 3.5983223240963484, 3.5983077409797635)"
]
}
],
"prompt_number": 53
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the linear model, we can replicate OLS results with GMM if we specify `'iid'` as the weights method. This assumes that the variance of error term is uncorrelated or independent of the explanatory variables, but it also assumes that this variance is constant. The latter is, however, not the case for the Poisson/Exponential model, so it does not replicate the MLE standard errors."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"res_ex_iid = mod_ex.fit(res_ex.params, maxiter=0, inv_weights=w0exinv,\n",
" weights_method='iid',\n",
" optim_method='bfgs', optim_args={'disp':False})"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 54
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print(res_ex_iid.params)\n",
"print(res_ex_iid.bse)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[ 0.79866522 1.09186505 0.49254798 0.03557012 -0.22972605]\n",
"[ 0.15455798 0.06488262 0.06341369 0.00791076 0.17186891]\n"
]
}
],
"prompt_number": 55
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While the parameter estimates are the same as before, the standard errors of the parameter estimates are larger than in GMM in this example. This would correspond to a nonlinear model $y = exp(x \\hspace{3pt} \\beta) + u \\hspace{3pt}$ with homoscedastic error $u$."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Endogeneity and Instrumental Variables"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are using income as a explanatory variable in `exog`, which might have \"measurement errors\".\n",
"\n",
"If income is correlated with the residual, then using Poisson, the maximum likelihood estimator, \n",
"and GMM without instruments will report biased estimates.\n",
"\n",
"We can use instrumental variables in GMM to remove the \"endogeneity\" problem. We include all explanatory variables except income as instruments, and add additionally variables for age and race."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"mod = gmm.NonlinearIVGMM(endog, exog, instrument, moment_exponential)\n",
"w0inv = np.dot(instrument.T, instrument) / len(endog)\n",
"start_params = 0.1 + np.zeros(exog.shape[1])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 33
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\n",
"res0 = mod.fit(start_params, maxiter=0, inv_weights=w0inv, optim_method='nm')\n",
"res = mod.fit(res0.params, maxiter=2, inv_weights=w0inv, optim_method='bfgs') "
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 0.126724\n",
" Iterations: 433\n",
" Function evaluations: 694\n",
"Optimization terminated successfully.\n",
" Current function value: 0.126724\n",
" Iterations: 7\n",
" Function evaluations: 9\n",
" Gradient evaluations: 9\n",
"Optimization terminated successfully."
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
" Current function value: 0.002160\n",
" Iterations: 22\n",
" Function evaluations: 24\n",
" Gradient evaluations: 24\n"
]
}
],
"prompt_number": 34
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print res.params"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[ 0.53532318 1.09015077 0.66367986 0.1428607 -0.59840238]\n"
]
}
],
"prompt_number": 35
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print(res.summary(yname='docvisits', xname=exog_names))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" NonlinearIVGMM Results \n",
"==============================================================================\n",
"Dep. Variable: docvisits Hansen J: 9.529\n",
"Model: NonlinearIVGMM Prob (Hansen J): 0.00853\n",
"Method: GMM \n",
"Date: Wed, 25 Dec 2013 \n",
"Time: 14:50:26 \n",
"No. Observations: 4412 \n",
"==============================================================================\n",
" coef std err z P>|z| [95.0% Conf. Int.]\n",
"------------------------------------------------------------------------------\n",
"private 0.5353 0.160 3.348 0.001 0.222 0.849\n",
"chronic 1.0902 0.062 17.650 0.000 0.969 1.211\n",
"female 0.6637 0.096 6.915 0.000 0.476 0.852\n",
"income 0.1429 0.027 5.261 0.000 0.090 0.196\n",
"const -0.5984 0.138 -4.323 0.000 -0.870 -0.327\n",
"==============================================================================\n"
]
}
],
"prompt_number": 36
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The effect of income is now `0.1429` which is much larger than the effect estimated using MLE, which was `0.0356`, or `0.0143` versys `0.0036` in the original income scale.\n",
"\n",
"Based on a rough check with Stata output, my results agree to about 4 or 5 decimals"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment