Skip to content

Instantly share code, notes, and snippets.

@josef-pkt
Created June 14, 2021 23:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save josef-pkt/c94158a8fd0c7b60670c0a8c03f69419 to your computer and use it in GitHub Desktop.
Save josef-pkt/c94158a8fd0c7b60670c0a8c03f69419 to your computer and use it in GitHub Desktop.
discretized count model
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "rural-accommodation",
"metadata": {},
"source": [
"# Discretized distribution and count models\n",
"\n",
"Count distributions can be created by discretizing a continuous distributions. Two new classes are added to statsmodels, the distribution class DiscretizedCount which subclasses `rv_discrete` from scipy.stats, and the model class `_DiscretizedModel` to estimate parameters for the discretized distribution by maximum likelihood. The model class is a subclass of the GenericLikelihoodModel.\n",
"\n",
"The current version implements the basic discretized distribution, additional options and features will be added in a future version of statsmodels. One limitation compared to other models is that it does not allow regression, i.e. distribution parameters that depend on explanatory variables.\n",
"\n",
"The current implementation uses any continuous distribution from scipy stats that has support on the non-negative or positive real line. The support of the discretized distribution is the set of non-negative integers, 0, 1, 2, ...\n",
"\n",
"The probability that the count k is realized is defined by the difference in survival function evaluated consecutive integers\n",
"\n",
"$Prob(y = k) = sf(k, \\theta) - sf(k + 1, \\theta)$\n",
"\n",
"where $\\theta$ are the distribution parameters, shape and scale of underlying continuous distribution. The location is assumed to be fixed, i.e. `loc=0` in the scipy distributions. \n",
"Discrete distributions do not have a scale parameter. In our implementation the last shape parameter of the discretized distribution is the scale of the underlying continuous distribution.\n",
"\n",
"The implementation was based mainly on the following two articles:\n",
"\n",
"Chakraborty, Subrata, and Dhrubajyoti Chakravarty. \"Discrete gamma distributions: Properties and parameter estimations.\" Communications in Statistics-Theory and Methods 41, no. 18 (2012): 3301-3324.\n",
"\n",
"Alzaatreh, Ayman, Carl Lee, and Felix Famoye. 2012. “On the Discrete Analogues of Continuous Distributions.” Statistical Methodology 9 (6): 589–603. https://doi.org/10.1016/j.stamet.2012.03.003.\n",
"\n",
"However, many new discrete distributions have been developed in recent years based on the approach that we use here.\n",
"\n",
"Note: Several articles use a reparameterized version of the underlying distribution. In statsmodels, we reuse the parameterization provided by the scipy distributions."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "connected-smile",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from scipy import stats\n",
"from statsmodels.distributions.discrete import (\n",
" DiscretizedCount, _DiscretizedModel)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "killing-salvation",
"metadata": {},
"outputs": [],
"source": [
"# expand frequencies to observations, (no freq_weights yet)\n",
"freq = np.array([46, 76, 24, 9, 1])\n",
"y = np.repeat(np.arange(5), freq)\n",
"nobs = freq.sum()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "recreational-croatia",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 1.201728\n",
" Iterations: 55\n",
" Function evaluations: 103\n"
]
}
],
"source": [
"dp = DiscretizedCount(stats.gamma)\n",
"mod = _DiscretizedModel(y, distr=dp)\n",
"res = mod.fit(start_params=[1, 1])"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "understanding-operator",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" _DiscretizedModel Results \n",
"==============================================================================\n",
"Dep. Variable: y Log-Likelihood: -187.47\n",
"Model: _DiscretizedModel AIC: 378.9\n",
"Method: Maximum Likelihood BIC: 385.0\n",
"Date: Mon, 14 Jun 2021 \n",
"Time: 19:37:20 \n",
"No. Observations: 156 \n",
"Df Residuals: 154 \n",
"Df Model: 2 \n",
"==============================================================================\n",
" coef std err z P>|z| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"par0 3.5264 0.517 6.816 0.000 2.512 4.540\n",
"par1 0.4256 0.064 6.681 0.000 0.301 0.550\n",
"==============================================================================\n"
]
}
],
"source": [
"print(res.summary())"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "functional-cancellation",
"metadata": {},
"outputs": [],
"source": [
"probs = res.predict(which=\"probs\", k_max=5)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "cardiac-injection",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0. , 46. , 46.48],\n",
" [ 1. , 76. , 73.72],\n",
" [ 2. , 24. , 27.88],\n",
" [ 3. , 9. , 6.5 ],\n",
" [ 4. , 1. , 1.2 ]])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table = np.column_stack((np.arange(5), freq, np.round(probs * nobs, 2)))\n",
"columns = [\"count\", \"observed\", \"gamma\"]\n",
"table"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "amazing-trace",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.9935897435897436]\n",
"Optimization terminated successfully.\n",
" Current function value: 1.381841\n",
" Iterations: 4\n",
" Function evaluations: 5\n",
" Gradient evaluations: 5\n",
" Hessian evaluations: 4\n",
"[8649968300695.803, 8594519002899.616]\n",
"Optimization terminated successfully.\n",
" Current function value: 1.387747\n",
" Iterations: 10\n",
" Function evaluations: 11\n",
" Gradient evaluations: 11\n",
" Hessian evaluations: 10\n",
"[1.0, 1.05, 1.0]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"...\\python-3.9.2.amd64\\lib\\site-packages\\scipy\\stats\\_distn_infrastructure.py:2494: RuntimeWarning: invalid value encountered in double_scalars\n",
" Lhat = muhat - Shat*mu\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 1.203549\n",
" Iterations: 10\n",
" Function evaluations: 11\n",
" Gradient evaluations: 11\n",
" Hessian evaluations: 10\n"
]
}
],
"source": [
"probs_all = []\n",
"res_all = []\n",
"\n",
"for distr in [stats.expon, stats.lomax, stats.burr12]:\n",
" dp = DiscretizedCount(distr)\n",
" sp = distr.fit(y, floc=0)\n",
" sp = list(sp)\n",
" del(sp[-2])\n",
" print(sp)\n",
" mod = _DiscretizedModel(y, distr=dp)\n",
" # It is difficult to find starting parameters and optimizers that works for many cases\n",
" # especially when a model is not appropriate for a specific dataset.\n",
" # Here we use start_params that works for these cases, found by trial and error\n",
" #res = mod.fit(method=\"nm\", start_params=0.5 * np.ones(mod.nparams))\n",
" res = mod.fit(method=\"minimize\", min_method=\"dogleg\", start_params=1.1 * np.ones(mod.nparams))\n",
" probs = res.predict(which=\"probs\", k_max=5)\n",
" probs_all.append(np.round(probs * nobs, 2))\n",
" columns.append(distr.name)\n",
" res_all.append(res)\n",
"\n",
"table = np.column_stack([table] + probs_all)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "three-kenya",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0. , 46. , 46.48, 78.25, 78.5 , 46.79],\n",
" [ 1. , 76. , 73.72, 39. , 38.63, 73.12],\n",
" [ 2. , 24. , 27.88, 19.44, 19.19, 28.34],\n",
" [ 3. , 9. , 6.5 , 9.69, 9.62, 6.25],\n",
" [ 4. , 1. , 1.2 , 4.83, 4.87, 1.19]])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "ultimate-feedback",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" <th>observed</th>\n",
" <th>gamma</th>\n",
" <th>expon</th>\n",
" <th>lomax</th>\n",
" <th>burr12</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.0</td>\n",
" <td>46.0</td>\n",
" <td>46.48</td>\n",
" <td>78.25</td>\n",
" <td>78.50</td>\n",
" <td>46.79</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.0</td>\n",
" <td>76.0</td>\n",
" <td>73.72</td>\n",
" <td>39.00</td>\n",
" <td>38.63</td>\n",
" <td>73.12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" <td>24.0</td>\n",
" <td>27.88</td>\n",
" <td>19.44</td>\n",
" <td>19.19</td>\n",
" <td>28.34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" <td>9.0</td>\n",
" <td>6.50</td>\n",
" <td>9.69</td>\n",
" <td>9.62</td>\n",
" <td>6.25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" <td>1.0</td>\n",
" <td>1.20</td>\n",
" <td>4.83</td>\n",
" <td>4.87</td>\n",
" <td>1.19</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" count observed gamma expon lomax burr12\n",
"0 0.0 46.0 46.48 78.25 78.50 46.79\n",
"1 1.0 76.0 73.72 39.00 38.63 73.12\n",
"2 2.0 24.0 27.88 19.44 19.19 28.34\n",
"3 3.0 9.0 6.50 9.69 9.62 6.25\n",
"4 4.0 1.0 1.20 4.83 4.87 1.19"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"pd.DataFrame(table, columns=columns)"
]
},
{
"cell_type": "markdown",
"id": "utility-diabetes",
"metadata": {},
"source": [
"Based on predicted counts, discrete gamma and discrete burr12 provide a close match to the observed count. Discrete gamma has a sligthly lower AIC. Gamma and burr12 fit equally well in terms of loglikelihood, but gamma has one parameter less than burr12.\n",
"Also note that in burr12 we cannot reliably identify two of the parameters. In many cases of flexible distributions is that parameters are closely related and can match similar features of the distribution. In those cases the predictive distribution might be well identified, but not all individual parameters.\n",
"\n",
"The distribution in this dataset is underdispersed relative to Poisson. Both, exponential and lomax have heavier tails and are not able to provide a good fit to this underdispersed distribution.\n",
"\n",
"In the following we print the summary tables for all 3 distributions. \n",
"Discretized exponential has only a scale parameter, lomax has one additional shape parameter, and burr12 has two additional shape parameters."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "hourly-italic",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"expon:\n",
"\n",
"Optimization terminated successfully.\n",
" Current function value: 1.203549\n",
" Iterations: 10\n",
" Function evaluations: 11\n",
" Gradient evaluations: 11\n",
" Hessian evaluations: 10\n",
" _DiscretizedModel Results \n",
"==============================================================================\n",
"Dep. Variable: y Log-Likelihood: -187.75\n",
"Model: _DiscretizedModel AIC: 381.5\n",
"Method: Maximum Likelihood BIC: 390.7\n",
"Date: Mon, 14 Jun 2021 \n",
"Time: 19:37:21 \n",
"No. Observations: 156 \n",
"Df Residuals: 153 \n",
"Df Model: 3 \n",
"==============================================================================\n",
" coef std err z P>|z| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"par0 2.2165 0.417 5.317 0.000 1.400 3.034\n",
"par1 4.6059 5.696 0.809 0.419 -6.558 15.770\n",
"par2 3.1166 2.358 1.322 0.186 -1.505 7.739\n",
"==============================================================================\n",
"\n",
"\n",
"lomax:\n",
"\n",
"Optimization terminated successfully.\n",
" Current function value: 1.203549\n",
" Iterations: 10\n",
" Function evaluations: 11\n",
" Gradient evaluations: 11\n",
" Hessian evaluations: 10\n",
" _DiscretizedModel Results \n",
"==============================================================================\n",
"Dep. Variable: y Log-Likelihood: -187.75\n",
"Model: _DiscretizedModel AIC: 381.5\n",
"Method: Maximum Likelihood BIC: 390.7\n",
"Date: Mon, 14 Jun 2021 \n",
"Time: 19:37:21 \n",
"No. Observations: 156 \n",
"Df Residuals: 153 \n",
"Df Model: 3 \n",
"==============================================================================\n",
" coef std err z P>|z| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"par0 2.2165 0.417 5.317 0.000 1.400 3.034\n",
"par1 4.6059 5.696 0.809 0.419 -6.558 15.770\n",
"par2 3.1166 2.358 1.322 0.186 -1.505 7.739\n",
"==============================================================================\n",
"\n",
"\n",
"burr12:\n",
"\n",
"Optimization terminated successfully.\n",
" Current function value: 1.203549\n",
" Iterations: 10\n",
" Function evaluations: 11\n",
" Gradient evaluations: 11\n",
" Hessian evaluations: 10\n",
" _DiscretizedModel Results \n",
"==============================================================================\n",
"Dep. Variable: y Log-Likelihood: -187.75\n",
"Model: _DiscretizedModel AIC: 381.5\n",
"Method: Maximum Likelihood BIC: 390.7\n",
"Date: Mon, 14 Jun 2021 \n",
"Time: 19:37:21 \n",
"No. Observations: 156 \n",
"Df Residuals: 153 \n",
"Df Model: 3 \n",
"==============================================================================\n",
" coef std err z P>|z| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"par0 2.2165 0.417 5.317 0.000 1.400 3.034\n",
"par1 4.6059 5.696 0.809 0.419 -6.558 15.770\n",
"par2 3.1166 2.358 1.322 0.186 -1.505 7.739\n",
"==============================================================================\n"
]
}
],
"source": [
"for res in res_all:\n",
" print(f\"\\n\\n{res.model.distr.distr.name}:\\n\")\n",
" mod = _DiscretizedModel(y, distr=dp)\n",
" # use start_params that works for these cases, with nm optimizer\n",
" res = mod.fit(method=\"minimize\", min_method=\"dogleg\", start_params=1 * np.ones(mod.nparams))\n",
" \n",
" print(res.summary())"
]
},
{
"cell_type": "markdown",
"id": "metallic-stretch",
"metadata": {},
"source": [
"## A closer look at burr12 estimates"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "written-tumor",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 1.203549\n",
" Iterations: 0\n",
" Function evaluations: 1\n",
" Gradient evaluations: 1\n",
" Hessian evaluations: 1\n",
"\n",
" burr12\n",
" _DiscretizedModel Results \n",
"==============================================================================\n",
"Dep. Variable: y Log-Likelihood: -187.75\n",
"Model: _DiscretizedModel AIC: 381.5\n",
"Method: Maximum Likelihood BIC: 390.7\n",
"Date: Mon, 14 Jun 2021 \n",
"Time: 19:37:21 \n",
"No. Observations: 156 \n",
"Df Residuals: 153 \n",
"Df Model: 3 \n",
"==============================================================================\n",
" coef std err z P>|z| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"par0 2.2164 0.417 5.313 0.000 1.399 3.034\n",
"par1 4.6081 5.707 0.807 0.419 -6.578 15.794\n",
"par2 3.1175 2.362 1.320 0.187 -1.513 7.748\n",
"==============================================================================\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"...\\python-3.9.2.amd64\\lib\\site-packages\\scipy\\stats\\_distn_infrastructure.py:2494: RuntimeWarning: invalid value encountered in double_scalars\n",
" Lhat = muhat - Shat*mu\n"
]
}
],
"source": [
"dp = DiscretizedCount(stats.burr12)\n",
"sp = distr.fit(y, floc=0)\n",
"mod = _DiscretizedModel(y, distr=dp)\n",
"res = mod.fit(method=\"minimize\", min_method=\"dogleg\", start_params=res_all[-1].params)\n",
"print(\"\\n\", distr.name)\n",
"print(res.summary())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "convertible-stadium",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "annual-broad",
"metadata": {},
"source": [
"The estimation results from the discrete burr12 model shows that neither par1 nor par2 are statistically significantly different from zero in the individual hypothesis tests. \n",
"We can use the Wald test to test the joint hypothesis that both are zero, which is strongly rejected."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "empirical-reconstruction",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<class 'statsmodels.stats.contrast.ContrastResults'>\n",
"<Wald test (chi2): statistic=[[41.76881892]], p-value=8.511698133721117e-10, df_denom=2>"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res.wald_test(\"par1, par2\")"
]
},
{
"cell_type": "markdown",
"id": "labeled-officer",
"metadata": {},
"source": [
"The covariance and correlation of the parameter estimates shows that the parameter estimates are strongly correlated, which makes it difficult to identify each parameter individually.\n",
"This does not need to affect how well the estimated distribution matches the data."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "foreign-blackberry",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0.17402593, -2.17362355, -0.91808006],\n",
" [-2.17362355, 32.57393105, 13.43900532],\n",
" [-0.91808006, 13.43900532, 5.580505 ]])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res.cov_params()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "strange-constitutional",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1. , -0.91294005, -0.93161575],\n",
" [-0.91294005, 1. , 0.99677076],\n",
" [-0.93161575, 0.99677076, 1. ]])"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res.cov_params() / np.outer(res.bse, res.bse)"
]
},
{
"cell_type": "markdown",
"id": "level-shelter",
"metadata": {},
"source": [
"The model instance has a method to get the underlying discrete distribution with the estimated parameters. Specifically, it return a `frozen` distribution instance based on inheritance from the scipy stats distribution classes.\n",
"\n",
"We can use the inherited method for discrete distributions.\n",
"As example, we use `pmf` to compute the expected probabilities, which returns the same results as the `predict` method of the model.\n",
"In the next case we compute moments and dispersion index. The dispersion index, variance divide by mean, is 0.74, which means it is underdispersed relative to Poisson. Both, mean, variance and dispersion index closely match between data and estimated model. "
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "civil-doubt",
"metadata": {},
"outputs": [],
"source": [
"dfr = res.model.get_distr(res.params)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "precious-steel",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(array([0.29991925, 0.46869925, 0.18169602, 0.04003344, 0.00764496]),\n",
" array([46.79, 73.12, 28.34, 6.25, 1.19]))"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pr = dfr.pmf(np.arange(5))\n",
"pr, np.round(pr * nobs, 2)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "governmental-store",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(array(0.99346659), array(0.74912935), 0.75405590469357)"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m, v = dfr.stats()\n",
"m, v, v / m"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "violent-belfast",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.9935897435897436, 0.7371383957922419, 0.7418941273779983)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ms, vs = y.mean(), y.var()\n",
"ms, vs, vs / ms"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "corporate-export",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
@josef-pkt
Copy link
Author

I just realized that line [10] was saved in the middle of refactoring and repeatedly computes the same distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment