Skip to content

Instantly share code, notes, and snippets.

@asmaier
Created June 16, 2023 16:03
Show Gist options
  • Save asmaier/5d89179f67b04471e13ba93afa3bad24 to your computer and use it in GitHub Desktop.
Save asmaier/5d89179f67b04471e13ba93afa3bad24 to your computer and use it in GitHub Desktop.
Computing confidence intervals for Poisson distributions
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "872107a8-5805-47ac-9c12-2cd8fbc2ee8a",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# Computing confidence intervals for Poisson distributions\n",
"\n",
"Statsmodels and astropy are two python packages which have implementations to compute confidence intervals for Poisson distributions"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "7316124c-3b91-40c3-b509-aea2e4195bab",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import statsmodels.stats.rates as st\n",
"import astropy.stats as ast\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "98af47cd-ed97-43c4-8097-6d39f1110453",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"A Poisson distribution is defined by a rate $\\lambda$, which is equal to the mean $\\mu$ and the variation $\\sigma$ of the Poisson distribution. A simple and **wrong** way is therefore to compute confidence intervals like for the normal distribution:\n",
"$$\n",
"(\\mu - \\sqrt{\\sigma}, \\mu + \\sqrt{\\sigma}) = (\\lambda - \\sqrt{\\lambda}, \\lambda + \\sqrt{\\lambda})\n",
"$$\n",
"It should be noted that this will compute the 1$\\sigma$ confidence interval, which is equal to 68.27% of the probability around the mean of the distribution, see [68–95–99.7 rule](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule). "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4afda20f-0aa9-45c1-bda4-d2ff4f97595b",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# setting the rate\n",
"lam = 2"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "54e8cbf3-d736-4854-a79b-dbc8fb7cca27",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(0.5857864376269049, 3.414213562373095)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lam-lam**0.5, lam+lam**0.5"
]
},
{
"cell_type": "markdown",
"id": "4fcd5cc0-060d-4f11-8a9a-1c52646d2486",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"This is the same you get when using astropy with the \"root-n\" algorithm. It is called like that, because the rate $\\lambda$ in astropy is called $n$. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "97918ed8-9e87-4470-8917-87106f93a5a4",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([0.58578644, 3.41421356])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ast.poisson_conf_interval(2, interval='root-n')"
]
},
{
"cell_type": "markdown",
"id": "88967018-7c46-4177-859d-409917777905",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"In statsmodels this algorithm is called \"wald\". In statsmodel there is no rate parameter. Instead the function takes the parameters \"count\" and \"exposure\". The rate is internally computed as \n",
"$$\n",
"\\text{rate}=\\frac{\\text{count}}{\\text{exposure}} . \n",
"$$"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "7031db28-06d1-4877-b1f9-cbaec6d3cd82",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(0.5857557303510352, 3.4142442696489645)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"st.confint_poisson(lam, 1, method=\"wald\", alpha=1-0.6827) # 1 sigma = 68.27%"
]
},
{
"cell_type": "markdown",
"id": "52a5720d-de7c-40c2-8905-82a4f25b0ac2",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Now it is common in the literature to compute the 95% confidence interval. We can't do that with astropy and \"root-n\". But we can do it manually like suggested at https://stats.stackexchange.com/questions/15371/how-to-calculate-a-confidence-level-for-a-poisson-distribution "
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e2cab60c-b49a-4576-abf5-5d2f6cb7fde8",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(-0.771807670563021, 4.771807670563021)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# a more precice value for the https://en.wikipedia.org/wiki/97.5th_percentile_point than 1.96 is 1.959964\n",
"lam-1.959964*lam**0.5, lam+1.959964*lam**0.5"
]
},
{
"cell_type": "markdown",
"id": "f65077fe-0bbb-4ee1-a7a4-1c35a14c2683",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Here we see a big problem with the root-n/wald algorithm. The confidence interval for small rates can have a negative lower limit. For the Poisson distribution which is strictly positive this doesn't make sense. So let's try this with statsmodel"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "32358176-1de9-4ba7-948e-83446244f6dc",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(0.0, 4.771807648699356)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"st.confint_poisson(2, 1, method=\"wald\", alpha=1-0.95) # 1 sigma = 68.27%, 2 sigma = 95.45%"
]
},
{
"cell_type": "markdown",
"id": "64d6aa6a-e032-4ef0-abe7-670972ee680d",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Values of the lower interval border below zero will simply be set to zero by statsmodels code (https://www.statsmodels.org/dev/_modules/statsmodels/stats/rates.html#confint_poisson)"
]
},
{
"cell_type": "markdown",
"id": "468e2714-ddee-4314-8a94-983bd5eb6284",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## The correct way\n",
"\n",
"So how can we improve this situation? Simply by not treating the Poisson distribution as normal distribution and using the correct way by Garwood (1936) to compute the confidence intervals. In statsmodel the correct way is called \"exact-c\"."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7747e9f3-07d7-4cc1-aa36-3de2386d326e",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(0.24220927854396496, 7.224687667723961)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"st.confint_poisson(lam, 1, method=\"exact-c\")"
]
},
{
"cell_type": "markdown",
"id": "a3a13a1a-4e94-4698-9342-78d0b59b7cd3",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"In astropy this method is called \"frequentist-confidence\". And instead of a parameter \"alpha\" we set the confidence in terms of \"sigma\", so for a CI of 95% we need sigma=1.959964, see https://en.wikipedia.org/wiki/97.5th_percentile_point"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "5fbfa4e0-24e1-4fc6-b735-15841b7fb741",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([0.24220927, 7.22468772])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ast.poisson_conf_interval(lam, interval=\"frequentist-confidence\", sigma=1.959964) "
]
},
{
"cell_type": "markdown",
"id": "f79c9046-4c6e-4364-b4fa-10d0589d861a",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"To compute the confidence interval for a column of numbers in a pandas dataframe we can do"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "6b1fad11-cf8a-4178-8b5e-074db7e29c3b",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>rate</th>\n",
" <th>lower_bound</th>\n",
" <th>upper_bound</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.1</td>\n",
" <td>5.791710e-17</td>\n",
" <td>3.896009</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.2</td>\n",
" <td>6.372540e-09</td>\n",
" <td>4.097368</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.5</td>\n",
" <td>4.910346e-04</td>\n",
" <td>4.674202</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.9</td>\n",
" <td>1.602430e-02</td>\n",
" <td>5.397120</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1.0</td>\n",
" <td>2.531781e-02</td>\n",
" <td>5.571643</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2.0</td>\n",
" <td>2.422093e-01</td>\n",
" <td>7.224688</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>3.0</td>\n",
" <td>6.186721e-01</td>\n",
" <td>8.767273</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>4.0</td>\n",
" <td>1.089865e+00</td>\n",
" <td>10.241589</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>5.0</td>\n",
" <td>1.623486e+00</td>\n",
" <td>11.668332</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>10.0</td>\n",
" <td>4.795389e+00</td>\n",
" <td>18.390356</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" rate lower_bound upper_bound\n",
"0 0.1 5.791710e-17 3.896009\n",
"1 0.2 6.372540e-09 4.097368\n",
"2 0.5 4.910346e-04 4.674202\n",
"3 0.9 1.602430e-02 5.397120\n",
"4 1.0 2.531781e-02 5.571643\n",
"5 2.0 2.422093e-01 7.224688\n",
"6 3.0 6.186721e-01 8.767273\n",
"7 4.0 1.089865e+00 10.241589\n",
"8 5.0 1.623486e+00 11.668332\n",
"9 10.0 4.795389e+00 18.390356"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({\"rate\": [0.1,0.2,0.5,0.9,1.0,2,3,4,5,10]})\n",
"(df[\"lower_bound\"], df[\"upper_bound\"]) = st.confint_poisson(df, 1, method=\"exact-c\", alpha=1-0.95)\n",
"df.head(10)"
]
},
{
"cell_type": "markdown",
"id": "7970b5de-6e96-4419-b00a-9fada19de845",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Remarks\n",
"Why do people use the \"wald/root-n method\" to compute the confidence limits of Poisson distribution? \n",
"\n",
"The reason is that for high rates the Poisson distribution approximates the Normal/Gaussian distribution. However even for high rates the \"wald method\" doesn't really converge to the exact CI values for the upper limit. This is because the Poisson distribution is right skewed. The assumption of symmetric confidence intervals like for the normal distribution is therefore not a good idea for the Poisson distribution."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "302722dc-63c5-4eb8-b6dd-3ab864c28355",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"((9804.952467260184, 10197.951624661473),\n",
" (9804.003601545994, 10195.996398454006))"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"st.confint_poisson(10000, 1, method=\"exact-c\"), st.confint_poisson(10000,1 , method=\"wald\")"
]
},
{
"cell_type": "markdown",
"id": "ae4b9fd4-9071-4d9b-b726-d30a5c9cea74",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"So why are there so many other methods to compute the confidence intervals for Poisson distributions?\n",
"\n",
"> it is well known that the nominal coverage level for the Garwood (1936) limits serves only as a lower bound for their actual coverage.\n",
">\n",
"> *Swift (2009): Comparison of Confidence Intervals for a Poisson Mean*\n",
"\n",
"So the confidence intervals computed by that method are sometimes too \"conservative\", because for some rates they cover more than 95%. Scientist are happy about that problem, because that gives them the possibility to publish many, many papers with different ways to approximate the confidence limits of Poisson distribution. Some of these different methods are called \"liberal\", because their actual coverage can drop below 95%. "
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "7c7bba19-9689-4bf2-880f-b9b78848e674",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.0, 5.098975161522809)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"st.confint_poisson(2, 1, method=\"waldccv\")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "1ecc7955-a94b-482c-891f-72e88d87f5de",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.5484721382575759, 7.292986682436551)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"st.confint_poisson(2, 1, method=\"score\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "f326f28a-e7a2-41ea-a471-80a62331a5c0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.3353290568582626, 6.607959749018163)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"st.confint_poisson(2, 1, method=\"midp-c\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "2f4d8688-026d-4b0b-9fac-2d23c080627e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.41560580674333125, 6.416250997015015)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"st.confint_poisson(2, 1, method=\"jeffreys\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "e0fac9a5-d777-43cb-b80a-4246eca49e5d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.18855705647417542, 5.732172353872889)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"st.confint_poisson(2, 1, method=\"sqrt\")"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "f5a74d0b-8205-4c99-9db5-c9e741652ed0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.0, 5.9808720630769265)"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"st.confint_poisson(2, 1, method=\"sqrt-a\")"
]
},
{
"cell_type": "markdown",
"id": "6bb87410-8ccc-4351-9ad0-33a7f172fc88",
"metadata": {},
"source": [
"Swift (2009) recommends the method of Blaker (2000). This method is not implemented in the packages mentioned here. But there is an R package if you want to go down that rabbit hole: https://cran.r-project.org/web/packages/BlakerCI/index.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "87997f31-6ca4-49d7-8d37-7fa4a32d201b",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment