Skip to content

Instantly share code, notes, and snippets.

@yifeihuang
Last active July 10, 2023 02:55
Show Gist options
  • Save yifeihuang/e9b5787acdb0315e469b1410e9bc264b to your computer and use it in GitHub Desktop.
Save yifeihuang/e9b5787acdb0315e469b1410e9bc264b to your computer and use it in GitHub Desktop.
AB test sample size calculation and analysis procedures
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"from scipy.stats import norm, t\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sample size for A/B tests\n",
"Sample size required for an A/B test is dependent on\n",
"- continuous vs rate evaluation metric\n",
"- one-side hypothesis (is treatment better than control) vs two-side hypothesis (is treatment different, either better or worse from control)\n",
"- confidence threshold (critical p-value for rejecting the null hypothesis)\n",
"- Power of the test (probability of detecting an effect if it exists)\n",
"- traffic allocation ratio between experiment and control\n",
"- desired minimum detectable effect size\n",
"\n",
"## Sample size for continuous evaluation metric, e.g. revenue per visit\n",
"\n",
"### Formula\n",
"The formula for sample size required under a one-sided hypothesis is given by\n",
"\n",
"\\begin{align*}\n",
"\\mathbf{N} &= (r+1) \\sigma^2 \\Bigl(\\frac{q_\\alpha + q_{1-\\beta}}{\\Delta\\mu} \\Bigr)^2 \\\\\n",
"\\\\\n",
"\\text{Where}& \\\\\n",
"\\mathbf{N} &= \\text{the sample size required for the treatment group} \\\\\n",
"r &= \\frac{N_{treatment}}{N_{control}} \\text{ traffic allocation ratio, between (0,1) } \\\\\n",
"\\sigma^2 &= \\text{estimate of the population variance of the evaluation metric} \\\\\n",
"\\alpha &= \\text{confidence level threshold or critical p-value, typically 5%} \\\\\n",
"\\beta &= \\text{power of the test, typically 80%} \\\\\n",
"q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n",
"\\Delta\\mu &= \\text{desired miniumum detectable difference in the evaluation metric between treatment and control} \\\\\n",
"\\\\\n",
"\\text{For common}&\\text{ usecases of 5% critical p-value, 80% power, and equal traffic allocation, this simplifies to }\\\\\n",
"\\\\\n",
"\\mathbf{N} &\\approx 12.4 \\frac{\\sigma^2}{\\Delta\\mu^2} \\\\\n",
"\\\\\n",
"\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on }\\alpha\\\\\n",
"\\\\\n",
"\\mathbf{N} &= (r+1) \\sigma^2 \\Bigl(\\frac{q_{\\alpha/2} + q_{1-\\beta}}{\\Delta\\mu} \\Bigr)^2 \\\\\n",
"\\end{align*}\n"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"# one sided: treatment better than control\n",
"# returns the treatment sample size\n",
"def sample_size_mean_1side(sig, power, pop_var, lift, allocation_ratio=1):\n",
" q_alpha = norm.isf(sig)\n",
" q_beta = norm.isf(1-power)\n",
" n = (allocation_ratio + 1) * (q_alpha + q_beta) ** 2 * pop_var / lift ** 2\n",
" return int(np.ceil(n))\n",
"\n",
"#two-sided: both better and worse\n",
"# returns the treatment sample size\n",
"def sample_size_mean_2side(sig, power, pop_var, lift, allocation_ratio=1):\n",
" return sample_size_mean_1side(sig/2, power, pop_var, lift, allocation_ratio)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example usage\n",
"We want to run an experiment to measure the effect of a new hero image on revenue per visit. We know from historical data, revenue per visit is 10 dollars with a variance of 100, and we want to be able to detect an increase of at least 5% (or 0.5 dollars) with a confidence threshold of 5% and power of 80%, what is the sample size required for this experiment?"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4947"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_size_mean_1side(0.05, 0.8, 100, 0.5, allocation_ratio=1)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6280"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_size_mean_2side(0.05, 0.8, 100, 0.5, allocation_ratio=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sample size for rate evaluation metric, e.g. click through rate\n",
"\n",
"### Formula\n",
"The formula for sample size required under a one-sided hypothesis is given by\n",
"\n",
"\\begin{align*}\n",
"\\mathbf{N} &= \\Bigl[r(p+\\Delta p)[1-(p+\\Delta p)] + p(1-p)\\Bigr] \\Bigl(\\frac{q_\\alpha + q_{1-\\beta}}{\\Delta p} \\Bigr)^2 \\\\\n",
"\\\\\n",
"\\text{Where}& \\\\\n",
"\\mathbf{N} &= \\text{the sample size required for the treatment group} \\\\\n",
"r &= \\frac{N_{treatment}}{N_{control}} \\text{ traffic allocation ratio, between (0,1) } \\\\\n",
"p &= \\text{estimate of the baseline rate in the control population} \\\\\n",
"\\Delta p &= \\text{desired miniumum detectable difference in the evaluation metric between treatment and control} \\\\\n",
"\\alpha &= \\text{confidence level threshold or critical p-value, typically 5%} \\\\\n",
"\\beta &= \\text{power of the test, typically 80%} \\\\\n",
"q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n",
"\\\\\n",
"\\text{For common}&\\text{ usecases of 5% critical p-value, 80% power, and equal traffic allocation, this simplifies to }\\\\\n",
"\\\\\n",
"\\mathbf{N} &\\approx 12.4 \\frac{p(1-p)}{\\Delta p^2} \\\\\n",
"\\\\\n",
"\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on }\\alpha\\\\\n",
"\\\\\n",
"\\mathbf{N} &= \\Bigl[r(p+\\Delta p)[1-(p+\\Delta p)] + p(1-p)\\Bigr] \\Bigl(\\frac{q_{\\alpha/2} + q_{1-\\beta}}{\\Delta p} \\Bigr)^2 \\\\\n",
"\\end{align*}\n"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"# one sided - treatment better than control\n",
"# returns the treatment sample size\n",
"def sample_size_proportion_1side(sig, power, base_rate, abs_improvement, allocation_ratio=1):\n",
" z_alpha = norm.isf(sig)\n",
" z_beta = norm.isf(1-power)\n",
" improve_rate = base_rate + abs_improvement\n",
" n = (z_alpha + z_beta) ** 2 * (allocation_ratio*improve_rate*(1-improve_rate) + base_rate*(1-base_rate)) / (abs_improvement) ** 2\n",
" return int(np.ceil(n))\n",
"\n",
"\n",
"#both better and worse\n",
"# returns the treatment sample size\n",
"def sample_size_proportion_2side(sig, power, base_rate, abs_improvement, allocation_ratio=1):\n",
" return sample_size_proportion_1side(sig/2, power, base_rate, abs_improvement, allocation_ratio)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example usage\n",
"We want to run an experiment to measure the effect of a new hero image on sign up rate. We know from historical data, sign up rate is about 10%, and we want to be able to detect an increase of at least 5% (0.5% in absolute terms) with a confidence threshold of 5% and power of 80%, what is the sample size required for this experiment?"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"45498"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_size_proportion_1side(0.05, 0.8, 0.1, 0.005, allocation_ratio=1)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"57760"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_size_proportion_2side(0.05, 0.8, 0.1, 0.005, allocation_ratio=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analysis procedure for A/B tests\n",
"\n",
"## Analysis for continuous evaluation metric, e.g. revenue per visit\n",
"\n",
"Use the t-test for comparing continuous evaluation metrics across groups, assuming one-sided hypothesis\n",
"\n",
"\\begin{align*}\n",
"S_p^2 &= \\frac{S_1^2}{N_1} + \\frac{S_2^2}{N_2} \\\\\n",
"t &= \\frac{\\Delta\\mu}{\\sqrt{S_p^2}} \\\\\n",
"df &= \\frac{(S_p^2)^2}{\\frac{\\frac{S_1^2}{N_1}}{N_1-1} + \\frac{\\frac{S_2^2}{N_2}}{N_2-1}}\\\\\n",
"p-value &= 1-cdf(t, df) \\\\\n",
"\\\\\n",
"\\text{Reject null} &\\text{ hypothesis if the p-value is below the desired confidence level, typically 5%}\n",
"\\\\\n",
"\\\\\n",
"\\text{Where}& \\\\\n",
"S_x^2 &= \\text{Sample variance of the treatment/control groups} \\\\\n",
"N_x &= \\text{Sample size of the treatment/control groups} \\\\\n",
"\\Delta\\mu &= \\text{evaluation metric of the treatment group - evaluation metric of the control group} \\\\\n",
"df &= \\text{degrees of freedom of the experiment} \\\\\n",
"cdf &= \\text{cumulative distribution function of the t-distribution}\n",
"\\\\\n",
"\\\\\n",
"\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on p-value}\\\\\n",
"\\\\\n",
"p-value &= 2(1-cdf(|t|, df)) \\\\\n",
"\\\\\n",
"\\text{If the confidence}&\\text{ interval around the difference of evaluation metric between treatment and control is of interest}\\\\\n",
"\\\\\n",
"ci &= \\bigl(\\Delta\\mu - q_{(1-conf)/2}\\sqrt{S_p^2}, \\enspace \\Delta\\mu + q_{(1-conf)/2}\\sqrt{S_p^2}) \\\\\n",
"\\\\\n",
"\\text{Where}& \\\\\n",
"conf &= \\text{the confidence level of the confidence interval, typically 95%} \\\\\n",
"q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n",
"\\end{align*}\n"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"# mu1 > mu2\n",
"# does not assume equal variance between the 2 samples\n",
"def mean_test_1side(var_treat, var_control, mu_treat, mu_control, n_treat, n_control):\n",
" diff = mu_treat - mu_control\n",
" s = (var_treat/n_treat + var_control/n_control) ** 0.5\n",
" t_stat = diff / s\n",
" df = (var_treat/n_treat + var_control/n_control) ** 2 / ((var_treat/n_treat) ** 2 / (n_treat-1) + (var_control/n_control) ** 2 / (n_control-1))\n",
" prob = 1 - t.cdf(t_stat, df)\n",
" return prob\n",
"\n",
"\n",
"# mu1 != mu2\n",
"# does not assume equal variance between the 2 samples\n",
"def mean_test_2side(var_treat, var_control, mu_treat, mu_control, n_treat, n_control):\n",
" diff = mu_treat - mu_control\n",
" s = (var_treat/n_treat + var_control/n_control) ** 0.5\n",
" t_stat = diff / s\n",
" df = (var_treat/n_treat + var_control/n_control) ** 2 / ((var_treat/n_treat) ** 2 / (n_treat-1) + (var_control/n_control) ** 2 / (n_control-1))\n",
" prob = 1 - t.cdf(abs(t_stat), df)\n",
" return 2*prob\n",
"\n",
"# confidence level\n",
"def mean_test_ci(var_treat, var_control, mu_treat, mu_control, n_treat, n_control, conf=0.95):\n",
" diff = mu_treat - mu_control\n",
" s = (var_treat/n_treat + var_control/n_control) ** 0.5\n",
" df = (var_treat/n_treat + var_control/n_control) ** 2 / ((var_treat/n_treat) ** 2 / (n_treat-1) + (var_control/n_control) ** 2 / (n_control-1))\n",
" t_crit = t.isf((1-conf)/2, df)\n",
" return (diff - t_crit*s, diff + t_crit*s)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example usage\n",
"We ran an experiment to measure the effect of a new hero image on revenue per visit. For the control group we saw revenue per visit of 10 dollars with a variance of 100 across 5500 visits, and for the treatment group we saw revenue per visit of 10.5 dollars with a variance of 110 across 5000 visits, can we reject the null hypothesis at the confidence level of 5%? And what's the confidence level around the improvement?"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"p-value of the one-side test is 0.6%\n"
]
}
],
"source": [
"print('p-value of the one-sided test is {:0.1%}'.format(mean_test_1side(110, 100, 10.5, 10, 5000, 5500)))"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"p-value of the two-side test is 1.3%\n"
]
}
],
"source": [
"print('p-value of the two-sided test is {:0.1%}'.format(mean_test_2side(110, 100, 10.5, 10, 5000, 5500)))"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The 95% confidence interval of improve is between (0.1 dollars, 0.9 dollars)\n"
]
}
],
"source": [
"print('The 95% confidence interval of the improvement is between ({:0.1f} dollars, {:0.1f} dollars)'.format(*mean_test_ci(110, 100, 10.5, 10, 5000, 5500)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Treatment beats control at a confidence level of 5% (0.6% < 5%)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analysis for rate evaluation metric, e.g. click through rate\n",
"\n",
"Use the z-test for comparing rate evaluation metrics across groups, assuming one-sided hypothesis\n",
"\n",
"\\begin{align*}\n",
"p_{pool} &= \\frac{p_1N_1 + p_2N_2}{N_1 + N_2} \\\\\n",
"SE &= \\sqrt{p_{pool}(1-p_{pool})(\\frac{1}{N_1} + \\frac{1}{N_2})} \\\\\n",
"z &= \\frac{\\Delta p}{SE} \\\\\n",
"p-value &= 1-cdf(z) \\\\\n",
"\\\\\n",
"\\text{Reject null} &\\text{ hypothesis if the p-value is below the desired confidence level, typically 5%}\n",
"\\\\\n",
"\\\\\n",
"\\text{Where}& \\\\\n",
"p_x &= \\text{evaluation metric rate of the treatment/control groups} \\\\\n",
"N_x &= \\text{Sample size of the treatment/control groups} \\\\\n",
"\\Delta p &= \\text{evaluation metric of the treatment group - evaluation metric of the control group} \\\\\n",
"cdf &= \\text{cumulative distribution function of the standard normal distribution}\n",
"\\\\\n",
"\\\\\n",
"\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on p-value}\\\\\n",
"\\\\\n",
"p-value &= 2(1-cdf(|z|)) \\\\\n",
"\\\\\n",
"\\text{If the confidence}&\\text{ interval around the difference of evaluation metric between treatment and control is of interest}\\\\\n",
"\\\\\n",
"ci &= \\bigl(\\Delta p - q_{(1-conf)/2}SE, \\enspace \\Delta p + q_{(1-conf)/2}SE \\bigr) \\\\\n",
"\\\\\n",
"\\text{Where}& \\\\\n",
"conf &= \\text{the confidence level of the confidence interval, typically 95%} \\\\\n",
"q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n",
"\\end{align*}\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"# p1 > p2\n",
"def proportion_test_1side(p_treat, p_control, n_treat, n_control):\n",
" pooled_p = (p_treat*n_treat+p_control*n_control) / (n_treat+n_control)\n",
" standard_error = (pooled_p * (1 - pooled_p) * (1/n_treat + 1/n_control)) ** 0.5\n",
" z = (p_treat - p_control) / standard_error\n",
" prob = 1 - norm.cdf(z)\n",
" return prob\n",
"\n",
"# p1 != p2\n",
"def proportion_test_2side(p_treat, p_control, n_treat, n_control):\n",
" pooled_p = (p_treat*n_treat+p_control*n_control) / (n_treat+n_control)\n",
" standard_error = (pooled_p * (1 - pooled_p) * (1/n_treat + 1/n_control)) ** 0.5\n",
" z = (p_treat - p_control) / standard_error\n",
" prob = 1 - norm.cdf(abs(z))\n",
" return 2*prob\n",
"\n",
"def proportion_test_ci(p_treat, p_control, n_treat, n_control, conf=0.95):\n",
" pooled_p = (p_treat*n_treat+p_control*n_control) / (n_treat+n_control)\n",
" standard_error = (pooled_p * (1 - pooled_p) * (1/n_treat + 1/n_control)) ** 0.5\n",
" diff = p_treat-p_control\n",
" z_crit = norm.isf((1-conf)/2)\n",
" return (diff - z_crit*standard_error, diff + z_crit*standard_error)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example usage\n",
"We ran an experiment to measure the effect of a new hero image on signup rate. For the control group we saw signup rate of 10% across 5500 visits, and for the treatment group we saw signup rate of 10.5% across 5000 visits, can we reject the null hypothesis at the confidence level of 5%? And what's the confidence level around the improvement?"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"p-value of the one-sided test is 19.9%\n"
]
}
],
"source": [
"print('p-value of the one-sided test is {:0.1%}'.format(proportion_test_1side(0.105, 0.1, 5000, 5500)))"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"p-value of the two-sided test is 39.9%\n"
]
}
],
"source": [
"print('p-value of the two-sided test is {:0.1%}'.format(proportion_test_2side(0.105, 0.1, 5000, 5500)))"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The 95% confidence interval of the improvement is between (-0.7%, 1.7%)\n"
]
}
],
"source": [
"print('The 95% confidence interval of the improvement is between ({:0.1%}, {:0.1%})'.format(*proportion_test_ci(0.105, 0.1, 5000, 5500)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### We cannot conclude that treatment is better than control at a confidence level of 5% (19.9% > 5%)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment