Last active
July 10, 2023 02:55
-
-
Save yifeihuang/e9b5787acdb0315e469b1410e9bc264b to your computer and use it in GitHub Desktop.
AB test sample size calculation and analysis procedures
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 46, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from scipy.stats import norm, t\n", | |
"import numpy as np" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Sample size for A/B tests\n", | |
"Sample size required for an A/B test is dependent on\n", | |
"- continuous vs rate evaluation metric\n", | |
"- one-side hypothesis (is treatment better than control) vs two-side hypothesis (is treatment different, either better or worse from control)\n", | |
"- confidence threshold (critical p-value for rejecting the null hypothesis)\n", | |
"- Power of the test (probability of detecting an effect if it exists)\n", | |
"- traffic allocation ratio between experiment and control\n", | |
"- desired minimum detectable effect size\n", | |
"\n", | |
"## Sample size for continuous evaluation metric, e.g. revenue per visit\n", | |
"\n", | |
"### Formula\n", | |
"The formula for sample size required under a one-sided hypothesis is given by\n", | |
"\n", | |
"\\begin{align*}\n", | |
"\\mathbf{N} &= (r+1) \\sigma^2 \\Bigl(\\frac{q_\\alpha + q_{1-\\beta}}{\\Delta\\mu} \\Bigr)^2 \\\\\n", | |
"\\\\\n", | |
"\\text{Where}& \\\\\n", | |
"\\mathbf{N} &= \\text{the sample size required for the treatment group} \\\\\n", | |
"r &= \\frac{N_{treatment}}{N_{control}} \\text{ traffic allocation ratio, between (0,1) } \\\\\n", | |
"\\sigma^2 &= \\text{estimate of the population variance of the evaluation metric} \\\\\n", | |
"\\alpha &= \\text{confidence level threshold or critical p-value, typically 5%} \\\\\n", | |
"\\beta &= \\text{power of the test, typically 80%} \\\\\n", | |
"q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n", | |
"\\Delta\\mu &= \\text{desired miniumum detectable difference in the evaluation metric between treatment and control} \\\\\n", | |
"\\\\\n", | |
"\\text{For common}&\\text{ usecases of 5% critical p-value, 80% power, and equal traffic allocation, this simplifies to }\\\\\n", | |
"\\\\\n", | |
"\\mathbf{N} &\\approx 12.4 \\frac{\\sigma^2}{\\Delta\\mu^2} \\\\\n", | |
"\\\\\n", | |
"\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on }\\alpha\\\\\n", | |
"\\\\\n", | |
"\\mathbf{N} &= (r+1) \\sigma^2 \\Bigl(\\frac{q_{\\alpha/2} + q_{1-\\beta}}{\\Delta\\mu} \\Bigr)^2 \\\\\n", | |
"\\end{align*}\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 47, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# one sided: treatment better than control\n", | |
"# returns the treatment sample size\n", | |
"def sample_size_mean_1side(sig, power, pop_var, lift, allocation_ratio=1):\n", | |
" q_alpha = norm.isf(sig)\n", | |
" q_beta = norm.isf(1-power)\n", | |
" n = (allocation_ratio + 1) * (q_alpha + q_beta) ** 2 * pop_var / lift ** 2\n", | |
" return int(np.ceil(n))\n", | |
"\n", | |
"#two-sided: both better and worse\n", | |
"# returns the treatment sample size\n", | |
"def sample_size_mean_2side(sig, power, pop_var, lift, allocation_ratio=1):\n", | |
" return sample_size_mean_1side(sig/2, power, pop_var, lift, allocation_ratio)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Example usage\n", | |
"We want to run an experiment to measure the effect of a new hero image on revenue per visit. We know from historical data, revenue per visit is 10 dollars with a variance of 100, and we want to be able to detect an increase of at least 5% (or 0.5 dollars) with a confidence threshold of 5% and power of 80%, what is the sample size required for this experiment?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 48, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"4947" | |
] | |
}, | |
"execution_count": 48, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"sample_size_mean_1side(0.05, 0.8, 100, 0.5, allocation_ratio=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 49, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"6280" | |
] | |
}, | |
"execution_count": 49, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"sample_size_mean_2side(0.05, 0.8, 100, 0.5, allocation_ratio=1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Sample size for rate evaluation metric, e.g. click through rate\n", | |
"\n", | |
"### Formula\n", | |
"The formula for sample size required under a one-sided hypothesis is given by\n", | |
"\n", | |
"\\begin{align*}\n", | |
"\\mathbf{N} &= \\Bigl[r(p+\\Delta p)[1-(p+\\Delta p)] + p(1-p)\\Bigr] \\Bigl(\\frac{q_\\alpha + q_{1-\\beta}}{\\Delta p} \\Bigr)^2 \\\\\n", | |
"\\\\\n", | |
"\\text{Where}& \\\\\n", | |
"\\mathbf{N} &= \\text{the sample size required for the treatment group} \\\\\n", | |
"r &= \\frac{N_{treatment}}{N_{control}} \\text{ traffic allocation ratio, between (0,1) } \\\\\n", | |
"p &= \\text{estimate of the baseline rate in the control population} \\\\\n", | |
"\\Delta p &= \\text{desired miniumum detectable difference in the evaluation metric between treatment and control} \\\\\n", | |
"\\alpha &= \\text{confidence level threshold or critical p-value, typically 5%} \\\\\n", | |
"\\beta &= \\text{power of the test, typically 80%} \\\\\n", | |
"q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n", | |
"\\\\\n", | |
"\\text{For common}&\\text{ usecases of 5% critical p-value, 80% power, and equal traffic allocation, this simplifies to }\\\\\n", | |
"\\\\\n", | |
"\\mathbf{N} &\\approx 12.4 \\frac{p(1-p)}{\\Delta p^2} \\\\\n", | |
"\\\\\n", | |
"\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on }\\alpha\\\\\n", | |
"\\\\\n", | |
"\\mathbf{N} &= \\Bigl[r(p+\\Delta p)[1-(p+\\Delta p)] + p(1-p)\\Bigr] \\Bigl(\\frac{q_{\\alpha/2} + q_{1-\\beta}}{\\Delta p} \\Bigr)^2 \\\\\n", | |
"\\end{align*}\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 50, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# one sided - treatment better than control\n", | |
"# returns the treatment sample size\n", | |
"def sample_size_proportion_1side(sig, power, base_rate, abs_improvement, allocation_ratio=1):\n", | |
" z_alpha = norm.isf(sig)\n", | |
" z_beta = norm.isf(1-power)\n", | |
" improve_rate = base_rate + abs_improvement\n", | |
" n = (z_alpha + z_beta) ** 2 * (allocation_ratio*improve_rate*(1-improve_rate) + base_rate*(1-base_rate)) / (abs_improvement) ** 2\n", | |
" return int(np.ceil(n))\n", | |
"\n", | |
"\n", | |
"#both better and worse\n", | |
"# returns the treatment sample size\n", | |
"def sample_size_proportion_2side(sig, power, base_rate, abs_improvement, allocation_ratio=1):\n", | |
" return sample_size_proportion_1side(sig/2, power, base_rate, abs_improvement, allocation_ratio)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Example usage\n", | |
"We want to run an experiment to measure the effect of a new hero image on sign up rate. We know from historical data, sign up rate is about 10%, and we want to be able to detect an increase of at least 5% (0.5% in absolute terms) with a confidence threshold of 5% and power of 80%, what is the sample size required for this experiment?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 51, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"45498" | |
] | |
}, | |
"execution_count": 51, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"sample_size_proportion_1side(0.05, 0.8, 0.1, 0.005, allocation_ratio=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 52, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"57760" | |
] | |
}, | |
"execution_count": 52, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"sample_size_proportion_2side(0.05, 0.8, 0.1, 0.005, allocation_ratio=1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Analysis procedure for A/B tests\n", | |
"\n", | |
"## Analysis for continuous evaluation metric, e.g. revenue per visit\n", | |
"\n", | |
"Use the t-test for comparing continuous evaluation metrics across groups, assuming one-sided hypothesis\n", | |
"\n", | |
"\\begin{align*}\n", | |
"S_p^2 &= \\frac{S_1^2}{N_1} + \\frac{S_2^2}{N_2} \\\\\n", | |
"t &= \\frac{\\Delta\\mu}{\\sqrt{S_p^2}} \\\\\n", | |
"df &= \\frac{(S_p^2)^2}{\\frac{\\frac{S_1^2}{N_1}}{N_1-1} + \\frac{\\frac{S_2^2}{N_2}}{N_2-1}}\\\\\n", | |
"p-value &= 1-cdf(t, df) \\\\\n", | |
"\\\\\n", | |
"\\text{Reject null} &\\text{ hypothesis if the p-value is below the desired confidence level, typically 5%}\n", | |
"\\\\\n", | |
"\\\\\n", | |
"\\text{Where}& \\\\\n", | |
"S_x^2 &= \\text{Sample variance of the treatment/control groups} \\\\\n", | |
"N_x &= \\text{Sample size of the treatment/control groups} \\\\\n", | |
"\\Delta\\mu &= \\text{evaluation metric of the treatment group - evaluation metric of the control group} \\\\\n", | |
"df &= \\text{degrees of freedom of the experiment} \\\\\n", | |
"cdf &= \\text{cumulative distribution function of the t-distribution}\n", | |
"\\\\\n", | |
"\\\\\n", | |
"\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on p-value}\\\\\n", | |
"\\\\\n", | |
"p-value &= 2(1-cdf(|t|, df)) \\\\\n", | |
"\\\\\n", | |
"\\text{If the confidence}&\\text{ interval around the difference of evaluation metric between treatment and control is of interest}\\\\\n", | |
"\\\\\n", | |
"ci &= \\bigl(\\Delta\\mu - q_{(1-conf)/2}\\sqrt{S_p^2}, \\enspace \\Delta\\mu + q_{(1-conf)/2}\\sqrt{S_p^2}) \\\\\n", | |
"\\\\\n", | |
"\\text{Where}& \\\\\n", | |
"conf &= \\text{the confidence level of the confidence interval, typically 95%} \\\\\n", | |
"q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n", | |
"\\end{align*}\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 53, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# mu1 > mu2\n", | |
"# does not assume equal variance between the 2 samples\n", | |
"def mean_test_1side(var_treat, var_control, mu_treat, mu_control, n_treat, n_control):\n", | |
" diff = mu_treat - mu_control\n", | |
" s = (var_treat/n_treat + var_control/n_control) ** 0.5\n", | |
" t_stat = diff / s\n", | |
" df = (var_treat/n_treat + var_control/n_control) ** 2 / ((var_treat/n_treat) ** 2 / (n_treat-1) + (var_control/n_control) ** 2 / (n_control-1))\n", | |
" prob = 1 - t.cdf(t_stat, df)\n", | |
" return prob\n", | |
"\n", | |
"\n", | |
"# mu1 != mu2\n", | |
"# does not assume equal variance between the 2 samples\n", | |
"def mean_test_2side(var_treat, var_control, mu_treat, mu_control, n_treat, n_control):\n", | |
" diff = mu_treat - mu_control\n", | |
" s = (var_treat/n_treat + var_control/n_control) ** 0.5\n", | |
" t_stat = diff / s\n", | |
" df = (var_treat/n_treat + var_control/n_control) ** 2 / ((var_treat/n_treat) ** 2 / (n_treat-1) + (var_control/n_control) ** 2 / (n_control-1))\n", | |
" prob = 1 - t.cdf(abs(t_stat), df)\n", | |
" return 2*prob\n", | |
"\n", | |
"# confidence level\n", | |
"def mean_test_ci(var_treat, var_control, mu_treat, mu_control, n_treat, n_control, conf=0.95):\n", | |
" diff = mu_treat - mu_control\n", | |
" s = (var_treat/n_treat + var_control/n_control) ** 0.5\n", | |
" df = (var_treat/n_treat + var_control/n_control) ** 2 / ((var_treat/n_treat) ** 2 / (n_treat-1) + (var_control/n_control) ** 2 / (n_control-1))\n", | |
" t_crit = t.isf((1-conf)/2, df)\n", | |
" return (diff - t_crit*s, diff + t_crit*s)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Example usage\n", | |
"We ran an experiment to measure the effect of a new hero image on revenue per visit. For the control group we saw revenue per visit of 10 dollars with a variance of 100 across 5500 visits, and for the treatment group we saw revenue per visit of 10.5 dollars with a variance of 110 across 5000 visits, can we reject the null hypothesis at the confidence level of 5%? And what's the confidence level around the improvement?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 57, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"p-value of the one-side test is 0.6%\n" | |
] | |
} | |
], | |
"source": [ | |
"print('p-value of the one-sided test is {:0.1%}'.format(mean_test_1side(110, 100, 10.5, 10, 5000, 5500)))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 58, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"p-value of the two-side test is 1.3%\n" | |
] | |
} | |
], | |
"source": [ | |
"print('p-value of the two-sided test is {:0.1%}'.format(mean_test_2side(110, 100, 10.5, 10, 5000, 5500)))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 62, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The 95% confidence interval of improve is between (0.1 dollars, 0.9 dollars)\n" | |
] | |
} | |
], | |
"source": [ | |
"print('The 95% confidence interval of the improvement is between ({:0.1f} dollars, {:0.1f} dollars)'.format(*mean_test_ci(110, 100, 10.5, 10, 5000, 5500)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Treatment beats control at a confidence level of 5% (0.6% < 5%)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Analysis for rate evaluation metric, e.g. click through rate\n", | |
"\n", | |
"Use the z-test for comparing rate evaluation metrics across groups, assuming one-sided hypothesis\n", | |
"\n", | |
"\\begin{align*}\n", | |
"p_{pool} &= \\frac{p_1N_1 + p_2N_2}{N_1 + N_2} \\\\\n", | |
"SE &= \\sqrt{p_{pool}(1-p_{pool})(\\frac{1}{N_1} + \\frac{1}{N_2})} \\\\\n", | |
"z &= \\frac{\\Delta p}{SE} \\\\\n", | |
"p-value &= 1-cdf(z) \\\\\n", | |
"\\\\\n", | |
"\\text{Reject null} &\\text{ hypothesis if the p-value is below the desired confidence level, typically 5%}\n", | |
"\\\\\n", | |
"\\\\\n", | |
"\\text{Where}& \\\\\n", | |
"p_x &= \\text{evaluation metric rate of the treatment/control groups} \\\\\n", | |
"N_x &= \\text{Sample size of the treatment/control groups} \\\\\n", | |
"\\Delta p &= \\text{evaluation metric of the treatment group - evaluation metric of the control group} \\\\\n", | |
"cdf &= \\text{cumulative distribution function of the standard normal distribution}\n", | |
"\\\\\n", | |
"\\\\\n", | |
"\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on p-value}\\\\\n", | |
"\\\\\n", | |
"p-value &= 2(1-cdf(|z|)) \\\\\n", | |
"\\\\\n", | |
"\\text{If the confidence}&\\text{ interval around the difference of evaluation metric between treatment and control is of interest}\\\\\n", | |
"\\\\\n", | |
"ci &= \\bigl(\\Delta p - q_{(1-conf)/2}SE, \\enspace \\Delta p + q_{(1-conf)/2}SE \\bigr) \\\\\n", | |
"\\\\\n", | |
"\\text{Where}& \\\\\n", | |
"conf &= \\text{the confidence level of the confidence interval, typically 95%} \\\\\n", | |
"q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n", | |
"\\end{align*}\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# p1 > p2\n", | |
"def proportion_test_1side(p_treat, p_control, n_treat, n_control):\n", | |
" pooled_p = (p_treat*n_treat+p_control*n_control) / (n_treat+n_control)\n", | |
" standard_error = (pooled_p * (1 - pooled_p) * (1/n_treat + 1/n_control)) ** 0.5\n", | |
" z = (p_treat - p_control) / standard_error\n", | |
" prob = 1 - norm.cdf(z)\n", | |
" return prob\n", | |
"\n", | |
"# p1 != p2\n", | |
"def proportion_test_2side(p_treat, p_control, n_treat, n_control):\n", | |
" pooled_p = (p_treat*n_treat+p_control*n_control) / (n_treat+n_control)\n", | |
" standard_error = (pooled_p * (1 - pooled_p) * (1/n_treat + 1/n_control)) ** 0.5\n", | |
" z = (p_treat - p_control) / standard_error\n", | |
" prob = 1 - norm.cdf(abs(z))\n", | |
" return 2*prob\n", | |
"\n", | |
"def proportion_test_ci(p_treat, p_control, n_treat, n_control, conf=0.95):\n", | |
" pooled_p = (p_treat*n_treat+p_control*n_control) / (n_treat+n_control)\n", | |
" standard_error = (pooled_p * (1 - pooled_p) * (1/n_treat + 1/n_control)) ** 0.5\n", | |
" diff = p_treat-p_control\n", | |
" z_crit = norm.isf((1-conf)/2)\n", | |
" return (diff - z_crit*standard_error, diff + z_crit*standard_error)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Example usage\n", | |
"We ran an experiment to measure the effect of a new hero image on signup rate. For the control group we saw signup rate of 10% across 5500 visits, and for the treatment group we saw signup rate of 10.5% across 5000 visits, can we reject the null hypothesis at the confidence level of 5%? And what's the confidence level around the improvement?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 63, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"p-value of the one-sided test is 19.9%\n" | |
] | |
} | |
], | |
"source": [ | |
"print('p-value of the one-sided test is {:0.1%}'.format(proportion_test_1side(0.105, 0.1, 5000, 5500)))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 64, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"p-value of the two-sided test is 39.9%\n" | |
] | |
} | |
], | |
"source": [ | |
"print('p-value of the two-sided test is {:0.1%}'.format(proportion_test_2side(0.105, 0.1, 5000, 5500)))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 66, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The 95% confidence interval of the improvement is between (-0.7%, 1.7%)\n" | |
] | |
} | |
], | |
"source": [ | |
"print('The 95% confidence interval of the improvement is between ({:0.1%}, {:0.1%})'.format(*proportion_test_ci(0.105, 0.1, 5000, 5500)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### We cannot conclude that treatment is better than control at a confidence level of 5% (19.9% > 5%)" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.5" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment