yifeihuang/AB test formulas.ipynb

## AB test formulas.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import norm, t\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sample size for A/B tests\n",
    "Sample size required for an A/B test is dependent on\n",
    "- continuous vs rate evaluation metric\n",
    "- one-side hypothesis (is treatment better than control) vs two-side hypothesis (is treatment different, either better or worse from control)\n",
    "- confidence threshold (critical p-value for rejecting the null hypothesis)\n",
    "- Power of the test (probability of detecting an effect if it exists)\n",
    "- traffic allocation ratio between experiment and control\n",
    "- desired minimum detectable effect size\n",
    "\n",
    "## Sample size for continuous evaluation metric, e.g. revenue per visit\n",
    "\n",
    "### Formula\n",
    "The formula for sample size required under a one-sided hypothesis is given by\n",
    "\n",
    "\\begin{align*}\n",
    "\\mathbf{N} &= (r+1) \\sigma^2 \\Bigl(\\frac{q_\\alpha + q_{1-\\beta}}{\\Delta\\mu} \\Bigr)^2 \\\\\n",
    "\\\\\n",
    "\\text{Where}& \\\\\n",
    "\\mathbf{N} &= \\text{the sample size required for the treatment group} \\\\\n",
    "r &= \\frac{N_{treatment}}{N_{control}} \\text{ traffic allocation ratio, between (0,1) } \\\\\n",
    "\\sigma^2 &= \\text{estimate of the population variance of the evaluation metric} \\\\\n",
    "\\alpha &= \\text{confidence level threshold or critical p-value, typically 5%} \\\\\n",
    "\\beta &= \\text{power of the test, typically 80%} \\\\\n",
    "q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n",
    "\\Delta\\mu &= \\text{desired miniumum detectable difference in the evaluation metric between treatment and control} \\\\\n",
    "\\\\\n",
    "\\text{For common}&\\text{ usecases of 5% critical p-value, 80% power, and equal traffic allocation, this simplifies to }\\\\\n",
    "\\\\\n",
    "\\mathbf{N} &\\approx 12.4 \\frac{\\sigma^2}{\\Delta\\mu^2} \\\\\n",
    "\\\\\n",
    "\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on }\\alpha\\\\\n",
    "\\\\\n",
    "\\mathbf{N} &= (r+1) \\sigma^2 \\Bigl(\\frac{q_{\\alpha/2} + q_{1-\\beta}}{\\Delta\\mu} \\Bigr)^2 \\\\\n",
    "\\end{align*}\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "# one sided: treatment better than control\n",
    "# returns the treatment sample size\n",
    "def sample_size_mean_1side(sig, power, pop_var, lift, allocation_ratio=1):\n",
    "    q_alpha = norm.isf(sig)\n",
    "    q_beta = norm.isf(1-power)\n",
    "    n = (allocation_ratio + 1) * (q_alpha + q_beta) ** 2 * pop_var / lift ** 2\n",
    "    return int(np.ceil(n))\n",
    "\n",
    "#two-sided: both better and worse\n",
    "# returns the treatment sample size\n",
    "def sample_size_mean_2side(sig, power, pop_var, lift, allocation_ratio=1):\n",
    "    return sample_size_mean_1side(sig/2, power, pop_var, lift, allocation_ratio)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example usage\n",
    "We want to run an experiment to measure the effect of a new hero image on revenue per visit. We know from historical data, revenue per visit is 10 dollars with a variance of 100, and we want to be able to detect an increase of at least 5% (or 0.5 dollars) with a confidence threshold of 5% and power of 80%, what is the sample size required for this experiment?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4947"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_size_mean_1side(0.05, 0.8, 100, 0.5, allocation_ratio=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6280"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_size_mean_2side(0.05, 0.8, 100, 0.5, allocation_ratio=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sample size for rate evaluation metric, e.g. click through rate\n",
    "\n",
    "### Formula\n",
    "The formula for sample size required under a one-sided hypothesis is given by\n",
    "\n",
    "\\begin{align*}\n",
    "\\mathbf{N} &= \\Bigl[r(p+\\Delta p)[1-(p+\\Delta p)] + p(1-p)\\Bigr] \\Bigl(\\frac{q_\\alpha + q_{1-\\beta}}{\\Delta p} \\Bigr)^2 \\\\\n",
    "\\\\\n",
    "\\text{Where}& \\\\\n",
    "\\mathbf{N} &= \\text{the sample size required for the treatment group} \\\\\n",
    "r &= \\frac{N_{treatment}}{N_{control}} \\text{ traffic allocation ratio, between (0,1) } \\\\\n",
    "p &= \\text{estimate of the baseline rate in the control population} \\\\\n",
    "\\Delta p &= \\text{desired miniumum detectable difference in the evaluation metric between treatment and control} \\\\\n",
    "\\alpha &= \\text{confidence level threshold or critical p-value, typically 5%} \\\\\n",
    "\\beta &= \\text{power of the test, typically 80%} \\\\\n",
    "q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n",
    "\\\\\n",
    "\\text{For common}&\\text{ usecases of 5% critical p-value, 80% power, and equal traffic allocation, this simplifies to }\\\\\n",
    "\\\\\n",
    "\\mathbf{N} &\\approx 12.4 \\frac{p(1-p)}{\\Delta p^2} \\\\\n",
    "\\\\\n",
    "\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on }\\alpha\\\\\n",
    "\\\\\n",
    "\\mathbf{N} &= \\Bigl[r(p+\\Delta p)[1-(p+\\Delta p)] + p(1-p)\\Bigr] \\Bigl(\\frac{q_{\\alpha/2} + q_{1-\\beta}}{\\Delta p} \\Bigr)^2 \\\\\n",
    "\\end{align*}\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "# one sided - treatment better than control\n",
    "# returns the treatment sample size\n",
    "def sample_size_proportion_1side(sig, power, base_rate, abs_improvement, allocation_ratio=1):\n",
    "    z_alpha = norm.isf(sig)\n",
    "    z_beta = norm.isf(1-power)\n",
    "    improve_rate = base_rate + abs_improvement\n",
    "    n = (z_alpha + z_beta) ** 2 * (allocation_ratio*improve_rate*(1-improve_rate) + base_rate*(1-base_rate)) / (abs_improvement) ** 2\n",
    "    return int(np.ceil(n))\n",
    "\n",
    "\n",
    "#both better and worse\n",
    "# returns the treatment sample size\n",
    "def sample_size_proportion_2side(sig, power, base_rate, abs_improvement, allocation_ratio=1):\n",
    "    return sample_size_proportion_1side(sig/2, power, base_rate, abs_improvement, allocation_ratio)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example usage\n",
    "We want to run an experiment to measure the effect of a new hero image on sign up rate. We know from historical data, sign up rate is about 10%, and we want to be able to detect an increase of at least 5% (0.5% in absolute terms) with a confidence threshold of 5% and power of 80%, what is the sample size required for this experiment?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "45498"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_size_proportion_1side(0.05, 0.8, 0.1, 0.005, allocation_ratio=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "57760"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_size_proportion_2side(0.05, 0.8, 0.1, 0.005, allocation_ratio=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analysis procedure for A/B tests\n",
    "\n",
    "## Analysis for continuous evaluation metric, e.g. revenue per visit\n",
    "\n",
    "Use the t-test for comparing continuous evaluation metrics across groups, assuming one-sided hypothesis\n",
    "\n",
    "\\begin{align*}\n",
    "S_p^2 &= \\frac{S_1^2}{N_1} + \\frac{S_2^2}{N_2} \\\\\n",
    "t &= \\frac{\\Delta\\mu}{\\sqrt{S_p^2}} \\\\\n",
    "df &= \\frac{(S_p^2)^2}{\\frac{\\frac{S_1^2}{N_1}}{N_1-1} + \\frac{\\frac{S_2^2}{N_2}}{N_2-1}}\\\\\n",
    "p-value &= 1-cdf(t, df) \\\\\n",
    "\\\\\n",
    "\\text{Reject null} &\\text{ hypothesis if the p-value is below the desired confidence level, typically 5%}\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Where}& \\\\\n",
    "S_x^2 &= \\text{Sample variance of the treatment/control groups} \\\\\n",
    "N_x &= \\text{Sample size of the treatment/control groups} \\\\\n",
    "\\Delta\\mu &= \\text{evaluation metric of the treatment group - evaluation metric of the control group} \\\\\n",
    "df &= \\text{degrees of freedom of the experiment} \\\\\n",
    "cdf &= \\text{cumulative distribution function of the t-distribution}\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on p-value}\\\\\n",
    "\\\\\n",
    "p-value &= 2(1-cdf(|t|, df)) \\\\\n",
    "\\\\\n",
    "\\text{If the confidence}&\\text{ interval around the difference of evaluation metric between treatment and control is of interest}\\\\\n",
    "\\\\\n",
    "ci &= \\bigl(\\Delta\\mu - q_{(1-conf)/2}\\sqrt{S_p^2}, \\enspace \\Delta\\mu + q_{(1-conf)/2}\\sqrt{S_p^2}) \\\\\n",
    "\\\\\n",
    "\\text{Where}& \\\\\n",
    "conf &= \\text{the confidence level of the confidence interval, typically 95%} \\\\\n",
    "q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n",
    "\\end{align*}\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "# mu1 > mu2\n",
    "# does not assume equal variance between the 2 samples\n",
    "def mean_test_1side(var_treat, var_control, mu_treat, mu_control, n_treat, n_control):\n",
    "    diff = mu_treat - mu_control\n",
    "    s = (var_treat/n_treat + var_control/n_control) ** 0.5\n",
    "    t_stat = diff / s\n",
    "    df = (var_treat/n_treat + var_control/n_control) ** 2 / ((var_treat/n_treat) ** 2 / (n_treat-1) + (var_control/n_control) ** 2 / (n_control-1))\n",
    "    prob = 1 - t.cdf(t_stat, df)\n",
    "    return prob\n",
    "\n",
    "\n",
    "# mu1 != mu2\n",
    "# does not assume equal variance between the 2 samples\n",
    "def mean_test_2side(var_treat, var_control, mu_treat, mu_control, n_treat, n_control):\n",
    "    diff = mu_treat - mu_control\n",
    "    s = (var_treat/n_treat + var_control/n_control) ** 0.5\n",
    "    t_stat = diff / s\n",
    "    df = (var_treat/n_treat + var_control/n_control) ** 2 / ((var_treat/n_treat) ** 2 / (n_treat-1) + (var_control/n_control) ** 2 / (n_control-1))\n",
    "    prob = 1 - t.cdf(abs(t_stat), df)\n",
    "    return 2*prob\n",
    "\n",
    "# confidence level\n",
    "def mean_test_ci(var_treat, var_control, mu_treat, mu_control, n_treat, n_control, conf=0.95):\n",
    "    diff = mu_treat - mu_control\n",
    "    s = (var_treat/n_treat + var_control/n_control) ** 0.5\n",
    "    df = (var_treat/n_treat + var_control/n_control) ** 2 / ((var_treat/n_treat) ** 2 / (n_treat-1) + (var_control/n_control) ** 2 / (n_control-1))\n",
    "    t_crit = t.isf((1-conf)/2, df)\n",
    "    return (diff - t_crit*s, diff + t_crit*s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example usage\n",
    "We ran an experiment to measure the effect of a new hero image on revenue per visit. For the control group we saw revenue per visit of 10 dollars with a variance of 100 across 5500 visits, and for the treatment group we saw revenue per visit of 10.5 dollars with a variance of 110 across 5000 visits, can we reject the null hypothesis at the confidence level of 5%? And what's the confidence level around the improvement?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p-value of the one-side test is 0.6%\n"
     ]
    }
   ],
   "source": [
    "print('p-value of the one-sided test is {:0.1%}'.format(mean_test_1side(110, 100, 10.5, 10, 5000, 5500)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p-value of the two-side test is 1.3%\n"
     ]
    }
   ],
   "source": [
    "print('p-value of the two-sided test is {:0.1%}'.format(mean_test_2side(110, 100, 10.5, 10, 5000, 5500)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The 95% confidence interval of improve is between (0.1 dollars, 0.9 dollars)\n"
     ]
    }
   ],
   "source": [
    "print('The 95% confidence interval of the improvement is between ({:0.1f} dollars, {:0.1f} dollars)'.format(*mean_test_ci(110, 100, 10.5, 10, 5000, 5500)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Treatment beats control at a confidence level of 5% (0.6% < 5%)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis for rate evaluation metric, e.g. click through rate\n",
    "\n",
    "Use the z-test for comparing rate evaluation metrics across groups, assuming one-sided hypothesis\n",
    "\n",
    "\\begin{align*}\n",
    "p_{pool} &= \\frac{p_1N_1 + p_2N_2}{N_1 + N_2} \\\\\n",
    "SE &= \\sqrt{p_{pool}(1-p_{pool})(\\frac{1}{N_1} + \\frac{1}{N_2})} \\\\\n",
    "z &= \\frac{\\Delta p}{SE} \\\\\n",
    "p-value &= 1-cdf(z) \\\\\n",
    "\\\\\n",
    "\\text{Reject null} &\\text{ hypothesis if the p-value is below the desired confidence level, typically 5%}\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Where}& \\\\\n",
    "p_x &= \\text{evaluation metric rate of the treatment/control groups} \\\\\n",
    "N_x &= \\text{Sample size of the treatment/control groups} \\\\\n",
    "\\Delta p &= \\text{evaluation metric of the treatment group - evaluation metric of the control group} \\\\\n",
    "cdf &= \\text{cumulative distribution function of the standard normal distribution}\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on p-value}\\\\\n",
    "\\\\\n",
    "p-value &= 2(1-cdf(|z|)) \\\\\n",
    "\\\\\n",
    "\\text{If the confidence}&\\text{ interval around the difference of evaluation metric between treatment and control is of interest}\\\\\n",
    "\\\\\n",
    "ci &= \\bigl(\\Delta p - q_{(1-conf)/2}SE, \\enspace \\Delta p + q_{(1-conf)/2}SE \\bigr) \\\\\n",
    "\\\\\n",
    "\\text{Where}& \\\\\n",
    "conf &= \\text{the confidence level of the confidence interval, typically 95%} \\\\\n",
    "q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n",
    "\\end{align*}\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "# p1 > p2\n",
    "def proportion_test_1side(p_treat, p_control, n_treat, n_control):\n",
    "    pooled_p = (p_treat*n_treat+p_control*n_control) / (n_treat+n_control)\n",
    "    standard_error = (pooled_p * (1 - pooled_p) * (1/n_treat + 1/n_control)) ** 0.5\n",
    "    z = (p_treat - p_control) / standard_error\n",
    "    prob = 1 - norm.cdf(z)\n",
    "    return prob\n",
    "\n",
    "# p1 != p2\n",
    "def proportion_test_2side(p_treat, p_control, n_treat, n_control):\n",
    "    pooled_p = (p_treat*n_treat+p_control*n_control) / (n_treat+n_control)\n",
    "    standard_error = (pooled_p * (1 - pooled_p) * (1/n_treat + 1/n_control)) ** 0.5\n",
    "    z = (p_treat - p_control) / standard_error\n",
    "    prob = 1 - norm.cdf(abs(z))\n",
    "    return 2*prob\n",
    "\n",
    "def proportion_test_ci(p_treat, p_control, n_treat, n_control, conf=0.95):\n",
    "    pooled_p = (p_treat*n_treat+p_control*n_control) / (n_treat+n_control)\n",
    "    standard_error = (pooled_p * (1 - pooled_p) * (1/n_treat + 1/n_control)) ** 0.5\n",
    "    diff = p_treat-p_control\n",
    "    z_crit = norm.isf((1-conf)/2)\n",
    "    return (diff - z_crit*standard_error, diff + z_crit*standard_error)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example usage\n",
    "We ran an experiment to measure the effect of a new hero image on signup rate. For the control group we saw signup rate of 10% across 5500 visits, and for the treatment group we saw signup rate of 10.5% across 5000 visits, can we reject the null hypothesis at the confidence level of 5%? And what's the confidence level around the improvement?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p-value of the one-sided test is 19.9%\n"
     ]
    }
   ],
   "source": [
    "print('p-value of the one-sided test is {:0.1%}'.format(proportion_test_1side(0.105, 0.1, 5000, 5500)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p-value of the two-sided test is 39.9%\n"
     ]
    }
   ],
   "source": [
    "print('p-value of the two-sided test is {:0.1%}'.format(proportion_test_2side(0.105, 0.1, 5000, 5500)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The 95% confidence interval of the improvement is between (-0.7%, 1.7%)\n"
     ]
    }
   ],
   "source": [
    "print('The 95% confidence interval of the improvement is between ({:0.1%}, {:0.1%})'.format(*proportion_test_ci(0.105, 0.1, 5000, 5500)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### We cannot conclude that treatment is better than control at a confidence level of 5% (19.9% > 5%)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 46,
	"metadata": {},
	"outputs": [],
	"source": [
	"from scipy.stats import norm, t\n",
	"import numpy as np"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Sample size for A/B tests\n",
	"Sample size required for an A/B test is dependent on\n",
	"- continuous vs rate evaluation metric\n",
	"- one-side hypothesis (is treatment better than control) vs two-side hypothesis (is treatment different, either better or worse from control)\n",
	"- confidence threshold (critical p-value for rejecting the null hypothesis)\n",
	"- Power of the test (probability of detecting an effect if it exists)\n",
	"- traffic allocation ratio between experiment and control\n",
	"- desired minimum detectable effect size\n",
	"\n",
	"## Sample size for continuous evaluation metric, e.g. revenue per visit\n",
	"\n",
	"### Formula\n",
	"The formula for sample size required under a one-sided hypothesis is given by\n",
	"\n",
	"\\begin{align*}\n",
	"\\mathbf{N} &= (r+1) \\sigma^2 \\Bigl(\\frac{q_\\alpha + q_{1-\\beta}}{\\Delta\\mu} \\Bigr)^2 \\\\\n",
	"\\\\\n",
	"\\text{Where}& \\\\\n",
	"\\mathbf{N} &= \\text{the sample size required for the treatment group} \\\\\n",
	"r &= \\frac{N_{treatment}}{N_{control}} \\text{ traffic allocation ratio, between (0,1) } \\\\\n",
	"\\sigma^2 &= \\text{estimate of the population variance of the evaluation metric} \\\\\n",
	"\\alpha &= \\text{confidence level threshold or critical p-value, typically 5%} \\\\\n",
	"\\beta &= \\text{power of the test, typically 80%} \\\\\n",
	"q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n",
	"\\Delta\\mu &= \\text{desired miniumum detectable difference in the evaluation metric between treatment and control} \\\\\n",
	"\\\\\n",
	"\\text{For common}&\\text{ usecases of 5% critical p-value, 80% power, and equal traffic allocation, this simplifies to }\\\\\n",
	"\\\\\n",
	"\\mathbf{N} &\\approx 12.4 \\frac{\\sigma^2}{\\Delta\\mu^2} \\\\\n",
	"\\\\\n",
	"\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on }\\alpha\\\\\n",
	"\\\\\n",
	"\\mathbf{N} &= (r+1) \\sigma^2 \\Bigl(\\frac{q_{\\alpha/2} + q_{1-\\beta}}{\\Delta\\mu} \\Bigr)^2 \\\\\n",
	"\\end{align*}\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 47,
	"metadata": {},
	"outputs": [],
	"source": [
	"# one sided: treatment better than control\n",
	"# returns the treatment sample size\n",
	"def sample_size_mean_1side(sig, power, pop_var, lift, allocation_ratio=1):\n",
	" q_alpha = norm.isf(sig)\n",
	" q_beta = norm.isf(1-power)\n",
	" n = (allocation_ratio + 1) * (q_alpha + q_beta) ** 2 * pop_var / lift ** 2\n",
	" return int(np.ceil(n))\n",
	"\n",
	"#two-sided: both better and worse\n",
	"# returns the treatment sample size\n",
	"def sample_size_mean_2side(sig, power, pop_var, lift, allocation_ratio=1):\n",
	" return sample_size_mean_1side(sig/2, power, pop_var, lift, allocation_ratio)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Example usage\n",
	"We want to run an experiment to measure the effect of a new hero image on revenue per visit. We know from historical data, revenue per visit is 10 dollars with a variance of 100, and we want to be able to detect an increase of at least 5% (or 0.5 dollars) with a confidence threshold of 5% and power of 80%, what is the sample size required for this experiment?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 48,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"4947"
	]
	},
	"execution_count": 48,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sample_size_mean_1side(0.05, 0.8, 100, 0.5, allocation_ratio=1)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 49,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"6280"
	]
	},
	"execution_count": 49,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sample_size_mean_2side(0.05, 0.8, 100, 0.5, allocation_ratio=1)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Sample size for rate evaluation metric, e.g. click through rate\n",
	"\n",
	"### Formula\n",
	"The formula for sample size required under a one-sided hypothesis is given by\n",
	"\n",
	"\\begin{align*}\n",
	"\\mathbf{N} &= \\Bigl[r(p+\\Delta p)[1-(p+\\Delta p)] + p(1-p)\\Bigr] \\Bigl(\\frac{q_\\alpha + q_{1-\\beta}}{\\Delta p} \\Bigr)^2 \\\\\n",
	"\\\\\n",
	"\\text{Where}& \\\\\n",
	"\\mathbf{N} &= \\text{the sample size required for the treatment group} \\\\\n",
	"r &= \\frac{N_{treatment}}{N_{control}} \\text{ traffic allocation ratio, between (0,1) } \\\\\n",
	"p &= \\text{estimate of the baseline rate in the control population} \\\\\n",
	"\\Delta p &= \\text{desired miniumum detectable difference in the evaluation metric between treatment and control} \\\\\n",
	"\\alpha &= \\text{confidence level threshold or critical p-value, typically 5%} \\\\\n",
	"\\beta &= \\text{power of the test, typically 80%} \\\\\n",
	"q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n",
	"\\\\\n",
	"\\text{For common}&\\text{ usecases of 5% critical p-value, 80% power, and equal traffic allocation, this simplifies to }\\\\\n",
	"\\\\\n",
	"\\mathbf{N} &\\approx 12.4 \\frac{p(1-p)}{\\Delta p^2} \\\\\n",
	"\\\\\n",
	"\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on }\\alpha\\\\\n",
	"\\\\\n",
	"\\mathbf{N} &= \\Bigl[r(p+\\Delta p)[1-(p+\\Delta p)] + p(1-p)\\Bigr] \\Bigl(\\frac{q_{\\alpha/2} + q_{1-\\beta}}{\\Delta p} \\Bigr)^2 \\\\\n",
	"\\end{align*}\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 50,
	"metadata": {},
	"outputs": [],
	"source": [
	"# one sided - treatment better than control\n",
	"# returns the treatment sample size\n",
	"def sample_size_proportion_1side(sig, power, base_rate, abs_improvement, allocation_ratio=1):\n",
	" z_alpha = norm.isf(sig)\n",
	" z_beta = norm.isf(1-power)\n",
	" improve_rate = base_rate + abs_improvement\n",
	" n = (z_alpha + z_beta) ** 2 * (allocation_ratioimprove_rate(1-improve_rate) + base_rate(1-base_rate)) / (abs_improvement) * 2\n",
	" return int(np.ceil(n))\n",
	"\n",
	"\n",
	"#both better and worse\n",
	"# returns the treatment sample size\n",
	"def sample_size_proportion_2side(sig, power, base_rate, abs_improvement, allocation_ratio=1):\n",
	" return sample_size_proportion_1side(sig/2, power, base_rate, abs_improvement, allocation_ratio)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Example usage\n",
	"We want to run an experiment to measure the effect of a new hero image on sign up rate. We know from historical data, sign up rate is about 10%, and we want to be able to detect an increase of at least 5% (0.5% in absolute terms) with a confidence threshold of 5% and power of 80%, what is the sample size required for this experiment?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 51,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"45498"
	]
	},
	"execution_count": 51,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sample_size_proportion_1side(0.05, 0.8, 0.1, 0.005, allocation_ratio=1)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 52,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"57760"
	]
	},
	"execution_count": 52,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sample_size_proportion_2side(0.05, 0.8, 0.1, 0.005, allocation_ratio=1)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Analysis procedure for A/B tests\n",
	"\n",
	"## Analysis for continuous evaluation metric, e.g. revenue per visit\n",
	"\n",
	"Use the t-test for comparing continuous evaluation metrics across groups, assuming one-sided hypothesis\n",
	"\n",
	"\\begin{align*}\n",
	"S_p^2 &= \\frac{S_1^2}{N_1} + \\frac{S_2^2}{N_2} \\\\\n",
	"t &= \\frac{\\Delta\\mu}{\\sqrt{S_p^2}} \\\\\n",
	"df &= \\frac{(S_p^2)^2}{\\frac{\\frac{S_1^2}{N_1}}{N_1-1} + \\frac{\\frac{S_2^2}{N_2}}{N_2-1}}\\\\\n",
	"p-value &= 1-cdf(t, df) \\\\\n",
	"\\\\\n",
	"\\text{Reject null} &\\text{ hypothesis if the p-value is below the desired confidence level, typically 5%}\n",
	"\\\\\n",
	"\\\\\n",
	"\\text{Where}& \\\\\n",
	"S_x^2 &= \\text{Sample variance of the treatment/control groups} \\\\\n",
	"N_x &= \\text{Sample size of the treatment/control groups} \\\\\n",
	"\\Delta\\mu &= \\text{evaluation metric of the treatment group - evaluation metric of the control group} \\\\\n",
	"df &= \\text{degrees of freedom of the experiment} \\\\\n",
	"cdf &= \\text{cumulative distribution function of the t-distribution}\n",
	"\\\\\n",
	"\\\\\n",
	"\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on p-value}\\\\\n",
	"\\\\\n",
	"p-value &= 2(1-cdf(\|t\|, df)) \\\\\n",
	"\\\\\n",
	"\\text{If the confidence}&\\text{ interval around the difference of evaluation metric between treatment and control is of interest}\\\\\n",
	"\\\\\n",
	"ci &= \\bigl(\\Delta\\mu - q_{(1-conf)/2}\\sqrt{S_p^2}, \\enspace \\Delta\\mu + q_{(1-conf)/2}\\sqrt{S_p^2}) \\\\\n",
	"\\\\\n",
	"\\text{Where}& \\\\\n",
	"conf &= \\text{the confidence level of the confidence interval, typically 95%} \\\\\n",
	"q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n",
	"\\end{align*}\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 53,
	"metadata": {},
	"outputs": [],
	"source": [
	"# mu1 > mu2\n",
	"# does not assume equal variance between the 2 samples\n",
	"def mean_test_1side(var_treat, var_control, mu_treat, mu_control, n_treat, n_control):\n",
	" diff = mu_treat - mu_control\n",
	" s = (var_treat/n_treat + var_control/n_control) ** 0.5\n",
	" t_stat = diff / s\n",
	" df = (var_treat/n_treat + var_control/n_control) 2 / ((var_treat/n_treat) 2 / (n_treat-1) + (var_control/n_control) ** 2 / (n_control-1))\n",
	" prob = 1 - t.cdf(t_stat, df)\n",
	" return prob\n",
	"\n",
	"\n",
	"# mu1 != mu2\n",
	"# does not assume equal variance between the 2 samples\n",
	"def mean_test_2side(var_treat, var_control, mu_treat, mu_control, n_treat, n_control):\n",
	" diff = mu_treat - mu_control\n",
	" s = (var_treat/n_treat + var_control/n_control) ** 0.5\n",
	" t_stat = diff / s\n",
	" df = (var_treat/n_treat + var_control/n_control) 2 / ((var_treat/n_treat) 2 / (n_treat-1) + (var_control/n_control) ** 2 / (n_control-1))\n",
	" prob = 1 - t.cdf(abs(t_stat), df)\n",
	" return 2*prob\n",
	"\n",
	"# confidence level\n",
	"def mean_test_ci(var_treat, var_control, mu_treat, mu_control, n_treat, n_control, conf=0.95):\n",
	" diff = mu_treat - mu_control\n",
	" s = (var_treat/n_treat + var_control/n_control) ** 0.5\n",
	" df = (var_treat/n_treat + var_control/n_control) 2 / ((var_treat/n_treat) 2 / (n_treat-1) + (var_control/n_control) ** 2 / (n_control-1))\n",
	" t_crit = t.isf((1-conf)/2, df)\n",
	" return (diff - t_crits, diff + t_crits)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Example usage\n",
	"We ran an experiment to measure the effect of a new hero image on revenue per visit. For the control group we saw revenue per visit of 10 dollars with a variance of 100 across 5500 visits, and for the treatment group we saw revenue per visit of 10.5 dollars with a variance of 110 across 5000 visits, can we reject the null hypothesis at the confidence level of 5%? And what's the confidence level around the improvement?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 57,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"p-value of the one-side test is 0.6%\n"
	]
	}
	],
	"source": [
	"print('p-value of the one-sided test is {:0.1%}'.format(mean_test_1side(110, 100, 10.5, 10, 5000, 5500)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 58,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"p-value of the two-side test is 1.3%\n"
	]
	}
	],
	"source": [
	"print('p-value of the two-sided test is {:0.1%}'.format(mean_test_2side(110, 100, 10.5, 10, 5000, 5500)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 62,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"The 95% confidence interval of improve is between (0.1 dollars, 0.9 dollars)\n"
	]
	}
	],
	"source": [
	"print('The 95% confidence interval of the improvement is between ({:0.1f} dollars, {:0.1f} dollars)'.format(*mean_test_ci(110, 100, 10.5, 10, 5000, 5500)))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Treatment beats control at a confidence level of 5% (0.6% < 5%)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Analysis for rate evaluation metric, e.g. click through rate\n",
	"\n",
	"Use the z-test for comparing rate evaluation metrics across groups, assuming one-sided hypothesis\n",
	"\n",
	"\\begin{align*}\n",
	"p_{pool} &= \\frac{p_1N_1 + p_2N_2}{N_1 + N_2} \\\\\n",
	"SE &= \\sqrt{p_{pool}(1-p_{pool})(\\frac{1}{N_1} + \\frac{1}{N_2})} \\\\\n",
	"z &= \\frac{\\Delta p}{SE} \\\\\n",
	"p-value &= 1-cdf(z) \\\\\n",
	"\\\\\n",
	"\\text{Reject null} &\\text{ hypothesis if the p-value is below the desired confidence level, typically 5%}\n",
	"\\\\\n",
	"\\\\\n",
	"\\text{Where}& \\\\\n",
	"p_x &= \\text{evaluation metric rate of the treatment/control groups} \\\\\n",
	"N_x &= \\text{Sample size of the treatment/control groups} \\\\\n",
	"\\Delta p &= \\text{evaluation metric of the treatment group - evaluation metric of the control group} \\\\\n",
	"cdf &= \\text{cumulative distribution function of the standard normal distribution}\n",
	"\\\\\n",
	"\\\\\n",
	"\\text{Formula for}&\\text{ a two-sided hypothesis is very similar with a small modification on p-value}\\\\\n",
	"\\\\\n",
	"p-value &= 2(1-cdf(\|z\|)) \\\\\n",
	"\\\\\n",
	"\\text{If the confidence}&\\text{ interval around the difference of evaluation metric between treatment and control is of interest}\\\\\n",
	"\\\\\n",
	"ci &= \\bigl(\\Delta p - q_{(1-conf)/2}SE, \\enspace \\Delta p + q_{(1-conf)/2}SE \\bigr) \\\\\n",
	"\\\\\n",
	"\\text{Where}& \\\\\n",
	"conf &= \\text{the confidence level of the confidence interval, typically 95%} \\\\\n",
	"q_x &= \\text{x level quantile of the standard normal distribution} \\\\\n",
	"\\end{align*}\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 39,
	"metadata": {},
	"outputs": [],
	"source": [
	"# p1 > p2\n",
	"def proportion_test_1side(p_treat, p_control, n_treat, n_control):\n",
	" pooled_p = (p_treatn_treat+p_controln_control) / (n_treat+n_control)\n",
	" standard_error = (pooled_p * (1 - pooled_p) * (1/n_treat + 1/n_control)) ** 0.5\n",
	" z = (p_treat - p_control) / standard_error\n",
	" prob = 1 - norm.cdf(z)\n",
	" return prob\n",
	"\n",
	"# p1 != p2\n",
	"def proportion_test_2side(p_treat, p_control, n_treat, n_control):\n",
	" pooled_p = (p_treatn_treat+p_controln_control) / (n_treat+n_control)\n",
	" standard_error = (pooled_p * (1 - pooled_p) * (1/n_treat + 1/n_control)) ** 0.5\n",
	" z = (p_treat - p_control) / standard_error\n",
	" prob = 1 - norm.cdf(abs(z))\n",
	" return 2*prob\n",
	"\n",
	"def proportion_test_ci(p_treat, p_control, n_treat, n_control, conf=0.95):\n",
	" pooled_p = (p_treatn_treat+p_controln_control) / (n_treat+n_control)\n",
	" standard_error = (pooled_p * (1 - pooled_p) * (1/n_treat + 1/n_control)) ** 0.5\n",
	" diff = p_treat-p_control\n",
	" z_crit = norm.isf((1-conf)/2)\n",
	" return (diff - z_critstandard_error, diff + z_critstandard_error)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Example usage\n",
	"We ran an experiment to measure the effect of a new hero image on signup rate. For the control group we saw signup rate of 10% across 5500 visits, and for the treatment group we saw signup rate of 10.5% across 5000 visits, can we reject the null hypothesis at the confidence level of 5%? And what's the confidence level around the improvement?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 63,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"p-value of the one-sided test is 19.9%\n"
	]
	}
	],
	"source": [
	"print('p-value of the one-sided test is {:0.1%}'.format(proportion_test_1side(0.105, 0.1, 5000, 5500)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 64,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"p-value of the two-sided test is 39.9%\n"
	]
	}
	],
	"source": [
	"print('p-value of the two-sided test is {:0.1%}'.format(proportion_test_2side(0.105, 0.1, 5000, 5500)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 66,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"The 95% confidence interval of the improvement is between (-0.7%, 1.7%)\n"
	]
	}
	],
	"source": [
	"print('The 95% confidence interval of the improvement is between ({:0.1%}, {:0.1%})'.format(*proportion_test_ci(0.105, 0.1, 5000, 5500)))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### We cannot conclude that treatment is better than control at a confidence level of 5% (19.9% > 5%)"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.5"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}