Skip to content

Instantly share code, notes, and snippets.

@petermchale
Last active December 2, 2023 15:46
Show Gist options
  • Save petermchale/a4fc2ca750048d21a0cbb8fafcc690af to your computer and use it in GitHub Desktop.
Save petermchale/a4fc2ca750048d21a0cbb8fafcc690af to your computer and use it in GitHub Desktop.
A derivation of the bias-variance decomposition of test error in machine learning.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Bias and Variance in Machine Learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## True model \n",
"Suppose that a target variable $Y$ and a feature variable $X$ are related via $Y = f(X) + \\epsilon$, where $X$ and $\\epsilon$ are independent random variables and the expected value of $\\epsilon$ is zero, $E[\\epsilon] = 0$. \n",
"\n",
"We can use this mathematical relationship to generate a data set $\\cal D$. Because data sets are always of finite size, we may think of $\\cal D$ as a random variable, the realizations of which take the form\n",
"$d = \\{ (x_1,y_1), \\ldots , (x_m,y_m) \\}$, where $x_i$ and $y_i$ are realizations of $X$ and $Y$. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Estimated model\n",
"\n",
"Machine learning uses a particular realization $d$ of $\\cal D$ to train an estimate of the function $f(x)$, called the hypothesis $h_d(x)$. The subscript $d$ reminds us that the hypothesis is a random function that varies over training data sets. \n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test error of the estimated model\n",
"\n",
"Having learned an hypothesis for a particular training set $d$, we next evaluate the error made in predicting the value of $y$ on an unseen test value $x$. In linear regression, that test error is quantified by taking a test data set (also drawn from the distribution of $\\cal D$) and computing the average of $(Y - h_d)^2$ over the data set. If the size of the test data set is large enough, this average is approximated by $E_{X,\\epsilon} [ (Y(X,\\epsilon) - h_{d}(X))^2 ]$. As the training data set $d$ varies, so does the test error; in other words, test error is a random variable, the average of which over all training sets is given by \n",
"\n",
"\\begin{equation*}\n",
"\\text{expected test error} = E_{\\cal D} \\left[ E_{X,\\epsilon} \\left[ (Y(X,\\epsilon) - h_{\\cal D}(X))^2 \\right] \\right].\n",
"\\end{equation*}\n",
"\n",
"In the following sections, I will show how this error can be decomposed into three parts: a *bias* that quantifies how much (the average of) the hypothesis deviates from $f$; a *variance* term that quantifies how much the hypothesis varies among training data sets; and an *irreducible error* that describes the fact that one's ability to predict is always limited by the noise $\\epsilon$. \n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Establishing a useful order of integration \n",
"\n",
"To compute the expected test error analytically, we rewrite the expectation operators in two steps. The first step is to recognize that \n",
"$ E_{X,\\epsilon} [\\ldots] = E_X \\left[ E_\\epsilon [ \\ldots ] \\right],$\n",
"since $X$ and $E$ are independent. The second step is to use Fubini's theorem to reverse the order in which $X$ and $D$ are integrated out. The final result is that the expected test error is given by \n",
"\n",
"$$ \n",
"\\text{expected test error} = \n",
"E_X \\left[ E_{\\cal D} \\left[ E_\\epsilon \\left[\n",
"(Y - h)^2 \n",
"\\right] \\right] \\right], $$\n",
"\n",
"where I have dropped the dependence of $Y$ and $h$ on $X$, $\\epsilon$ and $\\cal D$ in the interests of clarity.\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reducible and irreducible error\n",
"\n",
"We fix values of $X$ and $\\cal D$ (and therefore $f$ and $h$) and compute the inner-most integral in the expected test error:\n",
"\n",
"\\begin{align*}\n",
"E_\\epsilon \\left[ (Y - h)^2 \\right] \n",
"& = E_\\epsilon \\left[ (f + \\epsilon - h)^2 \\right]\\\\\n",
"& = E_\\epsilon \\left[ (f-h)^2 + \\epsilon^2 + 2\\epsilon (f-h) \\right]\\\\\n",
"& = (f-h)^2 + E_\\epsilon\\left[ \\epsilon^2 \\right] + 0 \\\\\n",
"& = (f-h)^2 + Var_\\epsilon \\left[ \\epsilon \\right].\n",
"\\end{align*}\n",
"\n",
"The last term remains unaltered by subsequent averaging over $X$ and $D$. It represents the irreducible error contribution to the expected test error.\n",
"\n",
"The average of the first term, \n",
"$E_X \\left[ E_{\\cal D} \\left[ \\left( f-h\\right)^2 \\right] \\right]$,\n",
"is sometimes called the reducible error.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Decomposing the reducible error into 'bias' and 'variance' \n",
"\n",
"We relax our constraint that $\\cal D$ is fixed (but keep the constraint that $X$ is fixed) and compute the innermost integral in the reducible error:\n",
"\n",
"\\begin{align*}\n",
"E_{\\cal D} \\left[ (f-h)^2 \\right] \n",
"& = E_{\\cal D} \\left[ f^2 + h^2 - 2fh \\right]\\\\\n",
"& = f^2 + E_{\\cal D} \\left[ h^2 \\right] - 2f E_{\\cal D} \\left[h\\right]\\\\\n",
"\\end{align*}\n",
"\n",
"Adding and subtracting $E_{\\cal D} \\left[ h \\right]^2$, and rearranging terms, we may write the right-hand side above as \n",
"\n",
"$$\n",
"\\left( f - E_{\\cal D} \\left[ h \\right] \\right)^2 + Var_{\\cal D} \\left[ h \\right].\n",
"$$\n",
"\n",
"Averaging over $X$, and restoring the irreducible error, yields finally: \n",
"\n",
"\n",
"$$\n",
"\\boxed{\n",
"\\text{expected test error} = \n",
"E_X \\left[ \\left( f - E_{\\cal D} \\left[ h \\right] \\right)^2 \\right]\n",
"+ E_X \\left[ Var_{\\cal D} \\left[ h \\right] \\right] \n",
"+ Var_\\epsilon \\left[ \\epsilon \\right].\n",
"}\n",
"$$\n",
"\n",
"The first term is called the bias and the second term is called the variance. \n",
"\n",
"The variance component of the expected test error is a consequence of the finite size of the training data sets. In the limit that training sets contain an infinite number of data points, there are no fluctuations in $h$ among the training sets and the variance term vanishes. Put another way, when the size of the training set is large, the expected test error is expected to be solely due to bias (assuming the irreducible error is negligible). "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## More info \n",
"An excellent exposition of these concepts and more can be found [here](https://www.youtube.com/watch?v=zrEyxfl2-a8)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@sdangi
Copy link

sdangi commented Dec 14, 2017

Thanks for the helpful gist Peter! One potential correction -- instead of:
Adding and subtracting $$E_{\\cal D} \\left[ h^2 \\right]$$
I believe it should be:
Adding and subtracting $$E_{\\cal D} \\left[ h \\right]^2$$

@petermchale
Copy link
Author

@sdangi, I don't see why.

@bradyneal
Copy link

Nice gist! Regarding the term that should be added and subtracted, I think you accidentally typed up the square inside of the expectation rather than outside of the expectation. This is necessary for the "rearranging terms, we may write the right-hand side above as" math to carry through.

@petermchale
Copy link
Author

petermchale commented Apr 5, 2018

@bradyneal, I understand what @sdangi meant. My issue is that I don't see why the suggested change is correct. I could be wrong, though. Happy to look over the algebra if you or @sdangi are interested in providing it. Best, Peter

@bradyneal
Copy link

bias variance peter

@petermchale
Copy link
Author

@bradyneal @sdangi : you are correct! Thanks! I've updated the jupyter notebook accordingly.

@rafgonsi
Copy link

In section 'Reducible and irreducible error' why is E_e[2\epsilon (f - h)] equal to 0?

I agree that E_e[\epsilon f] = 0, but why E_e[\epsilon h] = 0? The hypothesis h is learned from some test set (X,Y) and Y depends on \epsilon and so the parameters of the learned model do. Thus one cannot write E_e[\epsilon h] = h E_e[\epsilon].

Could you explain this?

@petermchale
Copy link
Author

@rafgonsi : ... In performing the triple integral over $X$, $\cal{D}$ and $\epsilon$, I fix two variables ($X$ and $\cal{D}$) and vary the third ($\epsilon$). Since $X$ is fixed, so are $f$ and $h$, which may therefore be "pulled out" of the innermost integral over $\epsilon$.

@petermchale
Copy link
Author

An entirely analogous result to that outlined in this gist is obtained when one computes the error of an estimator of a parameter. Namely the mean square error of any estimator is equal to its variance plus (the square of) its bias. See section 7.7 at https://www.sciencedirect.com/science/article/pii/B9780123948113500071

@petermchale
Copy link
Author

petermchale commented Dec 2, 2023

In active machine learning, we assume that the learner is unbiased, and focus on algorithms that minimize the learner's variance, as shown in Cohn et al (1996): https://arxiv.org/abs/cs/9603104 (Eq. 4 is difficult to interpret precisely, though, in the absence of further reading).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment