petermchale/analysis.ipynb

## analysis.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bias and Variance in Machine Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## True model \n",
    "Suppose that a target variable $Y$ and a feature variable $X$ are related via $Y = f(X) + \\epsilon$, where $X$ and $\\epsilon$ are independent random variables and the expected value of $\\epsilon$ is zero, $E[\\epsilon] = 0$. \n",
    "\n",
    "We can  use this mathematical relationship to generate a data set $\\cal D$. Because data sets are always of finite size, we may think of $\\cal D$ as a random variable, the realizations of which take the form\n",
    "$d = \\{ (x_1,y_1), \\ldots , (x_m,y_m) \\}$, where $x_i$ and $y_i$ are realizations of $X$ and $Y$.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Estimated model\n",
    "\n",
    "Machine learning uses a particular realization $d$ of $\\cal D$ to train an estimate of the function $f(x)$, called the hypothesis $h_d(x)$. The subscript $d$ reminds us that the hypothesis is a random function that varies over training data sets. \n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test error of the estimated model\n",
    "\n",
    "Having learned an hypothesis for a particular training set $d$, we next evaluate the error made in predicting the value of $y$ on an unseen test value $x$. In linear regression, that test error is quantified by taking a test data set (also drawn from the distribution of $\\cal D$) and computing the average of $(Y - h_d)^2$ over the data set. If the size of the test data set is large enough, this average is approximated by $E_{X,\\epsilon} [ (Y(X,\\epsilon) - h_{d}(X))^2 ]$. As the training data set $d$ varies, so does the test error; in other words, test error is a random variable, the average of which over all training sets is given by \n",
    "\n",
    "\\begin{equation*}\n",
    "\\text{expected test error} = E_{\\cal D} \\left[ E_{X,\\epsilon} \\left[ (Y(X,\\epsilon) - h_{\\cal D}(X))^2 \\right] \\right].\n",
    "\\end{equation*}\n",
    "\n",
    "In the following sections, I will show how this error can be decomposed into three parts: a *bias* that quantifies how much (the average of) the hypothesis deviates from $f$; a *variance* term that quantifies how much the hypothesis varies among training data sets; and an *irreducible error* that describes the fact that one's ability to predict is always limited by the noise $\\epsilon$.  \n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Establishing a useful order of integration \n",
    "\n",
    "To compute the expected test error analytically, we rewrite the expectation operators in two steps. The first step is to recognize that \n",
    "$ E_{X,\\epsilon} [\\ldots] = E_X \\left[ E_\\epsilon [ \\ldots ] \\right],$\n",
    "since $X$ and $E$ are independent. The second step is to use Fubini's theorem to reverse the order in which $X$ and $D$ are integrated out. The final result is that the expected test error is given by \n",
    "\n",
    "$$ \n",
    "\\text{expected test error} = \n",
    "E_X \\left[ E_{\\cal D} \\left[ E_\\epsilon \\left[\n",
    "(Y - h)^2 \n",
    "\\right] \\right] \\right], $$\n",
    "\n",
    "where I have dropped the dependence of $Y$ and $h$ on $X$, $\\epsilon$ and $\\cal D$ in the interests of clarity.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reducible and irreducible error\n",
    "\n",
    "We fix values of $X$ and $\\cal D$ (and therefore $f$ and $h$) and compute the inner-most integral in the expected test error:\n",
    "\n",
    "\\begin{align*}\n",
    "E_\\epsilon \\left[ (Y - h)^2 \\right] \n",
    "& = E_\\epsilon \\left[ (f + \\epsilon - h)^2 \\right]\\\\\n",
    "& = E_\\epsilon \\left[ (f-h)^2 + \\epsilon^2 + 2\\epsilon (f-h) \\right]\\\\\n",
    "& = (f-h)^2 + E_\\epsilon\\left[ \\epsilon^2 \\right] + 0 \\\\\n",
    "& = (f-h)^2 + Var_\\epsilon \\left[ \\epsilon \\right].\n",
    "\\end{align*}\n",
    "\n",
    "The last term remains unaltered by subsequent averaging over $X$ and $D$. It represents the irreducible error contribution to the expected test error.\n",
    "\n",
    "The average of the first term, \n",
    "$E_X \\left[ E_{\\cal D} \\left[ \\left( f-h\\right)^2 \\right] \\right]$,\n",
    "is sometimes called the reducible error.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Decomposing the reducible error into 'bias' and 'variance' \n",
    "\n",
    "We relax our constraint that $\\cal D$ is fixed (but keep the constraint that $X$ is fixed) and compute the innermost integral in the reducible error:\n",
    "\n",
    "\\begin{align*}\n",
    "E_{\\cal D} \\left[ (f-h)^2 \\right] \n",
    "& = E_{\\cal D} \\left[ f^2 + h^2 - 2fh \\right]\\\\\n",
    "& = f^2 + E_{\\cal D} \\left[ h^2 \\right] - 2f E_{\\cal D} \\left[h\\right]\\\\\n",
    "\\end{align*}\n",
    "\n",
    "Adding and subtracting $E_{\\cal D} \\left[ h \\right]^2$, and rearranging terms, we may write the right-hand side above as \n",
    "\n",
    "$$\n",
    "\\left( f - E_{\\cal D} \\left[ h \\right] \\right)^2 + Var_{\\cal D} \\left[ h \\right].\n",
    "$$\n",
    "\n",
    "Averaging over $X$, and restoring the irreducible error, yields finally: \n",
    "\n",
    "\n",
    "$$\n",
    "\\boxed{\n",
    "\\text{expected test error} = \n",
    "E_X \\left[ \\left( f - E_{\\cal D} \\left[ h \\right] \\right)^2 \\right]\n",
    "+ E_X \\left[ Var_{\\cal D} \\left[ h \\right] \\right] \n",
    "+ Var_\\epsilon \\left[ \\epsilon \\right].\n",
    "}\n",
    "$$\n",
    "\n",
    "The first term is called the bias and the second term is called the variance. \n",
    "\n",
    "The variance component of the expected test error is a consequence of the finite size of the training data sets. In the limit that training sets contain an infinite number of data points, there are no fluctuations in $h$ among the training sets and the variance term vanishes. Put another way, when the size of the training set is large, the expected test error is expected to be solely due to bias (assuming the irreducible error is negligible). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## More info \n",
    "An excellent exposition of these concepts and more can be found [here](https://www.youtube.com/watch?v=zrEyxfl2-a8)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Bias and Variance in Machine Learning"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## True model \n",
	"Suppose that a target variable $Y$ and a feature variable $X$ are related via $Y = f(X) + \\epsilon$, where $X$ and $\\epsilon$ are independent random variables and the expected value of $\\epsilon$ is zero, $E[\\epsilon] = 0$. \n",
	"\n",
	"We can use this mathematical relationship to generate a data set $\\cal D$. Because data sets are always of finite size, we may think of $\\cal D$ as a random variable, the realizations of which take the form\n",
	"$d = \\{ (x_1,y_1), \\ldots , (x_m,y_m) \\}$, where $x_i$ and $y_i$ are realizations of $X$ and $Y$. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Estimated model\n",
	"\n",
	"Machine learning uses a particular realization $d$ of $\\cal D$ to train an estimate of the function $f(x)$, called the hypothesis $h_d(x)$. The subscript $d$ reminds us that the hypothesis is a random function that varies over training data sets. \n",
	"\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Test error of the estimated model\n",
	"\n",
	"Having learned an hypothesis for a particular training set $d$, we next evaluate the error made in predicting the value of $y$ on an unseen test value $x$. In linear regression, that test error is quantified by taking a test data set (also drawn from the distribution of $\\cal D$) and computing the average of $(Y - h_d)^2$ over the data set. If the size of the test data set is large enough, this average is approximated by $E_{X,\\epsilon} [ (Y(X,\\epsilon) - h_{d}(X))^2 ]$. As the training data set $d$ varies, so does the test error; in other words, test error is a random variable, the average of which over all training sets is given by \n",
	"\n",
	"\\begin{equation*}\n",
	"\\text{expected test error} = E_{\\cal D} \\left[ E_{X,\\epsilon} \\left[ (Y(X,\\epsilon) - h_{\\cal D}(X))^2 \\right] \\right].\n",
	"\\end{equation*}\n",
	"\n",
	"In the following sections, I will show how this error can be decomposed into three parts: a bias that quantifies how much (the average of) the hypothesis deviates from $f$; a variance term that quantifies how much the hypothesis varies among training data sets; and an irreducible error that describes the fact that one's ability to predict is always limited by the noise $\\epsilon$. \n",
	"\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Establishing a useful order of integration \n",
	"\n",
	"To compute the expected test error analytically, we rewrite the expectation operators in two steps. The first step is to recognize that \n",
	"$ E_{X,\\epsilon} [\\ldots] = E_X \\left[ E_\\epsilon [ \\ldots ] \\right],$\n",
	"since $X$ and $E$ are independent. The second step is to use Fubini's theorem to reverse the order in which $X$ and $D$ are integrated out. The final result is that the expected test error is given by \n",
	"\n",
	"$$ \n",
	"\\text{expected test error} = \n",
	"E_X \\left[ E_{\\cal D} \\left[ E_\\epsilon \\left[\n",
	"(Y - h)^2 \n",
	"\\right] \\right] \\right], $$\n",
	"\n",
	"where I have dropped the dependence of $Y$ and $h$ on $X$, $\\epsilon$ and $\\cal D$ in the interests of clarity.\n",
	"\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Reducible and irreducible error\n",
	"\n",
	"We fix values of $X$ and $\\cal D$ (and therefore $f$ and $h$) and compute the inner-most integral in the expected test error:\n",
	"\n",
	"\\begin{align*}\n",
	"E_\\epsilon \\left[ (Y - h)^2 \\right] \n",
	"& = E_\\epsilon \\left[ (f + \\epsilon - h)^2 \\right]\\\\\n",
	"& = E_\\epsilon \\left[ (f-h)^2 + \\epsilon^2 + 2\\epsilon (f-h) \\right]\\\\\n",
	"& = (f-h)^2 + E_\\epsilon\\left[ \\epsilon^2 \\right] + 0 \\\\\n",
	"& = (f-h)^2 + Var_\\epsilon \\left[ \\epsilon \\right].\n",
	"\\end{align*}\n",
	"\n",
	"The last term remains unaltered by subsequent averaging over $X$ and $D$. It represents the irreducible error contribution to the expected test error.\n",
	"\n",
	"The average of the first term, \n",
	"$E_X \\left[ E_{\\cal D} \\left[ \\left( f-h\\right)^2 \\right] \\right]$,\n",
	"is sometimes called the reducible error.\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Decomposing the reducible error into 'bias' and 'variance' \n",
	"\n",
	"We relax our constraint that $\\cal D$ is fixed (but keep the constraint that $X$ is fixed) and compute the innermost integral in the reducible error:\n",
	"\n",
	"\\begin{align*}\n",
	"E_{\\cal D} \\left[ (f-h)^2 \\right] \n",
	"& = E_{\\cal D} \\left[ f^2 + h^2 - 2fh \\right]\\\\\n",
	"& = f^2 + E_{\\cal D} \\left[ h^2 \\right] - 2f E_{\\cal D} \\left[h\\right]\\\\\n",
	"\\end{align*}\n",
	"\n",
	"Adding and subtracting $E_{\\cal D} \\left[ h \\right]^2$, and rearranging terms, we may write the right-hand side above as \n",
	"\n",
	"$$\n",
	"\\left( f - E_{\\cal D} \\left[ h \\right] \\right)^2 + Var_{\\cal D} \\left[ h \\right].\n",
	"$$\n",
	"\n",
	"Averaging over $X$, and restoring the irreducible error, yields finally: \n",
	"\n",
	"\n",
	"$$\n",
	"\\boxed{\n",
	"\\text{expected test error} = \n",
	"E_X \\left[ \\left( f - E_{\\cal D} \\left[ h \\right] \\right)^2 \\right]\n",
	"+ E_X \\left[ Var_{\\cal D} \\left[ h \\right] \\right] \n",
	"+ Var_\\epsilon \\left[ \\epsilon \\right].\n",
	"}\n",
	"$$\n",
	"\n",
	"The first term is called the bias and the second term is called the variance. \n",
	"\n",
	"The variance component of the expected test error is a consequence of the finite size of the training data sets. In the limit that training sets contain an infinite number of data points, there are no fluctuations in $h$ among the training sets and the variance term vanishes. Put another way, when the size of the training set is large, the expected test error is expected to be solely due to bias (assuming the irreducible error is negligible). "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"## More info \n",
	"An excellent exposition of these concepts and more can be found [here](https://www.youtube.com/watch?v=zrEyxfl2-a8)."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}