springcoil/probineq.ipynb

## probineq.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Probability Inequalities\n",
    "These notes are based on [All of Statistics](http://www.amazon.com/All-Statistics-Statistical-Inference-Springer/dp/0387402721) by Wasserman. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can estimate the error rate of a model by finding a confidence interval for $\\hat{L}_{n}(h)$ (that is the training error rate) using probability inequalities. We can consider this method a theoretical method that is similar in effect to cross-validation. This method is also useful in the context of [empirical risk minimization (ERM)](https://en.wikipedia.org/wiki/Empirical_risk_minimization). I wrote my [Masters thesis](https://www.academia.edu/6660809/Concentration_Inequalities_and_some_applications_to_Statistical_Learning_Theory_Contents) on the applications of Probability inequalities to time series data mining and machine learning, this may be of interest. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classifier problem\n",
    "We will for simplicity consider a classifier problem. Let $\\mathcal{H}$ be a set of classifiers, for example, all linear classifiers. ERM means choosing the classifier $\\hat{h} \\in \\mathcal{H}$ to minimize the training error $\\hat{L}_{n}(h)$, also called the **empirical risk**. This we can be all fancy and mathematical and write $$ \\hat{h} = argmin_{h \\in \\mathcal{H}} \\hat{L_{n}}(h) = argmin_{h \\in \\mathcal{H}}  \\left(\\frac{1}{n} \\Sigma_{i} I(h(X_{i}) \\neq Y_{i})\\right)$$\n",
    "Typically , $\\hat{L}_{n}(h)$ underestimates the true error rate. This is why we use **Hoeffding's inequality** to assess the underestimation. Recall that if $X_{1}, \\ldots, X_{n} \\;\\tilde\\; Bernoulli(p)$, then for any $\\epsilon \\gt 0$, $$\\mathbb{P}\\left(|\\hat{p} - p | \\gt \\epsilon \\right) \\leq 2e^{-2n \\epsilon^{2}} $$ \n",
    "where $\\hat{p} = n^{-1} \\Sigma_{n=1}^{n} X_{i}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, suppose that $\\mathcal{H} = \\lbrace h_{1}, \\ldots, h_{n} \\rbrace$ consists of finitely many classifiers. For any fixed h, $\\hat{L}_{n}(h)$ converges almost surely to L(h) by the law of large numbers. We shall state a stronger result (you can google for the proof). \n",
    "\n",
    "\n",
    "**Theorem**(Uniform Convergence) Assume $\\mathcal{H}$ is finite and has m elements. Then,\n",
    "$$ \\mathbb{P} \\left( max_{h \\in \\mathcal{H}} \\| \\hat{L}_{n}(h) - L(h) \\| \\leq 2m e^{-2 \\epsilon^{2}}\\right)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can state another theorem without proof.\n",
    "\n",
    "**Theorem** Let \n",
    "$$ \\epsilon = \\sqrt{\\frac{2}{n} \\log \\left( \\frac{2m}{\\alpha} \\right)} $$\n",
    "\n",
    "Then,\n",
    "$\\hat{L}_{n}\\left( \\hat{h} \\right) \\pm \\epsilon $ is a $1 - \\alpha$ confidence interval for $L(\\hat{h})$\n",
    "\n",
    "When $\\mathcal{H}$ is large the confidence interval for $L(\\hat{h})$ is large. The more functions there are in $\\mathcal{H}$ the more likely is we have *overfit* whcih we compensate for by having a larger confidence interval. \n",
    "\n",
    "In practice we usually use sets $\\mathcal{H}$ that infinite, such as the set of linear classifiers or the set of linear regressions. To extend our analysis to these cases, we want to be able to say something like\n",
    "\n",
    "$\\mathbb{P} \\left( \\sup_{h \\in \\mathcal{H}} | \\hat{L}_{n}(h) - L(h) | \\gt \\epsilon \\right) \\leq  $ something not too big.\n",
    "\n",
    "To do this we'll develop a generalization called VC theory. We'll cover this in the next blog post in this series. \n",
    "\n",
    "I hope this proves that there is some value in learning about probability theory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Probability Inequalities\n",
	"These notes are based on [All of Statistics](http://www.amazon.com/All-Statistics-Statistical-Inference-Springer/dp/0387402721) by Wasserman. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can estimate the error rate of a model by finding a confidence interval for $\\hat{L}_{n}(h)$ (that is the training error rate) using probability inequalities. We can consider this method a theoretical method that is similar in effect to cross-validation. This method is also useful in the context of [empirical risk minimization (ERM)](https://en.wikipedia.org/wiki/Empirical_risk_minimization). I wrote my [Masters thesis](https://www.academia.edu/6660809/Concentration_Inequalities_and_some_applications_to_Statistical_Learning_Theory_Contents) on the applications of Probability inequalities to time series data mining and machine learning, this may be of interest. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Classifier problem\n",
	"We will for simplicity consider a classifier problem. Let $\\mathcal{H}$ be a set of classifiers, for example, all linear classifiers. ERM means choosing the classifier $\\hat{h} \\in \\mathcal{H}$ to minimize the training error $\\hat{L}_{n}(h)$, also called the empirical risk. This we can be all fancy and mathematical and write $$ \\hat{h} = argmin_{h \\in \\mathcal{H}} \\hat{L_{n}}(h) = argmin_{h \\in \\mathcal{H}} \\left(\\frac{1}{n} \\Sigma_{i} I(h(X_{i}) \\neq Y_{i})\\right)$$\n",
	"Typically , $\\hat{L}_{n}(h)$ underestimates the true error rate. This is why we use Hoeffding's inequality to assess the underestimation. Recall that if $X_{1}, \\ldots, X_{n} \\;\\tilde\\; Bernoulli(p)$, then for any $\\epsilon \\gt 0$, $$\\mathbb{P}\\left(\|\\hat{p} - p \| \\gt \\epsilon \\right) \\leq 2e^{-2n \\epsilon^{2}} $$ \n",
	"where $\\hat{p} = n^{-1} \\Sigma_{n=1}^{n} X_{i}$"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"First, suppose that $\\mathcal{H} = \\lbrace h_{1}, \\ldots, h_{n} \\rbrace$ consists of finitely many classifiers. For any fixed h, $\\hat{L}_{n}(h)$ converges almost surely to L(h) by the law of large numbers. We shall state a stronger result (you can google for the proof). \n",
	"\n",
	"\n",
	"Theorem(Uniform Convergence) Assume $\\mathcal{H}$ is finite and has m elements. Then,\n",
	"$$ \\mathbb{P} \\left( max_{h \\in \\mathcal{H}} \\\| \\hat{L}_{n}(h) - L(h) \\\| \\leq 2m e^{-2 \\epsilon^{2}}\\right)$$"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can state another theorem without proof.\n",
	"\n",
	"Theorem Let \n",
	"$$ \\epsilon = \\sqrt{\\frac{2}{n} \\log \\left( \\frac{2m}{\\alpha} \\right)} $$\n",
	"\n",
	"Then,\n",
	"$\\hat{L}_{n}\\left( \\hat{h} \\right) \\pm \\epsilon $ is a $1 - \\alpha$ confidence interval for $L(\\hat{h})$\n",
	"\n",
	"When $\\mathcal{H}$ is large the confidence interval for $L(\\hat{h})$ is large. The more functions there are in $\\mathcal{H}$ the more likely is we have overfit whcih we compensate for by having a larger confidence interval. \n",
	"\n",
	"In practice we usually use sets $\\mathcal{H}$ that infinite, such as the set of linear classifiers or the set of linear regressions. To extend our analysis to these cases, we want to be able to say something like\n",
	"\n",
	"$\\mathbb{P} \\left( \\sup_{h \\in \\mathcal{H}} \| \\hat{L}_{n}(h) - L(h) \| \\gt \\epsilon \\right) \\leq $ something not too big.\n",
	"\n",
	"To do this we'll develop a generalization called VC theory. We'll cover this in the next blog post in this series. \n",
	"\n",
	"I hope this proves that there is some value in learning about probability theory."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.0"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}