vijayvd/EM

## EM
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction to Expectation Maximization via  Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem\n",
    "Consider an experiment with coin 1 that has a probability $\\alpha_1$ of heads, and a coin 2 that has a probability $\\alpha_2$ of heads. We draw $m$ samples as follows - for each sample, pick one of the coins at random, flip it $n$ times, and record the number of heads. If we recorded which coin we used for each sample, we have complete information and can estimate $\\alpha_1$ and $\\alpha_2$ in closed form. To be very explicit, suppose we drew 5 samples with the number of heads represented as a vector $x$, and the sequence of coins chosen was 1,2,2,1,2. If we know the sequence of coins, then it is easy to estimate -- for each coin, we can simply sum up the number of heads and divide by the number of tosses. However, if we do not know the sequence of coins, then it is harder. We will first start by writing the likelihood function. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Likelihood function\n",
    "We would like to use the observed values of $x$'s to learn $\\theta = (\\alpha_1, \\alpha_2)$, where $\\alpha_i$ is probability of heads with coin $i$. The vector of $x$'s consist of iid samples. For a single sample of $x_i$, \n",
    "$$\n",
    "\\begin{split}\n",
    "p(x_i|\\theta)\n",
    "&= \n",
    "p(x_i, \\alpha=\\alpha_1) + p(x_i, \\alpha=\\alpha_2)\n",
    "\\\\\n",
    "&=\n",
    "p(x_i|\\alpha=\\alpha_1)p(\\alpha=\\alpha_1) + p(x_i|\\alpha=\\alpha_2)p(\\alpha=\\alpha_2)\n",
    "\\\\\n",
    "&=\n",
    "\\frac{1}{2}\\left(B(x_i|n,\\alpha_1) + B(x_i|n,\\alpha_2)\\right),\n",
    "\\end{split}\n",
    "$$\n",
    "where $B(x|n, \\alpha) = {n \\choose x}\\alpha^x (1-\\alpha)^{n-x}$ is the probability density of Binomial distribution.\n",
    "The log-likelihood function \n",
    "$$\n",
    "\\begin{split}\n",
    "l(\\theta)\n",
    "&=\n",
    "\\sum_{i=1}^m \\log p(x_i|\\theta)\n",
    "\\\\\n",
    "&=\n",
    "\\sum_{i=1}^m \\log\\left(B(x_i|\\ n,\\alpha_1) + B(x_i|\\ n,\\alpha_2)\\right)\n",
    "+ \\textsf{constant}\n",
    "\\\\\n",
    "\\end{split}\n",
    "$$\n",
    "This is a difficult function to optimize because of the sum within logarithm!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We will now write the complete data likelihood function, which assumes we observe both the number of heads and the coin number. Let $z_i$ be the latent variable that identifies the coin, i.e., $z_i \\in \\{1, 2\\}$. The complete data log-likelihood is \n",
    "$$\n",
    "\\begin{split}\n",
    "l_c(\\theta) \n",
    "&= \n",
    "\\sum_{i=1}^m \\log p(x_i, z_i|\\theta)\n",
    "\\\\\n",
    "&= \n",
    "\\sum_{i=1}^m \\log \n",
    "\\Pi_{j=1}^2\\left(p(z_i=j)p(x_i|z_i=j)\\right)^{\\mathbb{I}_{\\{z_i = j\\}}}\n",
    "\\\\\n",
    "&= \n",
    "\\sum_{i=1}^m \\sum_{j=1}^2  \n",
    "\\mathbb{I}_{\\{z_i = j\\}}\\log\\left(p(z_i=j)p(x_i|z_i=j)\\right)\n",
    "\\\\\n",
    "\\end{split}\n",
    "$$\n",
    "This function cannot be computed because we do not know $z_i$. We work around this difficulty by defining expected complete data log likelihood as follows:\n",
    "$$\n",
    "\\begin{split}\n",
    "Q(\\theta, \\theta_{t-1})\n",
    "&=\n",
    "E[l_c(\\theta)|\\mathcal{D}, \\theta^{t-1}]\n",
    "\\\\\n",
    "&=\n",
    "\\sum_{i=1}^m \\sum_{j=1}^2  \n",
    "E[\\mathbb{I}_{\\{z_i = j\\}}|\\mathcal{D}, \\theta^{t-1}]\\log\\left(p(z_i=j)p(x_i|z_i=j)\\right)\n",
    "\\\\\n",
    "&=\n",
    "\\sum_{i=1}^m \\sum_{j=1}^2  \n",
    "p(z_i = j|x_i, \\theta^{t-1})\\log\\left(\\frac{1}{2}B(x_i|n, \\alpha_j)\\right)\n",
    "\\\\\n",
    "\\end{split}\n",
    "$$\n",
    "The expectation is of random variable $z_i$ using parameters $\\theta^{t-1}$ conditional on the observed data $\\mathcal{D} = \\{x_i\\}_{i=1}^m$. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### E-step\n",
    "The E-step simply evaluates\n",
    "\n",
    "$$\n",
    "p(z_i = j|x_i, \\theta^{t-1})\n",
    "=\n",
    "\\frac{\\alpha_{j,t-1}^{x_i}(1-\\alpha_{j, t-1})^{n-x_i}}{\\alpha_{1,t-1}^{x_i}(1-\\alpha_{1,t-1})^{n-x_i} + \\alpha_{2,t-1}^{x_i}(1-\\alpha_{2,t-1})^{n-x_i}}\n",
    "\\triangleq\n",
    "r_{ij},\n",
    "$$\n",
    "where $r_{ij}$ is the probability that $i$th data $x_i$ is from coin $j$. The calculation is a simple application of Bayes rule. Substituting this expression we get\n",
    "$$\n",
    "Q(\\theta,\\theta^{t-1})\n",
    "=\n",
    "\\sum_{i=1}^m \\sum_{j=1}^{2} r_{ij}\\left(x_i\\log\\alpha_j + (n-x_i)\\log(1-\\alpha_j)\\right)\n",
    "+ \\text{terms independent of }\\theta\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### M-step\n",
    "The M-step is to compute \n",
    "$$\n",
    "\\theta^t\n",
    "\\in\n",
    "\\mathrm{argmax}_{\\theta} Q(\\theta, \\theta^{t-1})\n",
    "$$\n",
    "Taking the derivative wrt $\\alpha_j$ and setting it to zero, we get\n",
    "$$\n",
    "\\frac{\\sum_{i=1}^m r_{ij} x_i}{\\alpha_j} - \\frac{\\sum_{i=1}^m r_{ij}(n-x_i)}{1-\\alpha_j} = 0.\n",
    "$$\n",
    "Solving we get the maximizing $$\\alpha_j = \\frac{\\sum_{i=1}^m r_{ij}\\ x_i/n}{\\sum_{i=1}^m r_{ij}}.$$\n",
    "The above expression is very interpretable: $x_i/n$ is an estimate for probability of head of the coin. The expression allows us to combine these estimates with weights equal to probability that it is from a particular coin.\n",
    "\n",
    "It can be shown that likelihood of the observed data is monotonically increasing, which can be used as a convenient debugging tool!\n",
    "\n",
    "**Reference**\n",
    "1. Kevin Murphy, Machine Learning: a Probabilistic Perspective, MIT Press, 2011."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generate data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from scipy.stats import binom, bernoulli\n",
    "import math as mth\n",
    "from IPython.display import display\n",
    "import pandas as pd\n",
    "\n",
    "\n",
    "# Define problem parameters\n",
    "nc = 2 # number of coins\n",
    "act_theta = [0.2, 0.7] # probability of success for coin 1 and coin 2\n",
    "m, n = 20, 10 # m = number of coin picks, n = number of tosses for each coin pick\n",
    "\n",
    "def gen_data():\n",
    "    x = []\n",
    "    for _ in range(m):\n",
    "        # pick the coin randomly\n",
    "        p = act_theta[bernoulli.rvs(0.5, loc=0, size=1)[0]]\n",
    "        x += [binom.rvs(n, p)] # record number of heads in n tosses\n",
    "    return x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Estimate the $\\theta$s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def next_theta(theta):\n",
    "    \"\"\"\n",
    "    Function uses EM method to calculate new theta using the given\n",
    "    value of theta\n",
    "    \"\"\"\n",
    "    \n",
    "    # E-step\n",
    "    r = np.zeros((m,nc))\n",
    "    for i in range(m):\n",
    "        r[i]=[binom.pmf(x[i], n, theta[j]) for j in range(nc)]\n",
    "        r[i] = r[i]/r[i].sum()\n",
    "    \n",
    "    # M-step\n",
    "    new_theta = [0.0, 0.0]\n",
    "    for j in range(nc):\n",
    "        new_theta[j] = sum([r[i,j]*x[i] for i in range(m)])/(n*r[:,j].sum())\n",
    "    return new_theta\n",
    "\n",
    "def log_likelihood(theta):\n",
    "    \"\"\"\n",
    "    Calculate log-likelihood function for given theta\n",
    "    \"\"\"\n",
    "    return sum([mth.log(binom.pmf(k, n, theta[0])+binom.pmf(k, n, theta[1])) for k in x])\n",
    "\n",
    "def EM(debug=False):\n",
    "    \"\"\"\n",
    "    Use EM algorithm to determine theta\n",
    "    \"\"\"\n",
    "    theta = np.random.rand(nc).tolist()\n",
    "    obj = log_likelihood(theta)\n",
    "    for i in range(500): # max_iter is 500\n",
    "        theta = next_theta(theta)\n",
    "        new_obj = log_likelihood(theta)\n",
    "\n",
    "        # Sanity check: objective value always increases\n",
    "        assert new_obj >= obj, \"likelihood decreased!\"\n",
    "\n",
    "        # stopping criterion\n",
    "        if new_obj <= obj + 1e-7:\n",
    "            break\n",
    "        obj = new_obj\n",
    "    return theta      "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[6, 4, 2, 1, 7, 10, 2, 2, 3, 8, 6, 4, 1, 1, 2, 2, 8, 3, 6, 3]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[0.22917590030584753, 0.70378821853450535]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = gen_data()\n",
    "print x\n",
    "EM()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "A subtle point is that in the estimated $\\theta$ the probabilities can be switched around because statistically it is equivalent to call either of the coins as coin 1 and the other one as coin 2!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Introduction to Expectation Maximization via Example"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Problem\n",
	"Consider an experiment with coin 1 that has a probability $\\alpha_1$ of heads, and a coin 2 that has a probability $\\alpha_2$ of heads. We draw $m$ samples as follows - for each sample, pick one of the coins at random, flip it $n$ times, and record the number of heads. If we recorded which coin we used for each sample, we have complete information and can estimate $\\alpha_1$ and $\\alpha_2$ in closed form. To be very explicit, suppose we drew 5 samples with the number of heads represented as a vector $x$, and the sequence of coins chosen was 1,2,2,1,2. If we know the sequence of coins, then it is easy to estimate -- for each coin, we can simply sum up the number of heads and divide by the number of tosses. However, if we do not know the sequence of coins, then it is harder. We will first start by writing the likelihood function. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Likelihood function\n",
	"We would like to use the observed values of $x$'s to learn $\\theta = (\\alpha_1, \\alpha_2)$, where $\\alpha_i$ is probability of heads with coin $i$. The vector of $x$'s consist of iid samples. For a single sample of $x_i$, \n",
	"$$\n",
	"\\begin{split}\n",
	"p(x_i\|\\theta)\n",
	"&= \n",
	"p(x_i, \\alpha=\\alpha_1) + p(x_i, \\alpha=\\alpha_2)\n",
	"\\\\\n",
	"&=\n",
	"p(x_i\|\\alpha=\\alpha_1)p(\\alpha=\\alpha_1) + p(x_i\|\\alpha=\\alpha_2)p(\\alpha=\\alpha_2)\n",
	"\\\\\n",
	"&=\n",
	"\\frac{1}{2}\\left(B(x_i\|n,\\alpha_1) + B(x_i\|n,\\alpha_2)\\right),\n",
	"\\end{split}\n",
	"$$\n",
	"where $B(x\|n, \\alpha) = {n \\choose x}\\alpha^x (1-\\alpha)^{n-x}$ is the probability density of Binomial distribution.\n",
	"The log-likelihood function \n",
	"$$\n",
	"\\begin{split}\n",
	"l(\\theta)\n",
	"&=\n",
	"\\sum_{i=1}^m \\log p(x_i\|\\theta)\n",
	"\\\\\n",
	"&=\n",
	"\\sum_{i=1}^m \\log\\left(B(x_i\|\\ n,\\alpha_1) + B(x_i\|\\ n,\\alpha_2)\\right)\n",
	"+ \\textsf{constant}\n",
	"\\\\\n",
	"\\end{split}\n",
	"$$\n",
	"This is a difficult function to optimize because of the sum within logarithm!"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"We will now write the complete data likelihood function, which assumes we observe both the number of heads and the coin number. Let $z_i$ be the latent variable that identifies the coin, i.e., $z_i \\in \\{1, 2\\}$. The complete data log-likelihood is \n",
	"$$\n",
	"\\begin{split}\n",
	"l_c(\\theta) \n",
	"&= \n",
	"\\sum_{i=1}^m \\log p(x_i, z_i\|\\theta)\n",
	"\\\\\n",
	"&= \n",
	"\\sum_{i=1}^m \\log \n",
	"\\Pi_{j=1}^2\\left(p(z_i=j)p(x_i\|z_i=j)\\right)^{\\mathbb{I}_{\\{z_i = j\\}}}\n",
	"\\\\\n",
	"&= \n",
	"\\sum_{i=1}^m \\sum_{j=1}^2 \n",
	"\\mathbb{I}_{\\{z_i = j\\}}\\log\\left(p(z_i=j)p(x_i\|z_i=j)\\right)\n",
	"\\\\\n",
	"\\end{split}\n",
	"$$\n",
	"This function cannot be computed because we do not know $z_i$. We work around this difficulty by defining expected complete data log likelihood as follows:\n",
	"$$\n",
	"\\begin{split}\n",
	"Q(\\theta, \\theta_{t-1})\n",
	"&=\n",
	"E[l_c(\\theta)\|\\mathcal{D}, \\theta^{t-1}]\n",
	"\\\\\n",
	"&=\n",
	"\\sum_{i=1}^m \\sum_{j=1}^2 \n",
	"E[\\mathbb{I}_{\\{z_i = j\\}}\|\\mathcal{D}, \\theta^{t-1}]\\log\\left(p(z_i=j)p(x_i\|z_i=j)\\right)\n",
	"\\\\\n",
	"&=\n",
	"\\sum_{i=1}^m \\sum_{j=1}^2 \n",
	"p(z_i = j\|x_i, \\theta^{t-1})\\log\\left(\\frac{1}{2}B(x_i\|n, \\alpha_j)\\right)\n",
	"\\\\\n",
	"\\end{split}\n",
	"$$\n",
	"The expectation is of random variable $z_i$ using parameters $\\theta^{t-1}$ conditional on the observed data $\\mathcal{D} = \\{x_i\\}_{i=1}^m$. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### E-step\n",
	"The E-step simply evaluates\n",
	"\n",
	"$$\n",
	"p(z_i = j\|x_i, \\theta^{t-1})\n",
	"=\n",
	"\\frac{\\alpha_{j,t-1}^{x_i}(1-\\alpha_{j, t-1})^{n-x_i}}{\\alpha_{1,t-1}^{x_i}(1-\\alpha_{1,t-1})^{n-x_i} + \\alpha_{2,t-1}^{x_i}(1-\\alpha_{2,t-1})^{n-x_i}}\n",
	"\\triangleq\n",
	"r_{ij},\n",
	"$$\n",
	"where $r_{ij}$ is the probability that $i$th data $x_i$ is from coin $j$. The calculation is a simple application of Bayes rule. Substituting this expression we get\n",
	"$$\n",
	"Q(\\theta,\\theta^{t-1})\n",
	"=\n",
	"\\sum_{i=1}^m \\sum_{j=1}^{2} r_{ij}\\left(x_i\\log\\alpha_j + (n-x_i)\\log(1-\\alpha_j)\\right)\n",
	"+ \\text{terms independent of }\\theta\n",
	"$$"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### M-step\n",
	"The M-step is to compute \n",
	"$$\n",
	"\\theta^t\n",
	"\\in\n",
	"\\mathrm{argmax}_{\\theta} Q(\\theta, \\theta^{t-1})\n",
	"$$\n",
	"Taking the derivative wrt $\\alpha_j$ and setting it to zero, we get\n",
	"$$\n",
	"\\frac{\\sum_{i=1}^m r_{ij} x_i}{\\alpha_j} - \\frac{\\sum_{i=1}^m r_{ij}(n-x_i)}{1-\\alpha_j} = 0.\n",
	"$$\n",
	"Solving we get the maximizing $$\\alpha_j = \\frac{\\sum_{i=1}^m r_{ij}\\ x_i/n}{\\sum_{i=1}^m r_{ij}}.$$\n",
	"The above expression is very interpretable: $x_i/n$ is an estimate for probability of head of the coin. The expression allows us to combine these estimates with weights equal to probability that it is from a particular coin.\n",
	"\n",
	"It can be shown that likelihood of the observed data is monotonically increasing, which can be used as a convenient debugging tool!\n",
	"\n",
	"Reference\n",
	"1. Kevin Murphy, Machine Learning: a Probabilistic Perspective, MIT Press, 2011."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Generate data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import numpy as np\n",
	"from scipy.stats import binom, bernoulli\n",
	"import math as mth\n",
	"from IPython.display import display\n",
	"import pandas as pd\n",
	"\n",
	"\n",
	"# Define problem parameters\n",
	"nc = 2 # number of coins\n",
	"act_theta = [0.2, 0.7] # probability of success for coin 1 and coin 2\n",
	"m, n = 20, 10 # m = number of coin picks, n = number of tosses for each coin pick\n",
	"\n",
	"def gen_data():\n",
	" x = []\n",
	" for _ in range(m):\n",
	" # pick the coin randomly\n",
	" p = act_theta[bernoulli.rvs(0.5, loc=0, size=1)[0]]\n",
	" x += [binom.rvs(n, p)] # record number of heads in n tosses\n",
	" return x"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Estimate the $\\theta$s"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"def next_theta(theta):\n",
	" \"\"\"\n",
	" Function uses EM method to calculate new theta using the given\n",
	" value of theta\n",
	" \"\"\"\n",
	" \n",
	" # E-step\n",
	" r = np.zeros((m,nc))\n",
	" for i in range(m):\n",
	" r[i]=[binom.pmf(x[i], n, theta[j]) for j in range(nc)]\n",
	" r[i] = r[i]/r[i].sum()\n",
	" \n",
	" # M-step\n",
	" new_theta = [0.0, 0.0]\n",
	" for j in range(nc):\n",
	" new_theta[j] = sum([r[i,j]x[i] for i in range(m)])/(nr[:,j].sum())\n",
	" return new_theta\n",
	"\n",
	"def log_likelihood(theta):\n",
	" \"\"\"\n",
	" Calculate log-likelihood function for given theta\n",
	" \"\"\"\n",
	" return sum([mth.log(binom.pmf(k, n, theta[0])+binom.pmf(k, n, theta[1])) for k in x])\n",
	"\n",
	"def EM(debug=False):\n",
	" \"\"\"\n",
	" Use EM algorithm to determine theta\n",
	" \"\"\"\n",
	" theta = np.random.rand(nc).tolist()\n",
	" obj = log_likelihood(theta)\n",
	" for i in range(500): # max_iter is 500\n",
	" theta = next_theta(theta)\n",
	" new_obj = log_likelihood(theta)\n",
	"\n",
	" # Sanity check: objective value always increases\n",
	" assert new_obj >= obj, \"likelihood decreased!\"\n",
	"\n",
	" # stopping criterion\n",
	" if new_obj <= obj + 1e-7:\n",
	" break\n",
	" obj = new_obj\n",
	" return theta "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[6, 4, 2, 1, 7, 10, 2, 2, 3, 8, 6, 4, 1, 1, 2, 2, 8, 3, 6, 3]\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"[0.22917590030584753, 0.70378821853450535]"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"x = gen_data()\n",
	"print x\n",
	"EM()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"A subtle point is that in the estimated $\\theta$ the probabilities can be switched around because statistically it is equivalent to call either of the coins as coin 1 and the other one as coin 2!"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.9"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}