CalvinTChi/EM_Algorithm.ipynb

## EM_Algorithm.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# EM Algorithm\n",
    "### Author: _Calvin Chi_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The expectation-maximization algorithm, or commonly called the EM algorithm, is an iterative approach for finding the maximum likelihood estimate for an unknown parameter $\\theta$ of a statistical model, where the model is dependent on another unknown and unobserved latent variable. \n",
    "\n",
    "Suppose we have a training set of $\\{x^{(1)},...,x^{(m)}\\}$ independent samples, and we wish to use this dataset to estimate $\\theta$ in a model dependent on $z$ and $\\theta$. In this case, $z$ is our unobserved latent variable. If $z$ is known, finidng the maximum likelihood estimate for $\\theta$ may be straightforward. However, if $z$ is not observed, the task is not as straightforward and the EM algorithm is often used in such situations. \n",
    "\n",
    "Normally, finding the maximum likelihood estimate involves finding $\\theta$ that maximizes the log-likelihood of the data. The $\\theta$ maximizing the likelihood and log-likelihood are the same because the $log(x)$ function is a monotonically increasing function. Let $x$ represent the training set $\\{x^{(1)},...,x^{(m)}\\}$, $l(\\theta)$ represent the log-likelihood of $\\theta$, $f(\\theta)$ represent the likelihood of $\\theta$, and $p(x^{(i)};\\theta)$ represent the probability of observing $x^{(i)}$ under the model parameterized by $\\theta$, then:\n",
    "\n",
    "$$f(\\theta) = \\prod_{i=1}^{m}p(x^{(i)};\\theta)$$\n",
    "\n",
    "$$ln(f(\\theta)) = l(\\theta) = \\sum_{i=1}^{m}log\\:p(x^{(i)};\\theta)$$\n",
    "\n",
    "If however, the model includes a latent variable $z$ that can take on $u$ values with a probability distribution $Q$, then the log-likelihood expression should be further broken down to reflect that:\n",
    "\n",
    "$$l(\\theta) = \\sum_{i=1}^{m}log\\sum_{j=1}^{u}p(x^{(i)}|\\:z^{(j)};\\theta)\\:Q_{i}(z^{(j)})$$\n",
    "\n",
    "$$=\\sum_{i=1}^{m}log\\sum_{j=1}^{u}\\frac{p(z^{(j)}|\\:x^{(i)};\\theta)p(x^{(i)})}{Q_{i}(z^{(j)})}\\:Q_{i}(z^{(j)})\\:\\text{, by Bayes Rule}$$\n",
    "\n",
    "$$=\\sum_{i=1}^{m}log\\sum_{j=1}^{u}\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\:Q_{i}(z^{(j)})$$\n",
    "\n",
    "$$\\ge \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\:\\text{, by Jensen's inequality}$$\n",
    "\n",
    "Let us first state Jensen's inequality - Jensen's inequality states that given a convex function $f$ and a random variable $X$:\n",
    "\n",
    "$$E[f(X)] \\ge f(E[X])$$\n",
    "\n",
    "Jensen's inequality can easily be converted to another statement about concave functions $g$. A convex function can be converted to a concave function by multiplying by negative one, or $g = -f$. Since math dictates that multiplying by negative one reverses the direction of inequality: \n",
    "\n",
    "$$E[g(X)] \\le g(E[X])$$\n",
    "\n",
    "for concave functions. With Jensen's inequality defined, let us use it to understand the inequality in the log-likelihood expression. Since $Q_{i}(z^{(j)})$ means the probability of $z^{(j)}$ for sample $x^{(i)}$, $Q$ is a probability distribution. Let $g = log(x)$ be our concave function. We can then hold the view:\n",
    "\n",
    "$$log\\sum_{j=1}^{u}\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\:Q_{i}(z^{(j)}) = g\\Big(E\\Big[\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\Big]\\Big)$$\n",
    "\n",
    "and also:\n",
    "\n",
    "$$\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})} = E\\Big[g(\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})})\\Big]$$\n",
    "\n",
    "Thus by Jensen's inequality extended to concave functions, the last inequality expression for the log-likelihood expression holds. Which $Q_{i}(z^{(j)})$ shall we choose? It turns out that in order to guarantee convergence of the EM algorithm, we must choose $Q_{i}(z^{(j)})$ such that the equality portion of the inequality in the log-likelihood expression must be invoked. In Jensen's inequality, equality is invoked when $X$ is a constant (easily shown since $E[X] = X$). Similarly, invoking equality in our log-likelihood expression means:\n",
    "\n",
    "$$\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})} = c$$\n",
    "\n",
    "Where $c$ is a constant. To accomplish this, $Q_{i}(z^{(j)}) \\propto p(z^{(j)}, x^{(i)};\\theta)$ must be true. In order for this proportionality to be true while having $Q_{i}(z)$ remain as a probability distribution, the following must be true:\n",
    "\n",
    "$$Q_{i}(z^{(j)}) = \\frac{p(x^{(i)}, z^{(j)};\\theta)}{\\sum_{j=1}^{u}p(x^{(i)}, z^{(j)};\\theta)}$$\n",
    "\n",
    "$$=\\frac{p(x^{(i)}, z^{(j)};\\theta)}{p(x^{(i)};\\theta)}$$\n",
    "\n",
    "$$=p(z^{(j)}|x^{(i)};\\theta)$$\n",
    "\n",
    "Now that we know how to choose $Q_{i}(z^{(j)}$, we can proceed to describe the EM algorithm itself:\n",
    "\n",
    "Repeat until convergence {\n",
    "\n",
    "$\\qquad$ Expectation step: for each $i$, set \n",
    "$$Q_{i}(z^{(j)}) = p(z^{(j)}|x^{(i)};\\theta)$$\n",
    "\n",
    "$\\qquad$ Maximization step: set\n",
    "$$\\theta = \\arg\\max_{\\theta} = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
    "\n",
    "}\n",
    "\n",
    "Note the reason the first step is called expectation is because it chooses the appropriate $Q_{i}$ to define the log-likelihood written as a sum of expectations. We then find $\\theta$ to maximize the log-likelihood in the maximization step."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Proof of Convergence\n",
    "To show that the EM algorithm convergences upon the optimal $\\theta$ after $n$ iterations, we need to show that $l(\\theta^{(t+1)}) \\ge l(\\theta^{(t)})$, where $t$ represents the $t^{ith}$ iteration. \n",
    "\n",
    "Recall that we have chosen $Q_{i}(z^{(j)}) = p(z^{(j)}|\\:x^{(i)};\\theta)$ because it leads to the equality:\n",
    "\n",
    "$$l(\\theta^{(t)}) = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}^{(t)}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta^{(t)})}{Q_{i}^{(t)}(z^{(j)})}$$\n",
    "\n",
    "After finding maximiation step, since we set $\\theta$ to be the parameter that maximizes the log-likelihood, we have the inequality:\n",
    "\n",
    "$$l(\\theta^{(t+1)}) \\ge \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}^{(t)}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta^{(t+1)})}{Q_{i}^{(t)}(z^{(j)})}$$\n",
    "\n",
    "$$ \\ge \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}^{(t)}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta^{(t)})}{Q_{i}^{(t)}(z^{(j)})} = l(\\theta^{(t)})$$\n",
    "\n",
    "The inequality involving $l(\\theta^{(t+1)})$ simply comes from Jensen's inequality, using similar logic used before. Hence, we have shown that with each iteration updating $\\theta$, the log-likelihood increases monotonically. \n",
    "\n",
    "There is an alternative perspective viewing the EM algorithm as a method of iteratively increasing the lower-bound on the log-likelihood by maximizing it with respect to $Q$ in the expectation step and then maximizing it with respect to $\\theta$. Let us define lower-bound on the log-likelihood $l(\\theta)$ as $J(Q, \\theta)$:\n",
    "\n",
    "$$J(Q, \\theta) = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
    "\n",
    "Since $J(Q, \\theta)$ is the lower-bound on $l(\\theta)$, the largest $J(Q, \\theta)$ can be is equal to $l(\\theta)$. Thus, choosing $Q$ such that $J(Q, \\theta) = l(\\theta)$ maximizes $J(Q, \\theta)$ with respect to $Q$. Then, maximizing $J(Q, \\theta)$ with respect to $\\theta$ is simply the maxization step of the EM algorithm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Case Study\n",
    "Suppose we wish to estimate the biases of two coins $A$ and $B$ as $p_{A}$ and $p_{B}$ respectively, given independent repeated sets of 10 coin tosses, where the identity of the coin used ($A$ or $B$) used in each set of tosses is unknown.  \n",
    "\n",
    "<img src=\"http://i.imgur.com/lYzq37t.png\", width=250, height=250>\n",
    "\n",
    "## 3.1 EM Algorithm Application\n",
    "Using the same notation in this tutorial. We can see that:\n",
    "\n",
    "$\\qquad x^{(i)} = \\text{number of heads in the $i^{ith}$ set of 10 tosses}$\n",
    "\n",
    "$\\qquad z^{(j)} = \\text{whether coin used was $A$ or $B$, with j $\\in \\{A, B\\}$}$\n",
    "\n",
    "$\\qquad Q_{i}(z^{(j)}) = \\text{probability of $z^{(j)}$ for the $i^{ith}$ set of 10 tosses}$\n",
    "\n",
    "Let $n$ be the number of coin tosses in a set and $m$ is the number of sets. We see that the likelihood function $l(\\theta)$ for set $i$ is simply the binomial distribution:\n",
    "\n",
    "$$l(\\theta)_{i} = \\binom{n}{x^{(i)}}\\theta^{x^{(i)}}(1-\\theta)^{n-x^{(i)}}$$\n",
    "\n",
    "Since the term $\\binom{n}{x^{(i)}}$ is not dependent on $\\theta$, we will drop it when writing our log-likelihood expression for a set of 10 coin tosses. \n",
    "### 3.1.1 Expectation\n",
    "Recall that in the expectation step for a sample $i$:\n",
    "\n",
    "$$Q_{i}(z^{(j)}) = p(z^{(j)}|x^{(i)};\\theta) = \\frac{p(x^{(i)}, z^{(j)};\\theta)}{\\sum_{j=1}^{u}p(x^{(i)}, z^{(j)};\\theta)}$$\n",
    "\n",
    "In our case study this turns out to be for coin $A$:\n",
    "\n",
    "$$Q_{i}(A) = \\frac{\\theta_{A}^{x^{(i)}}(1-\\theta_{A}^{n-x^{(i)}})}{\\theta_{A}^{x^{(i)}}(1-\\theta_{A}^{n-x^{(i)}}) + \\theta_{B}^{x^{(i)}}(1-\\theta_{B}^{n-x^{(i)}})}$$\n",
    "\n",
    "The term $Q_{i}(B)$ is derived similarly.\n",
    "\n",
    "### 3.1.2 Maximization\n",
    "In the maximization step, we are to find:\n",
    "\n",
    "$$\\theta = \\arg\\max_{\\theta} = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
    "\n",
    "Which can be found by finding $\\theta$ such that the first derivative is equal to 0:\n",
    "\n",
    "$$\\frac{d}{d\\theta}\\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})} = 0$$\n",
    "\n",
    "The numerator $p(x^{(i)}, z^{(j)};\\theta)$ means what is the probability of observing $x^{(i)}$ number of heads when the probability of landing heads is $\\theta$ of coin $z^{(j)}$, and is expressed as the binomial distribution. Leaving out the binomial coefficient leads to the expression:\n",
    "\n",
    "$$\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
    "\n",
    "$$\\propto p_{z_{A}^{(i)}}\\Big[x^{(i)}log(\\theta_{A}) + (n-x^{(i)})log(1-\\theta_{A})-log(p_{z_{A}^{(i)}})\\Big]$$\n",
    "\n",
    "$$+ p_{z_{B}}^{(i)}\\Big[x^{(i)}log(\\theta_{B}) + (n-x^{(i)})log(1-\\theta_{B})-log(p_{z_{B}^{(i)}})\\Big]$$\n",
    "\n",
    "To find the maximal bias for coin $A$, take the first partial derivative with respect to $\\theta_{A}$:\n",
    "\n",
    "$$\\frac{d}{d\\theta_{A}}\\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
    "\n",
    "$$=\\sum_{i=1}^{m}p_{z_{A}^{(i)}}\\Big[\\frac{x^{(i)}}{\\theta_{A}} - \\frac{n-x^{(i)}}{1-\\theta_{A}}\\Big]$$\n",
    "\n",
    "$$=\\sum_{i=1}^{m}\\Big[p_{z_{A}^{(i)}}x^{(i)}-p_{z_{A}^{(i)}}x^{(i)}\\theta_{A} - n\\theta_{A}p_{z_{A}^{(i)}} + \\theta_{A}x^{(i)}p_{z_{A}^{(i)}}\\Big] = 0$$\n",
    "\n",
    "$$n\\theta_{A}\\sum_{i=1}^{m}p_{z_{A}^{(i)}} = \\sum_{i=1}^{m}p_{z_{A}^{(i)}}x^{(i)}$$\n",
    "\n",
    "$$\\theta_{A} = \\frac{\\sum_{i=1}^{m}p_{z_{A}^{(i)}}x^{(i)}}{n\\sum_{i=1}^{m}p_{z_{A}^{(i)}}}$$\n",
    "\n",
    "The $\\theta_{B}$ maximizing the lower bound on the log-likelihood of the data is found in a similar way. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 EM Algorithm Implementation\n",
    "Let us implement an EM algorithm to estimate the biases of two coins in a set of coin tosses. Our dataset will follow the one depicted in the diagram above. We will first implement an EM algorithm class with the following usage:\n",
    "\n",
    "```\n",
    "clf = EM(initial=(0.600, 0.500), convergence=0.01, verbose=False)\n",
    "```\n",
    "\n",
    "1. initial: tuple of size two indicating initial guesses to the biases of the two coins\n",
    "2. convergence: threshold indicating difference in estimated parameters between iterations below which convergence is established\n",
    "3. verbose: boolean indicating whether to output the estimated parameter values per iteration\n",
    "\n",
    "To train, where `data` is the variable assigned to a 2D numpy array with dimensions $m\\:x\\:n$: \n",
    "\n",
    "```\n",
    "p1, p2 = clf.train(data)\n",
    "```\n",
    "Where `p1` and `p2` are the final estimated biases for the two coins. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np \n",
    "import pandas as pd\n",
    "\n",
    "class EM: \n",
    "    # Set initial coin bias guesses\n",
    "    def __init__(self, initial=(0.600, 0.500), convergence=0.01, verbose=False):\n",
    "        self.initial = initial\n",
    "        self.convergence = convergence\n",
    "        self.verbose = verbose\n",
    "\n",
    "    def train(self, data):\n",
    "        p1 = self.initial[0]\n",
    "        p2 = self.initial[1]\n",
    "        p1Old = 0\n",
    "        p2Old = 0\n",
    "\n",
    "        def likelihood(n, p, k):\n",
    "            return (p**k)*(1-p)**(n-k)\n",
    "\n",
    "        # Returns probability distribution coins used for each experiment\n",
    "        def expectation(data, p1, p2):\n",
    "            ncol = data.shape[1]\n",
    "            heads = np.sum(data, axis=1)\n",
    "            l1 = [likelihood(ncol, p1, head) for head in heads]\n",
    "            l2 = [likelihood(ncol, p2, head) for head in heads]\n",
    "            likelihoods = np.array([l1, l2]).T\n",
    "            total = np.sum(likelihoods, axis=1, keepdims=True)\n",
    "            prob1 = likelihoods[:, [0]] / total\n",
    "            prob2 = 1 - prob1\n",
    "            return np.concatenate((prob1, prob2), axis=1)\n",
    "\n",
    "        # Returns updated coin biases\n",
    "        def maximization(data, probs): \n",
    "            ncol = data.shape[1]\n",
    "            heads = np.sum(data, axis=1, keepdims=True)\n",
    "            totals = ncol * np.sum(probs, axis=0)\n",
    "            numerators = np.sum(heads * probs, axis=0)\n",
    "            return numerators / totals\n",
    "\n",
    "        i = 1\n",
    "        while (abs(p1-p1Old) + abs(p2-p2Old)) > self.convergence: \n",
    "            p1Old = p1\n",
    "            p2Old = p2\n",
    "            if self.verbose: \n",
    "                print(str(i) + \". \", round(p1, 3), round(p2, 3))\n",
    "            probs = expectation(data, p1, p2)\n",
    "            p1, p2 = maximization(data, probs)\n",
    "            i += 1\n",
    "\n",
    "        return p1, p2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let us create our dataset and train to estimate the coin biases:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.  0.6 0.5\n",
      "2.  0.713 0.581\n",
      "3.  0.745 0.569\n",
      "4.  0.768 0.55\n",
      "5.  0.783 0.535\n",
      "6.  0.791 0.526\n",
      "Final Estimation: \n",
      "0.794532537994 0.522390437518\n"
     ]
    }
   ],
   "source": [
    "data = np.array([[1, 0, 0, 0, 1, 1, 0, 1, 0, 1],\n",
    "                 [1, 1, 1, 1, 0, 1, 1, 1, 1, 1],\n",
    "                 [1, 0, 1, 1, 1, 1, 1, 0, 1, 1], \n",
    "                 [1, 0, 1, 0, 0, 0, 1, 1, 0, 0],\n",
    "                 [0, 1, 1, 1, 0, 1, 1, 1, 0, 1]])\n",
    "clf = EM(verbose=True)\n",
    "p1, p2 = clf.train(data)\n",
    "print(\"Final Estimation: \")\n",
    "print(p1, p2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Let us assess whether this answer makes sense in light of the data that we have. \n",
    "\n",
    "Set Index|Number of Heads\n",
    "-|-\n",
    "1|5\n",
    "2|9\n",
    "3|8\n",
    "4|4\n",
    "5|7\n",
    "\n",
    "Based on this data, it appears that sets 2, 3, and 5 results from using the coin with estimated bias of 0.79, while sets 1 and 4 results from using the coin with estimated bias of 0.52. The number of heads appearing in our dataset sounds reasonable with this coin to set assignment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reference\n",
    "1. Ng, A. _The EM Algorithm_. Retrieved from http://cs229.stanford.edu/\n",
    "2. Do, C. B., & Batzoglou, S. (2008). What is the expectation maximization algorithm? Nature Biotechnology, 26(8), 897–899. doi:10.1038/nbt1406"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# EM Algorithm\n",
	"### Author: _Calvin Chi_"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# 1. Algorithm"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The expectation-maximization algorithm, or commonly called the EM algorithm, is an iterative approach for finding the maximum likelihood estimate for an unknown parameter $\\theta$ of a statistical model, where the model is dependent on another unknown and unobserved latent variable. \n",
	"\n",
	"Suppose we have a training set of $\\{x^{(1)},...,x^{(m)}\\}$ independent samples, and we wish to use this dataset to estimate $\\theta$ in a model dependent on $z$ and $\\theta$. In this case, $z$ is our unobserved latent variable. If $z$ is known, finidng the maximum likelihood estimate for $\\theta$ may be straightforward. However, if $z$ is not observed, the task is not as straightforward and the EM algorithm is often used in such situations. \n",
	"\n",
	"Normally, finding the maximum likelihood estimate involves finding $\\theta$ that maximizes the log-likelihood of the data. The $\\theta$ maximizing the likelihood and log-likelihood are the same because the $log(x)$ function is a monotonically increasing function. Let $x$ represent the training set $\\{x^{(1)},...,x^{(m)}\\}$, $l(\\theta)$ represent the log-likelihood of $\\theta$, $f(\\theta)$ represent the likelihood of $\\theta$, and $p(x^{(i)};\\theta)$ represent the probability of observing $x^{(i)}$ under the model parameterized by $\\theta$, then:\n",
	"\n",
	"$$f(\\theta) = \\prod_{i=1}^{m}p(x^{(i)};\\theta)$$\n",
	"\n",
	"$$ln(f(\\theta)) = l(\\theta) = \\sum_{i=1}^{m}log\\:p(x^{(i)};\\theta)$$\n",
	"\n",
	"If however, the model includes a latent variable $z$ that can take on $u$ values with a probability distribution $Q$, then the log-likelihood expression should be further broken down to reflect that:\n",
	"\n",
	"$$l(\\theta) = \\sum_{i=1}^{m}log\\sum_{j=1}^{u}p(x^{(i)}\|\\:z^{(j)};\\theta)\\:Q_{i}(z^{(j)})$$\n",
	"\n",
	"$$=\\sum_{i=1}^{m}log\\sum_{j=1}^{u}\\frac{p(z^{(j)}\|\\:x^{(i)};\\theta)p(x^{(i)})}{Q_{i}(z^{(j)})}\\:Q_{i}(z^{(j)})\\:\\text{, by Bayes Rule}$$\n",
	"\n",
	"$$=\\sum_{i=1}^{m}log\\sum_{j=1}^{u}\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\:Q_{i}(z^{(j)})$$\n",
	"\n",
	"$$\\ge \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\:\\text{, by Jensen's inequality}$$\n",
	"\n",
	"Let us first state Jensen's inequality - Jensen's inequality states that given a convex function $f$ and a random variable $X$:\n",
	"\n",
	"$$E[f(X)] \\ge f(E[X])$$\n",
	"\n",
	"Jensen's inequality can easily be converted to another statement about concave functions $g$. A convex function can be converted to a concave function by multiplying by negative one, or $g = -f$. Since math dictates that multiplying by negative one reverses the direction of inequality: \n",
	"\n",
	"$$E[g(X)] \\le g(E[X])$$\n",
	"\n",
	"for concave functions. With Jensen's inequality defined, let us use it to understand the inequality in the log-likelihood expression. Since $Q_{i}(z^{(j)})$ means the probability of $z^{(j)}$ for sample $x^{(i)}$, $Q$ is a probability distribution. Let $g = log(x)$ be our concave function. We can then hold the view:\n",
	"\n",
	"$$log\\sum_{j=1}^{u}\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\:Q_{i}(z^{(j)}) = g\\Big(E\\Big[\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\Big]\\Big)$$\n",
	"\n",
	"and also:\n",
	"\n",
	"$$\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})} = E\\Big[g(\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})})\\Big]$$\n",
	"\n",
	"Thus by Jensen's inequality extended to concave functions, the last inequality expression for the log-likelihood expression holds. Which $Q_{i}(z^{(j)})$ shall we choose? It turns out that in order to guarantee convergence of the EM algorithm, we must choose $Q_{i}(z^{(j)})$ such that the equality portion of the inequality in the log-likelihood expression must be invoked. In Jensen's inequality, equality is invoked when $X$ is a constant (easily shown since $E[X] = X$). Similarly, invoking equality in our log-likelihood expression means:\n",
	"\n",
	"$$\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})} = c$$\n",
	"\n",
	"Where $c$ is a constant. To accomplish this, $Q_{i}(z^{(j)}) \\propto p(z^{(j)}, x^{(i)};\\theta)$ must be true. In order for this proportionality to be true while having $Q_{i}(z)$ remain as a probability distribution, the following must be true:\n",
	"\n",
	"$$Q_{i}(z^{(j)}) = \\frac{p(x^{(i)}, z^{(j)};\\theta)}{\\sum_{j=1}^{u}p(x^{(i)}, z^{(j)};\\theta)}$$\n",
	"\n",
	"$$=\\frac{p(x^{(i)}, z^{(j)};\\theta)}{p(x^{(i)};\\theta)}$$\n",
	"\n",
	"$$=p(z^{(j)}\|x^{(i)};\\theta)$$\n",
	"\n",
	"Now that we know how to choose $Q_{i}(z^{(j)}$, we can proceed to describe the EM algorithm itself:\n",
	"\n",
	"Repeat until convergence {\n",
	"\n",
	"$\\qquad$ Expectation step: for each $i$, set \n",
	"$$Q_{i}(z^{(j)}) = p(z^{(j)}\|x^{(i)};\\theta)$$\n",
	"\n",
	"$\\qquad$ Maximization step: set\n",
	"$$\\theta = \\arg\\max_{\\theta} = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
	"\n",
	"}\n",
	"\n",
	"Note the reason the first step is called expectation is because it chooses the appropriate $Q_{i}$ to define the log-likelihood written as a sum of expectations. We then find $\\theta$ to maximize the log-likelihood in the maximization step."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# 2. Proof of Convergence\n",
	"To show that the EM algorithm convergences upon the optimal $\\theta$ after $n$ iterations, we need to show that $l(\\theta^{(t+1)}) \\ge l(\\theta^{(t)})$, where $t$ represents the $t^{ith}$ iteration. \n",
	"\n",
	"Recall that we have chosen $Q_{i}(z^{(j)}) = p(z^{(j)}\|\\:x^{(i)};\\theta)$ because it leads to the equality:\n",
	"\n",
	"$$l(\\theta^{(t)}) = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}^{(t)}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta^{(t)})}{Q_{i}^{(t)}(z^{(j)})}$$\n",
	"\n",
	"After finding maximiation step, since we set $\\theta$ to be the parameter that maximizes the log-likelihood, we have the inequality:\n",
	"\n",
	"$$l(\\theta^{(t+1)}) \\ge \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}^{(t)}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta^{(t+1)})}{Q_{i}^{(t)}(z^{(j)})}$$\n",
	"\n",
	"$$ \\ge \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}^{(t)}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta^{(t)})}{Q_{i}^{(t)}(z^{(j)})} = l(\\theta^{(t)})$$\n",
	"\n",
	"The inequality involving $l(\\theta^{(t+1)})$ simply comes from Jensen's inequality, using similar logic used before. Hence, we have shown that with each iteration updating $\\theta$, the log-likelihood increases monotonically. \n",
	"\n",
	"There is an alternative perspective viewing the EM algorithm as a method of iteratively increasing the lower-bound on the log-likelihood by maximizing it with respect to $Q$ in the expectation step and then maximizing it with respect to $\\theta$. Let us define lower-bound on the log-likelihood $l(\\theta)$ as $J(Q, \\theta)$:\n",
	"\n",
	"$$J(Q, \\theta) = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
	"\n",
	"Since $J(Q, \\theta)$ is the lower-bound on $l(\\theta)$, the largest $J(Q, \\theta)$ can be is equal to $l(\\theta)$. Thus, choosing $Q$ such that $J(Q, \\theta) = l(\\theta)$ maximizes $J(Q, \\theta)$ with respect to $Q$. Then, maximizing $J(Q, \\theta)$ with respect to $\\theta$ is simply the maxization step of the EM algorithm."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# 3. Case Study\n",
	"Suppose we wish to estimate the biases of two coins $A$ and $B$ as $p_{A}$ and $p_{B}$ respectively, given independent repeated sets of 10 coin tosses, where the identity of the coin used ($A$ or $B$) used in each set of tosses is unknown. \n",
	"\n",
	"<img src=\"http://i.imgur.com/lYzq37t.png\", width=250, height=250>\n",
	"\n",
	"## 3.1 EM Algorithm Application\n",
	"Using the same notation in this tutorial. We can see that:\n",
	"\n",
	"$\\qquad x^{(i)} = \\text{number of heads in the $i^{ith}$ set of 10 tosses}$\n",
	"\n",
	"$\\qquad z^{(j)} = \\text{whether coin used was $A$ or $B$, with j $\\in \\{A, B\\}$}$\n",
	"\n",
	"$\\qquad Q_{i}(z^{(j)}) = \\text{probability of $z^{(j)}$ for the $i^{ith}$ set of 10 tosses}$\n",
	"\n",
	"Let $n$ be the number of coin tosses in a set and $m$ is the number of sets. We see that the likelihood function $l(\\theta)$ for set $i$ is simply the binomial distribution:\n",
	"\n",
	"$$l(\\theta)_{i} = \\binom{n}{x^{(i)}}\\theta^{x^{(i)}}(1-\\theta)^{n-x^{(i)}}$$\n",
	"\n",
	"Since the term $\\binom{n}{x^{(i)}}$ is not dependent on $\\theta$, we will drop it when writing our log-likelihood expression for a set of 10 coin tosses. \n",
	"### 3.1.1 Expectation\n",
	"Recall that in the expectation step for a sample $i$:\n",
	"\n",
	"$$Q_{i}(z^{(j)}) = p(z^{(j)}\|x^{(i)};\\theta) = \\frac{p(x^{(i)}, z^{(j)};\\theta)}{\\sum_{j=1}^{u}p(x^{(i)}, z^{(j)};\\theta)}$$\n",
	"\n",
	"In our case study this turns out to be for coin $A$:\n",
	"\n",
	"$$Q_{i}(A) = \\frac{\\theta_{A}^{x^{(i)}}(1-\\theta_{A}^{n-x^{(i)}})}{\\theta_{A}^{x^{(i)}}(1-\\theta_{A}^{n-x^{(i)}}) + \\theta_{B}^{x^{(i)}}(1-\\theta_{B}^{n-x^{(i)}})}$$\n",
	"\n",
	"The term $Q_{i}(B)$ is derived similarly.\n",
	"\n",
	"### 3.1.2 Maximization\n",
	"In the maximization step, we are to find:\n",
	"\n",
	"$$\\theta = \\arg\\max_{\\theta} = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
	"\n",
	"Which can be found by finding $\\theta$ such that the first derivative is equal to 0:\n",
	"\n",
	"$$\\frac{d}{d\\theta}\\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})} = 0$$\n",
	"\n",
	"The numerator $p(x^{(i)}, z^{(j)};\\theta)$ means what is the probability of observing $x^{(i)}$ number of heads when the probability of landing heads is $\\theta$ of coin $z^{(j)}$, and is expressed as the binomial distribution. Leaving out the binomial coefficient leads to the expression:\n",
	"\n",
	"$$\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
	"\n",
	"$$\\propto p_{z_{A}^{(i)}}\\Big[x^{(i)}log(\\theta_{A}) + (n-x^{(i)})log(1-\\theta_{A})-log(p_{z_{A}^{(i)}})\\Big]$$\n",
	"\n",
	"$$+ p_{z_{B}}^{(i)}\\Big[x^{(i)}log(\\theta_{B}) + (n-x^{(i)})log(1-\\theta_{B})-log(p_{z_{B}^{(i)}})\\Big]$$\n",
	"\n",
	"To find the maximal bias for coin $A$, take the first partial derivative with respect to $\\theta_{A}$:\n",
	"\n",
	"$$\\frac{d}{d\\theta_{A}}\\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
	"\n",
	"$$=\\sum_{i=1}^{m}p_{z_{A}^{(i)}}\\Big[\\frac{x^{(i)}}{\\theta_{A}} - \\frac{n-x^{(i)}}{1-\\theta_{A}}\\Big]$$\n",
	"\n",
	"$$=\\sum_{i=1}^{m}\\Big[p_{z_{A}^{(i)}}x^{(i)}-p_{z_{A}^{(i)}}x^{(i)}\\theta_{A} - n\\theta_{A}p_{z_{A}^{(i)}} + \\theta_{A}x^{(i)}p_{z_{A}^{(i)}}\\Big] = 0$$\n",
	"\n",
	"$$n\\theta_{A}\\sum_{i=1}^{m}p_{z_{A}^{(i)}} = \\sum_{i=1}^{m}p_{z_{A}^{(i)}}x^{(i)}$$\n",
	"\n",
	"$$\\theta_{A} = \\frac{\\sum_{i=1}^{m}p_{z_{A}^{(i)}}x^{(i)}}{n\\sum_{i=1}^{m}p_{z_{A}^{(i)}}}$$\n",
	"\n",
	"The $\\theta_{B}$ maximizing the lower bound on the log-likelihood of the data is found in a similar way. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## 3.2 EM Algorithm Implementation\n",
	"Let us implement an EM algorithm to estimate the biases of two coins in a set of coin tosses. Our dataset will follow the one depicted in the diagram above. We will first implement an EM algorithm class with the following usage:\n",
	"\n",
	"```\n",
	"clf = EM(initial=(0.600, 0.500), convergence=0.01, verbose=False)\n",
	"```\n",
	"\n",
	"1. initial: tuple of size two indicating initial guesses to the biases of the two coins\n",
	"2. convergence: threshold indicating difference in estimated parameters between iterations below which convergence is established\n",
	"3. verbose: boolean indicating whether to output the estimated parameter values per iteration\n",
	"\n",
	"To train, where `data` is the variable assigned to a 2D numpy array with dimensions $m\\:x\\:n$: \n",
	"\n",
	"```\n",
	"p1, p2 = clf.train(data)\n",
	"```\n",
	"Where `p1` and `p2` are the final estimated biases for the two coins. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import numpy as np \n",
	"import pandas as pd\n",
	"\n",
	"class EM: \n",
	" # Set initial coin bias guesses\n",
	" def __init__(self, initial=(0.600, 0.500), convergence=0.01, verbose=False):\n",
	" self.initial = initial\n",
	" self.convergence = convergence\n",
	" self.verbose = verbose\n",
	"\n",
	" def train(self, data):\n",
	" p1 = self.initial[0]\n",
	" p2 = self.initial[1]\n",
	" p1Old = 0\n",
	" p2Old = 0\n",
	"\n",
	" def likelihood(n, p, k):\n",
	" return (p*k)(1-p)**(n-k)\n",
	"\n",
	" # Returns probability distribution coins used for each experiment\n",
	" def expectation(data, p1, p2):\n",
	" ncol = data.shape[1]\n",
	" heads = np.sum(data, axis=1)\n",
	" l1 = [likelihood(ncol, p1, head) for head in heads]\n",
	" l2 = [likelihood(ncol, p2, head) for head in heads]\n",
	" likelihoods = np.array([l1, l2]).T\n",
	" total = np.sum(likelihoods, axis=1, keepdims=True)\n",
	" prob1 = likelihoods[:, [0]] / total\n",
	" prob2 = 1 - prob1\n",
	" return np.concatenate((prob1, prob2), axis=1)\n",
	"\n",
	" # Returns updated coin biases\n",
	" def maximization(data, probs): \n",
	" ncol = data.shape[1]\n",
	" heads = np.sum(data, axis=1, keepdims=True)\n",
	" totals = ncol * np.sum(probs, axis=0)\n",
	" numerators = np.sum(heads * probs, axis=0)\n",
	" return numerators / totals\n",
	"\n",
	" i = 1\n",
	" while (abs(p1-p1Old) + abs(p2-p2Old)) > self.convergence: \n",
	" p1Old = p1\n",
	" p2Old = p2\n",
	" if self.verbose: \n",
	" print(str(i) + \". \", round(p1, 3), round(p2, 3))\n",
	" probs = expectation(data, p1, p2)\n",
	" p1, p2 = maximization(data, probs)\n",
	" i += 1\n",
	"\n",
	" return p1, p2"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now, let us create our dataset and train to estimate the coin biases:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1. 0.6 0.5\n",
	"2. 0.713 0.581\n",
	"3. 0.745 0.569\n",
	"4. 0.768 0.55\n",
	"5. 0.783 0.535\n",
	"6. 0.791 0.526\n",
	"Final Estimation: \n",
	"0.794532537994 0.522390437518\n"
	]
	}
	],
	"source": [
	"data = np.array([[1, 0, 0, 0, 1, 1, 0, 1, 0, 1],\n",
	" [1, 1, 1, 1, 0, 1, 1, 1, 1, 1],\n",
	" [1, 0, 1, 1, 1, 1, 1, 0, 1, 1], \n",
	" [1, 0, 1, 0, 0, 0, 1, 1, 0, 0],\n",
	" [0, 1, 1, 1, 0, 1, 1, 1, 0, 1]])\n",
	"clf = EM(verbose=True)\n",
	"p1, p2 = clf.train(data)\n",
	"print(\"Final Estimation: \")\n",
	"print(p1, p2)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"Let us assess whether this answer makes sense in light of the data that we have. \n",
	"\n",
	"Set Index\|Number of Heads\n",
	"-\|-\n",
	"1\|5\n",
	"2\|9\n",
	"3\|8\n",
	"4\|4\n",
	"5\|7\n",
	"\n",
	"Based on this data, it appears that sets 2, 3, and 5 results from using the coin with estimated bias of 0.79, while sets 1 and 4 results from using the coin with estimated bias of 0.52. The number of heads appearing in our dataset sounds reasonable with this coin to set assignment"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Reference\n",
	"1. Ng, A. _The EM Algorithm_. Retrieved from http://cs229.stanford.edu/\n",
	"2. Do, C. B., & Batzoglou, S. (2008). What is the expectation maximization algorithm? Nature Biotechnology, 26(8), 897–899. doi:10.1038/nbt1406"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}