Skip to content

Instantly share code, notes, and snippets.

@CalvinTChi
Created August 5, 2016 01:40
Show Gist options
  • Save CalvinTChi/ca15afdd25c1c5995994e6dd723bd3b8 to your computer and use it in GitHub Desktop.
Save CalvinTChi/ca15afdd25c1c5995994e6dd723bd3b8 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# EM Algorithm\n",
"### Author: _Calvin Chi_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Algorithm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The expectation-maximization algorithm, or commonly called the EM algorithm, is an iterative approach for finding the maximum likelihood estimate for an unknown parameter $\\theta$ of a statistical model, where the model is dependent on another unknown and unobserved latent variable. \n",
"\n",
"Suppose we have a training set of $\\{x^{(1)},...,x^{(m)}\\}$ independent samples, and we wish to use this dataset to estimate $\\theta$ in a model dependent on $z$ and $\\theta$. In this case, $z$ is our unobserved latent variable. If $z$ is known, finidng the maximum likelihood estimate for $\\theta$ may be straightforward. However, if $z$ is not observed, the task is not as straightforward and the EM algorithm is often used in such situations. \n",
"\n",
"Normally, finding the maximum likelihood estimate involves finding $\\theta$ that maximizes the log-likelihood of the data. The $\\theta$ maximizing the likelihood and log-likelihood are the same because the $log(x)$ function is a monotonically increasing function. Let $x$ represent the training set $\\{x^{(1)},...,x^{(m)}\\}$, $l(\\theta)$ represent the log-likelihood of $\\theta$, $f(\\theta)$ represent the likelihood of $\\theta$, and $p(x^{(i)};\\theta)$ represent the probability of observing $x^{(i)}$ under the model parameterized by $\\theta$, then:\n",
"\n",
"$$f(\\theta) = \\prod_{i=1}^{m}p(x^{(i)};\\theta)$$\n",
"\n",
"$$ln(f(\\theta)) = l(\\theta) = \\sum_{i=1}^{m}log\\:p(x^{(i)};\\theta)$$\n",
"\n",
"If however, the model includes a latent variable $z$ that can take on $u$ values with a probability distribution $Q$, then the log-likelihood expression should be further broken down to reflect that:\n",
"\n",
"$$l(\\theta) = \\sum_{i=1}^{m}log\\sum_{j=1}^{u}p(x^{(i)}|\\:z^{(j)};\\theta)\\:Q_{i}(z^{(j)})$$\n",
"\n",
"$$=\\sum_{i=1}^{m}log\\sum_{j=1}^{u}\\frac{p(z^{(j)}|\\:x^{(i)};\\theta)p(x^{(i)})}{Q_{i}(z^{(j)})}\\:Q_{i}(z^{(j)})\\:\\text{, by Bayes Rule}$$\n",
"\n",
"$$=\\sum_{i=1}^{m}log\\sum_{j=1}^{u}\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\:Q_{i}(z^{(j)})$$\n",
"\n",
"$$\\ge \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\:\\text{, by Jensen's inequality}$$\n",
"\n",
"Let us first state Jensen's inequality - Jensen's inequality states that given a convex function $f$ and a random variable $X$:\n",
"\n",
"$$E[f(X)] \\ge f(E[X])$$\n",
"\n",
"Jensen's inequality can easily be converted to another statement about concave functions $g$. A convex function can be converted to a concave function by multiplying by negative one, or $g = -f$. Since math dictates that multiplying by negative one reverses the direction of inequality: \n",
"\n",
"$$E[g(X)] \\le g(E[X])$$\n",
"\n",
"for concave functions. With Jensen's inequality defined, let us use it to understand the inequality in the log-likelihood expression. Since $Q_{i}(z^{(j)})$ means the probability of $z^{(j)}$ for sample $x^{(i)}$, $Q$ is a probability distribution. Let $g = log(x)$ be our concave function. We can then hold the view:\n",
"\n",
"$$log\\sum_{j=1}^{u}\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\:Q_{i}(z^{(j)}) = g\\Big(E\\Big[\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\Big]\\Big)$$\n",
"\n",
"and also:\n",
"\n",
"$$\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})} = E\\Big[g(\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})})\\Big]$$\n",
"\n",
"Thus by Jensen's inequality extended to concave functions, the last inequality expression for the log-likelihood expression holds. Which $Q_{i}(z^{(j)})$ shall we choose? It turns out that in order to guarantee convergence of the EM algorithm, we must choose $Q_{i}(z^{(j)})$ such that the equality portion of the inequality in the log-likelihood expression must be invoked. In Jensen's inequality, equality is invoked when $X$ is a constant (easily shown since $E[X] = X$). Similarly, invoking equality in our log-likelihood expression means:\n",
"\n",
"$$\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})} = c$$\n",
"\n",
"Where $c$ is a constant. To accomplish this, $Q_{i}(z^{(j)}) \\propto p(z^{(j)}, x^{(i)};\\theta)$ must be true. In order for this proportionality to be true while having $Q_{i}(z)$ remain as a probability distribution, the following must be true:\n",
"\n",
"$$Q_{i}(z^{(j)}) = \\frac{p(x^{(i)}, z^{(j)};\\theta)}{\\sum_{j=1}^{u}p(x^{(i)}, z^{(j)};\\theta)}$$\n",
"\n",
"$$=\\frac{p(x^{(i)}, z^{(j)};\\theta)}{p(x^{(i)};\\theta)}$$\n",
"\n",
"$$=p(z^{(j)}|x^{(i)};\\theta)$$\n",
"\n",
"Now that we know how to choose $Q_{i}(z^{(j)}$, we can proceed to describe the EM algorithm itself:\n",
"\n",
"Repeat until convergence {\n",
"\n",
"$\\qquad$ Expectation step: for each $i$, set \n",
"$$Q_{i}(z^{(j)}) = p(z^{(j)}|x^{(i)};\\theta)$$\n",
"\n",
"$\\qquad$ Maximization step: set\n",
"$$\\theta = \\arg\\max_{\\theta} = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
"\n",
"}\n",
"\n",
"Note the reason the first step is called expectation is because it chooses the appropriate $Q_{i}$ to define the log-likelihood written as a sum of expectations. We then find $\\theta$ to maximize the log-likelihood in the maximization step."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Proof of Convergence\n",
"To show that the EM algorithm convergences upon the optimal $\\theta$ after $n$ iterations, we need to show that $l(\\theta^{(t+1)}) \\ge l(\\theta^{(t)})$, where $t$ represents the $t^{ith}$ iteration. \n",
"\n",
"Recall that we have chosen $Q_{i}(z^{(j)}) = p(z^{(j)}|\\:x^{(i)};\\theta)$ because it leads to the equality:\n",
"\n",
"$$l(\\theta^{(t)}) = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}^{(t)}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta^{(t)})}{Q_{i}^{(t)}(z^{(j)})}$$\n",
"\n",
"After finding maximiation step, since we set $\\theta$ to be the parameter that maximizes the log-likelihood, we have the inequality:\n",
"\n",
"$$l(\\theta^{(t+1)}) \\ge \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}^{(t)}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta^{(t+1)})}{Q_{i}^{(t)}(z^{(j)})}$$\n",
"\n",
"$$ \\ge \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}^{(t)}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta^{(t)})}{Q_{i}^{(t)}(z^{(j)})} = l(\\theta^{(t)})$$\n",
"\n",
"The inequality involving $l(\\theta^{(t+1)})$ simply comes from Jensen's inequality, using similar logic used before. Hence, we have shown that with each iteration updating $\\theta$, the log-likelihood increases monotonically. \n",
"\n",
"There is an alternative perspective viewing the EM algorithm as a method of iteratively increasing the lower-bound on the log-likelihood by maximizing it with respect to $Q$ in the expectation step and then maximizing it with respect to $\\theta$. Let us define lower-bound on the log-likelihood $l(\\theta)$ as $J(Q, \\theta)$:\n",
"\n",
"$$J(Q, \\theta) = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
"\n",
"Since $J(Q, \\theta)$ is the lower-bound on $l(\\theta)$, the largest $J(Q, \\theta)$ can be is equal to $l(\\theta)$. Thus, choosing $Q$ such that $J(Q, \\theta) = l(\\theta)$ maximizes $J(Q, \\theta)$ with respect to $Q$. Then, maximizing $J(Q, \\theta)$ with respect to $\\theta$ is simply the maxization step of the EM algorithm."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Case Study\n",
"Suppose we wish to estimate the biases of two coins $A$ and $B$ as $p_{A}$ and $p_{B}$ respectively, given independent repeated sets of 10 coin tosses, where the identity of the coin used ($A$ or $B$) used in each set of tosses is unknown. \n",
"\n",
"<img src=\"http://i.imgur.com/lYzq37t.png\", width=250, height=250>\n",
"\n",
"## 3.1 EM Algorithm Application\n",
"Using the same notation in this tutorial. We can see that:\n",
"\n",
"$\\qquad x^{(i)} = \\text{number of heads in the $i^{ith}$ set of 10 tosses}$\n",
"\n",
"$\\qquad z^{(j)} = \\text{whether coin used was $A$ or $B$, with j $\\in \\{A, B\\}$}$\n",
"\n",
"$\\qquad Q_{i}(z^{(j)}) = \\text{probability of $z^{(j)}$ for the $i^{ith}$ set of 10 tosses}$\n",
"\n",
"Let $n$ be the number of coin tosses in a set and $m$ is the number of sets. We see that the likelihood function $l(\\theta)$ for set $i$ is simply the binomial distribution:\n",
"\n",
"$$l(\\theta)_{i} = \\binom{n}{x^{(i)}}\\theta^{x^{(i)}}(1-\\theta)^{n-x^{(i)}}$$\n",
"\n",
"Since the term $\\binom{n}{x^{(i)}}$ is not dependent on $\\theta$, we will drop it when writing our log-likelihood expression for a set of 10 coin tosses. \n",
"### 3.1.1 Expectation\n",
"Recall that in the expectation step for a sample $i$:\n",
"\n",
"$$Q_{i}(z^{(j)}) = p(z^{(j)}|x^{(i)};\\theta) = \\frac{p(x^{(i)}, z^{(j)};\\theta)}{\\sum_{j=1}^{u}p(x^{(i)}, z^{(j)};\\theta)}$$\n",
"\n",
"In our case study this turns out to be for coin $A$:\n",
"\n",
"$$Q_{i}(A) = \\frac{\\theta_{A}^{x^{(i)}}(1-\\theta_{A}^{n-x^{(i)}})}{\\theta_{A}^{x^{(i)}}(1-\\theta_{A}^{n-x^{(i)}}) + \\theta_{B}^{x^{(i)}}(1-\\theta_{B}^{n-x^{(i)}})}$$\n",
"\n",
"The term $Q_{i}(B)$ is derived similarly.\n",
"\n",
"### 3.1.2 Maximization\n",
"In the maximization step, we are to find:\n",
"\n",
"$$\\theta = \\arg\\max_{\\theta} = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
"\n",
"Which can be found by finding $\\theta$ such that the first derivative is equal to 0:\n",
"\n",
"$$\\frac{d}{d\\theta}\\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})} = 0$$\n",
"\n",
"The numerator $p(x^{(i)}, z^{(j)};\\theta)$ means what is the probability of observing $x^{(i)}$ number of heads when the probability of landing heads is $\\theta$ of coin $z^{(j)}$, and is expressed as the binomial distribution. Leaving out the binomial coefficient leads to the expression:\n",
"\n",
"$$\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
"\n",
"$$\\propto p_{z_{A}^{(i)}}\\Big[x^{(i)}log(\\theta_{A}) + (n-x^{(i)})log(1-\\theta_{A})-log(p_{z_{A}^{(i)}})\\Big]$$\n",
"\n",
"$$+ p_{z_{B}}^{(i)}\\Big[x^{(i)}log(\\theta_{B}) + (n-x^{(i)})log(1-\\theta_{B})-log(p_{z_{B}^{(i)}})\\Big]$$\n",
"\n",
"To find the maximal bias for coin $A$, take the first partial derivative with respect to $\\theta_{A}$:\n",
"\n",
"$$\\frac{d}{d\\theta_{A}}\\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n",
"\n",
"$$=\\sum_{i=1}^{m}p_{z_{A}^{(i)}}\\Big[\\frac{x^{(i)}}{\\theta_{A}} - \\frac{n-x^{(i)}}{1-\\theta_{A}}\\Big]$$\n",
"\n",
"$$=\\sum_{i=1}^{m}\\Big[p_{z_{A}^{(i)}}x^{(i)}-p_{z_{A}^{(i)}}x^{(i)}\\theta_{A} - n\\theta_{A}p_{z_{A}^{(i)}} + \\theta_{A}x^{(i)}p_{z_{A}^{(i)}}\\Big] = 0$$\n",
"\n",
"$$n\\theta_{A}\\sum_{i=1}^{m}p_{z_{A}^{(i)}} = \\sum_{i=1}^{m}p_{z_{A}^{(i)}}x^{(i)}$$\n",
"\n",
"$$\\theta_{A} = \\frac{\\sum_{i=1}^{m}p_{z_{A}^{(i)}}x^{(i)}}{n\\sum_{i=1}^{m}p_{z_{A}^{(i)}}}$$\n",
"\n",
"The $\\theta_{B}$ maximizing the lower bound on the log-likelihood of the data is found in a similar way. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 EM Algorithm Implementation\n",
"Let us implement an EM algorithm to estimate the biases of two coins in a set of coin tosses. Our dataset will follow the one depicted in the diagram above. We will first implement an EM algorithm class with the following usage:\n",
"\n",
"```\n",
"clf = EM(initial=(0.600, 0.500), convergence=0.01, verbose=False)\n",
"```\n",
"\n",
"1. initial: tuple of size two indicating initial guesses to the biases of the two coins\n",
"2. convergence: threshold indicating difference in estimated parameters between iterations below which convergence is established\n",
"3. verbose: boolean indicating whether to output the estimated parameter values per iteration\n",
"\n",
"To train, where `data` is the variable assigned to a 2D numpy array with dimensions $m\\:x\\:n$: \n",
"\n",
"```\n",
"p1, p2 = clf.train(data)\n",
"```\n",
"Where `p1` and `p2` are the final estimated biases for the two coins. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np \n",
"import pandas as pd\n",
"\n",
"class EM: \n",
" # Set initial coin bias guesses\n",
" def __init__(self, initial=(0.600, 0.500), convergence=0.01, verbose=False):\n",
" self.initial = initial\n",
" self.convergence = convergence\n",
" self.verbose = verbose\n",
"\n",
" def train(self, data):\n",
" p1 = self.initial[0]\n",
" p2 = self.initial[1]\n",
" p1Old = 0\n",
" p2Old = 0\n",
"\n",
" def likelihood(n, p, k):\n",
" return (p**k)*(1-p)**(n-k)\n",
"\n",
" # Returns probability distribution coins used for each experiment\n",
" def expectation(data, p1, p2):\n",
" ncol = data.shape[1]\n",
" heads = np.sum(data, axis=1)\n",
" l1 = [likelihood(ncol, p1, head) for head in heads]\n",
" l2 = [likelihood(ncol, p2, head) for head in heads]\n",
" likelihoods = np.array([l1, l2]).T\n",
" total = np.sum(likelihoods, axis=1, keepdims=True)\n",
" prob1 = likelihoods[:, [0]] / total\n",
" prob2 = 1 - prob1\n",
" return np.concatenate((prob1, prob2), axis=1)\n",
"\n",
" # Returns updated coin biases\n",
" def maximization(data, probs): \n",
" ncol = data.shape[1]\n",
" heads = np.sum(data, axis=1, keepdims=True)\n",
" totals = ncol * np.sum(probs, axis=0)\n",
" numerators = np.sum(heads * probs, axis=0)\n",
" return numerators / totals\n",
"\n",
" i = 1\n",
" while (abs(p1-p1Old) + abs(p2-p2Old)) > self.convergence: \n",
" p1Old = p1\n",
" p2Old = p2\n",
" if self.verbose: \n",
" print(str(i) + \". \", round(p1, 3), round(p2, 3))\n",
" probs = expectation(data, p1, p2)\n",
" p1, p2 = maximization(data, probs)\n",
" i += 1\n",
"\n",
" return p1, p2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let us create our dataset and train to estimate the coin biases:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. 0.6 0.5\n",
"2. 0.713 0.581\n",
"3. 0.745 0.569\n",
"4. 0.768 0.55\n",
"5. 0.783 0.535\n",
"6. 0.791 0.526\n",
"Final Estimation: \n",
"0.794532537994 0.522390437518\n"
]
}
],
"source": [
"data = np.array([[1, 0, 0, 0, 1, 1, 0, 1, 0, 1],\n",
" [1, 1, 1, 1, 0, 1, 1, 1, 1, 1],\n",
" [1, 0, 1, 1, 1, 1, 1, 0, 1, 1], \n",
" [1, 0, 1, 0, 0, 0, 1, 1, 0, 0],\n",
" [0, 1, 1, 1, 0, 1, 1, 1, 0, 1]])\n",
"clf = EM(verbose=True)\n",
"p1, p2 = clf.train(data)\n",
"print(\"Final Estimation: \")\n",
"print(p1, p2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Let us assess whether this answer makes sense in light of the data that we have. \n",
"\n",
"Set Index|Number of Heads\n",
"-|-\n",
"1|5\n",
"2|9\n",
"3|8\n",
"4|4\n",
"5|7\n",
"\n",
"Based on this data, it appears that sets 2, 3, and 5 results from using the coin with estimated bias of 0.79, while sets 1 and 4 results from using the coin with estimated bias of 0.52. The number of heads appearing in our dataset sounds reasonable with this coin to set assignment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reference\n",
"1. Ng, A. _The EM Algorithm_. Retrieved from http://cs229.stanford.edu/\n",
"2. Do, C. B., & Batzoglou, S. (2008). What is the expectation maximization algorithm? Nature Biotechnology, 26(8), 897–899. doi:10.1038/nbt1406"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment