Created
August 5, 2016 01:40
-
-
Save CalvinTChi/ca15afdd25c1c5995994e6dd723bd3b8 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# EM Algorithm\n", | |
"### Author: _Calvin Chi_" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# 1. Algorithm" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The expectation-maximization algorithm, or commonly called the EM algorithm, is an iterative approach for finding the maximum likelihood estimate for an unknown parameter $\\theta$ of a statistical model, where the model is dependent on another unknown and unobserved latent variable. \n", | |
"\n", | |
"Suppose we have a training set of $\\{x^{(1)},...,x^{(m)}\\}$ independent samples, and we wish to use this dataset to estimate $\\theta$ in a model dependent on $z$ and $\\theta$. In this case, $z$ is our unobserved latent variable. If $z$ is known, finidng the maximum likelihood estimate for $\\theta$ may be straightforward. However, if $z$ is not observed, the task is not as straightforward and the EM algorithm is often used in such situations. \n", | |
"\n", | |
"Normally, finding the maximum likelihood estimate involves finding $\\theta$ that maximizes the log-likelihood of the data. The $\\theta$ maximizing the likelihood and log-likelihood are the same because the $log(x)$ function is a monotonically increasing function. Let $x$ represent the training set $\\{x^{(1)},...,x^{(m)}\\}$, $l(\\theta)$ represent the log-likelihood of $\\theta$, $f(\\theta)$ represent the likelihood of $\\theta$, and $p(x^{(i)};\\theta)$ represent the probability of observing $x^{(i)}$ under the model parameterized by $\\theta$, then:\n", | |
"\n", | |
"$$f(\\theta) = \\prod_{i=1}^{m}p(x^{(i)};\\theta)$$\n", | |
"\n", | |
"$$ln(f(\\theta)) = l(\\theta) = \\sum_{i=1}^{m}log\\:p(x^{(i)};\\theta)$$\n", | |
"\n", | |
"If however, the model includes a latent variable $z$ that can take on $u$ values with a probability distribution $Q$, then the log-likelihood expression should be further broken down to reflect that:\n", | |
"\n", | |
"$$l(\\theta) = \\sum_{i=1}^{m}log\\sum_{j=1}^{u}p(x^{(i)}|\\:z^{(j)};\\theta)\\:Q_{i}(z^{(j)})$$\n", | |
"\n", | |
"$$=\\sum_{i=1}^{m}log\\sum_{j=1}^{u}\\frac{p(z^{(j)}|\\:x^{(i)};\\theta)p(x^{(i)})}{Q_{i}(z^{(j)})}\\:Q_{i}(z^{(j)})\\:\\text{, by Bayes Rule}$$\n", | |
"\n", | |
"$$=\\sum_{i=1}^{m}log\\sum_{j=1}^{u}\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\:Q_{i}(z^{(j)})$$\n", | |
"\n", | |
"$$\\ge \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\:\\text{, by Jensen's inequality}$$\n", | |
"\n", | |
"Let us first state Jensen's inequality - Jensen's inequality states that given a convex function $f$ and a random variable $X$:\n", | |
"\n", | |
"$$E[f(X)] \\ge f(E[X])$$\n", | |
"\n", | |
"Jensen's inequality can easily be converted to another statement about concave functions $g$. A convex function can be converted to a concave function by multiplying by negative one, or $g = -f$. Since math dictates that multiplying by negative one reverses the direction of inequality: \n", | |
"\n", | |
"$$E[g(X)] \\le g(E[X])$$\n", | |
"\n", | |
"for concave functions. With Jensen's inequality defined, let us use it to understand the inequality in the log-likelihood expression. Since $Q_{i}(z^{(j)})$ means the probability of $z^{(j)}$ for sample $x^{(i)}$, $Q$ is a probability distribution. Let $g = log(x)$ be our concave function. We can then hold the view:\n", | |
"\n", | |
"$$log\\sum_{j=1}^{u}\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\:Q_{i}(z^{(j)}) = g\\Big(E\\Big[\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})}\\Big]\\Big)$$\n", | |
"\n", | |
"and also:\n", | |
"\n", | |
"$$\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})} = E\\Big[g(\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})})\\Big]$$\n", | |
"\n", | |
"Thus by Jensen's inequality extended to concave functions, the last inequality expression for the log-likelihood expression holds. Which $Q_{i}(z^{(j)})$ shall we choose? It turns out that in order to guarantee convergence of the EM algorithm, we must choose $Q_{i}(z^{(j)})$ such that the equality portion of the inequality in the log-likelihood expression must be invoked. In Jensen's inequality, equality is invoked when $X$ is a constant (easily shown since $E[X] = X$). Similarly, invoking equality in our log-likelihood expression means:\n", | |
"\n", | |
"$$\\frac{p(z^{(j)}, x^{(i)};\\theta)}{Q_{i}(z^{(j)})} = c$$\n", | |
"\n", | |
"Where $c$ is a constant. To accomplish this, $Q_{i}(z^{(j)}) \\propto p(z^{(j)}, x^{(i)};\\theta)$ must be true. In order for this proportionality to be true while having $Q_{i}(z)$ remain as a probability distribution, the following must be true:\n", | |
"\n", | |
"$$Q_{i}(z^{(j)}) = \\frac{p(x^{(i)}, z^{(j)};\\theta)}{\\sum_{j=1}^{u}p(x^{(i)}, z^{(j)};\\theta)}$$\n", | |
"\n", | |
"$$=\\frac{p(x^{(i)}, z^{(j)};\\theta)}{p(x^{(i)};\\theta)}$$\n", | |
"\n", | |
"$$=p(z^{(j)}|x^{(i)};\\theta)$$\n", | |
"\n", | |
"Now that we know how to choose $Q_{i}(z^{(j)}$, we can proceed to describe the EM algorithm itself:\n", | |
"\n", | |
"Repeat until convergence {\n", | |
"\n", | |
"$\\qquad$ Expectation step: for each $i$, set \n", | |
"$$Q_{i}(z^{(j)}) = p(z^{(j)}|x^{(i)};\\theta)$$\n", | |
"\n", | |
"$\\qquad$ Maximization step: set\n", | |
"$$\\theta = \\arg\\max_{\\theta} = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n", | |
"\n", | |
"}\n", | |
"\n", | |
"Note the reason the first step is called expectation is because it chooses the appropriate $Q_{i}$ to define the log-likelihood written as a sum of expectations. We then find $\\theta$ to maximize the log-likelihood in the maximization step." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# 2. Proof of Convergence\n", | |
"To show that the EM algorithm convergences upon the optimal $\\theta$ after $n$ iterations, we need to show that $l(\\theta^{(t+1)}) \\ge l(\\theta^{(t)})$, where $t$ represents the $t^{ith}$ iteration. \n", | |
"\n", | |
"Recall that we have chosen $Q_{i}(z^{(j)}) = p(z^{(j)}|\\:x^{(i)};\\theta)$ because it leads to the equality:\n", | |
"\n", | |
"$$l(\\theta^{(t)}) = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}^{(t)}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta^{(t)})}{Q_{i}^{(t)}(z^{(j)})}$$\n", | |
"\n", | |
"After finding maximiation step, since we set $\\theta$ to be the parameter that maximizes the log-likelihood, we have the inequality:\n", | |
"\n", | |
"$$l(\\theta^{(t+1)}) \\ge \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}^{(t)}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta^{(t+1)})}{Q_{i}^{(t)}(z^{(j)})}$$\n", | |
"\n", | |
"$$ \\ge \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}^{(t)}(z^{(j)})\\:log\\frac{p(z^{(j)}, x^{(i)};\\theta^{(t)})}{Q_{i}^{(t)}(z^{(j)})} = l(\\theta^{(t)})$$\n", | |
"\n", | |
"The inequality involving $l(\\theta^{(t+1)})$ simply comes from Jensen's inequality, using similar logic used before. Hence, we have shown that with each iteration updating $\\theta$, the log-likelihood increases monotonically. \n", | |
"\n", | |
"There is an alternative perspective viewing the EM algorithm as a method of iteratively increasing the lower-bound on the log-likelihood by maximizing it with respect to $Q$ in the expectation step and then maximizing it with respect to $\\theta$. Let us define lower-bound on the log-likelihood $l(\\theta)$ as $J(Q, \\theta)$:\n", | |
"\n", | |
"$$J(Q, \\theta) = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n", | |
"\n", | |
"Since $J(Q, \\theta)$ is the lower-bound on $l(\\theta)$, the largest $J(Q, \\theta)$ can be is equal to $l(\\theta)$. Thus, choosing $Q$ such that $J(Q, \\theta) = l(\\theta)$ maximizes $J(Q, \\theta)$ with respect to $Q$. Then, maximizing $J(Q, \\theta)$ with respect to $\\theta$ is simply the maxization step of the EM algorithm." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# 3. Case Study\n", | |
"Suppose we wish to estimate the biases of two coins $A$ and $B$ as $p_{A}$ and $p_{B}$ respectively, given independent repeated sets of 10 coin tosses, where the identity of the coin used ($A$ or $B$) used in each set of tosses is unknown. \n", | |
"\n", | |
"<img src=\"http://i.imgur.com/lYzq37t.png\", width=250, height=250>\n", | |
"\n", | |
"## 3.1 EM Algorithm Application\n", | |
"Using the same notation in this tutorial. We can see that:\n", | |
"\n", | |
"$\\qquad x^{(i)} = \\text{number of heads in the $i^{ith}$ set of 10 tosses}$\n", | |
"\n", | |
"$\\qquad z^{(j)} = \\text{whether coin used was $A$ or $B$, with j $\\in \\{A, B\\}$}$\n", | |
"\n", | |
"$\\qquad Q_{i}(z^{(j)}) = \\text{probability of $z^{(j)}$ for the $i^{ith}$ set of 10 tosses}$\n", | |
"\n", | |
"Let $n$ be the number of coin tosses in a set and $m$ is the number of sets. We see that the likelihood function $l(\\theta)$ for set $i$ is simply the binomial distribution:\n", | |
"\n", | |
"$$l(\\theta)_{i} = \\binom{n}{x^{(i)}}\\theta^{x^{(i)}}(1-\\theta)^{n-x^{(i)}}$$\n", | |
"\n", | |
"Since the term $\\binom{n}{x^{(i)}}$ is not dependent on $\\theta$, we will drop it when writing our log-likelihood expression for a set of 10 coin tosses. \n", | |
"### 3.1.1 Expectation\n", | |
"Recall that in the expectation step for a sample $i$:\n", | |
"\n", | |
"$$Q_{i}(z^{(j)}) = p(z^{(j)}|x^{(i)};\\theta) = \\frac{p(x^{(i)}, z^{(j)};\\theta)}{\\sum_{j=1}^{u}p(x^{(i)}, z^{(j)};\\theta)}$$\n", | |
"\n", | |
"In our case study this turns out to be for coin $A$:\n", | |
"\n", | |
"$$Q_{i}(A) = \\frac{\\theta_{A}^{x^{(i)}}(1-\\theta_{A}^{n-x^{(i)}})}{\\theta_{A}^{x^{(i)}}(1-\\theta_{A}^{n-x^{(i)}}) + \\theta_{B}^{x^{(i)}}(1-\\theta_{B}^{n-x^{(i)}})}$$\n", | |
"\n", | |
"The term $Q_{i}(B)$ is derived similarly.\n", | |
"\n", | |
"### 3.1.2 Maximization\n", | |
"In the maximization step, we are to find:\n", | |
"\n", | |
"$$\\theta = \\arg\\max_{\\theta} = \\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n", | |
"\n", | |
"Which can be found by finding $\\theta$ such that the first derivative is equal to 0:\n", | |
"\n", | |
"$$\\frac{d}{d\\theta}\\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})} = 0$$\n", | |
"\n", | |
"The numerator $p(x^{(i)}, z^{(j)};\\theta)$ means what is the probability of observing $x^{(i)}$ number of heads when the probability of landing heads is $\\theta$ of coin $z^{(j)}$, and is expressed as the binomial distribution. Leaving out the binomial coefficient leads to the expression:\n", | |
"\n", | |
"$$\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n", | |
"\n", | |
"$$\\propto p_{z_{A}^{(i)}}\\Big[x^{(i)}log(\\theta_{A}) + (n-x^{(i)})log(1-\\theta_{A})-log(p_{z_{A}^{(i)}})\\Big]$$\n", | |
"\n", | |
"$$+ p_{z_{B}}^{(i)}\\Big[x^{(i)}log(\\theta_{B}) + (n-x^{(i)})log(1-\\theta_{B})-log(p_{z_{B}^{(i)}})\\Big]$$\n", | |
"\n", | |
"To find the maximal bias for coin $A$, take the first partial derivative with respect to $\\theta_{A}$:\n", | |
"\n", | |
"$$\\frac{d}{d\\theta_{A}}\\sum_{i=1}^{m}\\sum_{j=1}^{u}Q_{i}(z^{(j)})\\:log\\frac{p(x^{(i)}, z^{(j)};\\theta)}{Q_{i}(z^{(i)})}$$\n", | |
"\n", | |
"$$=\\sum_{i=1}^{m}p_{z_{A}^{(i)}}\\Big[\\frac{x^{(i)}}{\\theta_{A}} - \\frac{n-x^{(i)}}{1-\\theta_{A}}\\Big]$$\n", | |
"\n", | |
"$$=\\sum_{i=1}^{m}\\Big[p_{z_{A}^{(i)}}x^{(i)}-p_{z_{A}^{(i)}}x^{(i)}\\theta_{A} - n\\theta_{A}p_{z_{A}^{(i)}} + \\theta_{A}x^{(i)}p_{z_{A}^{(i)}}\\Big] = 0$$\n", | |
"\n", | |
"$$n\\theta_{A}\\sum_{i=1}^{m}p_{z_{A}^{(i)}} = \\sum_{i=1}^{m}p_{z_{A}^{(i)}}x^{(i)}$$\n", | |
"\n", | |
"$$\\theta_{A} = \\frac{\\sum_{i=1}^{m}p_{z_{A}^{(i)}}x^{(i)}}{n\\sum_{i=1}^{m}p_{z_{A}^{(i)}}}$$\n", | |
"\n", | |
"The $\\theta_{B}$ maximizing the lower bound on the log-likelihood of the data is found in a similar way. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 3.2 EM Algorithm Implementation\n", | |
"Let us implement an EM algorithm to estimate the biases of two coins in a set of coin tosses. Our dataset will follow the one depicted in the diagram above. We will first implement an EM algorithm class with the following usage:\n", | |
"\n", | |
"```\n", | |
"clf = EM(initial=(0.600, 0.500), convergence=0.01, verbose=False)\n", | |
"```\n", | |
"\n", | |
"1. initial: tuple of size two indicating initial guesses to the biases of the two coins\n", | |
"2. convergence: threshold indicating difference in estimated parameters between iterations below which convergence is established\n", | |
"3. verbose: boolean indicating whether to output the estimated parameter values per iteration\n", | |
"\n", | |
"To train, where `data` is the variable assigned to a 2D numpy array with dimensions $m\\:x\\:n$: \n", | |
"\n", | |
"```\n", | |
"p1, p2 = clf.train(data)\n", | |
"```\n", | |
"Where `p1` and `p2` are the final estimated biases for the two coins. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np \n", | |
"import pandas as pd\n", | |
"\n", | |
"class EM: \n", | |
" # Set initial coin bias guesses\n", | |
" def __init__(self, initial=(0.600, 0.500), convergence=0.01, verbose=False):\n", | |
" self.initial = initial\n", | |
" self.convergence = convergence\n", | |
" self.verbose = verbose\n", | |
"\n", | |
" def train(self, data):\n", | |
" p1 = self.initial[0]\n", | |
" p2 = self.initial[1]\n", | |
" p1Old = 0\n", | |
" p2Old = 0\n", | |
"\n", | |
" def likelihood(n, p, k):\n", | |
" return (p**k)*(1-p)**(n-k)\n", | |
"\n", | |
" # Returns probability distribution coins used for each experiment\n", | |
" def expectation(data, p1, p2):\n", | |
" ncol = data.shape[1]\n", | |
" heads = np.sum(data, axis=1)\n", | |
" l1 = [likelihood(ncol, p1, head) for head in heads]\n", | |
" l2 = [likelihood(ncol, p2, head) for head in heads]\n", | |
" likelihoods = np.array([l1, l2]).T\n", | |
" total = np.sum(likelihoods, axis=1, keepdims=True)\n", | |
" prob1 = likelihoods[:, [0]] / total\n", | |
" prob2 = 1 - prob1\n", | |
" return np.concatenate((prob1, prob2), axis=1)\n", | |
"\n", | |
" # Returns updated coin biases\n", | |
" def maximization(data, probs): \n", | |
" ncol = data.shape[1]\n", | |
" heads = np.sum(data, axis=1, keepdims=True)\n", | |
" totals = ncol * np.sum(probs, axis=0)\n", | |
" numerators = np.sum(heads * probs, axis=0)\n", | |
" return numerators / totals\n", | |
"\n", | |
" i = 1\n", | |
" while (abs(p1-p1Old) + abs(p2-p2Old)) > self.convergence: \n", | |
" p1Old = p1\n", | |
" p2Old = p2\n", | |
" if self.verbose: \n", | |
" print(str(i) + \". \", round(p1, 3), round(p2, 3))\n", | |
" probs = expectation(data, p1, p2)\n", | |
" p1, p2 = maximization(data, probs)\n", | |
" i += 1\n", | |
"\n", | |
" return p1, p2" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now, let us create our dataset and train to estimate the coin biases:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"1. 0.6 0.5\n", | |
"2. 0.713 0.581\n", | |
"3. 0.745 0.569\n", | |
"4. 0.768 0.55\n", | |
"5. 0.783 0.535\n", | |
"6. 0.791 0.526\n", | |
"Final Estimation: \n", | |
"0.794532537994 0.522390437518\n" | |
] | |
} | |
], | |
"source": [ | |
"data = np.array([[1, 0, 0, 0, 1, 1, 0, 1, 0, 1],\n", | |
" [1, 1, 1, 1, 0, 1, 1, 1, 1, 1],\n", | |
" [1, 0, 1, 1, 1, 1, 1, 0, 1, 1], \n", | |
" [1, 0, 1, 0, 0, 0, 1, 1, 0, 0],\n", | |
" [0, 1, 1, 1, 0, 1, 1, 1, 0, 1]])\n", | |
"clf = EM(verbose=True)\n", | |
"p1, p2 = clf.train(data)\n", | |
"print(\"Final Estimation: \")\n", | |
"print(p1, p2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"Let us assess whether this answer makes sense in light of the data that we have. \n", | |
"\n", | |
"Set Index|Number of Heads\n", | |
"-|-\n", | |
"1|5\n", | |
"2|9\n", | |
"3|8\n", | |
"4|4\n", | |
"5|7\n", | |
"\n", | |
"Based on this data, it appears that sets 2, 3, and 5 results from using the coin with estimated bias of 0.79, while sets 1 and 4 results from using the coin with estimated bias of 0.52. The number of heads appearing in our dataset sounds reasonable with this coin to set assignment" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Reference\n", | |
"1. Ng, A. _The EM Algorithm_. Retrieved from http://cs229.stanford.edu/\n", | |
"2. Do, C. B., & Batzoglou, S. (2008). What is the expectation maximization algorithm? Nature Biotechnology, 26(8), 897–899. doi:10.1038/nbt1406" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment