xxsang/Expectation Maximisation Algorithm

## Expectation Maximisation Algorithm
1. General Form of EM
	1. Concave functions
	2. Satisfy Jensen’s equality f(Et)>=Ef(t)
	3. kullback-leibler divergence: measure differnce of two probabilistic distributions
		1. KL divergence
		2. how different is each data point at any point of x-axis in the log scale, take expectation
		3. non symmetric
		4. = 0 if compare to self
		5. always non-negative
	4. EM
		1. Build a lower bound at the local likelihood, depending on the theta to maximize
		2. Optimize the lower bound until convergence
			1. maximize the lower bound with respect to q (E step)
			2. fix q and maximize the lower bound with respect to theta (M step)
		3. E step
			1. The gap between the log likelihood and the lower bound equals to the sum of KL divergences within the distribution Q
		4. M Step
			1. maximize Elogp(X,T|theta) + const
		5. Convergence guarantee to at least local maximum or saddle point
			1. EM algorithm never make things worse
			2. On each iteration EM doesn’t decrease the objective
	5. Example for discrete mixture
	6. Summary
		1. Method for training latent variable models: more efficient than some general purpose optimization algorithm like stochastic gradient descent
		2. Handle missing data
		3. Replace the hard problem (maximize margin log likelihood) to sequence of simple task (maximize lower bound)
		4. Guaranties to converge (only local maximum)
		5. Helps complicated parameter constraints
		6. Some extensions
2. Applications of EM
	1. E step: q -> posterior distribution of latent variable t
	2. M step: maximize expectation of joint log likelihood while fixing q
	3. K-means
		1. Randomly initialize parameters theta
		2. Until convergence repeat: for each point compute closest centroid, update centroids
		3. K-means is the special case of EM
			1. where q belongs to Q (a set of delta functions, 0 or 1)
			2. with fixed covariance matrices (uniform circle with certain width)
		4. k-means is faster but less flexible than GMM
	4. Probabilistic PCA
		1. Dimensional reduction
			1. two variables are so correlated
			2. project two dimension to 1 dimension to keep as much data as possible
		2. PCA tries to find the best linear transformation to project 2d to 1d
		3. In presence of missing values using probabilistic PCA
		4. Integral is intractable
		5. The full exact EM-algorithm will not allow you to optimize intractable functions, but can appoximate it to build an apporximate EM algorithm
		6. Advantages
			1. missing data
			2. straightforward iterative scheme for large dimensionalities
			3. Can do mixture of PPCA
			4. hyperparameter tuning
	1. General Form of EM
	1. Concave functions
	2. Satisfy Jensen’s equality f(Et)>=Ef(t)
	3. kullback-leibler divergence: measure differnce of two probabilistic distributions
	1. KL divergence
	2. how different is each data point at any point of x-axis in the log scale, take expectation
	3. non symmetric
	4. = 0 if compare to self
	5. always non-negative
	4. EM
	1. Build a lower bound at the local likelihood, depending on the theta to maximize
	2. Optimize the lower bound until convergence
	1. maximize the lower bound with respect to q (E step)
	2. fix q and maximize the lower bound with respect to theta (M step)
	3. E step
	1. The gap between the log likelihood and the lower bound equals to the sum of KL divergences within the distribution Q
	4. M Step
	1. maximize Elogp(X,T\|theta) + const
	5. Convergence guarantee to at least local maximum or saddle point
	1. EM algorithm never make things worse
	2. On each iteration EM doesn’t decrease the objective
	5. Example for discrete mixture
	6. Summary
	1. Method for training latent variable models: more efficient than some general purpose optimization algorithm like stochastic gradient descent
	2. Handle missing data
	3. Replace the hard problem (maximize margin log likelihood) to sequence of simple task (maximize lower bound)
	4. Guaranties to converge (only local maximum)
	5. Helps complicated parameter constraints
	6. Some extensions
	2. Applications of EM
	1. E step: q -> posterior distribution of latent variable t
	2. M step: maximize expectation of joint log likelihood while fixing q
	3. K-means
	1. Randomly initialize parameters theta
	2. Until convergence repeat: for each point compute closest centroid, update centroids
	3. K-means is the special case of EM
	1. where q belongs to Q (a set of delta functions, 0 or 1)
	2. with fixed covariance matrices (uniform circle with certain width)
	4. k-means is faster but less flexible than GMM
	4. Probabilistic PCA
	1. Dimensional reduction
	1. two variables are so correlated
	2. project two dimension to 1 dimension to keep as much data as possible
	2. PCA tries to find the best linear transformation to project 2d to 1d
	3. In presence of missing values using probabilistic PCA
	4. Integral is intractable
	5. The full exact EM-algorithm will not allow you to optimize intractable functions, but can appoximate it to build an apporximate EM algorithm
	6. Advantages
	1. missing data
	2. straightforward iterative scheme for large dimensionalities
	3. Can do mixture of PPCA
	4. hyperparameter tuning