# Expectation maximization

Chung-Jen Hsu

## Introduction

The Expectation Maximization (EM) technique.

Expectation maximization (EM) is a technique that was originally proposed by (Dempster et al.1977) as an iterative technique to calculate maximum likelihood (ML) even when there is hidden or missing data. According to one of Dr. Scott's lectures on EM in the Fall 2008 Machine Learning course, when this approach was originally presented, it was met with a fair amount of resistance and skepticism. Essentially, EM involves taking two 'guesses', and then taking turns modifying each guess based on the assumption that the other one is actually correct.

In general, EM consists of two distinct steps which are performed iteratively until convergence:

1. E-Step (expectation step): Estimate the values of each missing variable using (1) existing data, (2) current estimates of the values of missing variables, and(3) the current hypothesis</i>. Replace the old estimates with the new ones, go to step 2.
2. M-Step (maximization step): Assume that the missing data values calculated in the previous step are correct, and calculate the new maximum likelihood hypothesis based on these values. Replace the old hypothesis with the new one, go to step 1.
3. </ol>

Although the EM technique may have seemed ad-hoc at first blush, it has proved quite effective in solving some challenging real-world problems and has led to several well-known techniques. These include: the Baum-Welch algorithm for Hidden Markov Models and the K-means clustering algorithms, and EM also used to train Bayesian belief networks and radial basis function networks (Mitchell, 1997, p. 191).

## Maximum-likelihood

There is a density function $p(x|\Theta ),\Theta$ is the set of parameters. There is a data set of size N, it is drawn from this distribution, i.e., $X=\{x_{1},x_{2},...x_{N}\}$. Assuming that these data vectors are independent and identically distributed (i.i.d.). Therefore, the resulting density function for the samples is

$p(X|\Theta )=\prod _{{i=1}}^{N}\ p(x_{i}|\Theta )=\zeta (\Theta |X)$

The function $\zeta (\Theta |X)$ is called the likelihood of the parameters given the data, or simply the likelihood function. It is intuitively that given a data set X, we would like to know the maximum of the likelihood function. Therefore, we need to find $\Theta ^{*}$ that maximize the likelihood function.

$\Theta ^{*}=argmax\zeta (\Theta |X)$

If take the log of the likelihood function, the it becomes maximum log likelihood function which sometimes is easy on calculation or needed to avoid under flow problem.

## Expectation Maximization Algorithm

The EM algorithm is a general method of finding the maximum-likelihood estimate of the parameters of an underlying distribution from a given data set when the data is incomplete or has missing values. Assume that a complete data set contain two variables X and Y. If Y is missing for any reason, then

$p(x,y|\Theta )=p(y|x,\Theta )p(x|\Theta )$

with this new density function we can define a new likelihood function $\zeta (\Theta |X,Y)=p(X,Y|\Theta$. Since X is not missing it can be treated as a constant, only Y can be treated as a random variable. $\Theta$ is the current parameter set.

The EM algorithm tries to find the expected value of the log-likelihood of $p(X,Y|\Theta )$ with Y as missing data, X as observed data and current parameter $\Theta$. So we can define

where $\Theta ^{{i-1}}$ is the current parameter estimate used to evaluate the expectation, and $\Theta$ is the new parameters that we optimize to increase Q.

Remember that only Y is the random variable, so we can rewrite the previous equation to be

Note that $f(y|X,\Theta ^{{i-1}})$ is the marginal distribution of the unobserved data and is dependent on both the observed data X and on the current parameters, and $\gamma$ is the space of values y can take on.

The evaluation of this expectation is called the E-step of the algorithm. The second step (the M-step) of the EM algorithm is to maximize the expectation we computed in the first step. That is, we find:

These two steps are repeated as necessary. Each iteration is guaranteed to increase the log likelihood and the algorithm is guaranteed to converge to a local maximum of the likelihood function.

## Applications

In general, the EM algorithm is more of an approach to algorithm design than an algorithm in and of itself (i.e., EM is more of a meta-algorithm). This approach is often utilized whenever an algorithm must start with an initial solution (parameters) and incrementally improve that solution over time by first matching the data to the parameters, then improve the parameters based on the data. Two derived algorithms for which this is true include K-Means clustering and the Baum-Welsh algorithm for hidden Markov model parameter learning.

### K-Means Clustering

In K-Means Clustering (Mitchell, 1997; Jain et al., 1999), the primary objective is to find a near optimal set of cluster centers (i.e., centroids) given a set of data and assign the data to the closest clusters. Here, the algorithm starts with a solution of a random set of cluster centers. Based on this initial set, it then groups all data points with their nearest centers. Based on this grouping, K-Means then computes the center of each cluster according to the distance function utilized and uses these new centers as the current set of cluster centers. Thus, the algorithm has revised its solution based on first matching data points to their respective clusters (improving cluster membership), then improves the set of cluster centers based on cluster membership. This process continues for either a set number of iterations or until the algorithm has converged on a stable set of cluster centers which do not reassign cluster membership or move during center computation.

Fitting this approach with our previous description of the EM algorithm, we can see K-Means tries to maximize the likelihood of a set of cluster centers (the learned parameters) given the data points (the data). This likelihood is incrementally improved by iteratively estimating which cluster each data point belongs to (matching data to the parameters), then re-estimating the cluster centers (matching the parameters to the data). A formal proof deriving K-Means from EM when the clusters are considered to be a mixture of Gaussian distributions over the data points (where each cluster is a distribution) is given on pages 195-196 of Mitchell (1997).

### Baum-Welsh

In order to build models of non-deterministic, partially observable environments, hidden Markov models (HMM) are commonly employed due to their ability to capture probabilistic state transitions and estimate (hidden) environment state from observations emitted by the environment. While these models are very useful, training such models is often challenging due to the difficulty in obtaining true parameters for the probabilistic state transition and observation functions. In other words, we often don't have an oracle to give us the probabilities of transitioning from one state to another or the likelihood of a state emitting a specific observation. In order to overcome this problem, one approach to training an HMM is to use the Baum-Welsh algorithm which follows an EM approach (Durbin et al., 1998).

Baum-Welsh starts with a random guess as to the true probabilities for the state transition and observation functions (Durbin et al., 1998). Using these estimates, Baum-Welsh then computes the forward and backward probability tables for a set of observation emission seqeuences and determines the likelihood that the HMM would have generated the emission sequences. Using these probability tables from the data, the algorithm then revises its state transition and observation functions to better match the data. Like with K-Means, this process continues iterativey until either a specific number of iterations have occurred or until the algorithm converges on a likelihood that does not change from iteration to iteration.

Fitting this approach to our description of EM, we can see that Baum-Welsh tries to maximize the likelihood of the state transition and observation functions (the learned parameters) given a set of observation emission seqeunces (the data). Over time, this likelihood is improved by incrementally computing the forward and backward probabilities tables for the data based on the probability function estimates (matching data to the parameters), then re-estimating the state transition and observation functions (matching the parameters to the data).

## Reference

Jeff A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. International Computer Science Institute, TR-97-021, April 1998.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from in-complete data via the em algorithm. Journal of the Royal Statistical Society: Series B, 39(1):1–38, November 1977.

R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis. Cambridge University Press, 1998.

A.K. Jain, M.N. Murty, and P.J. Flynn. Data clustering: a review. ACM Computing Surveys, vol. 31, no. 3, pp. 264–323, 1999. pdf

T.M. Mitchell, Machine Learning, WCB McGraw-Hill: Boston, MA, pp. 191-196, 1997.