Expectation maximization algorithm
Expectation maximization(EM) is an algorithm that applied in many applications. EM can be used in Hidden Markov Model (HMM) or in Bayes model. This algorithm basically has 2 steps: Expectation step and Maximization step. The main advantage of EM is resolve problem with incomplete data or with latent variable. In simple, E step gives an assumption and M step will maximize the assumption and find out the next attribute for next E step. The algorithm is finished when we got convergence. We will talk about the main idea of algorithm and the math behind it.
The most popular example of EM is flip two coins A, B. Assume, we have two biased coins A and B. We flip coin in $m$ times, each time for $n$ flips. The question is: what is probability of head of coin A and coin B respectively: $\theta_{A}$ and $\theta_{B}$ in experiment.
If all information is provided: which coin (A or B) is used in each time, we can calculate the probabilities easily.
$$ \theta_{A} = \frac{\sum headA}{\sum(headA + tailA)} (1) $$
$$ \theta_{B} = \frac{\sum headB}{\sum(headB + tailB)} (2) $$.
But how can we estimate the $ \theta_{A}$ and $\theta_{B}$ if we don't know which one is used in each time ?
A simple usecase
We model the problem with parameters. Assume, $x= \{x_1, ..., x_m\}$ with $x_{i}$ is number of head in the $i_{th}$ sample, $z = \{ z_1, ..., z_m\} $ with $z_{i}$ in ${A, B}$ is type of coin. The problem will become joint probability of $x$ and $z$. In the previous (1) and (2) we already knew both of parameters and we can easily calculate.
We define log likelihood of distribution of joint probability is: $logp(x, z; \theta)$ and it is a concave function with optimal point $\theta = (theta_A, theta_B)$ is calculated in (1) and (2).
In the case of latent $z$ we will use EM algorithm to estimate the $\theta$ parameter as bellowing:
1. Assume that, we got a random initial values of $\theta_A = 0.6 $ and $ \theta_B = 0.5$.
2. E-step: The input is 5 times choose the coin, each time we flip this coin 10. We estimate the chance of coin A and B:
The pdf: $$ pdf = (_{k}^{n})p^{k}(1-p)^{n-k} $$ We omit the coefficient with both A and B coin
For the first sample:
$$ pdf_A = \theta_{A}^{5}*(1-\theta_{A})^{(10 - 5)} = 0.6^{5}*0.4^{5} = 0.0007962624 $$
$$ pdf_B = \theta_{B}^{5}*(1-\theta_{B})^{(10 - 5)} = 0.5^{5}*0.5^{5} = 0.0009765625$$
The probability of A and B for this sample:
$$ p_A = 0.0007962624/(0.0007962624 + 0.0009765625) = ~ 0.45 $$
And
$$p_B = ~ 0.55 $$
Now we got the distribution head and tail:
Coin A: ~2.2H 2.2T Coin B: ~2.8H 2.8T
Similarity, we got a table of distribution of head and tail for each A and B.
3. M-step:
We re-calculate the $\theta_A and \theta_B $ based on the formula (1) and (2).
4. After several step E-step & M-step the algorithm converges.
Behind the algorithm
Mixture of Gaussian model: $x = \{x^{(1)},...,x^{(m)}\}, z = \{z^{(1)},...,z^{(m)}\}$
$$p(x^{(i)}, z^{(i)}) = p(x^{(i)}|z^{(i)}) | p(z^{(i)})$$
Jensen inequality:
$$E[f(x)] \ge f(E[x]) $$
if f(x) is convex and we flip the inequality in case f(x) is concave
Log likelihood: $ p(x^{(i)};\theta)$ can be referred as pdf
$$ l(\theta) =\sum_{i=1}^{m} logp(x^{(i)};\theta) $$
$$= \sum_{i=1}^{m} log\sum_{z}p(x^{(i)}, z^{(i)};\theta) $$
E-step: construct a problem posterior of $z^i$: $Q_i = p(z^i = j;x, \theta)$
$$ l(\theta) = \sum_{i=1}^{m}log\sum_{z}Q_i\frac{p(x^{(i)}, z^{(i)};\theta) }{Q_i}$$
$$ \ge \sum_{i=1}^m E[log( \frac{p(x^{(i)}, z^{(i)};\theta) }{Q_i} )] = \sum_{i=1}^m \sum_{z}Q_ilog(\frac{p(x^{(i)}, z^{(i)};\theta) }{Q_i}) $$
For the choice of $Q_i$ give lower-bound on the loglikelihood of $l(\theta)$ that we are trying to maximize.
M-step: optimize the problem $argmax_{\theta}\sum_{i=1}^m\sum_{z}Q_ilog(\frac{p(x, z;\theta)}{Q_i})$
EM is a powerful algorithm. It is appropriate for problems with clustering, latent variable or joint distribution.
Comments