Back Updated: 2024-10-31 Statistical Simulation, Wei Li
Suppose we have \(n\) observables \(x_{1},x_{2},...,x_{n} \stackrel{i.i.d}{\sim} g(x|\theta)\). Let \(\boldsymbol{x}=\{x_i\}_{i=1}^n\). Our goal is to compute: \[\begin{align*} \hat{\theta} = \arg\max L(\theta | \boldsymbol{x}) = \prod^{n}_{i=1} g(x_{i} | \theta)= f(\boldsymbol{x}|\theta), \end{align*}\] where \(L(\theta | \boldsymbol{x})\) is the observed data likelihood function and \(f(\boldsymbol{x}|\theta)\) is the joint density.
Suppose there is some unobservables \(\boldsymbol{z}\) such that the joint density of \((\boldsymbol{x} ,\boldsymbol{z})\) satisfies \(f(\boldsymbol{x}|\theta)=\int f(\boldsymbol{x}, \boldsymbol{z}|\theta) d\boldsymbol{z}\). Denote \(L^{c}(\theta | \boldsymbol{x},\boldsymbol{z}) = f(\boldsymbol{x},\boldsymbol{z}|\theta)\), which is the complete data likelihood function.
Since \[\begin{align*} (\boldsymbol{x},\boldsymbol{z}) \sim f(\boldsymbol{x},\boldsymbol{z}|\theta), \end{align*}\] the conditional distribution of \(\boldsymbol{z}\) given the observables \(\boldsymbol{x}\) is \[\begin{align*} f(\boldsymbol{z}|\boldsymbol{x},\theta) = \frac{f(\boldsymbol{x},\boldsymbol{z}|\theta)}{f(\boldsymbol{x}|\theta)}. \end{align*}\]
Use this result on the observed data likelihood function: \[\begin{align*} logL(\theta | \boldsymbol{x}) & = log f(\boldsymbol{x}|\theta) \\ & = log \frac{f(\boldsymbol{x},\boldsymbol{z}|\theta)}{f(\boldsymbol{z}|\boldsymbol{x},\theta)} \\ & = log(f(\boldsymbol{x},\boldsymbol{z}|\theta)) - log(f(\boldsymbol{z}| \boldsymbol{x},\theta)). \end{align*}\]
We shall use the notations:
\[\begin{align*} E_{g}(f(\boldsymbol{x})) = \int f(\boldsymbol{x})g(\boldsymbol{x}) d\boldsymbol{x}, \qquad E_{\boldsymbol{z}|\boldsymbol{x}}(h(\boldsymbol{z},\boldsymbol{x})) = \int h(\boldsymbol{z},\boldsymbol{x})f(\boldsymbol{z}|\boldsymbol{x}) d\boldsymbol{z}. \end{align*}\]
Let \(\theta^{(0)}\) as our initial guess of the parameter, and take conditional expectation of \(\boldsymbol{z}\) given \(\boldsymbol{x}\), that is, the integral is taken with respect to \(f(\boldsymbol{z}| \boldsymbol{x},\theta^{(0)})\), on both sides:
\[\begin{align*} E_{\theta^{(0)}}(log(L(\theta|\boldsymbol{x}))) = log(L(\theta|\boldsymbol{x})) & = E_{\theta^{(0)}}[logf(\boldsymbol{x},\boldsymbol{z}|\theta)|\boldsymbol{x}] - E_{\theta^{(0)}}[logf(\boldsymbol{z}|\boldsymbol{x},\theta)|\boldsymbol{x}] \\ & = Q(\theta | \theta^{(0)},\boldsymbol{x}) - K(\theta | \theta^{(0)},\boldsymbol{x}), \end{align*}\] where we take \[\begin{align*} Q(\theta | \theta^{(0)},\boldsymbol{x}) := E_{\theta^{(0)}}[logf(\boldsymbol{x},\boldsymbol{z}|\theta)|\boldsymbol{x}], \\ K(\theta | \theta^{(0)},\boldsymbol{x}):= E_{\theta^{(0)}}[logf(\boldsymbol{z}|\boldsymbol{x},\theta)|\boldsymbol{x}]. \end{align*}\]
It turns out for any candidate \(\theta^{'}\) for next iterate, \[\begin{align*} K(\theta^{'}|\theta^{0},\boldsymbol{x}) \le K(\theta^{(0)} | \theta^{(0)},\boldsymbol{x}). \end{align*}\]
To see this,that is, for any \(\theta^{'}\): \[\begin{align*} E_{\theta^{(m)}}(\log f(\boldsymbol{z}|\boldsymbol{x},{\theta}^{'})|\boldsymbol{x}) \le E_{\theta^{(m)}}(\log f(\boldsymbol{z}|\boldsymbol{x},\theta^{(m)})|\boldsymbol{x})= \int \log f(\boldsymbol{z}| \boldsymbol{x},\theta^{(m)})f(\boldsymbol{z}|x,\theta^{(m)}) d\boldsymbol{z}. \end{align*}\]
Call \(g(\boldsymbol{z}) =
f(\boldsymbol{z}|\boldsymbol{x},\theta^{'})\), \(h(\boldsymbol{z}) =
f(\boldsymbol{z}|\boldsymbol{x},\theta^{(m)})\). It suffices to
show \[\begin{align*}
E_{h}\Big(\log\frac{h(z)}{g(z)}\Big) \ge 0.
\end{align*}\] The term \(E_{h}\Big(\log\frac{h(z)}{g(z)}\Big)\) is
the so-called Kullback-Leibler divergence \(KL(h||g)\) between the distributions \(h\) and \(g\). To see the inequality, we use Jensen’s
inequality:
\[\begin{align*}
LHS & = \int \log \Big(\frac{h(z)}{g(z)}\Big) h(z) dz \\
& = -\int \log \Big(\frac{g(z)}{h(z)}\Big) h(z) dz \\
& \ge - log \int \frac{g(z)}{h(z)}h(z) dz = 0.
\end{align*}\] In fact \(E_{h}(\log(h(z)/g(z))) =0\) if and only if
\(h=g\).
So to maximize \(log(L(\theta|\boldsymbol{x})\) over \(\theta\), it suffices to just maximize \(Q(\theta | \theta^{(0)},\boldsymbol{x})\) over \(\theta\). By maximizing \(Q(\theta | \theta^{(0)},\boldsymbol{x})\) over \(\theta\), one obtain the maximizer \(\theta^{(1)}\) as the next iterate; we then by maximizing \(Q(\theta | \theta^{(1)},\boldsymbol{x})\) over \(\theta\), obtaining the next iterate \(\theta^{(2)}\)–the process can keep go on until convergence.
EM algorithm
Based on the result, we have two main steps for EM algorithm: at the \(m\)-th iteration,
Remark
E-step: evaluate \[ Q(\theta\mid\theta^{(m)}, {X})=E_{\theta^{(m)}}\big[\log L^c(\theta\mid {X}, {Z})\mid {X}\big], \]
where \({X}\) is observable, \({Z}\) is unobservable; \(E_{\theta^{(m)}}\) is the conditional expectation of \(Z\) given \(X\), where the conditional density is given by \(f( {Z}\mid {X},\theta^{(m)})\); \(\log L^c(\theta\mid {X}, {Z})\) is complete data likelihood function.
M-step: \(\theta^{(m+1)}\gets\mathop{\arg\max}\limits_{\theta}\ \ Q(\theta\mid\theta^{(m)}, {X})\)
Suppose \(X_i\overset{i.i.d.}{\sim}\frac{1}{4}N(\mu_1,1)+\frac{3}{4}N(\mu_2,1)\),
\(\theta=(\mu_1,\mu_2).\)
log-likelihood function: \[\log L(\theta\mid
{X})=\sum\limits_{i=1}^{n}\log\big(\frac{1}{4}f(X_i\mid\mu_1)+\frac{3}{4}f(X_i\mid\mu_2)\big),\]
where \(L(\theta\mid
{X})=\prod\limits_{i=1}^{n}f(X_i\mid\mu_1,\mu_2).\)
The p.d.f of \(X_i\) is:
\[
f(X_i\mid \mu_j)=\frac{1}{\sqrt{2\pi}}e^{-\frac{(X_i-\mu_j)^2}{2}}.
\] Define some latent variables (binary) \(Z_i \sim Bernoulli(p)\):
\[P(Z_i=1)=\frac{1}{4},
P(Z_i=0)=\frac{3}{4}.\]
Assuming \(W_{1i}\sim f_1\), \(W_{2i}\sim f_2\), and \(Z_i\perp(W_{1i},W_{2i})\), \(\{Z_i,W_{1i},W_{2i}\}_{i=1}^n\) are independent, we can write \[ X_i=Z_iW_{1i}+(1-Z_i)W_{2i}. \]
Complete data likelihood function \(({X},
{Z})\):
\[
\begin{aligned}
L^c( \theta|{X}, {Z})&=\prod\limits_{i=1}^{n}P(X_i,Z_i)\\
&=\prod\limits_{i=1}^{n}\big[P(X_i\mid
Z_i=1)P(Z_i=1)\big]^{Z_i}\big[P(X_i\mid Z_i=0)P(Z_i=0)\big]^{1-Z_i}.
\end{aligned}
\]
So the log-likelihood function: \[ \log\big(L^c(\theta\mid {X}, {Z})\big)=\sum\limits_{i=1}^{n}\big[Z_i\log f(X_i\mid \mu_1)+Z_i\log\frac{1}{4}+(1-Z_i)\log f(X_i\mid \mu_2)+(1-Z_i)\log\frac{3}{4}\big]. \] Let \(\theta^{(0)}=(\mu_1^{(0)},\mu_2^{(0)})\), \[ \begin{aligned} Q(\theta\mid\theta^{(0)}, {X})&=E_{\theta^{(0)}}\big[\log L^c(\theta\mid {X}, {Z})\mid {X}\big]\\ &=-\sum\limits_{i=1}^{n}\Big\{\frac{1}{2}E_{\theta^{(0)}}[Z_i(X_i-\mu_1)^2\mid X_i]+\frac{1}{2}E_{\theta^{(0)}}\big[(1-Z_i)(X_i-\mu_2)^2\mid X_i\big]\Big\}\\ &\ \ \ \ \ +\sum\limits_{i=1}^{n}\Big\{\log(\frac{1}{4}) E_{\theta^{(0)}}(Z_i\mid X_i)+\log(\frac{3}{4}) E_{\theta^{(0)}}(1-Z_i\mid X_i)\Big\}+C, \end{aligned} \]
where \[ E_{\theta^{(0)}}[Z_i(X_i-\mu_1)^2\mid X_i]=(X_i-\mu_1)^2E_{\theta^{(0)}}[Z_i\mid X_i]. \] Since \[E[Z\mid X=x]=\frac{P(X=x\mid Z=1)P(Z=1)}{P(X=x\mid Z=1)P(Z=1)+P(X=x\mid Z=0)P(Z=0)}.\]So we can calculate \(E_{\theta^{(0)}}[Z_i\mid X_i]\):
\[ E_{\theta^{(0)}}\big[Z_i\mid X_i\big]=\frac{\frac{1}{4}f_1(X_i\mid \mu_1^{(0)})}{\frac{1}{4}f_1(X_i\mid\mu_1^{(0)})+\frac{3}{4}f_2(X_i\mid\mu_2^{(0)})}=:r_i^{(0)}(X_i;\mu_1^{(0)},\mu_2^{(0)}). \]
And \(Q(\theta\mid\theta^{(0)}, {X})\) can be expressed: \[ -\sum\limits_{i=1}^{n}\Big\{\frac{1}{2}(X_i-\mu_1)^2r_i^{(0)}(X_i;\mu_1^{(0)},\mu_2^{(0)})+\frac{1}{2}(X_i-\mu_2)^2\big(1-r_i^{(0)}(X_i;\mu_1^{(0)},\mu_2^{(0)})\big)\Big\}+const. \]
Setting the derivatives w.r.t. the \(\theta\) to zero function: \[ \begin{cases} \frac{\partial Q(\theta\mid\theta^{(0)}, {X})}{\partial \mu_1}\overset{set}{=}0\\ \frac{\partial Q(\theta\mid\theta^{(0)}, {X})}{\partial \mu_2}\overset{set}{=}0, \end{cases} \]
we obtain the solutions \(\mu_1^{(1)}\) and \(\mu_2^{(1)}\)
\[ \begin{cases} \mu_1^{(1)}=\frac{\sum\limits_{i=1}^{n}r_i^{(0)}(X_i;\mu_1^{(0)},\mu_2^{(0)})X_i}{\sum\limits_{i=1}^{n}r_i^{(0)}(X_i;\mu_1^{(0)},\mu_2^{(0)})}\\ \mu_2^{(1)}=\frac{\sum\limits_{i=1}^{n}\big(1-r_i^{(0)}(X_i;\mu_1^{(0)},\mu_2^{(0)})\big)X_i}{\sum\limits_{i=1}^{n}\big(1-r_i^{(0)}(X_i;\mu_1^{(0)},\mu_2^{(0)})\big)}. \end{cases} \]
Let \(X_i\overset{i.i.d.}{\sim}f(x-\theta)\),
where \(f\) is density function of
\(N(0,1)\), \(F\) be the CDF of \(N(0,1)\), so \(X_i\sim N(\theta,1)\)
The Goal is to estimate \(\theta\).
However, \(X_i\) are not fully
observed. They are right censored.
Assume
That is, \((Y_1, Y_2, \dots, Y_m)\) are uncensored, \((Y_{m+1}, Y_{m+2}, \dots, Y_n)\) are censored. So the observed data likelihood function: \[ \begin{aligned} L(\theta\mid Y_1, Y_2, \dots, Y_n, \delta_1, \delta_2, \dots, \delta_n)&=\prod\limits_{i=1}^{n}\big(f(Y_i-\theta)\big)^{\delta_i}\big(P(Y_i=a)\big)^{1-\delta_i}\\ &=\prod\limits_{i=1}^{n}\big(f(Y_i-\theta)\big)^{\delta_i}\big(1-P(X_i\le a)\big)^{1-\delta_i}\\ &=\prod\limits_{i=1}^{n}\big(f(Y_i-\theta)\big)^{\delta_i}\big(1-F(a-\theta)\big)^{1-\delta_i}. \end{aligned} \]
Let \({Z}\) be the vector of the
unobservable \(X_{m+1}, X_{m+2}, \dots,
X_n\).
The complete data likelihood function \(L^c(\theta\mid X_1, X_2, \dots, X_n)\) can
be written as: \[
L^c(\theta\mid Y_1, Y_2, \dots,
Y_m; {Z})=\prod\limits_{i=1}^{m}f(Y_i-\theta)\prod\limits_{i=m+1}^{n}f(X_i-\theta).
\] The log-likelihood function: \[
\log L^c(\theta\mid Y_1, Y_2, \dots,
Y_m; {Z})=\sum\limits_{i=1}^{m}\log
f(Y_i-\theta)+\sum\limits_{i=m+1}^{n}\log f(X_i-\theta).
\]
Let \[ \begin{aligned} Q(\theta\mid\theta^{(0)};Y_1, \dots, Y_n, \delta_1, \dots, \delta_n)&=E_{\theta^{(0)}}\big[\log L^c(\theta\mid X_1, \dots, X_n)\mid Y_1, \dots, Y_n, \delta_1, \dots, \delta_n\big]\\ &=-\frac{1}{2}\sum\limits_{i=1}^{m}(Y_i-\theta)^2-\frac{1}{2}\sum\limits_{i=m+1}^{n}E_{\theta^{(0)}}\big[(X_i-\theta)^2\mid Y_1, \dots, Y_n, \delta_1, \dots, \delta_n\big]. \end{aligned} \]
For those \(i=m+1, \dots, n\), \(\delta_i=0\), So
\[
E_{\theta^{(0)}}\big[(X_i-\theta)^2\mid Y_i=a,
\delta_i=0\big]=E_{\theta^{(0)}}\big[(X_i-\theta)^2\mid X_i\ge a \big].
\]
We can calculate \(\theta^{(1)}\) by: \[ \frac{\partial Q(\theta\mid\theta^{(0)};Y_1, \dots, Y_n, \delta_1, \dots, \delta_n)}{\partial\theta}\overset{set}{=}0, \]
which can be expressed as: \[ \sum\limits_{i=1}^{m}(Y_i-\theta)+\sum\limits_{i=m+1}^{n}E_{\theta^{(0)}}\big[(X_i-\theta)\mid X_i\ge a\big]\overset{set}{=}0, \\ \] \[ \sum\limits_{i=1}^{m}Y_i-m\theta+\sum\limits_{i=m+1}^{n}E_{\theta^{(0)}}[X_i\mid X_i\ge a]-(n-m)\theta\overset{set}{=}0. \]
We can calculate the solution \(\theta^{(1)}\): \[ \theta^{(1)}=\frac{(\sum\limits_{i=1}^{m}Y_i)+(n-m)E_{\theta^{(0)}}[X_i\mid X_i\ge a]}{n}. \]
Finally, we can show that \[ E_{\theta^{(0)}}[X_i\mid X_i\ge a]=\theta^{(0)}+\frac{\phi(a-\theta^{(0)})}{1-\Phi(a-\theta^{(0)})}. \]MAR (missing at random) assumption: \[\begin{align*} P(R=r \mid Z)=P(R=r \mid Z_{(r)}). \end{align*}\]
Selection model factorization: suppose a joint model \(P(z, r ; \psi, \theta)\) for the density of \((R, Z)\) can be factorized as
\[\begin{align*} \begin{aligned} P(z, r ; \psi, \theta) & =P(Z=z, R=r ; \psi, \theta) \\ & =P(R=r|Z=z ; \psi) P(Z=z ; \theta) \\ & =P(R=r| Z_{(r)}=z_{(r)} ; \psi) P(Z=z ; \theta), \end{aligned} \end{align*}\] where the last equation is due to MAR. This factorization (with MAR) is called separability condition, which says that there is no information about \(\theta\) in \(\psi\) (assuming no overlap of elements). Therefore, if we are interested in the estimation of \(\theta\) alone, it suffices to consider \(P(Z=z ; \theta).\)
Now, the likelihood for the observed data point \(\left(R=r, Z_{(R)}=z_{(r)}\right)\) is obtained accordingly \[\begin{align*} P\left(R=r, Z_{(R)}\right. & \left.=z_{(r)} ; \psi, \theta\right)=\int P(z ; r ; \psi, \theta) d z_{(\bar{r})} \\ & =P\left(R=r \mid Z_{(r)}=z_{(r)} ; \psi\right) \int P(z ; \theta) d z_{(\bar{r})} \\ & =P\left(R=r \mid Z_{(r)}=z_{(r)} ; \psi\right) P\left(Z_{(r)}=z_{(r)} ; \theta\right). \end{align*}\]
The likelihood based on the observed data \(\left(R_i=r_i, Z_{\left(R_i\right)i}=z_{\left(r_i\right)}\right)_{i=1}^n\) is therefore given by
\[\begin{align*} & \prod_i \prod_r P\left(R_i=r ; Z_{(r) i}=z_{(r)} ; \psi, \theta\right)^{1\left(R_i=r\right)} \\ = & \prod_i\left(\prod_r P\left(R_i=r|Z_{(r) i}=z_{(r) ;} ; \psi \right)^{1\left(R_i=r\right)} \prod_r P\left(Z_{(r) i}=z_{(r)} ; \theta\right)^{1\left(R_i=r\right)}\right). \end{align*}\] Goal: Suppose we are only interested in \(\theta\), the MLE aims to maximize the following objective function over \(\theta\): \[\begin{align*} &\sum_{i=1}^n \sum_r 1\left(R_i=r\right) \log P\left(Z_{(r)i}=z_{(r)}; \theta\right) \\ =&\sum_{i=1}^n \sum_r \left[1\left(R_i=r\right) \log \left( \int P\left(z_{(r)i}, z_{(\bar{r})i}; \theta\right) d{z_{(\bar{r})}}\right)\right]. \end{align*}\]
EM algorithm: Let’s consider how EM algorithm works in this situation. The ideal full (complete) data likelihood is given by \[ \prod_{i} \prod_{r} P(R_i, Z_{(R_i)i}, Z_{(\bar{R}_i)i}; \psi, \theta)^{1\{R_i=r\}}. \] By the MAR and separability, it suffices to consider the full data likelihood. \[ \prod_{i} \prod_{r} P(Z_{(R_i)i}, Z_{(\bar{R}_i)i}; \theta)^{1\{R_i=r\}}. \]
In the same spirit of EM discussed previously, the E-step is \[\begin{align*} Q\left(\theta| \theta^{(t-1)}\right)=\sum_{i=1}^n E_{\theta^{(t-1)}}(\log \prod_{r}P(Z_i;\theta)^{1(R_i=r)}|R_i, Z_{(R_i)i})= \sum_{i=1}^n \sum_{r}1\left(R_i=r\right) E_{\theta^{(t-1)}}\left(\log P\left(Z_i; \theta\right) \mid Z_{(r) i}\right), \end{align*}\] here the expectation is conditional on the observables \((R_i, Z_{(R_i)i})_{i=1}^n\) with parameter \(\theta^{(t-1)}\). The second equality holds because \[\begin{align*} \sum_{r}1\left(R_i=r\right) E_{\theta^{\prime}}\left(\log P\left(Z_i; \theta\right) \mid R_i=r, Z_{(r)i}\right) =\sum_{r}1\left(R_i=r\right) E_{\theta^{\prime}}\left(\log P\left(Z_i; \theta\right) \mid Z_{(r) i}\right), \end{align*}\] here the expectation is conditional on \((R_i=r, Z_{(r)i})\) with parameter \(\theta'\). The equality holds because of MAR.
The M-step: \[\theta^{(t)} =\arg \max_{\theta} Q(\theta| \theta^{(t-1)}).\]
Example: Suppose \(Z\) follows a \(N(\mu,\Sigma)\), here \(\theta:=(\mu,\Sigma)\). Note that \[\begin{align*} &E_{\theta^{(t-1)}}\left(\log P\left(Z_i; \theta\right) \mid Z_{(r) i}\right) \\ =&-\frac{1}{2} \log |2 \pi {\Sigma}|-\frac{1}{2} {E}_{\theta^{(t-1)}}\left(\left(Z_i-{\mu}\right)^{\top} {\Sigma}^{-1}\left(Z_i-{\mu}\right) \mid Z_{(r) i} \right) \\ =&-\frac{1}{2} \log |{\Sigma}|-\frac{p}{2} \log (2 \pi)-\frac{1}{2} \operatorname{tr}\left({\Sigma}^{-1} {E}_{\theta^{(t-1)}}({S}({\mu})\mid Z_{(r) i})\right), \end{align*}\] where \({E}_{\theta^{(t-1)}}({S}({\mu})\mid Z_{(r) i}):= {E}_{\theta^{(t-1)}}(Z_i Z_i^{\top}\mid Z_{(r) i})+{\mu} {\mu}^{\top}-2 {\mu} {E}_{\theta^{(t-1)}}(Z_i\mid Z_{(r) i})^{\top}.\) The key is to compute \({E}_{\theta^{(t-1)}}(Z_i\mid Z_{(r) i})\) and \({E}_{\theta^{(t-1)}}(Z_i Z_i^{\top}\mid Z_{(r) i})\), and these are expected sufficient statistics and can be computed using conditional formula for multivariate Gaussian distribution.
In the above calculation, gathering all observations for all missing patterns, the solution in the M-step then is given by the MLE solution using the expected sufficient statistics: \[\begin{align*} {\mu}^{(t)} & =\frac{1}{n}\sum_{i=1}^n \sum_r 1(R_i=r){E}_{\theta^{(t-1)}}(Z_i\mid Z_{(r) i}), \\ {\Sigma}^{(t)} & =\frac{1}{n} \sum_{i=1}^n \sum_r 1(R_i=r){E}_{\theta^{(t-1)}}(Z_i Z_i^{\top}\mid Z_{(r) i})-{\mu}^{(t)}{\mu}^{(t)\top}. \end{align*}\]
Provided and discussed in lab.
Back Updated: 2024-10-31 Statistical Simulation, Wei Li