Back Updated: 2024-10-31 Statistical Simulation, Wei Li

EM Algorithm

Suppose we have \(n\) observables \(x_{1},x_{2},...,x_{n} \stackrel{i.i.d}{\sim} g(x|\theta)\). Let \(\boldsymbol{x}=\{x_i\}_{i=1}^n\). Our goal is to compute: \[\begin{align*} \hat{\theta} = \arg\max L(\theta | \boldsymbol{x}) = \prod^{n}_{i=1} g(x_{i} | \theta)= f(\boldsymbol{x}|\theta), \end{align*}\] where \(L(\theta | \boldsymbol{x})\) is the observed data likelihood function and \(f(\boldsymbol{x}|\theta)\) is the joint density.

Suppose there is some unobservables \(\boldsymbol{z}\) such that the joint density of \((\boldsymbol{x} ,\boldsymbol{z})\) satisfies \(f(\boldsymbol{x}|\theta)=\int f(\boldsymbol{x}, \boldsymbol{z}|\theta) d\boldsymbol{z}\). Denote \(L^{c}(\theta | \boldsymbol{x},\boldsymbol{z}) = f(\boldsymbol{x},\boldsymbol{z}|\theta)\), which is the complete data likelihood function.

Since \[\begin{align*} (\boldsymbol{x},\boldsymbol{z}) \sim f(\boldsymbol{x},\boldsymbol{z}|\theta), \end{align*}\] the conditional distribution of \(\boldsymbol{z}\) given the observables \(\boldsymbol{x}\) is \[\begin{align*} f(\boldsymbol{z}|\boldsymbol{x},\theta) = \frac{f(\boldsymbol{x},\boldsymbol{z}|\theta)}{f(\boldsymbol{x}|\theta)}. \end{align*}\]

Use this result on the observed data likelihood function: \[\begin{align*} logL(\theta | \boldsymbol{x}) & = log f(\boldsymbol{x}|\theta) \\ & = log \frac{f(\boldsymbol{x},\boldsymbol{z}|\theta)}{f(\boldsymbol{z}|\boldsymbol{x},\theta)} \\ & = log(f(\boldsymbol{x},\boldsymbol{z}|\theta)) - log(f(\boldsymbol{z}| \boldsymbol{x},\theta)). \end{align*}\]

We shall use the notations:

\[\begin{align*} E_{g}(f(\boldsymbol{x})) = \int f(\boldsymbol{x})g(\boldsymbol{x}) d\boldsymbol{x}, \qquad E_{\boldsymbol{z}|\boldsymbol{x}}(h(\boldsymbol{z},\boldsymbol{x})) = \int h(\boldsymbol{z},\boldsymbol{x})f(\boldsymbol{z}|\boldsymbol{x}) d\boldsymbol{z}. \end{align*}\]

Let \(\theta^{(0)}\) as our initial guess of the parameter, and take conditional expectation of \(\boldsymbol{z}\) given \(\boldsymbol{x}\), that is, the integral is taken with respect to \(f(\boldsymbol{z}| \boldsymbol{x},\theta^{(0)})\), on both sides:

\[\begin{align*} E_{\theta^{(0)}}(log(L(\theta|\boldsymbol{x}))) = log(L(\theta|\boldsymbol{x})) & = E_{\theta^{(0)}}[logf(\boldsymbol{x},\boldsymbol{z}|\theta)|\boldsymbol{x}] - E_{\theta^{(0)}}[logf(\boldsymbol{z}|\boldsymbol{x},\theta)|\boldsymbol{x}] \\ & = Q(\theta | \theta^{(0)},\boldsymbol{x}) - K(\theta | \theta^{(0)},\boldsymbol{x}), \end{align*}\] where we take \[\begin{align*} Q(\theta | \theta^{(0)},\boldsymbol{x}) := E_{\theta^{(0)}}[logf(\boldsymbol{x},\boldsymbol{z}|\theta)|\boldsymbol{x}], \\ K(\theta | \theta^{(0)},\boldsymbol{x}):= E_{\theta^{(0)}}[logf(\boldsymbol{z}|\boldsymbol{x},\theta)|\boldsymbol{x}]. \end{align*}\]

It turns out for any candidate \(\theta^{'}\) for next iterate, \[\begin{align*} K(\theta^{'}|\theta^{0},\boldsymbol{x}) \le K(\theta^{(0)} | \theta^{(0)},\boldsymbol{x}). \end{align*}\]

To see this,that is, for any \(\theta^{'}\): \[\begin{align*} E_{\theta^{(m)}}(\log f(\boldsymbol{z}|\boldsymbol{x},{\theta}^{'})|\boldsymbol{x}) \le E_{\theta^{(m)}}(\log f(\boldsymbol{z}|\boldsymbol{x},\theta^{(m)})|\boldsymbol{x})= \int \log f(\boldsymbol{z}| \boldsymbol{x},\theta^{(m)})f(\boldsymbol{z}|x,\theta^{(m)}) d\boldsymbol{z}. \end{align*}\]

Call \(g(\boldsymbol{z}) = f(\boldsymbol{z}|\boldsymbol{x},\theta^{'})\), \(h(\boldsymbol{z}) = f(\boldsymbol{z}|\boldsymbol{x},\theta^{(m)})\). It suffices to show \[\begin{align*} E_{h}\Big(\log\frac{h(z)}{g(z)}\Big) \ge 0. \end{align*}\] The term \(E_{h}\Big(\log\frac{h(z)}{g(z)}\Big)\) is the so-called Kullback-Leibler divergence \(KL(h||g)\) between the distributions \(h\) and \(g\). To see the inequality, we use Jensen’s inequality:
\[\begin{align*} LHS & = \int \log \Big(\frac{h(z)}{g(z)}\Big) h(z) dz \\ & = -\int \log \Big(\frac{g(z)}{h(z)}\Big) h(z) dz \\ & \ge - log \int \frac{g(z)}{h(z)}h(z) dz = 0. \end{align*}\] In fact \(E_{h}(\log(h(z)/g(z))) =0\) if and only if \(h=g\).

So to maximize \(log(L(\theta|\boldsymbol{x})\) over \(\theta\), it suffices to just maximize \(Q(\theta | \theta^{(0)},\boldsymbol{x})\) over \(\theta\). By maximizing \(Q(\theta | \theta^{(0)},\boldsymbol{x})\) over \(\theta\), one obtain the maximizer \(\theta^{(1)}\) as the next iterate; we then by maximizing \(Q(\theta | \theta^{(1)},\boldsymbol{x})\) over \(\theta\), obtaining the next iterate \(\theta^{(2)}\)–the process can keep go on until convergence.

EM algorithm

Based on the result, we have two main steps for EM algorithm: at the \(m\)-th iteration,

  1. E Step: compute \(Q(\theta |\theta^{(m)},\boldsymbol{x})\) as a function of \(\theta\) and \(\theta^{(m)}\).
  2. M Step: find \(\theta^{(m+1)} = \arg\max\limits_{\theta} Q(\theta |\theta^{(m)},\boldsymbol{x})\).

Remark

  1. EM algorithm only generates the limit point of \(\theta^{(m)}\) that is a stationary point of the objective function \(log(L(\theta|\boldsymbol{x})\). In practice, you’ll try different starting values of \(\theta^{(0)}\).
  2. If \(f(\boldsymbol{z}|\theta)\) does not depend on \(\theta\), then the algorithm is equivalent to that uses \[ Q(\theta |\theta^{(m)},\boldsymbol{x})= E_{\theta^{(m)}}[logf(\boldsymbol{x}|\boldsymbol{z}, \theta)|\boldsymbol{x}]. \]
  3. Consider for \(h(x)=E(H(x,Z))\) where the expectation is taken w.r.t to the random variable \(Z\). \[\begin{align*} \max\limits_{x} h(x) &= \max \limits_{x} E(H(x,Z)) \\ &= \max \limits_{x} E(H(X,Z)|X = x) \\ &= \max \limits_{x} \int H(x,z)f(z|x)dz, \end{align*}\] we can use Monte Carlo to approximate the objective function: \[\begin{align*} \frac{1}{m}\sum^{m}_{i=1}H(x,Z_{i}) \to \int H(x,z)f(z|x)dz, \end{align*}\] where \(Z_{i} \stackrel{i.i.d}{\sim} f(z|x)\).
    If we approximate \(Q\) function using this idea, this then gives the so-called Monte-Carlo EM: \[\begin{align*} \hat{Q}(\theta|\theta^{(m)},\boldsymbol{x}) = \frac{1}{T}\sum^{T}_{j=1}log(L^{c}(\theta|\boldsymbol{x},\boldsymbol{z}_j)), \end{align*}\]
    where \(\boldsymbol{z}_1, \ldots, \boldsymbol{z}_T\) is an i.i.d. random sample generated from \(f(\boldsymbol{z}|\theta^{(m)}, \boldsymbol{x})\).
  4. We may not need to find the exact maximizer in the process . Instead, sometimes we just find \(\theta^{(m+1)}\) that can improve upon the value of \(Q\) at the current \(\theta^{(m)}\), that is, \[\begin{align*} Q(\theta^{(m+1)}|\theta^{(m)},\boldsymbol{x}) \ge Q(\theta^{(m)}|\theta^{(m)},\boldsymbol{x}). \end{align*}\] We called that generalized EM Algorithm.

EM Algorithms (examples)

E-step: evaluate \[ Q(\theta\mid\theta^{(m)}, {X})=E_{\theta^{(m)}}\big[\log L^c(\theta\mid {X}, {Z})\mid {X}\big], \]

where \({X}\) is observable, \({Z}\) is unobservable; \(E_{\theta^{(m)}}\) is the conditional expectation of \(Z\) given \(X\), where the conditional density is given by \(f( {Z}\mid {X},\theta^{(m)})\); \(\log L^c(\theta\mid {X}, {Z})\) is complete data likelihood function.

M-step: \(\theta^{(m+1)}\gets\mathop{\arg\max}\limits_{\theta}\ \ Q(\theta\mid\theta^{(m)}, {X})\)

Two component mixture of normals

Suppose \(X_i\overset{i.i.d.}{\sim}\frac{1}{4}N(\mu_1,1)+\frac{3}{4}N(\mu_2,1)\), \(\theta=(\mu_1,\mu_2).\)
log-likelihood function: \[\log L(\theta\mid {X})=\sum\limits_{i=1}^{n}\log\big(\frac{1}{4}f(X_i\mid\mu_1)+\frac{3}{4}f(X_i\mid\mu_2)\big),\] where \(L(\theta\mid {X})=\prod\limits_{i=1}^{n}f(X_i\mid\mu_1,\mu_2).\)
The p.d.f of \(X_i\) is:
\[ f(X_i\mid \mu_j)=\frac{1}{\sqrt{2\pi}}e^{-\frac{(X_i-\mu_j)^2}{2}}. \] Define some latent variables (binary) \(Z_i \sim Bernoulli(p)\):
\[P(Z_i=1)=\frac{1}{4}, P(Z_i=0)=\frac{3}{4}.\]

Assuming \(W_{1i}\sim f_1\), \(W_{2i}\sim f_2\), and \(Z_i\perp(W_{1i},W_{2i})\), \(\{Z_i,W_{1i},W_{2i}\}_{i=1}^n\) are independent, we can write \[ X_i=Z_iW_{1i}+(1-Z_i)W_{2i}. \]

Complete data likelihood function \(({X}, {Z})\):
\[ \begin{aligned} L^c( \theta|{X}, {Z})&=\prod\limits_{i=1}^{n}P(X_i,Z_i)\\ &=\prod\limits_{i=1}^{n}\big[P(X_i\mid Z_i=1)P(Z_i=1)\big]^{Z_i}\big[P(X_i\mid Z_i=0)P(Z_i=0)\big]^{1-Z_i}. \end{aligned} \]

So the log-likelihood function: \[ \log\big(L^c(\theta\mid {X}, {Z})\big)=\sum\limits_{i=1}^{n}\big[Z_i\log f(X_i\mid \mu_1)+Z_i\log\frac{1}{4}+(1-Z_i)\log f(X_i\mid \mu_2)+(1-Z_i)\log\frac{3}{4}\big]. \] Let \(\theta^{(0)}=(\mu_1^{(0)},\mu_2^{(0)})\), \[ \begin{aligned} Q(\theta\mid\theta^{(0)}, {X})&=E_{\theta^{(0)}}\big[\log L^c(\theta\mid {X}, {Z})\mid {X}\big]\\ &=-\sum\limits_{i=1}^{n}\Big\{\frac{1}{2}E_{\theta^{(0)}}[Z_i(X_i-\mu_1)^2\mid X_i]+\frac{1}{2}E_{\theta^{(0)}}\big[(1-Z_i)(X_i-\mu_2)^2\mid X_i\big]\Big\}\\ &\ \ \ \ \ +\sum\limits_{i=1}^{n}\Big\{\log(\frac{1}{4}) E_{\theta^{(0)}}(Z_i\mid X_i)+\log(\frac{3}{4}) E_{\theta^{(0)}}(1-Z_i\mid X_i)\Big\}+C, \end{aligned} \]

where \[ E_{\theta^{(0)}}[Z_i(X_i-\mu_1)^2\mid X_i]=(X_i-\mu_1)^2E_{\theta^{(0)}}[Z_i\mid X_i]. \] Since \[E[Z\mid X=x]=\frac{P(X=x\mid Z=1)P(Z=1)}{P(X=x\mid Z=1)P(Z=1)+P(X=x\mid Z=0)P(Z=0)}.\]

So we can calculate \(E_{\theta^{(0)}}[Z_i\mid X_i]\):

\[ E_{\theta^{(0)}}\big[Z_i\mid X_i\big]=\frac{\frac{1}{4}f_1(X_i\mid \mu_1^{(0)})}{\frac{1}{4}f_1(X_i\mid\mu_1^{(0)})+\frac{3}{4}f_2(X_i\mid\mu_2^{(0)})}=:r_i^{(0)}(X_i;\mu_1^{(0)},\mu_2^{(0)}). \]

And \(Q(\theta\mid\theta^{(0)}, {X})\) can be expressed: \[ -\sum\limits_{i=1}^{n}\Big\{\frac{1}{2}(X_i-\mu_1)^2r_i^{(0)}(X_i;\mu_1^{(0)},\mu_2^{(0)})+\frac{1}{2}(X_i-\mu_2)^2\big(1-r_i^{(0)}(X_i;\mu_1^{(0)},\mu_2^{(0)})\big)\Big\}+const. \]

Setting the derivatives w.r.t. the \(\theta\) to zero function: \[ \begin{cases} \frac{\partial Q(\theta\mid\theta^{(0)}, {X})}{\partial \mu_1}\overset{set}{=}0\\ \frac{\partial Q(\theta\mid\theta^{(0)}, {X})}{\partial \mu_2}\overset{set}{=}0, \end{cases} \]

we obtain the solutions \(\mu_1^{(1)}\) and \(\mu_2^{(1)}\)

\[ \begin{cases} \mu_1^{(1)}=\frac{\sum\limits_{i=1}^{n}r_i^{(0)}(X_i;\mu_1^{(0)},\mu_2^{(0)})X_i}{\sum\limits_{i=1}^{n}r_i^{(0)}(X_i;\mu_1^{(0)},\mu_2^{(0)})}\\ \mu_2^{(1)}=\frac{\sum\limits_{i=1}^{n}\big(1-r_i^{(0)}(X_i;\mu_1^{(0)},\mu_2^{(0)})\big)X_i}{\sum\limits_{i=1}^{n}\big(1-r_i^{(0)}(X_i;\mu_1^{(0)},\mu_2^{(0)})\big)}. \end{cases} \]

Right censored data (R.C example 5.13 and 5.14)

Let \(X_i\overset{i.i.d.}{\sim}f(x-\theta)\), where \(f\) is density function of \(N(0,1)\), \(F\) be the CDF of \(N(0,1)\), so \(X_i\sim N(\theta,1)\)
The Goal is to estimate \(\theta\). However, \(X_i\) are not fully observed. They are right censored.

  1. The actual observation are \(Y_i\). \[ Y_i= \begin{cases} a \ \ \ if \ X_i\ge a \\ X_i \ \ \ if \ X_i\le a, \end{cases} \] where \(a\) is some fixed constant. That is \(Y_i=\min(X_i,a)\).
  2. \(\delta_i=1(X_i\le a)\) be the indicator for non-censoring, or equivalently, for observing the actual \(X_i\).
  3. \(n\) be sample size.

Assume

  1. \((Y_1, Y_2, \dots, Y_m)\) all less than \(a\),
  2. \((Y_{m+1}, Y_{m+2}, \dots, Y_n)\) all equal than \(a\).

That is, \((Y_1, Y_2, \dots, Y_m)\) are uncensored, \((Y_{m+1}, Y_{m+2}, \dots, Y_n)\) are censored. So the observed data likelihood function: \[ \begin{aligned} L(\theta\mid Y_1, Y_2, \dots, Y_n, \delta_1, \delta_2, \dots, \delta_n)&=\prod\limits_{i=1}^{n}\big(f(Y_i-\theta)\big)^{\delta_i}\big(P(Y_i=a)\big)^{1-\delta_i}\\ &=\prod\limits_{i=1}^{n}\big(f(Y_i-\theta)\big)^{\delta_i}\big(1-P(X_i\le a)\big)^{1-\delta_i}\\ &=\prod\limits_{i=1}^{n}\big(f(Y_i-\theta)\big)^{\delta_i}\big(1-F(a-\theta)\big)^{1-\delta_i}. \end{aligned} \]

Let \({Z}\) be the vector of the unobservable \(X_{m+1}, X_{m+2}, \dots, X_n\).
The complete data likelihood function \(L^c(\theta\mid X_1, X_2, \dots, X_n)\) can be written as: \[ L^c(\theta\mid Y_1, Y_2, \dots, Y_m; {Z})=\prod\limits_{i=1}^{m}f(Y_i-\theta)\prod\limits_{i=m+1}^{n}f(X_i-\theta). \] The log-likelihood function: \[ \log L^c(\theta\mid Y_1, Y_2, \dots, Y_m; {Z})=\sum\limits_{i=1}^{m}\log f(Y_i-\theta)+\sum\limits_{i=m+1}^{n}\log f(X_i-\theta). \]

Let \[ \begin{aligned} Q(\theta\mid\theta^{(0)};Y_1, \dots, Y_n, \delta_1, \dots, \delta_n)&=E_{\theta^{(0)}}\big[\log L^c(\theta\mid X_1, \dots, X_n)\mid Y_1, \dots, Y_n, \delta_1, \dots, \delta_n\big]\\ &=-\frac{1}{2}\sum\limits_{i=1}^{m}(Y_i-\theta)^2-\frac{1}{2}\sum\limits_{i=m+1}^{n}E_{\theta^{(0)}}\big[(X_i-\theta)^2\mid Y_1, \dots, Y_n, \delta_1, \dots, \delta_n\big]. \end{aligned} \]

For those \(i=m+1, \dots, n\), \(\delta_i=0\), So
\[ E_{\theta^{(0)}}\big[(X_i-\theta)^2\mid Y_i=a, \delta_i=0\big]=E_{\theta^{(0)}}\big[(X_i-\theta)^2\mid X_i\ge a \big]. \]

We can calculate \(\theta^{(1)}\) by: \[ \frac{\partial Q(\theta\mid\theta^{(0)};Y_1, \dots, Y_n, \delta_1, \dots, \delta_n)}{\partial\theta}\overset{set}{=}0, \]

which can be expressed as: \[ \sum\limits_{i=1}^{m}(Y_i-\theta)+\sum\limits_{i=m+1}^{n}E_{\theta^{(0)}}\big[(X_i-\theta)\mid X_i\ge a\big]\overset{set}{=}0, \\ \] \[ \sum\limits_{i=1}^{m}Y_i-m\theta+\sum\limits_{i=m+1}^{n}E_{\theta^{(0)}}[X_i\mid X_i\ge a]-(n-m)\theta\overset{set}{=}0. \]

We can calculate the solution \(\theta^{(1)}\): \[ \theta^{(1)}=\frac{(\sum\limits_{i=1}^{m}Y_i)+(n-m)E_{\theta^{(0)}}[X_i\mid X_i\ge a]}{n}. \]

Finally, we can show that \[ E_{\theta^{(0)}}[X_i\mid X_i\ge a]=\theta^{(0)}+\frac{\phi(a-\theta^{(0)})}{1-\Phi(a-\theta^{(0)})}. \]

Missing Data

  • \(Z\) full data: \(Z=\left(Z_1, \cdots, Z_p\right)\) a vector of \(p\) variables; the components are subject to missing.
  • \(R\) observation indicator \(R=\left(R_1, \cdots, R_p\right) \in \{0,1\}^p\) such that \[\begin{align*} R_k= \begin{cases}1 & \text { if } Z_k \text { is observed } \\ 0 & \text { if } Z_k \text { is missing }.\end{cases} \end{align*}\] For a specific \(r=\left(r_1, \ldots, r_p\right) \in\{0,1\}^p\), \(Z_{(r)}\) is the subset of components that are observed; \(Z_{(\bar{r})}\) subset of components that are missing.
  • Let \(Z_{(R)}=\sum_r Z_{(r)} 1(R=r)\). Similarly, \(Z_{(\bar{R})}=Z \backslash Z_{(R)}\).
  • The ideal full data (complete data) is \((R, Z)\). The observed data is \(\left(R, Z_{(R)}\right)\).
    • A random sample of observed data is \(\left(R_i, Z_{(R_i)i}\right)_{i=1}^n.\)

MAR (missing at random) assumption: \[\begin{align*} P(R=r \mid Z)=P(R=r \mid Z_{(r)}). \end{align*}\]

Selection model factorization: suppose a joint model \(P(z, r ; \psi, \theta)\) for the density of \((R, Z)\) can be factorized as

\[\begin{align*} \begin{aligned} P(z, r ; \psi, \theta) & =P(Z=z, R=r ; \psi, \theta) \\ & =P(R=r|Z=z ; \psi) P(Z=z ; \theta) \\ & =P(R=r| Z_{(r)}=z_{(r)} ; \psi) P(Z=z ; \theta), \end{aligned} \end{align*}\] where the last equation is due to MAR. This factorization (with MAR) is called separability condition, which says that there is no information about \(\theta\) in \(\psi\) (assuming no overlap of elements). Therefore, if we are interested in the estimation of \(\theta\) alone, it suffices to consider \(P(Z=z ; \theta).\)

Now, the likelihood for the observed data point \(\left(R=r, Z_{(R)}=z_{(r)}\right)\) is obtained accordingly \[\begin{align*} P\left(R=r, Z_{(R)}\right. & \left.=z_{(r)} ; \psi, \theta\right)=\int P(z ; r ; \psi, \theta) d z_{(\bar{r})} \\ & =P\left(R=r \mid Z_{(r)}=z_{(r)} ; \psi\right) \int P(z ; \theta) d z_{(\bar{r})} \\ & =P\left(R=r \mid Z_{(r)}=z_{(r)} ; \psi\right) P\left(Z_{(r)}=z_{(r)} ; \theta\right). \end{align*}\]

The likelihood based on the observed data \(\left(R_i=r_i, Z_{\left(R_i\right)i}=z_{\left(r_i\right)}\right)_{i=1}^n\) is therefore given by

\[\begin{align*} & \prod_i \prod_r P\left(R_i=r ; Z_{(r) i}=z_{(r)} ; \psi, \theta\right)^{1\left(R_i=r\right)} \\ = & \prod_i\left(\prod_r P\left(R_i=r|Z_{(r) i}=z_{(r) ;} ; \psi \right)^{1\left(R_i=r\right)} \prod_r P\left(Z_{(r) i}=z_{(r)} ; \theta\right)^{1\left(R_i=r\right)}\right). \end{align*}\] Goal: Suppose we are only interested in \(\theta\), the MLE aims to maximize the following objective function over \(\theta\): \[\begin{align*} &\sum_{i=1}^n \sum_r 1\left(R_i=r\right) \log P\left(Z_{(r)i}=z_{(r)}; \theta\right) \\ =&\sum_{i=1}^n \sum_r \left[1\left(R_i=r\right) \log \left( \int P\left(z_{(r)i}, z_{(\bar{r})i}; \theta\right) d{z_{(\bar{r})}}\right)\right]. \end{align*}\]

EM algorithm: Let’s consider how EM algorithm works in this situation. The ideal full (complete) data likelihood is given by \[ \prod_{i} \prod_{r} P(R_i, Z_{(R_i)i}, Z_{(\bar{R}_i)i}; \psi, \theta)^{1\{R_i=r\}}. \] By the MAR and separability, it suffices to consider the full data likelihood. \[ \prod_{i} \prod_{r} P(Z_{(R_i)i}, Z_{(\bar{R}_i)i}; \theta)^{1\{R_i=r\}}. \]

In the same spirit of EM discussed previously, the E-step is \[\begin{align*} Q\left(\theta| \theta^{(t-1)}\right)=\sum_{i=1}^n E_{\theta^{(t-1)}}(\log \prod_{r}P(Z_i;\theta)^{1(R_i=r)}|R_i, Z_{(R_i)i})= \sum_{i=1}^n \sum_{r}1\left(R_i=r\right) E_{\theta^{(t-1)}}\left(\log P\left(Z_i; \theta\right) \mid Z_{(r) i}\right), \end{align*}\] here the expectation is conditional on the observables \((R_i, Z_{(R_i)i})_{i=1}^n\) with parameter \(\theta^{(t-1)}\). The second equality holds because \[\begin{align*} \sum_{r}1\left(R_i=r\right) E_{\theta^{\prime}}\left(\log P\left(Z_i; \theta\right) \mid R_i=r, Z_{(r)i}\right) =\sum_{r}1\left(R_i=r\right) E_{\theta^{\prime}}\left(\log P\left(Z_i; \theta\right) \mid Z_{(r) i}\right), \end{align*}\] here the expectation is conditional on \((R_i=r, Z_{(r)i})\) with parameter \(\theta'\). The equality holds because of MAR.

The M-step: \[\theta^{(t)} =\arg \max_{\theta} Q(\theta| \theta^{(t-1)}).\]

Example: Suppose \(Z\) follows a \(N(\mu,\Sigma)\), here \(\theta:=(\mu,\Sigma)\). Note that \[\begin{align*} &E_{\theta^{(t-1)}}\left(\log P\left(Z_i; \theta\right) \mid Z_{(r) i}\right) \\ =&-\frac{1}{2} \log |2 \pi {\Sigma}|-\frac{1}{2} {E}_{\theta^{(t-1)}}\left(\left(Z_i-{\mu}\right)^{\top} {\Sigma}^{-1}\left(Z_i-{\mu}\right) \mid Z_{(r) i} \right) \\ =&-\frac{1}{2} \log |{\Sigma}|-\frac{p}{2} \log (2 \pi)-\frac{1}{2} \operatorname{tr}\left({\Sigma}^{-1} {E}_{\theta^{(t-1)}}({S}({\mu})\mid Z_{(r) i})\right), \end{align*}\] where \({E}_{\theta^{(t-1)}}({S}({\mu})\mid Z_{(r) i}):= {E}_{\theta^{(t-1)}}(Z_i Z_i^{\top}\mid Z_{(r) i})+{\mu} {\mu}^{\top}-2 {\mu} {E}_{\theta^{(t-1)}}(Z_i\mid Z_{(r) i})^{\top}.\) The key is to compute \({E}_{\theta^{(t-1)}}(Z_i\mid Z_{(r) i})\) and \({E}_{\theta^{(t-1)}}(Z_i Z_i^{\top}\mid Z_{(r) i})\), and these are expected sufficient statistics and can be computed using conditional formula for multivariate Gaussian distribution.

In the above calculation, gathering all observations for all missing patterns, the solution in the M-step then is given by the MLE solution using the expected sufficient statistics: \[\begin{align*} {\mu}^{(t)} & =\frac{1}{n}\sum_{i=1}^n \sum_r 1(R_i=r){E}_{\theta^{(t-1)}}(Z_i\mid Z_{(r) i}), \\ {\Sigma}^{(t)} & =\frac{1}{n} \sum_{i=1}^n \sum_r 1(R_i=r){E}_{\theta^{(t-1)}}(Z_i Z_i^{\top}\mid Z_{(r) i})-{\mu}^{(t)}{\mu}^{(t)\top}. \end{align*}\]

Python Code Examples

Provided and discussed in lab.

Back Updated: 2024-10-31 Statistical Simulation, Wei Li