Back Updated: 2024-12-03 Statistical Simulation, Wei Li

Metropolis-Hasting algorithm (blockwise)

Target density: \(f(x,y)\), which could be unnormalized.

We use two candidate transition densities \(q_{X} (x \vert x_c, y_c)\) and \(q_{Y} (y \vert x_c, y_c)\), where \(x_c\), \(y_c\) stand for current values.

Algorithm:

  1. Update \(X\)
    1. sample \(X^* \sim q_X(\cdot \vert X^{(s)}, Y^{(s)}),\)
    2. compute \(\alpha_X (X^* \vert X^{(s)}, Y^{(s)})=\displaystyle \frac{f(X^{*}, Y^{(s)})}{f(X^{(s)}, Y^{(s)})} \cdot \frac{q_X(X^{(s)}\vert X^{*}, Y^{(s)})}{q_X(X^{*} \vert X^{(s)}, Y^{(s)})},\)
    3. set \(X^{(s+1)}\) to \(X^*\) with probability \(\min (1, \alpha_X)\); set \(X^{(s+1)}\) to \(X^{(s)}\) otherwise.
  2. Update \(Y\)
    1. sample \(Y^* \sim q_Y(\cdot \vert X^{(s+1)}, Y^{(s)}),\)
    2. compute \(\alpha_Y (Y^* \vert X^{(s+1)}, Y^{(s)})=\displaystyle \frac{f(X^{(s+1)}, Y^{*})}{f(X^{(s+1)}, Y^{(s)})} \cdot \frac{q_Y(Y^{(s)}\vert X^{(s+1)}, Y^{*})}{q_Y(Y^{*} \vert X^{(s+1)}, Y^{(s)})},\)
    3. set \(Y^{(s+1)}\) to \(Y^*\) with probability \(\min (1, \alpha_Y)\); set \(Y^{(s+1)}\) to \(Y^{(s)}\) otherwise.

Remark: The detailed balance condition holds within each update component; however it does not hold for the composition, that is, \(\displaystyle p \big( (x,y) \to (x^*,y^*) \big) f(x,y) = p \big( (x^*,y^*) \to (x,y) \big) f(x^*,y^*)\) does not hold for this algorithm in general.

Proof: provided in class.

Gibbs Sampling

Recall the block-wise Metropolis-Hastings: to sample from \(f(X,Y)\), we used two candidates: \(q_X(x|X,Y), q_Y(y|X,Y)\).

  1. Update \(X\): \[\begin{align*} \alpha_X(X^*|X^{(s)},Y^{(s)})&=\frac{f(X^*,Y^{(s)})}{f(X^{(s)},Y^{(s)})}\cdot\frac{q_X(X^{(s)}|X^{*},Y^{(s)})}{q_X(X^*|X^{(s)},Y^{(s)})}. \end{align*}\]

Suppose we take \(q_X(x|X^{(s)}, Y^{(s)})\) to be full conditional distribution \(f_{X|Y}(x|Y^{(s)})\), then

\[\begin{align*} \alpha_X(X^*|X^{(s)},Y^{(s)})&=\frac{f(X^*,Y^{(s)})}{f(X^{(s)},Y^{(s)})}\cdot\frac{q_X(X^{(s)}|X^{*},Y^{(s)})}{q_X(X^*|X^{(s)},Y^{(s)})}\\ &=\frac{f_{X|Y}(X^*|Y^{(s)})}{f_{X|Y}(X^{(s)}|Y^{(s)})}\cdot\frac{f_{X|Y}(X^{(s)}|Y^{(s)})}{f_{X|Y}(X^*|Y^{(s)})}\\ &=1. \end{align*}\]

  1. Update \(Y\): Similar argument goes through if we use the full conditional distributions as our candidate transition densities, i.e., taking \(q_Y(y|X^{(s+1)},Y^{(s)})\) to be full condition distribution \(f_{Y|X}(y|X^{(s+1)})\),

Definition: Gibbs sampling is a special case of (blockwise) MH that take \(q_X(x|X_c, Y_c)\) to be \(f_{X|Y}(x|Y_c)\), and take \(q_Y(y|X_c,Y_c)\) to be \(f_{Y|X}(y|X_c)\), which then yield \(\alpha_X=\alpha_Y=1\). So following this sampling scheme, all proposals are automatically accepted.

Algorithm: Two stages Gibbs sampling

\(f(X,Y)\), possibly unnormalized

Take some \(x^{(0)}\). For \(s=0,1,\cdots\), generate

Algorithm: Multi-stage Gibbs sampling

\(f(X_1,X_2,\cdots,X_d)\), possibly unnormalized

Starting Values: \(X^{(0)}=(X^{(0)}_1,X^{(0)}_2,\cdots,X^{(0)}_d)\). Let \(f_{j}(X_j|X_{-j})\) denote the conditional density of \(X_j\) given all the rest components \(X_{-j}:=\{X_i: i\neq j\}\).

The algorithm generates \(X^{(s+1)}\) from \(X^{(s)}\) as follows:

\[\begin{align*} (1)&\quad X^{(s+1)}_1 \sim f_1(\cdot | X^{(s)}_2, \cdots,X^{(s)}_d)\\ (2)&\quad X^{(s+1)}_2 \sim f_2(\cdot | X^{(s+1)}_1, X^{(s)}_3, \cdots,X^{(s)}_d)\\ & \vdots \\ (d)&\quad X^{(s+1)}_d \sim f_d(\cdot | X^{(s+1)}_1, \cdots,X^{(s+1)}_{d-1}) \end{align*}\]

Repeat \(s \leftarrow s+1\).

\(\textbf{Advantage:}\) For high-dim problem, all the situation can be univariate and all probabilities are accepted.

Remark:

Example: Two-stage Gibbs sampling

\[f(x)\propto\frac{e^{-x^2/20}}{(1+(z_1-x)^2)(1+(z_2-x)^2)},\quad z_1=-4.3, z_2=5.2\]

Note that: \(\displaystyle \frac{1}{1+(z_i-x)^2}=\int_0^\infty e^{-w_i(1+(z_i-x)^2)}\,dw_i\), then we can write \[f(x,w_1,w_2)\propto e^{-x^2/20}\prod_{i=1}^2 e^{-w_i(1+(z_i-x)^2)}\] so \(f(x)\) is just the marginal pdf of \(f(x,w_1,w_2)\).

Gibbs sampling: \({w}=(w_1, w_2)\)

  • \(X^{(s+1)}\sim f_{X|W}(\cdot|{W}^{(s)})\)
  • \(W^{(s+1)}\sim f_{W|X}(\cdot|X^{(s+1)})\)

Here \[f_{X|W}(x|w_1,w_2) \overset{x} \propto e^{-\big(\sum w_i\big)x^2+2x\sum w_iz_i} e^{-x^2/20}\sim N \Big(\frac{\sum w_iz_i}{\sum w_i+1/20},\frac{1}{2\big(\sum w_i+1/20\big)} \Big),\] \[f_{W|X}(w_1,w_2|x) \overset{w} \propto e^{-w_1(1+(z_1-x)^2)} e^{-w_2(1+(z_2-x)^2)}.\] Note from the factorization above, \(w_1,w_2\) are independent given \(x\), therefore, \[\begin{align*} W_1|X &\sim \mbox{exponential}(1+(z_1-X)^2),\\ W_2|X & \sim \mbox{exponential}(1+(z_2-X)^2). \end{align*}\]

Multivariate Normal

The multivariate normal distribution of a \(d-\)dimensional random vector \(\mathbf{X}=(X_1, \cdots, X_d)^{\top}\in\mathbb{R}^d\) can be written in the notation \(\mathbf{X} \sim \mathcal{N}(\boldsymbol\mu, \mathbf{\Sigma}).\) When the symmetric covariance matrix \(\mathbf{\Sigma}\) is positive definite, then the multivariate normal distribution is non-degenerate, and the distribution has density function \[\begin{align*} f_{\mathbf{X}}(\mathbf{x} \vert \boldsymbol\mu, \mathbf{\Sigma}) &= \frac{1}{(2 \pi)^{\frac{d}{2}}} \big( \det(\mathbf{\Sigma}) \big)^{-\frac{1}{2}} \exp \big(-\frac{1}{2} (\mathbf{x}-\boldsymbol\mu)^{\top} \mathbf{\Sigma}^{-1} (\mathbf{x}-\boldsymbol\mu) \big)\\ &\overset{\mathbf{x}} \propto \exp(-\frac{1}{2} \mathbf{x}^{\top} \mathbf{\Sigma}^{-1} \mathbf{x} + \mathbf{x}^{\top} \mathbf{\Sigma}^{-1} \boldsymbol\mu) \end{align*}\] where \(\mathbf{x}=(x_1, \cdots, x_d)^{\top}\).

For comparison, the density function of (univariate) normal distribution \(X\sim N(\mu, \sigma^2)\), \[ f_X(x \vert \mu, \sigma^2)=\frac{1}{\sqrt{2\pi}\sigma} \exp \big(-\frac{1}{2\sigma^2}(x-\mu)^2 \big) \overset{x} \propto \exp\big(-\frac{1}{2\sigma^2}(x^2-2x\mu+\mu^2)\big) \overset{x} \propto \exp\big(-\frac{1}{2} x^2 (\sigma^2)^{-1}+ x(\sigma^2)^{-1}\mu\big). \]

Bivariate normal \[\displaystyle (X,Y)\sim N \left(\begin{pmatrix}\mu_X\\ \mu_Y\end{pmatrix},\begin{pmatrix}\sigma_X^2 & \sigma_{XY}\\ \sigma_{XY}&\sigma_Y^2\end{pmatrix} \right),\] Let \(\rho=\frac{\sigma_{XY}}{\sigma_{X}\sigma_Y}\) be the correlation. It is a fact that \[f_{X|Y}(x|y)\sim N(\mu_X+\rho\frac{\sigma_X}{\sigma_Y}(y-\mu_Y), \sigma_X^2(1-\rho^2)),\] \[f_{Y|X}(y|x)\sim N(\mu_Y+\rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X), \sigma_Y^2(1-\rho^2)).\]

2D Ising model

The Ising model is a mathematical model of ferromagnetism in statistical mechanics. It comprises a lattice of spins, where each spin can be either up (+1) or down (-1). These spins represent magnetic dipole moments of atomic “spins”. The key interest in the Ising model is understanding how these spins interact under varying temperatures and how these interactions lead to macroscopic phenomena like phase transitions.

In image processing, a 2D lattice can represent an image where each lattice site corresponds to a pixel: \[\begin{align*} I=\{1,2, \ldots, L\} \times\{1,2, \ldots, L\} \end{align*}\] The value at each site (spin) can denote various image properties, like color or intensity. Here we consider binary images, where the state space is given by \(\mathcal{X}=\{-1, 1\}^{I}\).

The Ising model describes the energy of a system using an energy function:

\[\begin{equation*} H(x) = -J \sum_{i \sim j\\ i, j \in I} x_i x_j - h\sum_{i} x_i, \quad x \in \mathcal{X}, \end{equation*}\]

where \(i \sim j\) indicates that \(x_i\) and \(x_j\) are the spins of adjacent lattice sites, for simplicity, we consider only nearest neighbors, so for a pixel \(i=\left(i_1, i_2\right)\) in the interior of the grid there are four neighbours \(\left(i_1+1, i_2\right),\left(i_1, i_2+1\right),\left(i_1-1, i_2\right)\) and \(\left(i_1, i_2-1\right)\). For a pixel on the edges of the grid (but not corners) three neighbours, for a corner pixel two neighbours.

  • Spin (or color) \(x_i\): represents the state of each pixel in image processing contexts, \(+1\) is black, \(-1\) is white.
  • Interaction Energy \(J\): the inverse temperature, dictating how strongly neighboring pixels influence each other.
    • Positive \(J\): the energy of the system is lower when neighboring spins are aligned. This scenario promotes homogeneity in the system.
    • Negative \(J\): the system has lower energy when neighboring spins are opposite. This situation encourages heterogeneity and contrast.
  • External Field \(h\): represents an external field (which can be thought of as a bias or an external influence on the system). In image processing, this can be used to model external factors or a priori information about the image. For simplicity, we assume \(h=0\) for the moment.

The probability measure for the system on \(\mathcal{X}\) can be written as \[\begin{align*} \pi(x)=\frac{1}{Z} \exp (-H(x)), \end{align*}\] where \(Z\) is the normalizing constant.

It can be shown that \[\begin{align*} P\left(X_i=+1 \mid X_{-i}=\boldsymbol{x}_{-i}\right) = \frac{1}{1+\exp \left(-2J \eta_i\right)}, \end{align*}\] where \(\eta_i := \sum_{j \sim i} x_j\).

Bayesian binary denoising

Assume that a collection of original binary images has a distribution that is described by the Ising model \(\pi(x)\) above. We further assume that the observed images are noisy versions of the original images.

To be concrete, we assume for each pixel \(X_i\), we actually observe \(Y_i=X_i+\epsilon_i\) where \(\epsilon_i\) are indendently distributed \(N(0, \sigma^2).\)

Using Bayes’ rule, \[\begin{align*} p_{X \mid Y}(x \mid y)=\frac{p_{Y \mid X}(y \mid x) p_X(x)}{p_Y(y)}=\frac{1}{Z_y} \exp \left(-H(y, x)\right), \end{align*}\] where \[\begin{align*} H(y, x)=-\frac{1}{\sigma^2} \sum_{i \in I} y_i x_i+H(x)=-\frac{1}{\sigma^2} \sum_{i \in I} y_i x_i+J \sum_{\substack{i \sim j}} x_i x_j. \end{align*}\] The full conditional distribution of \(X_i\) can be similarly derived \[\begin{align*} P\left(X_i=+1 \mid X_{-i}=\boldsymbol{x}_{-i}, Y=\boldsymbol{y} \right) = \frac{1}{1+\exp \left(-2(J \eta_i+ y_i/\sigma^2)\right)}, \end{align*}\] where \(\eta_i := \sum_{j \sim i} x_j\).

Slice Sampling

The slice sampling is a technique based on the idea of the Fundamental Theorem of Simulation and Gibbs sampling.

Recall from the Fundamental Theorem of Simulation, \[ f(x)=\int_{0}^{f(x)}1du. \] So \(f(x)\) is a marginal density (in \(X\)) of the joint uniform distribution on the set \[ \{(x, u): 0<u<f(x), x\in\mathcal{X}\}, \] with the p.d.f. \(f(x, u)=1_{\{(x, u): 0<u<f(x)\}}(x, u).\)

To obtain a sample of \((X ,U)\) from this distribution, we can sample from a M.C. that has the target stationary distribution which can be done using Gibbs sampling, noting that \[\begin{align*} &f_{X|U}(x|u) \sim \operatorname{unif}\{x: u< f(x) \}, \\ &f_{U|X}(u|x) \sim \operatorname{unif}\{u: 0< u< f(x) \}. \end{align*}\]

Slice sampling:

At iteration \(s\), simulate

  1. \(U^{(s+1)} \sim \operatorname{unif}(0, f(X^{(s)}))\),
  2. \(X^{(s+1)} \sim \operatorname{unif}(A^{(s+1)}),\) where \(A^{(s+1)}:=\{x: f(x)>U^{(s+1)} \}.\)

Remark 1: A major difficulty in the algorithm is to evaluate the set \(A^{(s+1)}\). Consider for a univaraite distribution, we can apply an iterative expansion–called stepping out: star with some small constants \(L>0, R>0\) and consider the candidate interval \(x \in (X^{(s)}-L, X^{(s)}+R)\), then we extend the interval until the endpoints fall outside the slice (i.e., either \(f(X^{(s)}-L) \leq U^{(s+1)}\) or \(f(X^{(s)}+R)\leq U^{(s+1)})\)) or after some fixed number of steps. Eventually we aim for the interval \[ \tilde{A}^{(s+1)}=\{x\in (X^{(s)}-L, X^{(s)}+R): f(x)>U^{(s+1)} \}. \] Then in a loop we draw \(X^*\) from \(\tilde{A}^{(s+1)}\) using accept/reject method, i.e., we draw \(X^*\) from the interval until \(f(X^*)>U^{(s+1)}\) and then set \(X^{(s+1)} \leftarrow X^*\). During the loop, if \(f(X^*)\leq U^{(s+1)}\), we adjust the interval by shrinking it–specifically, if \(X^*>X^{(s)}\), set \(R \leftarrow X^*-X^{(s)}\); if \(X^*<X^{(s)}\), set \(L \leftarrow X^{(s)}-X^*\), then redraw from the interval.

Remark 2: The procedure described in Remark 1 can be generalized to multivariate distribution.

Remark 3: The algorithm remains valid if we use \(\tilde{f}\) the kernel of the p.d.f. in the algorithm.

General form:

Suppose \(f(x) \propto \prod_{i=1}^k f_i(x)\), where \(f_i(x)>0\) not necessarily density. Note that \(f_i(x) = \int 1\{x: 0 \leq w_i \leq f_i(x) \}dw_i\). Thus \(f\) is the marginal distribution of the joint distribution \[ (x, w_1, \ldots, w_k) \overset{x, w_1, \ldots, w_k}{\propto} \prod_{i=1}^k 1\{w_i : 0\leq w_i \leq f_i(x) \}, \] noting \[ f(x) \overset{x}{\propto} \int \cdots \int \prod_{i=1}^k 1\{w_i: 0 \leq w_i \leq f_i(x)\}dw_{1}\cdots dw_{k}= \prod_{i}f_i(x). \]

Slice sampling:

At iteration \(s\), simulate

  1. \(W_1^{(s+1)} \sim \operatorname{unif}(0, f_1(X^{(s)}))\),

    \(W_2^{(s+1)} \sim \operatorname{unif}(0, f_2(X^{(s)}))\),

    \(\vdots\)

    \(W_{k}^{(s+1)} \sim \operatorname{unif}(0, f_k(X^{(s)}))\),

  2. \(X^{(s+1)} \sim \operatorname{unif}(A^{(s+1)})\), where \(A^{(s+1)}:=\{x: f_i(x) \geq W_i^{(s+1)}, \forall i=1, \ldots, k\}.\)

Python Code Examples

Provided and discussed in lab.

Back Updated: 2024-12-03 Statistical Simulation, Wei Li