Back Updated: 2025-11-18 Statistical Simulation, Wei Li

Metropolis-Hasting algorithm (blockwise)

Target density: \(f(x,y)\), which could be unnormalized.

We use two candidate transition densities \(q_{X} (x \vert x_c, y_c)\) and \(q_{Y} (y \vert x_c, y_c)\), where \(x_c\), \(y_c\) stand for current values.

Algorithm:

  1. Update \(X\)
    1. sample \(X^* \sim q_X(\cdot \vert X^{(s)}, Y^{(s)}),\)
    2. compute \(\alpha_X (X^* \vert X^{(s)}, Y^{(s)})=\displaystyle \frac{f(X^{*}, Y^{(s)})}{f(X^{(s)}, Y^{(s)})} \cdot \frac{q_X(X^{(s)}\vert X^{*}, Y^{(s)})}{q_X(X^{*} \vert X^{(s)}, Y^{(s)})},\)
    3. set \(X^{(s+1)}\) to \(X^*\) with probability \(\min (1, \alpha_X)\); set \(X^{(s+1)}\) to \(X^{(s)}\) otherwise.
  2. Update \(Y\)
    1. sample \(Y^* \sim q_Y(\cdot \vert X^{(s+1)}, Y^{(s)}),\)
    2. compute \(\alpha_Y (Y^* \vert X^{(s+1)}, Y^{(s)})=\displaystyle \frac{f(X^{(s+1)}, Y^{*})}{f(X^{(s+1)}, Y^{(s)})} \cdot \frac{q_Y(Y^{(s)}\vert X^{(s+1)}, Y^{*})}{q_Y(Y^{*} \vert X^{(s+1)}, Y^{(s)})},\)
    3. set \(Y^{(s+1)}\) to \(Y^*\) with probability \(\min (1, \alpha_Y)\); set \(Y^{(s+1)}\) to \(Y^{(s)}\) otherwise.

Remark:

The conditional transition kernel for \(X\) (given \(Y\) ) satisfies detailed balance with respect to \(f(x, y)\). Same is true for \(Y\) (given \(X\)). Thus, each componentwise MH step individually satisfies detailed balance and is reversible with respect to \(f(x, y)\).

However the detailed balance condition does not hold for the composition, that is, when you compose the two steps (first update \(X\), then \(Y\) ), the overall transition kernel \[\begin{align*} p\left((x, y),\left(x^*, y^*\right)\right)=p_X\left((x, y),\left(x^*, y\right)\right) p_Y\left(\left(x^*, y\right),\left(x^*, y^*\right)\right) \end{align*}\] generally does not satisfy the global detailed balance condition \(\displaystyle p \big( (x,y) \to (x^*,y^*) \big) f(x,y) = p \big( (x^*,y^*) \to (x,y) \big) f(x^*,y^*)\).

Proof: provided in class.

Gibbs Sampling

Recall the block-wise Metropolis-Hastings: to sample from \(f(X,Y)\), we used two candidates: \(q_X(x|X,Y), q_Y(y|X,Y)\).

  1. Update \(X\): \[\begin{align*} \alpha_X(X^*|X^{(s)},Y^{(s)})&=\frac{f(X^*,Y^{(s)})}{f(X^{(s)},Y^{(s)})}\cdot\frac{q_X(X^{(s)}|X^{*},Y^{(s)})}{q_X(X^*|X^{(s)},Y^{(s)})}. \end{align*}\]

Suppose we take \(q_X(x|X^{(s)}, Y^{(s)})\) to be full conditional distribution \(f_{X|Y}(x|Y^{(s)})\), then

\[\begin{align*} \alpha_X(X^*|X^{(s)},Y^{(s)})&=\frac{f(X^*,Y^{(s)})}{f(X^{(s)},Y^{(s)})}\cdot\frac{q_X(X^{(s)}|X^{*},Y^{(s)})}{q_X(X^*|X^{(s)},Y^{(s)})}\\ &=\frac{f_{X|Y}(X^*|Y^{(s)})}{f_{X|Y}(X^{(s)}|Y^{(s)})}\cdot\frac{f_{X|Y}(X^{(s)}|Y^{(s)})}{f_{X|Y}(X^*|Y^{(s)})}\\ &=1. \end{align*}\]

  1. Update \(Y\): Similar argument goes through if we use the full conditional distributions as our candidate transition densities, i.e., taking \(q_Y(y|X^{(s+1)},Y^{(s)})\) to be full condition distribution \(f_{Y|X}(y|X^{(s+1)})\),

Definition: Gibbs sampling is a special case of (blockwise) MH that take \(q_X(x|X_c, Y_c)\) to be \(f_{X|Y}(x|Y_c)\), and take \(q_Y(y|X_c,Y_c)\) to be \(f_{Y|X}(y|X_c)\), which then yield \(\alpha_X=\alpha_Y=1\). So following this sampling scheme, all proposals are automatically accepted.

Algorithm: Two stages Gibbs sampling

\(f(X,Y)\), possibly unnormalized

Take some \(x^{(0)}\). For \(s=0,1,\cdots\), generate

Algorithm: Multi-stage Gibbs sampling

\(f(X_1,X_2,\cdots,X_d)\), possibly unnormalized

Starting Values: \(X^{(0)}=(X^{(0)}_1,X^{(0)}_2,\cdots,X^{(0)}_d)\). Let \(f_{j}(X_j|X_{-j})\) denote the conditional density of \(X_j\) given all the rest components \(X_{-j}:=\{X_i: i\neq j\}\).

The algorithm generates \(X^{(s+1)}\) from \(X^{(s)}\) as follows:

\[\begin{align*} (1)&\quad X^{(s+1)}_1 \sim f_1(\cdot | X^{(s)}_2, \cdots,X^{(s)}_d)\\ (2)&\quad X^{(s+1)}_2 \sim f_2(\cdot | X^{(s+1)}_1, X^{(s)}_3, \cdots,X^{(s)}_d)\\ & \vdots \\ (d)&\quad X^{(s+1)}_d \sim f_d(\cdot | X^{(s+1)}_1, \cdots,X^{(s+1)}_{d-1}) \end{align*}\]

Repeat \(s \leftarrow s+1\).

\(\textbf{Advantage:}\) For high-dim problem, all the situation can be univariate and all probabilities are accepted.

Remark:

Example: Two-stage Gibbs sampling

\[f(x)\propto\frac{e^{-x^2/20}}{(1+(z_1-x)^2)(1+(z_2-x)^2)},\quad z_1=-4.3, z_2=5.2\]

Note that: \(\displaystyle \frac{1}{1+(z_i-x)^2}=\int_0^\infty e^{-w_i(1+(z_i-x)^2)}\,dw_i\), then we can write \[f(x,w_1,w_2)\propto e^{-x^2/20}\prod_{i=1}^2 e^{-w_i(1+(z_i-x)^2)}\] so \(f(x)\) is just the marginal pdf of \(f(x,w_1,w_2)\).

Gibbs sampling: \({w}=(w_1, w_2)\)

  • \(X^{(s+1)}\sim f_{X|W}(\cdot|{W}^{(s)})\)
  • \(W^{(s+1)}\sim f_{W|X}(\cdot|X^{(s+1)})\)

Here \[f_{X|W}(x|w_1,w_2) \overset{x} \propto e^{-\big(\sum w_i\big)x^2+2x\sum w_iz_i} e^{-x^2/20}\sim N \Big(\frac{\sum w_iz_i}{\sum w_i+1/20},\frac{1}{2\big(\sum w_i+1/20\big)} \Big),\] \[f_{W|X}(w_1,w_2|x) \overset{w} \propto e^{-w_1(1+(z_1-x)^2)} e^{-w_2(1+(z_2-x)^2)}.\] Note from the factorization above, \(w_1,w_2\) are independent given \(x\), therefore, \[\begin{align*} W_1|X &\sim \mbox{exponential}(1+(z_1-X)^2),\\ W_2|X & \sim \mbox{exponential}(1+(z_2-X)^2). \end{align*}\]

Multivariate Normal

The multivariate normal distribution of a \(d-\)dimensional random vector \(\mathbf{X}=(X_1, \cdots, X_d)^{\top}\in\mathbb{R}^d\) can be written in the notation \(\mathbf{X} \sim \mathcal{N}(\boldsymbol\mu, \mathbf{\Sigma}).\) When the symmetric covariance matrix \(\mathbf{\Sigma}\) is positive definite, then the multivariate normal distribution is non-degenerate, and the distribution has density function \[\begin{align*} f_{\mathbf{X}}(\mathbf{x} \vert \boldsymbol\mu, \mathbf{\Sigma}) &= \frac{1}{(2 \pi)^{\frac{d}{2}}} \big( \det(\mathbf{\Sigma}) \big)^{-\frac{1}{2}} \exp \big(-\frac{1}{2} (\mathbf{x}-\boldsymbol\mu)^{\top} \mathbf{\Sigma}^{-1} (\mathbf{x}-\boldsymbol\mu) \big)\\ &\overset{\mathbf{x}} \propto \exp(-\frac{1}{2} \mathbf{x}^{\top} \mathbf{\Sigma}^{-1} \mathbf{x} + \mathbf{x}^{\top} \mathbf{\Sigma}^{-1} \boldsymbol\mu) \end{align*}\] where \(\mathbf{x}=(x_1, \cdots, x_d)^{\top}\).

For comparison, the density function of (univariate) normal distribution \(X\sim N(\mu, \sigma^2)\), \[ f_X(x \vert \mu, \sigma^2)=\frac{1}{\sqrt{2\pi}\sigma} \exp \big(-\frac{1}{2\sigma^2}(x-\mu)^2 \big) \overset{x} \propto \exp\big(-\frac{1}{2\sigma^2}(x^2-2x\mu+\mu^2)\big) \overset{x} \propto \exp\big(-\frac{1}{2} x^2 (\sigma^2)^{-1}+ x(\sigma^2)^{-1}\mu\big). \]

Bivariate normal \[\displaystyle (X,Y)\sim N \left(\begin{pmatrix}\mu_X\\ \mu_Y\end{pmatrix},\begin{pmatrix}\sigma_X^2 & \sigma_{XY}\\ \sigma_{XY}&\sigma_Y^2\end{pmatrix} \right),\] Let \(\rho=\frac{\sigma_{XY}}{\sigma_{X}\sigma_Y}\) be the correlation. It is a fact that \[f_{X|Y}(x|y)\sim N(\mu_X+\rho\frac{\sigma_X}{\sigma_Y}(y-\mu_Y), \sigma_X^2(1-\rho^2)),\] \[f_{Y|X}(y|x)\sim N(\mu_Y+\rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X), \sigma_Y^2(1-\rho^2)).\]

2D Ising model

The Ising model is a mathematical model of ferromagnetism in statistical mechanics. It comprises a lattice of spins, where each spin can be either up (+1) or down (-1). These spins represent magnetic dipole moments of atomic “spins”. The key interest in the Ising model is understanding how these spins interact under varying temperatures and how these interactions lead to macroscopic phenomena like phase transitions.

In image processing, a 2D lattice can represent an image where each lattice site corresponds to a pixel: \[\begin{align*} I=\{1,2, \ldots, L\} \times\{1,2, \ldots, L\} \end{align*}\] The value at each site (spin) can denote various image properties, like color or intensity. Here we consider binary images, where the state space is given by \(\mathcal{X}=\{-1, 1\}^{I}\).

The Ising model describes the energy of a system using an energy function:

\[\begin{equation*} H(x) = -J \sum_{i \sim j\\ i, j \in I} x_i x_j - h\sum_{i} x_i, \quad x \in \mathcal{X}, \end{equation*}\]

where \(i \sim j\) indicates that \(x_i\) and \(x_j\) are the spins of adjacent lattice sites, for simplicity, we consider only nearest neighbors, so for a pixel \(i=\left(i_1, i_2\right)\) in the interior of the grid there are four neighbours \(\left(i_1+1, i_2\right),\left(i_1, i_2+1\right),\left(i_1-1, i_2\right)\) and \(\left(i_1, i_2-1\right)\). For a pixel on the edges of the grid (but not corners) three neighbours, for a corner pixel two neighbours.

  • Spin (or color) \(x_i\): represents the state of each pixel in image processing contexts, \(+1\) is black, \(-1\) is white.
  • Interaction Energy \(J\): the inverse temperature, dictating how strongly neighboring pixels influence each other.
    • Positive \(J\): the energy of the system is lower when neighboring spins are aligned. This scenario promotes homogeneity in the system.
    • Negative \(J\): the system has lower energy when neighboring spins are opposite. This situation encourages heterogeneity and contrast.
  • External Field \(h\): represents an external field (which can be thought of as a bias or an external influence on the system). In image processing, this can be used to model external factors or a priori information about the image. For simplicity, we assume \(h=0\) for the moment.

The probability measure (via Boltzman-Gibbs transformation) for the system on \(\mathcal{X}\) can be written as \[\begin{align*} \pi(x)=\frac{1}{Z} \exp (-H(x)), \end{align*}\] where \(Z\) is the normalizing constant.

It can be shown that \[\begin{align*} P\left(X_i=+1 \mid X_{-i}=\boldsymbol{x}_{-i}\right) = \frac{1}{1+\exp \left(-2J \eta_i\right)}, \end{align*}\] where \(\eta_i := \sum_{j \sim i} x_j\).

The follows give some typical samples as \(J\) varies.

Bayesian binary denoising

Assume that a collection of original binary images has a distribution that is described by the Ising model \(\pi(x)\) above. We further assume that the observed images are noisy versions of the original images.

To be concrete, we assume for each pixel \(X_i\), we actually observe \(Y_i=X_i+\epsilon_i\) where \(\epsilon_i\) are indendently distributed \(N(0, \sigma^2).\)

Using Bayes’ rule, \[\begin{align*} p_{X \mid Y}(x \mid y)=\frac{p_{Y \mid X}(y \mid x) p_X(x)}{p_Y(y)}=\frac{1}{Z_y} \exp \left(-H(y, x)\right), \end{align*}\] where \[\begin{align*} H(y, x)=-\frac{1}{\sigma^2} \sum_{i \in I} y_i x_i+H(x)=-\frac{1}{\sigma^2} \sum_{i \in I} y_i x_i+J \sum_{\substack{i \sim j}} x_i x_j. \end{align*}\] The full conditional distribution of \(X_i\) can be similarly derived \[\begin{align*} P\left(X_i=+1 \mid X_{-i}=\boldsymbol{x}_{-i}, Y=\boldsymbol{y} \right) = \frac{1}{1+\exp \left(-2(J \eta_i+ y_i/\sigma^2)\right)}, \end{align*}\] where \(\eta_i := \sum_{j \sim i} x_j\).

A sample of real image, noisy image, and a reconstructed image based on posterior probabilities.

Slice Sampling

The slice sampling is a technique based on the idea of the Fundamental Theorem of Simulation and Gibbs sampling.

Recall from the Fundamental Theorem of Simulation, \[ f(x)=\int_{0}^{f(x)}1du. \] So \(f(x)\) is a marginal density (in \(X\)) of the joint uniform distribution on the set \[ \{(x, u): 0<u<f(x), x\in\mathcal{X}\}, \] with the p.d.f. \(f(x, u)=1_{\{(x, u): 0<u<f(x), x\in \mathcal{X}\}}(x, u).\)

To obtain a sample of \((X ,U)\) from this distribution, we can sample from a M.C. that has the target stationary distribution which can be done using Gibbs sampling, noting that \[\begin{align*} &f_{X|U}(x|u) \sim \operatorname{unif}\{x: u< f(x) \}, \\ &f_{U|X}(u|x) \sim \operatorname{unif}\{u: 0< u< f(x) \}. \end{align*}\]

Slice sampling:

At iteration \(s\), simulate

  1. \(U^{(s+1)} \sim \operatorname{unif}(0, f(X^{(s)}))\),
  2. \(X^{(s+1)} \sim \operatorname{unif}(A^{(s+1)}),\) where \(A^{(s+1)}:=\{x: f(x)>U^{(s+1)} \}.\)

Conceptually, Step 2 can indeed be understood as an accept-reject procedure:

General form:

Suppose \(f(x) \propto \prod_{i=1}^k f_i(x)\), where \(f_i(x)>0\) not necessarily density. Note that \(f_i(x) = \int 1\{x: 0 \leq w_i \leq f_i(x) \}dw_i\). Thus \(f\) is the marginal distribution of the joint distribution \[ (x, w_1, \ldots, w_k) \overset{x, w_1, \ldots, w_k}{\propto} \prod_{i=1}^k 1\{w_i : 0\leq w_i \leq f_i(x) \}, \] noting \[ f(x) \overset{x}{\propto} \int \cdots \int \prod_{i=1}^k 1\{w_i: 0 \leq w_i \leq f_i(x)\}dw_{1}\cdots dw_{k}= \prod_{i}f_i(x). \]

Slice sampling:

At iteration \(s\), simulate

  1. \(W_1^{(s+1)} \sim \operatorname{unif}(0, f_1(X^{(s)}))\),

    \(W_2^{(s+1)} \sim \operatorname{unif}(0, f_2(X^{(s)}))\),

    \(\vdots\)

    \(W_{k}^{(s+1)} \sim \operatorname{unif}(0, f_k(X^{(s)}))\),

  2. \(X^{(s+1)} \sim \operatorname{unif}(A^{(s+1)})\), where \(A^{(s+1)}:=\{x: f_i(x) \geq W_i^{(s+1)}, \forall i=1, \ldots, k\}.\)

stepping-out and shrinkage procedure

A major challenge in slice sampling is to identify or approximate the slice \(A^{(s+1)}\).For a univariate target \(f\), a common strategy is the stepping-out and shrinkage procedure:

  1. Start with small positive constants \(L\) and \(R\), and define an initial candidate interval

\[\begin{align*} \left(X^{(s)}-L, X^{(s)}+R\right) . \end{align*}\]

  1. Stepping out: iteratively (and possibly independently) expand the interval until both ends fall outside the slice (that is, until \(f\left(X^{(s)}-L\right) \leq U^{(s+1)}\) and \(f\left(X^{(s)}+R\right) \leq U^{(s+1)}\) ), or after reaching a fixed maximum number of expansions.
  2. Define the working interval \[\begin{align*} \tilde{A}^{(s+1)}=\left\{x \in\Big(X^{(s)}-L, X^{(s)}+R\Big): f(x)>U^{(s+1)}\right\} . \end{align*}\]
  3. Sampling by accept-reject on \(\tilde{A}^{(s+1)}\) with shrinkage:
    • Draw a candidate \(X^*\) uniformly from the current interval.
    • If \(f\left(X^*\right)>U^{(s+1)}\), accept and set \(X^{(s+1)} \leftarrow X^*\).
    • Otherwise, shrink the interval (by adjusting the widths):
      • If \(X^*>X^{(s)}\), set \(R \leftarrow X^*-X^{(s)}\);
      • If \(X^*<X^{(s)}\), set \(L \leftarrow X^{(s)}-X^*\); then redraw from the reduced interval.

This iterative “shrinkage” ensures that the accepted point always lies within the slice.

Remark 1: The above procedure generalizes to multivariate distributions, though defining and sampling from \(A^{(s+1)}\) becomes more challenging.

Remark 2: The algorithm remains valid if we use \(\tilde{f}\) the kernel of the p.d.f. in the algorithm.

Remark 3: The Gibbs sampler samples directly from the original space based on full conditionals and therefore requires closed-form distributions or invertible CDFs (unless combined with some other sampling methods). Slice sampling can be viewed as a Gibbs sampler operating on an augmented space. With stepping-out and shrinkage procedures, it often mixes better than Gibbs sampling on the original state space. Furthermore, this works even when the conditional distribution (in the original space) does not have a closed form.

Python Code Examples

Provided and discussed in lab.

Back Updated: 2025-11-18 Statistical Simulation, Wei Li