Back Updated: 2025-11-18 Statistical Simulation, Wei Li
Target density: \(f(x,y)\), which could be unnormalized.
We use two candidate transition densities \(q_{X} (x \vert x_c, y_c)\) and \(q_{Y} (y \vert x_c, y_c)\), where \(x_c\), \(y_c\) stand for current values.
Algorithm:
Remark:
The conditional transition kernel for \(X\) (given \(Y\) ) satisfies detailed balance with respect to \(f(x, y)\). Same is true for \(Y\) (given \(X\)). Thus, each componentwise MH step individually satisfies detailed balance and is reversible with respect to \(f(x, y)\).
However the detailed balance condition does not hold for the composition, that is, when you compose the two steps (first update \(X\), then \(Y\) ), the overall transition kernel \[\begin{align*} p\left((x, y),\left(x^*, y^*\right)\right)=p_X\left((x, y),\left(x^*, y\right)\right) p_Y\left(\left(x^*, y\right),\left(x^*, y^*\right)\right) \end{align*}\] generally does not satisfy the global detailed balance condition \(\displaystyle p \big( (x,y) \to (x^*,y^*) \big) f(x,y) = p \big( (x^*,y^*) \to (x,y) \big) f(x^*,y^*)\).
Proof: provided in class.
Recall the block-wise Metropolis-Hastings: to sample from \(f(X,Y)\), we used two candidates: \(q_X(x|X,Y), q_Y(y|X,Y)\).
Suppose we take \(q_X(x|X^{(s)}, Y^{(s)})\) to be full conditional distribution \(f_{X|Y}(x|Y^{(s)})\), then
\[\begin{align*} \alpha_X(X^*|X^{(s)},Y^{(s)})&=\frac{f(X^*,Y^{(s)})}{f(X^{(s)},Y^{(s)})}\cdot\frac{q_X(X^{(s)}|X^{*},Y^{(s)})}{q_X(X^*|X^{(s)},Y^{(s)})}\\ &=\frac{f_{X|Y}(X^*|Y^{(s)})}{f_{X|Y}(X^{(s)}|Y^{(s)})}\cdot\frac{f_{X|Y}(X^{(s)}|Y^{(s)})}{f_{X|Y}(X^*|Y^{(s)})}\\ &=1. \end{align*}\]
Definition: Gibbs sampling is a special case of (blockwise) MH that take \(q_X(x|X_c, Y_c)\) to be \(f_{X|Y}(x|Y_c)\), and take \(q_Y(y|X_c,Y_c)\) to be \(f_{Y|X}(y|X_c)\), which then yield \(\alpha_X=\alpha_Y=1\). So following this sampling scheme, all proposals are automatically accepted.
Algorithm: Two stages Gibbs sampling
\(f(X,Y)\), possibly unnormalized
Take some \(x^{(0)}\). For \(s=0,1,\cdots\), generate
Algorithm: Multi-stage Gibbs sampling
\(f(X_1,X_2,\cdots,X_d)\), possibly unnormalized
Starting Values: \(X^{(0)}=(X^{(0)}_1,X^{(0)}_2,\cdots,X^{(0)}_d)\). Let \(f_{j}(X_j|X_{-j})\) denote the conditional density of \(X_j\) given all the rest components \(X_{-j}:=\{X_i: i\neq j\}\).
The algorithm generates \(X^{(s+1)}\) from \(X^{(s)}\) as follows:
\[\begin{align*} (1)&\quad X^{(s+1)}_1 \sim f_1(\cdot | X^{(s)}_2, \cdots,X^{(s)}_d)\\ (2)&\quad X^{(s+1)}_2 \sim f_2(\cdot | X^{(s+1)}_1, X^{(s)}_3, \cdots,X^{(s)}_d)\\ & \vdots \\ (d)&\quad X^{(s+1)}_d \sim f_d(\cdot | X^{(s+1)}_1, \cdots,X^{(s+1)}_{d-1}) \end{align*}\]
Repeat \(s \leftarrow s+1\).
\(\textbf{Advantage:}\) For high-dim problem, all the situation can be univariate and all probabilities are accepted.
Remark:
\[f(x)\propto\frac{e^{-x^2/20}}{(1+(z_1-x)^2)(1+(z_2-x)^2)},\quad z_1=-4.3, z_2=5.2\]
Note that: \(\displaystyle \frac{1}{1+(z_i-x)^2}=\int_0^\infty e^{-w_i(1+(z_i-x)^2)}\,dw_i\), then we can write \[f(x,w_1,w_2)\propto e^{-x^2/20}\prod_{i=1}^2 e^{-w_i(1+(z_i-x)^2)}\] so \(f(x)\) is just the marginal pdf of \(f(x,w_1,w_2)\).
Gibbs sampling: \({w}=(w_1, w_2)\)
Here \[f_{X|W}(x|w_1,w_2) \overset{x} \propto e^{-\big(\sum w_i\big)x^2+2x\sum w_iz_i} e^{-x^2/20}\sim N \Big(\frac{\sum w_iz_i}{\sum w_i+1/20},\frac{1}{2\big(\sum w_i+1/20\big)} \Big),\] \[f_{W|X}(w_1,w_2|x) \overset{w} \propto e^{-w_1(1+(z_1-x)^2)} e^{-w_2(1+(z_2-x)^2)}.\] Note from the factorization above, \(w_1,w_2\) are independent given \(x\), therefore, \[\begin{align*} W_1|X &\sim \mbox{exponential}(1+(z_1-X)^2),\\ W_2|X & \sim \mbox{exponential}(1+(z_2-X)^2). \end{align*}\]
The multivariate normal distribution of a \(d-\)dimensional random vector \(\mathbf{X}=(X_1, \cdots, X_d)^{\top}\in\mathbb{R}^d\) can be written in the notation \(\mathbf{X} \sim \mathcal{N}(\boldsymbol\mu, \mathbf{\Sigma}).\) When the symmetric covariance matrix \(\mathbf{\Sigma}\) is positive definite, then the multivariate normal distribution is non-degenerate, and the distribution has density function \[\begin{align*} f_{\mathbf{X}}(\mathbf{x} \vert \boldsymbol\mu, \mathbf{\Sigma}) &= \frac{1}{(2 \pi)^{\frac{d}{2}}} \big( \det(\mathbf{\Sigma}) \big)^{-\frac{1}{2}} \exp \big(-\frac{1}{2} (\mathbf{x}-\boldsymbol\mu)^{\top} \mathbf{\Sigma}^{-1} (\mathbf{x}-\boldsymbol\mu) \big)\\ &\overset{\mathbf{x}} \propto \exp(-\frac{1}{2} \mathbf{x}^{\top} \mathbf{\Sigma}^{-1} \mathbf{x} + \mathbf{x}^{\top} \mathbf{\Sigma}^{-1} \boldsymbol\mu) \end{align*}\] where \(\mathbf{x}=(x_1, \cdots, x_d)^{\top}\).
For comparison, the density function of (univariate) normal distribution \(X\sim N(\mu, \sigma^2)\), \[ f_X(x \vert \mu, \sigma^2)=\frac{1}{\sqrt{2\pi}\sigma} \exp \big(-\frac{1}{2\sigma^2}(x-\mu)^2 \big) \overset{x} \propto \exp\big(-\frac{1}{2\sigma^2}(x^2-2x\mu+\mu^2)\big) \overset{x} \propto \exp\big(-\frac{1}{2} x^2 (\sigma^2)^{-1}+ x(\sigma^2)^{-1}\mu\big). \]
Bivariate normal \[\displaystyle (X,Y)\sim N \left(\begin{pmatrix}\mu_X\\ \mu_Y\end{pmatrix},\begin{pmatrix}\sigma_X^2 & \sigma_{XY}\\ \sigma_{XY}&\sigma_Y^2\end{pmatrix} \right),\] Let \(\rho=\frac{\sigma_{XY}}{\sigma_{X}\sigma_Y}\) be the correlation. It is a fact that \[f_{X|Y}(x|y)\sim N(\mu_X+\rho\frac{\sigma_X}{\sigma_Y}(y-\mu_Y), \sigma_X^2(1-\rho^2)),\] \[f_{Y|X}(y|x)\sim N(\mu_Y+\rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X), \sigma_Y^2(1-\rho^2)).\]
The Ising model is a mathematical model of ferromagnetism in statistical mechanics. It comprises a lattice of spins, where each spin can be either up (+1) or down (-1). These spins represent magnetic dipole moments of atomic “spins”. The key interest in the Ising model is understanding how these spins interact under varying temperatures and how these interactions lead to macroscopic phenomena like phase transitions.
In image processing, a 2D lattice can represent an image where each lattice site corresponds to a pixel: \[\begin{align*} I=\{1,2, \ldots, L\} \times\{1,2, \ldots, L\} \end{align*}\] The value at each site (spin) can denote various image properties, like color or intensity. Here we consider binary images, where the state space is given by \(\mathcal{X}=\{-1, 1\}^{I}\).
The Ising model describes the energy of a system using an energy function:
\[\begin{equation*} H(x) = -J \sum_{i \sim j\\ i, j \in I} x_i x_j - h\sum_{i} x_i, \quad x \in \mathcal{X}, \end{equation*}\]
where \(i \sim j\) indicates that \(x_i\) and \(x_j\) are the spins of adjacent lattice sites, for simplicity, we consider only nearest neighbors, so for a pixel \(i=\left(i_1, i_2\right)\) in the interior of the grid there are four neighbours \(\left(i_1+1, i_2\right),\left(i_1, i_2+1\right),\left(i_1-1, i_2\right)\) and \(\left(i_1, i_2-1\right)\). For a pixel on the edges of the grid (but not corners) three neighbours, for a corner pixel two neighbours.
The probability measure (via Boltzman-Gibbs transformation) for the system on \(\mathcal{X}\) can be written as \[\begin{align*} \pi(x)=\frac{1}{Z} \exp (-H(x)), \end{align*}\] where \(Z\) is the normalizing constant.
It can be shown that \[\begin{align*} P\left(X_i=+1 \mid X_{-i}=\boldsymbol{x}_{-i}\right) = \frac{1}{1+\exp \left(-2J \eta_i\right)}, \end{align*}\] where \(\eta_i := \sum_{j \sim i} x_j\).
The follows give some typical samples as \(J\) varies.
Bayesian binary denoising
Assume that a collection of original binary images has a distribution that is described by the Ising model \(\pi(x)\) above. We further assume that the observed images are noisy versions of the original images.
To be concrete, we assume for each pixel \(X_i\), we actually observe \(Y_i=X_i+\epsilon_i\) where \(\epsilon_i\) are indendently distributed \(N(0, \sigma^2).\)
Using Bayes’ rule, \[\begin{align*} p_{X \mid Y}(x \mid y)=\frac{p_{Y \mid X}(y \mid x) p_X(x)}{p_Y(y)}=\frac{1}{Z_y} \exp \left(-H(y, x)\right), \end{align*}\] where \[\begin{align*} H(y, x)=-\frac{1}{\sigma^2} \sum_{i \in I} y_i x_i+H(x)=-\frac{1}{\sigma^2} \sum_{i \in I} y_i x_i+J \sum_{\substack{i \sim j}} x_i x_j. \end{align*}\] The full conditional distribution of \(X_i\) can be similarly derived \[\begin{align*} P\left(X_i=+1 \mid X_{-i}=\boldsymbol{x}_{-i}, Y=\boldsymbol{y} \right) = \frac{1}{1+\exp \left(-2(J \eta_i+ y_i/\sigma^2)\right)}, \end{align*}\] where \(\eta_i := \sum_{j \sim i} x_j\).
A sample of real image, noisy image, and a reconstructed image based
on posterior probabilities.
The slice sampling is a technique based on the idea of the Fundamental Theorem of Simulation and Gibbs sampling.
Recall from the Fundamental Theorem of Simulation, \[ f(x)=\int_{0}^{f(x)}1du. \] So \(f(x)\) is a marginal density (in \(X\)) of the joint uniform distribution on the set \[ \{(x, u): 0<u<f(x), x\in\mathcal{X}\}, \] with the p.d.f. \(f(x, u)=1_{\{(x, u): 0<u<f(x), x\in \mathcal{X}\}}(x, u).\)
To obtain a sample of \((X ,U)\) from this distribution, we can sample from a M.C. that has the target stationary distribution which can be done using Gibbs sampling, noting that \[\begin{align*} &f_{X|U}(x|u) \sim \operatorname{unif}\{x: u< f(x) \}, \\ &f_{U|X}(u|x) \sim \operatorname{unif}\{u: 0< u< f(x) \}. \end{align*}\]
Slice sampling:
At iteration \(s\), simulate
Conceptually, Step 2 can indeed be understood as an accept-reject procedure:
General form:
Suppose \(f(x) \propto \prod_{i=1}^k f_i(x)\), where \(f_i(x)>0\) not necessarily density. Note that \(f_i(x) = \int 1\{x: 0 \leq w_i \leq f_i(x) \}dw_i\). Thus \(f\) is the marginal distribution of the joint distribution \[ (x, w_1, \ldots, w_k) \overset{x, w_1, \ldots, w_k}{\propto} \prod_{i=1}^k 1\{w_i : 0\leq w_i \leq f_i(x) \}, \] noting \[ f(x) \overset{x}{\propto} \int \cdots \int \prod_{i=1}^k 1\{w_i: 0 \leq w_i \leq f_i(x)\}dw_{1}\cdots dw_{k}= \prod_{i}f_i(x). \]
Slice sampling:
At iteration \(s\), simulate
\(W_1^{(s+1)} \sim \operatorname{unif}(0, f_1(X^{(s)}))\),
\(W_2^{(s+1)} \sim \operatorname{unif}(0, f_2(X^{(s)}))\),
\(\vdots\)
\(W_{k}^{(s+1)} \sim \operatorname{unif}(0, f_k(X^{(s)}))\),
\(X^{(s+1)} \sim \operatorname{unif}(A^{(s+1)})\), where \(A^{(s+1)}:=\{x: f_i(x) \geq W_i^{(s+1)}, \forall i=1, \ldots, k\}.\)
A major challenge in slice sampling is to identify or approximate the slice \(A^{(s+1)}\).For a univariate target \(f\), a common strategy is the stepping-out and shrinkage procedure:
\[\begin{align*} \left(X^{(s)}-L, X^{(s)}+R\right) . \end{align*}\]
This iterative “shrinkage” ensures that the accepted point always lies within the slice.
Remark 1: The above procedure generalizes to multivariate distributions, though defining and sampling from \(A^{(s+1)}\) becomes more challenging.
Remark 2: The algorithm remains valid if we use \(\tilde{f}\) the kernel of the p.d.f. in the algorithm.
Remark 3: The Gibbs sampler samples directly from the original space based on full conditionals and therefore requires closed-form distributions or invertible CDFs (unless combined with some other sampling methods). Slice sampling can be viewed as a Gibbs sampler operating on an augmented space. With stepping-out and shrinkage procedures, it often mixes better than Gibbs sampling on the original state space. Furthermore, this works even when the conditional distribution (in the original space) does not have a closed form.
Provided and discussed in lab.
Back Updated: 2025-11-18 Statistical Simulation, Wei Li