Back Updated: 2025-10-22 Statistical Simulation, Wei Li

Markov Chain with continuous state space

We consider Markov chains in the continuous state space, that is, \(S\) = uncountable set.

Transition probability from each possible state \(x\) to each possible set of states \(A\): \[\begin{align*} &P(x_0,A):=P(X^{(n+1)} \in A | X^{(n)}=x_0),\\ \text{CDF}:\ &F(x_1|x_0):=P(X^{(n+1)} \leq x_1 | X^{(n)}=x_0),\\ \text{One-step transition density function}:\ &f(x_0, x_1):= \frac{\partial F(x_1|x_0)}{\partial x_1}= p(X^{(n+1)}=x_1|X^{(n)}=x_0).\\ \end{align*}\]

Transition density

Define one-step transition density:

\(f^{[1]}(x_n,x_{n+1}):=p(x_{n+1}|x_n)=p(X^{(n+1)}=x_{n+1}|X^{(n)}=x_n).\)

Two-step transition density:

\[\begin{align*} f^{[2]}(x_n,x_{n+2}):=p(X^{(n+2)}=x_{n+2}|X^{(n)}=x_n)&= \int p(X^{(n+2)}=x_{n+2}|X^{(n+1)}=x_{n+1},X^{(n)}=x_n) p(X^{(n+1)}=x_{n+1}|X^n=x_n)d x_{n+1}\\ &=\int p(X^{(n+2)}=x_{n+2}|X^{(n+1)}=x_{n+1}) p(X^{(n+1)}=x_{n+1}|X^{(n)}=x_n)d x_{n+1}\\ &=\int f^{[1]}(x_n,x_{n+1}) f^{[1]}(x_{n+1},x_{n+2})d x_{n+1}. \end{align*}\]

Three-step transition density:

\[\begin{align*} f^{[3]}(x_n,x_{n+3}):=p(X^{(n+3)}=x_{n+3}|X^{(n)}=x_n) &= \int p(X^{(n+3)}=x_{n+3}|X^{(n+2)}=x_{n+2}) \cdot p(X^{(n+2)}=x_{n+2}|X^{(n)}=x_n)d x_{n+2}\\ &=\int f^{[1]}(x_{n+2},x_{n+3}) f^{[2]}(x_n,x_{n+2})d x_{n+2}\\ &=\int \int f^{[1]}(x_{n+2},x_{n+3}) f^{[1]}(x_{n+1},x_{n+2}) f^{[1]}(x_n,x_{n+1})d x_{n+1} d x_{n+2}.\\ \end{align*}\]

Arguing as above, we can obtain the m-step transition density:

\(f^{[m]}(x_n,x_{n+m})= \int \cdots \int \prod_{k=n+1}^{n+m}f^{[1]}(x_{k-1},x_k)dx_{n+1} \cdots dx_{n+m-1}.\)

The corresponding m-step transition probability can be written as \(P(X^{(n+m)} \in A| X^{(n)}=x_n)= \int_A f^{[m]}(x_n,x_{n+m})dx_{n+m}.\) More generally, we have the Chapman-Kolmogorov equation: \[ f^{[n+m]}(x_0, x_{n+m})=\int f^{[m]}(x_0, x_{m})f^{[n]}(x_m, x_{n+m})dx_{m}. \]

Some important properties

If some regular (ergodic) M.C. possesses a limiting transition density independent of the initial state, that is \[\lim_{n \to \infty}f^{[n]}(x,v)=g(v),\] then \(g(v)\) is called “stationary/steady-state density” (long-term probability density) of M.C. and it is a solution to steady-state equation:

\[g(v)=\int_{-\infty}^{+\infty}g(w)f(w,v)dw \qquad (*)\] \[(\text{recall discrete space}: \pi_j=\sum_{i=1}^{K} \pi_i p_{ij} \qquad \forall j=1,2,\cdots ,K),\] where \(g(w)\) may be viewed as the start distribution and \(f(w,v)\) is the transition density function. Note that an equivalent expression to \((*)\) is \[\int_A g(v) dv=\int_{-\infty}^{+\infty}g(w)P(w,A)dw \text{, for all set } A, \text{ where } P(w, A):=P(X^{(n+1)} \in A|X^{(n)}=w).\]

A sufficient condition for \(g(x)\) to be a steady-state density is the “detailed balance condition” given by \[g(x)f(x,v)=g(v)f(v,x) \qquad \forall x,v \qquad (**),\] here \(f(x,v)\) be the density function of one-step transition of the M.C. As in the discrete case, it can be shown that any Markov chain satisfying the “detailed balance condition” \((**)\) will have \(g\) as the steady state density, i.e. \((*)\) holds.

Ergodic Theorem

If \((X^{(1)},X^{(2)},\cdots)\) is an ergodic M.C. whose steady-state density is given by \(g\), then for \(n \to \infty\), \[P(X^{(n)} \in A) \to P(X^{(\infty)} \in A) = \int_A g(x)dx.\] In addition, \[\frac{1}{T}\sum_{n=1}^{T}h(X^{(n)}) \to E[h(X^{(\infty)})]=\int h(x)g(x)dx, \text{ for } T\to \infty.\]

Metropolis Algorithm in Continuous Space

We consider a continuous state space, from which we would like to sample a target density \(f(\cdot)\) or a kernel \(\tilde f\). Our goal is to find a transition function \(p(\cdot |\cdot)\) to satisfy the ‘detailed balance conditions’, i.e: for a proposed \(x^*\), \[ p(x^* |x)f(x) =p(x|x^*)f(x^*).\]

If we can find such a \(p\), then the long run distribution of the ergodic markov chain with transition distribution \(p(\cdot | \cdot)\) is \(f(\cdot)\).

Method

Choose an arbitrary transition density \(q(\cdot | \cdot)\), without requring that the detailed balance conditions hold. Then use an acceptance probability on proposed samples as \[\alpha(x^*|x) = \min \Big(1, \frac{q(x|x^*)f(x^*)}{q(x^*|x)f(x)} \Big).\]

The goal of this is to regulate the flow of probabilities between the current state and the proposed state. So given that our current state is \(x\) then the probability of a particular state being proposed and accepted is \(\alpha (x^*|x) q(x^*|x)\). Similarly, we can calculate the probability that any proposal is accepted and a subsequent move occurs as \(a(x) :=\int \alpha (x^*|x) q(x^*|x)dx^*\). A proposal is not accepted (hence no move) with probability \(r(x) := 1- a(x)\), in which case we set \(x^* = x\). The (effective) transition density is then

\[p(x^*|x) = \alpha (x^*|x) q(x^*|x) + r(x)\delta_x(x^*),\]

where \(\delta_x(x^*) = 1\) if \(x^*=x\) and 0 otherwise. Then since we are in continuous state space, this can be generalized to

\[P(x, A)=\int_A p(x^*|x)dx^* = \int_A\alpha (x^*|x) q(x^*|x)dx^* + r(x)\delta_A(x),\]

where \(\delta_A(x) = 1\) if \(x\in A\) and 0 otherwise.

We can show that the (effective) transition density indeed satisfies the detailed balance condition.

Proof: provided in class.

Metropolis-Hastings algorithm

  1. Given a target \(f(\cdot)\) and a chosen \(q(\cdot|\cdot)\), generate \(x^*\) from \(q(\cdot| x^{(s)})\).
  2. Compute \[\alpha(x^*|x^{(s)}) = \min \left(1, \frac{q(x^{(s)}|x^*)f(x^*)}{q(x^*|x^{(s)})f(x^{(s)})} \right).\]
  3. Generate \(u\sim \mbox{unif}(0,1)\)
    1. if \(u\leq \alpha(x^*|x^{(s)})\), set \(x^{(s+1)} = x^*\);
    2. if \(u> \alpha(x^*|x^{(s)})\), set \(x^{(s+1)} = x^{(s)}\).

Remark:

A necessary condition for the algorithm to work is \(\mbox{supp}(f)\subseteq \mbox{supp}(q(\cdot|x))\) for all \(x\in \mbox{supp}(f)\). A minimal necessary condition would be \(\mbox{supp}(f)\subseteq \cup_{x\in \mbox{supp}(f)}[\mbox{supp}(q(\cdot|x))].\)

Choice of Transition Density

  1. Classical Choice (Random Walk): Choose \(q(x^*|x) = \tilde q (x^*-x)\) where \(\tilde q\) is symmetric about 0. For example \(x^* = x^{(s)}+\epsilon^{(s)}\), where \(\epsilon^{(s)}\sim \mbox{unif}(-1,1)\) or \(\epsilon^{(s)}\sim N(0,\tau^2)\). Then we would have \(x^*\sim \mbox{unif}(x^{(s)}-1,x^{(s)}+1)\), or \(x^*\sim N(x^{(s)},\tau^2)\) respectively. Then due to the symmetry about 0, \(q(x^*|x) =q(x|x^*)\), so the acceptance probability simplifies to \(\alpha(x^*|x) = \min (1, \frac{f(x^*)}{f(x)})\).

  2. Independent Metropolis Hastings: Choose \(x^*\sim q(\cdot)=q(\cdot|x)\) independent of the current value \(x^{(s)}\). For example \(q\sim N(0,1)\) throughout.

  3. Metropolis-Adjusted Langevin Algorithm (MALA)

MALA originates from the Langevin diffusion, a continuous-time stochastic process whose stationary distribution is exactly \(f(x)\) : \[\begin{align*} d X_t=\frac{1}{2} \nabla \log f\left(X_t\right) d t+d W_t \end{align*}\]

where \(W_t\) is a standard Wiener process (Brownian motion). The drift term \(\frac{1}{2} \nabla \log f(x)\) drives the process toward regions of high probability density. The noise term \(d W_t\) ensures exploration of the full space. This diffusion has \(f(x)\) as its invariant (stationary) distribution under mild regularity conditions.

To simulate this diffusion in discrete steps of size \(\delta\),the so-called Euler-Maruyama approximation is applied: \[\begin{align*} x^*=x^{(s)}+\frac{\delta^2}{2} \nabla \log f\left(x^{(s)}\right)+\delta \varepsilon^{(s)}, \quad \varepsilon^{(s)} \sim N(0, I) \end{align*}\] However, because this is a numerical approximation, it does not exactly preserve \(f(x)\). To correct this bias, we add a Metropolis-Hastings acceptance step, yielding MALA. \[\begin{align*} \alpha\left(x^* \mid x\right)=\min \left(1, \frac{f\left(x^*\right) q\left(x \mid x^*\right)}{f(x) q\left(x^* \mid x\right)}\right) . \end{align*}\] where \[\begin{align*} q\left(x' \mid x\right)={N}\Big(x' ; x+\frac{\delta^2}{2} \nabla \log f(x), \delta^2 I\Big) \end{align*}\]

  • Note 1: The optimal scaling of step size satisfies roughly: \(\delta \propto d^{-1 / 6}\) where \(d\) is the dimension of the parameter space. In practice, need to experiment with \(\delta\).
  • Note 2: Langevin Monte Carlo (LMC) is a special case of the Hamiltonian Monte Carlo (HMC).

Practicalities

Proposals should not be too bold (standard deviation not too high), however should be large enough to be consequential (standard deviation not too small). Between \(30\%-60\%\) acceptance rate is ideal.

  1. Burn-in

    1. Run algorithm until some iteration \(B\) for which Markov Chain appears to reach stationarity.
    2. Run algorithm \(B'\) more times and keep \(\{x^{(B+1)},\ldots,x^{(B+B')} \}.\)
    3. Use empirical distribution of \(\{x^{(B+1)},\ldots,x^{(B+B')} \}\) to approximate the density \(f\).
  2. Thinning: After burning to discard first B observations, then thin the chain by including every kth draw.

  3. Trace Plot: The plot of the values generated from the metropolis algorithm against the iteration number. Check:

    1. The chain does not become trapped in a specific region of the state space.
    2. The chain moves reasonably quickly across the support, suggesting good mixing.
  4. “ACF” Autocorrelation Function: Calculate the “lag-\(j\) sample covariance” as \[\hat\gamma_j = \frac{1}{T}\sum_{t=j+1}^T(x^{(t)}-\bar x)(x^{(t-j)}-\bar x)\] and the “lag-\(j\) sample autocorrelation” as \(\hat{\rho}_j = \hat\gamma_j/\hat\gamma_0\). Then plotting autocorrelation \(\hat{\rho}_j\) v.s. the lag index (\(j\)), relatively high values would indicate a high degree of correlation between draws and therefore, slow mixing.

  5. Gelman-Rubin Statistic \((\hat{R})\) Run multiple chains from overdispersed starting values. Compare between-chain and within-chain variance: \[\begin{align*} \hat{R}=\sqrt{\frac{\widehat{\operatorname{Var}}^{+}(x)}{W}}, \end{align*}\] where \(\widehat{\operatorname{Var}}^{+}(x)\) is the pooled variance and \(W\) is the within-chain variance. Rule of thumb: convergence if \(\hat{R}<1.01\) or 1.05 .

Python Code Examples

Provided and discussed in lab.

Back Updated: 2025-10-22 Statistical Simulation, Wei Li