Back Updated: 2025-10-20 Statistical Simulation, Wei Li

Markov Chain

Consider a sequence of random variables \(\{ X^{(n)}: n=0, 1, \ldots, \}\), each \(X^{(n)} \in S\) (state space). We mainly discuss M.C in a discrete state space here, say, \(S = \{1,2, \cdots, K\}\).

Definition: A Markov Chain is a sequence of random variables \(X^{(n)}\) such that \[\begin{align*} &P(X^{(n)}\in A_n | X^{(n-1)}\in A_{n-1},\cdots, X^{(0)}\in A_0)= P(X^{(n)}\in A_n | X^{(n-1)}\in A_{n-1}). \end{align*}\] That is the probability distribution of \(X^{(n)}\) given the past variables depends only on \(X^{(n-1)}\). This is called first-order Markov Chain.

Define the one-step transition kernel \[p^{[ n-1,n ]}_{ij} := P(X^{(n)}=j| X^{(n-1)}=i) \qquad \text{transition kernel}.\] We’ll restrict ourselves to the time-invariant M.C.,that is, \(p^{[n-1, n]}_{ij}\) does not depend on the time reference point, but only depend on how far apart the two time points are. So we write \(p_{ij}^{[1]}:=p^{[n, n+1]}_{ij}=p^{[n-1, n]}_{ij}=\cdots=p^{[0, 1]}_{ij}\) etc.

one-step transition kernel

\[p_{ij} \geq 0, \ \forall i=1,\cdots, K; \ \sum_{j=1}^K p_{ij} = 1.\]

m-step transition kernel

Let’s consider some state distribution (occupation prob dist):

Time 0: \[\alpha^{(0)} = (\alpha_1^{(0)}, \cdots,\alpha_K^{(0)}), \text{ where } \ P(X^{(0)}=i)=\alpha_i^{(0)}.\]

Time 1: \[\alpha^{(1)} = (\alpha_1^{(1)}, \cdots,\alpha_K^{(1)})= \alpha^{(0)}P.\] One can show that the state distribution at time \(m\) is given by \(\alpha^{(m)}=\alpha^{(0)}P^m\). By definition, \(\alpha^{(m)}=\alpha^{(0)} P^{[m]}\).

Therefore, \[P^{[m]}=P^m.\]

More generally, one can show the Chapman-Kolmogorov equations: \[ P^{[m+n]}=P^{[m]}P^{[n]}, \] that is, \[ p^{[m+n]}_{ij}=\sum_{s=1}^Kp^{[m]}_{is}p^{[n]}_{sj}, \] which states that the probability of moving from \(i\) to \(j\) in \(m+n\) steps is just the probability of moving from \(i\) to \(s\) in m steps, and then from \(s\) to \(j\) in n steps, summing over all \(s\).

Definition (steady-state distribution/stationary distribution): If a state distribution (\(\pi\)) satisfies: \[\pi P = \pi.\] Then \(\pi\) is called a “steady state distribution” (stationary distribution). Note that it is the same as \[\begin{align*} \pi_j = \sum_{i=1}^K\pi_ip_{ij}, \ \forall \ j = 1,2,\cdots , K. \end{align*}\] The LHS \(\pi_j\) is steady state prob for the state \(j\). The RHS \(\sum_{i=1}^K\pi_ip_{ij}\) is total probability flowing into state \(j\) from any states. So the equation asserts that, in equilibrium, the fraction of time spent in state \(j\) remains constant because the total probability flowing into \(j\) equals the probability already in \(j\).

An expression is the global balance equations: \[\begin{align*} \pi_j \sum_{i: i \neq j} p_{ji}= \sum_{i: i\neq j}\pi_i p_{ij}, \ \forall \ j = 1,2,\cdots , K, \end{align*}\] which states that the total (net) probability per time step of flowing out from state \(j\) (to all other states) is equal to the total (net) probability per time step of entering \(j\) from other states.

To see the meaning of the steady state distribution exacly, consider a scenario that at some step n, we reached the steady state distribution, \[\begin{align*} \alpha^{(n)} &= \pi, \\ \alpha^{(n+1)} &= \alpha^{(n)}P = \pi. \end{align*}\] Therefore, once we reached the steady state distribution, the state distribution becomes stationary.

Find the steady state distribution: For a given transition matrix \(P\) (non-degenerate with possibly zero entries), note that \[P^\top\pi^\top = 1\pi^\top.\] So \(\pi^\top\) is eigenvalue of \(P^\top\) that corresponds to eigenvalue 1. The found eigenvector then needs to be normalized. A limitation is that it is not guaranteed the found eigenvector is real-valued.

A more general approach is to solve the following system of equations: \[M^\top\pi^\top = r^\top,\] where \(M\) is a \(K \times K\) matrix that removes the last column of \(I-P\) in \([I-P, 1]\), and \(r\) is a \(1 \times K\) vector \((0, 0, \ldots, 0, 1)\).

Limiting distribution: A Markov Chain has a limiting distribution if \[ \pi_j = \lim_{n\to \infty} p_{ij}^{[n]}, \] exits for all state \(i, j\) and the distribution is independent of the starting state \(i\). This basically says that the long-run distribution \(P(X^{(n)}=j) \to \pi_j\) as \(n \to \infty\) and is independent of the state at time \(0\). Intuitively, the chain “forgets” where it started - its distribution stabilizes over time to a unique equilibrium \(\pi\), which is the same for all initial states.

Regular Markov chains

Most of the M.C. encountered in MCMC settings enjoy some nice properties. In particular, we’re interested in M.C.s that satisfy the following three properties.

irreducible

The chain has a positive probability of eventually reaching any state, i.e. \[\forall i,j, \ p_{ij}^{[n_0]} > 0 \text{ for some } \ n_0.\] Irreducibility means the chain is “fully connected.” There are no isolated subsets of states that the chain cannot escape from or enter.

aperiodic
  • A state \(i\) has period \(v \in \mathbb{N}\) if \(p_{ii}^{[m]}>0\) only holds for \(m\) that is a multiple of \(v>1\). That is, it can return only at fixed, regular intervals.
  • A state is aperiodic if \(p_{ii}^{[m]}>0\) for all \(m\) sufficiently large that is no multiple of any \(v>1\). It is aperiodic if there is no such fixed cycle - once the chain can return to \(i\), it can do so at irregular intervals.
  • A chain is called “aperiodic” if all states are aperiodic.

Theorem: Every irreducible, aperiodic finite state Markov Chain has a limiting distribution which is equal to its unique stationary distribution.

recurrent
  • A state \(i\) is said to be (positive) recurrent, if when we run the M.C. from \(i\) continuously, we are guaranteed to return to state \(i\) with probability \(1\) (thus infinitely often) and the first return happens within a finite numbers of steps on average.
  • A chain is called recurrent if all state are.

As a matter of fact, every irreducible finite state Markov Chain is also positive recurrent as defined above. As it is impossible for probability mass to “leak away” indefinitely — eventually, every state must be revisited.

Ergodic M.C.

Henceforth, we restrict ourselves to “time inveriant” M.C. (possibly in continuous state space) that are irreducible, aperiodic and (positive) recurrent. We call these M.C. “ergodic M.C.”.

Theorem: Any ergodic M.C. has a unique stationary distribution \(\pi\), i.e., \(\pi P = \pi\), and it is equal to the long-run distribution: \[\begin{align*} \lim _{n \rightarrow \infty} p_{i j}^{[n]}=\pi_j \quad \forall i, j \end{align*}\] That is, the distribution of the chain converges to \(\pi\) regardless of the initial state.

Sampling from a stationary distribution

Let \(\alpha^{(m)}\) be the state distribution at time \(m\). How can we gather a sample from the stationary distribution? One approach involves initiating N Markov chains, running them for a sufficiently long period, and then gathering the values close to the end of each chain’s run. These values would then serve as a sample from the stationary distribution. An alternative and often more practical method, made possible by ergodicity, is to take several values near the end of a single long-running Markov chain. This set of values can also be considered a sample from the stationary distribution.

  • ensemble: the set of all possible chains \[\begin{bmatrix} \text{chain 1} & \text{chain 2} & \cdots & \text{chain N} \\ X^{(1)}_1 & X^{(1)}_2 & \cdots & X^{(1)}_N \\ X^{(2)}_1 & X^{(2)}_2 & \cdots & X^{(2)}_N \\ \vdots & \vdots & \cdots & \vdots \\ X^{(\infty)}_1 & X^{(\infty)}_2 & \cdots & X^{(\infty)}_N \\ \end{bmatrix}\]
  • single chain \[\begin{align*} X^{(1)} &\sim \alpha^{(1)}\\ X^{(2)} &\sim \alpha^{(2)}\\ & \ \ \vdots\\ X^{(\infty)} &\sim \alpha^{(\infty)} \equiv \pi\\ \end{align*}\]

The Law of Large Number

The Law of Large Numbers An ergodic Markov chain (M.C.) has a very important property: the time average computed along a single realization of the chain converges to the ensemble averagethe expected value over all possible realizations of the chain-after running sufficiently long.

Goal: Estimate

\[\begin{align*} E\left[h\left(X^{(\infty)}\right)\right]=\int h(x) \pi(x) d x \end{align*}\]

where \(\pi(x)\) is the stationary (or equilibrium) distribution of the chain.

Classical Law of Large Numbers:

In the conventional setting, where one draws an independent and identically distributed (i.i.d.) sample \(\left\{X_1^{(\infty)}, \ldots, X_N^{(\infty)}\right\}\) from the target distribution \(\pi\),

\[\begin{align*} \frac{1}{N} \sum_{i=1}^N h\left(X_i^{(\infty)}\right) \longrightarrow E\left[h\left(X^{(\infty)}\right)\right] \quad \text { as } N \rightarrow \infty . \end{align*}\]

This convergence is across the ensemble-that is, across many independent samples from \(\pi\).

Ergodic Law of Large Numbers: Ergodicity guarantees a similar property along time, even when we have only one realization of the Markov chain:

\[\begin{align*} \frac{1}{T} \sum_{n=1}^T h\left(X^{(n)}\right) \longrightarrow E\left[h\left(X^{(\infty)}\right)\right] \quad \text { as } T \rightarrow \infty . \end{align*}\]

That is, the time average of a single trajectory converges to the same value as the ensemble average across independent draws. This is the theoretical justification for Markov Chain Monte Carlo (MCMC): by running one sufficiently long chain, we can estimate expectations with respect to \(\pi\) as if we had many independent samples.

Remark:

In practice, one runs the Markov chain for a period called the burn-in phase-long enough for the chain to “forget” its initial state and reach its stationary regime.

After burn-in, samples \(\left\{X^{(n)}\right\}\) are treated as approximately drawn from \(\pi\), though they are correlated rather than independent. Nevertheless, under ergodicity, the law of large numbers for Markov chains ensures that averages of the form

\[\begin{align*} \hat{\mu}_T=\frac{1}{T} \sum_{n=1}^T h\left(X^{(n)}\right) \end{align*}\]

still converge to the true expectation \(E_\pi[h(X)]\) as \(T \rightarrow \infty\).

Detailed balance condition

Given a stationary \(\pi\), how to find a M.C. that has \(\pi\) as its stationary distribution? It turns out that finding the so-called time reversible M.C. that satisfies the detailed balance condition is the key.

Goal: to find the M.C. with the right transition kernel \(P_{ij}.\)

Idea: Consider a M.C. that is “time reversible”–that is there exists some distribution \(\pi\), it holds that (detailed balance condition) \[ \pi_jP_{ji}=\pi_iP_{ij}, \quad \forall i,j, \] implying that the flows of probabilities between two states \(i, j\) balance each other when weighted by the outgoing state probabilities in \(\pi\): \[ P(X^{(n)}=j)P(X^{(n+1)}=i|X^{(n)}=j)=P(X^{(n)}=i)P(X^{(n+1)}=j|X^{(n)}=i). \] This means that, in \(\pi\), the probability flow from state \(i\) to state \(j\) is exactly balanced by the flow from \(j\) to \(i\).

Note also that the time reversibility of the chain is implied: under the this regime, the sequence of states has the same probabilistic behavior whether we observe it forward or backward in time. That is, the joint distribution of two consecutive states is symmetric: \[\begin{align*} P\big(X^{(n)}=i, X^{(n+1)}=j\big)=P\big(X^{(n)}=j, X^{(n+1)}=i\big) \end{align*}\] Consequently, the direction of time does not matter for transitions: \[\begin{align*} P\big(X^{(n)}=i \mid X^{(n+1)}=j\big)=P\big(X^{(n+1)}=i \mid X^{(n)}=j\big) \end{align*}\]

Theorem: A set of transition probabilities \(P_{i,j}\), satisfying the “detailed balance condition” will have \(\pi\) as its stationary distribution.

Remark: The detailed balance condition is a sufficient but not necessary condition for a Markov chain to possess a stationary distribution. The key insight for sampling from a given stationary distribution \(\pi\) is that it is enough to construct a transition mechanism \(P\) that satisfies the detailed balance condition with respect to \(\pi\).

In other words, we do not need to sample directly from \(\pi\); it suffices to design a Markov chain whose transitions are reversible with respect to \(\pi\). Once the chain reaches equilibrium, its samples are effectively drawn from the desired distribution \(\pi\).

Metropolis Algorithm (discrete state space)

Goal: to find the transition kernel \(\tilde{P}_{ij}\), s.t., the detailed balance condition holds, i.e., \(\pi_j\tilde{P}_{ji}=\pi_i\tilde{P}_{ij},\forall i,j\).

We can use some arbitrary transition kernel, called \({P}_{ij}\). By following the modification described below, we can effectively modify \(P_{ij}\) to \(\tilde{P}_{ij}\).

Intuition goes like this:

Whenever a move is proposed, we want to weight it using some acceptance probability (i.e., accepting the move), effectively adjusting the transition probability.

Formally, Metropolis’ idea works as follows. Given some \(\pi\)–the stationary distribution (target distribution), choose some candidate transition kernel \(P_{ij}\). For all \(i\),

  1. Define acceptance probability \[\alpha_{ij}=\min\Big(\frac{\pi_jP_{ji}}{\pi_iP_{ij}},1\Big),\] for each \(i\) and \(j\neq i\).
  2. Let \[ \tilde{P}_{ij}=\alpha_{ij}P_{ij}, j \neq i, \] \[ \tilde{P}_{ii}=1-\sum_{j:j\neq i}\tilde{P}_{ij}. \]

Then \(\pi\) is the stationary distribution from M.C. with transition probability given by this \(\tilde{P}_{ij}\).

Proof: provided in class.

Note: \(\tilde{P}_{ii}=1-\sum_{j:j\neq i}\tilde{P}_{ij}\) is the probability that we reject any proposal that moves from \(i\) to any other state \(j\neq i\), that is, it is the probability of moving from \(i\) to \(i\) itself (remaining in \(i\)). We can prove that our \([\tilde{P}_{ij}: i, j]\) is a well-defined transition kernel.

Algorithm (Metropolis)

\(X^{(0)}\in S\), \(S\) is a discrete space.

  1. draw \(X^* \sim P(\cdot|X^{(t-1)})\);

  2. compute acceptance probability \[ \alpha(X^{(t-1)},X^*):=min\left\{1,\frac{P(X^{(t-1)}|X^*)\pi_{X^*}}{P(X^*|X^{(t-1)})\pi_{X^{(t-1)}}}\right\}, \] where \(\pi_{X^*}\) is the probability of taking value \(X^*\) under the stationary distribution \(\pi\), and \(\pi_{X^{(t-1)}}\) is the probability of taking value \(X^{(t-1)}\) under the stationary distribution \(\pi\).

  3. set \(X^{(t)}\leftarrow X^*\) with probability \(\alpha(X^{(t-1)},X)\); otherwise \(X^{(t)}\leftarrow X^{(t-1)}\).

To implement step 3, one can implement: generate \(U \sim Unif(0,1)\), if \(U \leq \alpha(X^{(t-1)},X^*)\), then accept \(X^*\), i.e., set \(X^{(t)}\leftarrow X^*\); otherwise, reject \(X^*\), set \(X^{(t)}\leftarrow X^{(t-1)}\).

Python Code Examples

Provided and discussed in lab.

Back Updated: 2025-10-20 Statistical Simulation, Wei Li