Back Updated: 2024-11-08 Statistical Simulation, Wei Li

Markov Chain

Consider a sequence of random variables \(\{ X^{(n)}: n=0, 1, \ldots, \}\), each \(X^{(n)} \in S\) (state space). We mainly discuss M.C in a discrete state space here, say, \(S = \{1,2, \cdots, K\}\).

Definition: A Markov Chain is a sequence of random variables \(X^{(n)}\) such that \[\begin{align*} &P(X^{(n)}\in A_n | X^{(n-1)}\in A_{n-1},\cdots, X^{(0)}\in A_0)= P(X^{(n)}\in A_n | X^{(n-1)}\in A_{n-1}). \end{align*}\] That is the probability distribution of \(X^{(n)}\) given the past variables depends only on \(X^{(n-1)}\). This is called first-order Markov Chain.

Define the one-step transition kernel \[p^{[ n-1,n ]}_{ij} := P(X^{(n)}=j| X^{(n-1)}=i) \qquad \text{transition kernel}.\] We’ll restrict ourselves to the time-invariant M.C.,that is, \(p^{[n-1, n]}_{ij}\) does not depend on the time reference point, but only depend on how far apart the two time points are. So we write \(p_{ij}^{[1]}:=p^{[n, n+1]}_{ij}=p^{[n-1, n]}_{ij}=\cdots=p^{[0, 1]}_{ij}\) etc.

one-step transition kernel

\[p_{ij} \geq 0, \ \forall i=1,\cdots, K; \ \sum_{j=1}^K p_{ij} = 1.\]

m-step transition kernel

Let’s consider some state distribution (occupation prob dist):

Time 0: \[\alpha^{(0)} = (\alpha_1^{(0)}, \cdots,\alpha_K^{(0)}), \text{ where } \ P(X^{(0)}=i)=\alpha_i^{(0)}.\]

Time 1: \[\alpha^{(1)} = (\alpha_1^{(1)}, \cdots,\alpha_K^{(1)})= \alpha^{(0)}P.\] One can show that the state distribution at time \(m\) is given by \(\alpha^{(m)}=\alpha^{(0)}P^m\). By definition, \(\alpha^{(m)}=\alpha^{(0)} P^{[m]}\).

Therefore, \[P^{[m]}=P^m.\]

More generally, one can show the Chapman-Kolmogorov equations: \[ P^{[m+n]}=P^{[m]}P^{[n]}, \] that is, \[ p^{[m+n]}_{ij}=\sum_{s=1}^Kp^{[m]}_{is}p^{[n]}_{sj}, \] which states that the probability of moving from \(i\) to \(j\) in \(m+n\) steps is just the probability of moving from \(i\) to \(s\) in m steps, and then from \(s\) to \(j\) in n steps, summing over all \(s\).

Definition (steady-state distribution/stationary distribution): If a state distribution (\(\pi\)) satisfies: \[\pi P = \pi.\] Then \(\pi\) is called a “steady state distribution” (stationary distribution). Note that it is the same as \[\begin{align*} \pi_j = \sum_{i=1}^K\pi_ip_{ij}, \ \forall \ j = 1,2,\cdots , K. \end{align*}\] The LHS \(\pi_j\) is steady state prob for the state \(j\). The RHS \(\sum_{i=1}^K\pi_ip_{ij}\) is total probability flowing into state \(j\) from any states. An expression is the global balance equations: \[\begin{align*} \pi_j \sum_{i: i \neq j} p_{ji}= \sum_{i: i\neq j}\pi_i p_{ij}, \ \forall \ j = 1,2,\cdots , K, \end{align*}\] which states that the probability of being in state \(j\) times the net total probability of flowing out of state \(j\) is equal to the sum of the probability of being in every state \(i\) (except \(j\)) times the probability of flowing from that state into \(j\).

To see the meaning of the steady state distribution, consider a scenario that at some step n, we reached the steady state distribution, \[\begin{align*} \alpha^{(n)} &= \pi, \\ \alpha^{(n+1)} &= \alpha^{(n)}P = \pi. \end{align*}\] Therefore, once we reached the steady state distribution, the state distribution becomes stationary.

Find the steady state distribution: For a given transition matrix \(P\) (non-degenerate with possibly zero entries), note that \[P^\top\pi^\top = 1\pi^\top.\] So \(\pi^\top\) is eigenvalue of \(P^\top\) that corresponds to eigenvalue 1. The found eigenvector then needs to be normalized. A limitation is that it is not guaranteed the found eigenvector is real-valued.

A more general approach is to solve the following system of equations: \[M^\top\pi^\top = r^\top,\] where \(M\) is a \(K \times K\) matrix that removes the last column of \(I-P\) in \([I-P, 1]\), and \(r\) is a \(1 \times K\) vector \((0, 0, \ldots, 0, 1)\).

Limiting distribution: A Markov Chain has a limiting distribution if \[ \pi_j = \lim_{n\to \infty} p_{ij}^n, \] exits and the distribution is independent of the starting state. This basically says that the long-run distribution \(P(X^{(n)}=j) \to \pi_j\) as \(n \to \infty\) and is independent of the state at time \(0\).

Regular Markov chains

Most of the M.C. encountered in MCMC settings enjoy some nice properties. In particular, we’re interested in M.C.s that satisfy the following three properties.

irreducible

The chain has a positive probability of eventually reaching any state, i.e. \[\forall i,j, \ p_{ij}^{[n_0]} > 0 \text{ for some } \ n_0.\]

aperiodic
  • A state \(i\) has period \(v \in \mathbb{N}\) if \(p_{ii}^{[m]}>0\) only holds for \(m\) that is a multiple of \(v>1\).
  • A state is aperiodic if \(p_{ii}^{[m]}>0\) for all \(m\) sufficiently large that is no multiple of any \(v>1\).
  • A chain is called “aperiodic” if all states are aperiodic.

Theorem: Every irreducible, aperiodic finite state Markov Chain has a limiting distribution which is equal to its unique stationary distribution.

recurrent
  • A state \(i\) is said to be (positive) recurrent, if when we run the M.C. from \(i\) continuously, we are guaranteed to return to state \(i\) infinitely often and the first return happens within a finite numbers of steps on average.
  • A chain is called recurrent if all state are.

As a matter of fact, every irreducible finite state Markov Chain is also positive recurrent as defined above.

Ergodic M.C.

Henceforth, we restrict ourselves to “time inveriant” M.C. (possibly in continuous state space) that are irreducible, aperiodic and (positive) recurrent. We call these M.C. “ergodic M.C.”.

Theorem: Any ergodic M.C. has a unique stationary distribution \(\pi\), i.e., \(\pi P = \pi\), and it is equal to the long-run distribution: \[\pi_j = \lim_{n \to \infty} p_{ij}^{[n]}, \qquad \text{for all} \ i.\]

Sampling from a stationary distribution

Let \(\alpha^{(m)}\) be the state distribution at time \(m\). How can we gather a sample from the stationary distribution? One approach involves initiating N Markov chains, running them for a sufficiently long period, and then gathering the values close to the end of each chain’s run. These values would then serve as a sample from the stationary distribution. An alternative and often more practical method, made possible by ergodicity, is to take several values near the end of a single long-running Markov chain. This set of values can also be considered a sample from the stationary distribution.

  • ensemble: the set of all possible chains \[\begin{bmatrix} \text{chain 1} & \text{chain 2} & \cdots & \text{chain N} \\ X^{(1)}_1 & X^{(1)}_2 & \cdots & X^{(1)}_N \\ X^{(2)}_1 & X^{(2)}_2 & \cdots & X^{(2)}_N \\ \vdots & \vdots & \cdots & \vdots \\ X^{(\infty)}_1 & X^{(\infty)}_2 & \cdots & X^{(\infty)}_N \\ \end{bmatrix}\]
  • single chain \[\begin{align*} X^{(1)} &\sim \alpha^{(1)}\\ X^{(2)} &\sim \alpha^{(2)}\\ & \ \ \vdots\\ X^{(\infty)} &\sim \alpha^{(\infty)} \equiv \pi\\ \end{align*}\]

The Law of Large Number

A ergodic M.C. has a very important property. That is, the time average of a simple realization approaches the average of all possible realization of their chains, called ensemble, at some particular point in the far future.

Goal: estimate \(E(h(X^{(\infty)})) = \int h(x)\pi(x)dx\)

The conventional law of large number says that, using the ensemble, \[\frac{1}{N}\sum_{i=1}^Nh(X^{(\infty)}_i) \rightarrow E(h(X^{(\infty)})) \text{ as } \ N \rightarrow \infty.\]

Ergodicity guarantees that, using just a single M.C., \[\frac{1}{T}\sum_{n=1}^Th(X^{(n)}) \rightarrow E[h(X^{(\infty)})] \text{ as }\ T \rightarrow \infty.\] This provides the theoretical justification–why obtaining samples from a single M.C. (after it has run long enough), we are effectively obtaining samples from the steady-state distribution.

In practice, we take a sample from a single realization of the M.C. after we have left it run a period of time called “burn-in” time, which we consider long enough for the chain to reach the stationary distribution. The sample we then obtain behaves as if it is a “random” sample from the stationary distribution (yet with weak correlation).

Detailed balance condition

Given a stationary \(\pi\), how to find a M.C. that has \(\pi\) as its stationary distribution? It turns out that finding the so-called time reversible M.C. that satisfies the detailed balance condition is the key.

Goal: to find the M.C. with the right transition kernel \(P_{ij}.\)

Idea: Consider a M.C. that is “time reversible”–that is there exists some distribution \(\pi\), it holds that (detailed balance condition) \[ \pi_jP_{ji}=\pi_iP_{ij}, \quad \forall i,j, \] implying that the flows of probabilities between two states \(i, j\) balance each other when weighted by the outgoing state probabilities: \[ P(X^{(n)}=j)P(X^{(n+1)}=i|X^{(n)}=j)=P(X^{(n)}=i)P(X^{(n+1)}=j|X^{(n)}=i). \] Note also that the time reversibility of the chain is implied, so the direction of the time does not matter for the probability in the move of the chain: \(P(X^{(n)}=i|X^{(n+1)}=j)=P(X^{(n+1)}=i|X^{(n)}=j)=P_{ji}\).

Theorem: A set of transition probabilities \(P_{i,j}\), satisfying the “detailed balance condition” will have \(\pi\) as its stationary distribution.

Remark: The detailed balance condition is sufficient but not necessary condition for a M.C. to have a stationary distribution. The key insight for sampling from some stationary distribution \(\pi\) is that it suffices to sample from the some transition distribution that satisfies the detailed balance condition defined by the stationary distribution.

Metropolis Algorithm (discrete state space)

Goal: to find the transition kernel \(\tilde{P}_{ij}\), s.t., the detailed balance condition holds, i.e., \(\pi_j\tilde{P}_{ji}=\pi_i\tilde{P}_{ij},\forall i,j\).

We can use some arbitrary transition kernel, called \({P}_{ij}\). By following the modification described below, we can effectively modify \(P_{ij}\) to \(\tilde{P}_{ij}\).

Intuition goes like this:

Whenever a move is proposed, we want to weight it using some acceptance probability (i.e., accepting the move), effectively adjusting the transition probability.

Formally, Metropolis’ idea works as follows. Given some \(\pi\)–the stationary distribution (target distribution), choose some candidate transition kernel \(P_{ij}\). For all \(i\),

  1. Define acceptance probability \[\alpha_{ij}=\min\Big(\frac{\pi_jP_{ji}}{\pi_iP_{ij}},1\Big),\] for each \(i\) and \(j\neq i\).
  2. Let \[ \tilde{P}_{ij}=\alpha_{ij}P_{ij}, j \neq i, \] \[ \tilde{P}_{ii}=1-\sum_{j:j\neq i}\tilde{P}_{ij}. \]

Then \(\pi\) is the stationary distribution from M.C. with transition probability given by this \(\tilde{P}_{ij}\).

Proof: provided in class.

Note: \(\tilde{P}_{ii}=1-\sum_{j:j\neq i}\tilde{P}_{ij}\) is the probability that we reject any proposal that moves from \(i\) to any other state \(j\neq i\), that is, it is the probability of moving from \(i\) to \(i\) itself (remaining in \(i\)). We can prove that our \([\tilde{P}_{ij}: i, j]\) is a well-defined transition kernel.

Algorithm (Metropolis)

\(X^{(0)}\in S\), \(S\) is a discrete space.

  1. draw \(X^* \sim P(\cdot|X^{(t-1)})\);

  2. compute acceptance probability \[ \alpha(X^{(t-1)},X^*):=min\left\{1,\frac{P(X^{(t-1)}|X^*)\pi_{X^*}}{P(X^*|X^{(t-1)})\pi_{X^{(t-1)}}}\right\}, \] where \(\pi_{X^*}\) is the probability of taking value \(X^*\) under the stationary distribution \(\pi\), and \(\pi_{X^{(t-1)}}\) is the probability of taking value \(X^{(t-1)}\) under the stationary distribution \(\pi\).

  3. set \(X^{(t)}\leftarrow X^*\) with probability \(\alpha(X^{(t-1)},X)\); otherwise \(X^{(t)}\leftarrow X^{(t-1)}\).

To implement step 3, one can implement: generate \(U \sim Unif(0,1)\), if \(U \leq \alpha(X^{(t-1)},X^*)\), then accept \(X^*\), i.e., set \(X^{(t)}\leftarrow X^*\); otherwise, reject \(X^*\), set \(X^{(t)}\leftarrow X^{(t-1)}\).

Python Code Examples

Provided and discussed in lab.

Back Updated: 2024-11-08 Statistical Simulation, Wei Li