Back Updated: 2024-11-08 Statistical Simulation, Wei Li
Consider a sequence of random variables \(\{ X^{(n)}: n=0, 1, \ldots, \}\), each \(X^{(n)} \in S\) (state space). We mainly discuss M.C in a discrete state space here, say, \(S = \{1,2, \cdots, K\}\).
Definition: A Markov Chain is a sequence of random variables \(X^{(n)}\) such that \[\begin{align*} &P(X^{(n)}\in A_n | X^{(n-1)}\in A_{n-1},\cdots, X^{(0)}\in A_0)= P(X^{(n)}\in A_n | X^{(n-1)}\in A_{n-1}). \end{align*}\] That is the probability distribution of \(X^{(n)}\) given the past variables depends only on \(X^{(n-1)}\). This is called first-order Markov Chain.
Define the one-step transition kernel \[p^{[ n-1,n ]}_{ij} := P(X^{(n)}=j| X^{(n-1)}=i) \qquad \text{transition kernel}.\] We’ll restrict ourselves to the time-invariant M.C.,that is, \(p^{[n-1, n]}_{ij}\) does not depend on the time reference point, but only depend on how far apart the two time points are. So we write \(p_{ij}^{[1]}:=p^{[n, n+1]}_{ij}=p^{[n-1, n]}_{ij}=\cdots=p^{[0, 1]}_{ij}\) etc.
one-step transition kernel
transition probability \[p_{ij}:=p^{[1]}_{ij}= P(X^{(n+1)}=j| X^{(n)}=i),\]
Transition matrix \[P := \begin{bmatrix} p_{11} & p_{12} & \cdots & p_{1K}\\ p_{21} & p_{22} & \cdots & p_{2K}\\ \vdots & \ddots & \cdots & \vdots\\ p_{K1} & p_{K2} & \cdots & p_{KK}\\ \end{bmatrix}.\]
\[p_{ij} \geq 0, \ \forall i=1,\cdots, K; \ \sum_{j=1}^K p_{ij} = 1.\]
m-step transition kernel
transition probability \[p^{[m]}_{ij} := P(X^{(n+m)}=j| X^{(n)}=i),\]
Transition matrix \[P^{[m]} := \begin{bmatrix} p_{11}^{[m]} & p_{12}^{[m]} & \cdots & p_{1K}^{[m]}\\ p_{21}^{[m]} & p_{22}^{[m]} & \cdots & p_{2K}^{[m]}\\ \vdots & \ddots & \cdots & \vdots\\ p_{K1}^{[m]} & p_{K2}^{[m]} & \cdots & p_{KK}^{[m]}\\ \end{bmatrix}.\]
Let’s consider some state distribution (occupation prob dist):
Time 0: \[\alpha^{(0)} = (\alpha_1^{(0)}, \cdots,\alpha_K^{(0)}), \text{ where } \ P(X^{(0)}=i)=\alpha_i^{(0)}.\]
Time 1: \[\alpha^{(1)} = (\alpha_1^{(1)}, \cdots,\alpha_K^{(1)})= \alpha^{(0)}P.\] One can show that the state distribution at time \(m\) is given by \(\alpha^{(m)}=\alpha^{(0)}P^m\). By definition, \(\alpha^{(m)}=\alpha^{(0)} P^{[m]}\).
Therefore, \[P^{[m]}=P^m.\]
More generally, one can show the Chapman-Kolmogorov equations: \[ P^{[m+n]}=P^{[m]}P^{[n]}, \] that is, \[ p^{[m+n]}_{ij}=\sum_{s=1}^Kp^{[m]}_{is}p^{[n]}_{sj}, \] which states that the probability of moving from \(i\) to \(j\) in \(m+n\) steps is just the probability of moving from \(i\) to \(s\) in m steps, and then from \(s\) to \(j\) in n steps, summing over all \(s\).
Definition (steady-state distribution/stationary distribution): If a state distribution (\(\pi\)) satisfies: \[\pi P = \pi.\] Then \(\pi\) is called a “steady state distribution” (stationary distribution). Note that it is the same as \[\begin{align*} \pi_j = \sum_{i=1}^K\pi_ip_{ij}, \ \forall \ j = 1,2,\cdots , K. \end{align*}\] The LHS \(\pi_j\) is steady state prob for the state \(j\). The RHS \(\sum_{i=1}^K\pi_ip_{ij}\) is total probability flowing into state \(j\) from any states. An expression is the global balance equations: \[\begin{align*} \pi_j \sum_{i: i \neq j} p_{ji}= \sum_{i: i\neq j}\pi_i p_{ij}, \ \forall \ j = 1,2,\cdots , K, \end{align*}\] which states that the probability of being in state \(j\) times the net total probability of flowing out of state \(j\) is equal to the sum of the probability of being in every state \(i\) (except \(j\)) times the probability of flowing from that state into \(j\).
To see the meaning of the steady state distribution, consider a scenario that at some step n, we reached the steady state distribution, \[\begin{align*} \alpha^{(n)} &= \pi, \\ \alpha^{(n+1)} &= \alpha^{(n)}P = \pi. \end{align*}\] Therefore, once we reached the steady state distribution, the state distribution becomes stationary.
Find the steady state distribution: For a given transition matrix \(P\) (non-degenerate with possibly zero entries), note that \[P^\top\pi^\top = 1\pi^\top.\] So \(\pi^\top\) is eigenvalue of \(P^\top\) that corresponds to eigenvalue 1. The found eigenvector then needs to be normalized. A limitation is that it is not guaranteed the found eigenvector is real-valued.
A more general approach is to solve the following system of equations: \[M^\top\pi^\top = r^\top,\] where \(M\) is a \(K \times K\) matrix that removes the last column of \(I-P\) in \([I-P, 1]\), and \(r\) is a \(1 \times K\) vector \((0, 0, \ldots, 0, 1)\).
Limiting distribution: A Markov Chain has a limiting distribution if \[ \pi_j = \lim_{n\to \infty} p_{ij}^n, \] exits and the distribution is independent of the starting state. This basically says that the long-run distribution \(P(X^{(n)}=j) \to \pi_j\) as \(n \to \infty\) and is independent of the state at time \(0\).
Most of the M.C. encountered in MCMC settings enjoy some nice properties. In particular, we’re interested in M.C.s that satisfy the following three properties.
The chain has a positive probability of eventually reaching any state, i.e. \[\forall i,j, \ p_{ij}^{[n_0]} > 0 \text{ for some } \ n_0.\]
Theorem: Every irreducible, aperiodic finite state Markov Chain has a limiting distribution which is equal to its unique stationary distribution.
As a matter of fact, every irreducible finite state Markov Chain is also positive recurrent as defined above.
Henceforth, we restrict ourselves to “time inveriant” M.C. (possibly in continuous state space) that are irreducible, aperiodic and (positive) recurrent. We call these M.C. “ergodic M.C.”.
Theorem: Any ergodic M.C. has a unique stationary distribution \(\pi\), i.e., \(\pi P = \pi\), and it is equal to the long-run distribution: \[\pi_j = \lim_{n \to \infty} p_{ij}^{[n]}, \qquad \text{for all} \ i.\]
Let \(\alpha^{(m)}\) be the state distribution at time \(m\). How can we gather a sample from the stationary distribution? One approach involves initiating N Markov chains, running them for a sufficiently long period, and then gathering the values close to the end of each chain’s run. These values would then serve as a sample from the stationary distribution. An alternative and often more practical method, made possible by ergodicity, is to take several values near the end of a single long-running Markov chain. This set of values can also be considered a sample from the stationary distribution.
A ergodic M.C. has a very important property. That is, the time average of a simple realization approaches the average of all possible realization of their chains, called ensemble, at some particular point in the far future.
Goal: estimate \(E(h(X^{(\infty)})) = \int h(x)\pi(x)dx\)
The conventional law of large number says that, using the ensemble, \[\frac{1}{N}\sum_{i=1}^Nh(X^{(\infty)}_i) \rightarrow E(h(X^{(\infty)})) \text{ as } \ N \rightarrow \infty.\]
Ergodicity guarantees that, using just a single M.C., \[\frac{1}{T}\sum_{n=1}^Th(X^{(n)}) \rightarrow E[h(X^{(\infty)})] \text{ as }\ T \rightarrow \infty.\] This provides the theoretical justification–why obtaining samples from a single M.C. (after it has run long enough), we are effectively obtaining samples from the steady-state distribution.
In practice, we take a sample from a single realization of the M.C. after we have left it run a period of time called “burn-in” time, which we consider long enough for the chain to reach the stationary distribution. The sample we then obtain behaves as if it is a “random” sample from the stationary distribution (yet with weak correlation).
Given a stationary \(\pi\), how to find a M.C. that has \(\pi\) as its stationary distribution? It turns out that finding the so-called time reversible M.C. that satisfies the detailed balance condition is the key.
Goal: to find the M.C. with the right transition kernel \(P_{ij}.\)
Idea: Consider a M.C. that is “time reversible”–that is there exists some distribution \(\pi\), it holds that (detailed balance condition) \[ \pi_jP_{ji}=\pi_iP_{ij}, \quad \forall i,j, \] implying that the flows of probabilities between two states \(i, j\) balance each other when weighted by the outgoing state probabilities: \[ P(X^{(n)}=j)P(X^{(n+1)}=i|X^{(n)}=j)=P(X^{(n)}=i)P(X^{(n+1)}=j|X^{(n)}=i). \] Note also that the time reversibility of the chain is implied, so the direction of the time does not matter for the probability in the move of the chain: \(P(X^{(n)}=i|X^{(n+1)}=j)=P(X^{(n+1)}=i|X^{(n)}=j)=P_{ji}\).
Theorem: A set of transition probabilities \(P_{i,j}\), satisfying the “detailed balance condition” will have \(\pi\) as its stationary distribution.
Remark: The detailed balance condition is sufficient but not necessary condition for a M.C. to have a stationary distribution. The key insight for sampling from some stationary distribution \(\pi\) is that it suffices to sample from the some transition distribution that satisfies the detailed balance condition defined by the stationary distribution.
Goal: to find the transition kernel \(\tilde{P}_{ij}\), s.t., the detailed balance condition holds, i.e., \(\pi_j\tilde{P}_{ji}=\pi_i\tilde{P}_{ij},\forall i,j\).
We can use some arbitrary transition kernel, called \({P}_{ij}\). By following the modification described below, we can effectively modify \(P_{ij}\) to \(\tilde{P}_{ij}\).
Intuition goes like this:
Whenever a move is proposed, we want to weight it using some acceptance probability (i.e., accepting the move), effectively adjusting the transition probability.
Formally, Metropolis’ idea works as follows. Given some \(\pi\)–the stationary distribution (target distribution), choose some candidate transition kernel \(P_{ij}\). For all \(i\),
Then \(\pi\) is the stationary distribution from M.C. with transition probability given by this \(\tilde{P}_{ij}\).
Proof: provided in class.
Note: \(\tilde{P}_{ii}=1-\sum_{j:j\neq i}\tilde{P}_{ij}\) is the probability that we reject any proposal that moves from \(i\) to any other state \(j\neq i\), that is, it is the probability of moving from \(i\) to \(i\) itself (remaining in \(i\)). We can prove that our \([\tilde{P}_{ij}: i, j]\) is a well-defined transition kernel.\(X^{(0)}\in S\), \(S\) is a discrete space.
draw \(X^* \sim P(\cdot|X^{(t-1)})\);
compute acceptance probability \[ \alpha(X^{(t-1)},X^*):=min\left\{1,\frac{P(X^{(t-1)}|X^*)\pi_{X^*}}{P(X^*|X^{(t-1)})\pi_{X^{(t-1)}}}\right\}, \] where \(\pi_{X^*}\) is the probability of taking value \(X^*\) under the stationary distribution \(\pi\), and \(\pi_{X^{(t-1)}}\) is the probability of taking value \(X^{(t-1)}\) under the stationary distribution \(\pi\).
set \(X^{(t)}\leftarrow X^*\) with probability \(\alpha(X^{(t-1)},X)\); otherwise \(X^{(t)}\leftarrow X^{(t-1)}\).
To implement step 3, one can implement: generate \(U \sim Unif(0,1)\), if \(U \leq \alpha(X^{(t-1)},X^*)\), then accept \(X^*\), i.e., set \(X^{(t)}\leftarrow X^*\); otherwise, reject \(X^*\), set \(X^{(t)}\leftarrow X^{(t-1)}\).
Provided and discussed in lab.
Back Updated: 2024-11-08 Statistical Simulation, Wei Li