Back Updated: 2024-09-30 Statistical Simulation, Wei Li
Ref. Robert, C.P., Casella, G. and Casella, G., 2010. Introducing Monte Carlo Methods with R. New York: Springer.
The general problem is to evaluate \[ {E}_f (h(X))=\int_{\mathcal{X}} h(x) f(x)\,dx, \]
where \(h\) is some known function, \(X\) is r.v. \(\sim\) \(f\) and \(\mathcal{X}=\textit{supp} (f)\). In the following discussion, the underlying \(\mathcal{X}\) is implicitly assumed.
The principle of Monte Carlo method for evaluating \({E}_f (h(X))\) is to generate an i.i.d. random sample sequence \((X_1, \cdots, X_n)\) from \(f\) and propose to estimate it by empirical average \[ \overline{h}_n=\frac{1}{n}\sum_{j=1}^n h(X_j) \] this is called the Monte Carlo estimator for \({E}_f (h(X))\). This works because by Law of Large numbers, \(\overline h_n \rightarrow {E}_f (h(X))\).
Properties of estimator \(\overline h_n\)
Note \(\operatorname{Var}_f (\overline h_n)\) can also be estimated from \((X_1, \cdots, X_n)\) by \(\sum_{j=1}^n \big(h(X_j)-\frac{1}{n}\sum_{i=1}^n h(X_i)\big)^2/n^2\).
Example: For \({N}(0,1)\), to estimate c.d.f \(\phi(t)=\displaystyle\int_{-\infty}^t \frac{1}{\sqrt{2\pi}} e^{-x}\,dx\). We can rewrite \(\phi(t)=\displaystyle\int_{-\infty}^t \frac{1}{\sqrt{2\pi}} e^{-x}\,dx=\int_{-\infty}^{\infty}\mathbf{1}_{(-\infty,t)}(x) \, f(x) \,dx\), \(f(x)=\frac{1}{\sqrt{2\pi}} e^{-x}\). We can generate a sample of size \(n\), \(\{X_i\}_{i=1}^n \overset{i.i.d.}\sim f\), then use \(\overline h_n = \sum_{i=1}^n \mathbf{1}_{(-\infty,t)}(X_i)/n\) to estimate \(\phi(t)\).
Example (A drawback of classical method): If \(Z \sim \mathcal{N}(0,1)\), to evaluate \(P(Z>4.5)\). Note\(P(Z>4.5)={E}_f (\mathbf{1}_{(4.5,\infty)}(Z))\). Even though we can use classical M.C. integration: generate a sample of size \(n\), \(\{Z_i\}_{i=1}^n \overset{i.i.d.}\sim \mathcal{N}(0,1)\), then use \(\overline h_n = \sum_{i=1}^n \mathbf{1}_{(4.5,\infty)}(Z_i)/n\) to estimate \(P(Z>4.5)\). But actually, simulating \(\{Z_i\}_{i=1}^n \overset{i.i.d.}\sim \mathcal{N}(0,1)\) probably only produces just one hit \(Z_i\in(4.5,\infty)\) in large amount of iterations.
Suppose we have another p.d.f \(g\) s.t. \(supp(g)\supseteq supp(f)\): \[ {E}_f (h(X))=\int h(x) f(x)\,dx=\int h(x) \frac{f(x)}{g(x)}g(x)\,dx={E}_g \Big(h(X) \frac{f(X)}{g(X)}\Big). \] So we can generate a sample of size \(n\), \(\{X_i\}_{i=1}^n \overset{i.i.d.}\sim g\), then use \(\overline h_n = \Big(\sum_{i=1}^n h(X_i) \frac{f(X_i)}{g(X_i)}\Big)/n\) to estimate \({E}_f (h(X))=\int h(x) f(x)\,dx\).
Here \(g\) is called importance function, \(f(X_i)/g(X_i)\) is called importance weight for \(X_i\) and \(\displaystyle \left( X_i, f(X_i)/g(X_i) \right)\) is called importance sample; \(\frac{1}{n}\sum_{i=1}^n h(X_i) f(X_i)/g(X_i)\) is called importance sampling estimator for \({E}_f (h(X))\).
Example (continued): recall that \(Z \sim \mathcal{N}(0,1)\), and we are asked to evaluate \(P(Z>4.5)\). Now we can take \(g\) to be density function of \(\operatorname{exp}(1)\) (right) truncated at \(4.5\): \[\begin{align*} g(y)&=P\left(exp(1)=y|exp(1)>4.5\right)=\frac{P\left(exp(1)=y , exp(1)>4.5\right)}{P\left(exp(1)>4.5\right)}= \frac{P\left(exp(1)=y\right)}{P\left(exp(1)>4.5\right)}=e^{-(y-4.5)} \quad (y> 4.5). \end{align*}\]
Now we can generate a sample of size \(n\), \(\{Y_i\}_{i=1}^n \overset{i.i.d.}\sim g\), recall \(f(x)=\frac{1}{\sqrt{2\pi}} e^{-x}\), then the importance sampling estimator for \(P(Z>4.5)={E}_f (\mathbf{1}_{(4.5,\infty)}(Z))\) becomes \[ \displaystyle\frac{1}{n}\sum_{i=1}^n \mathbf{1}_{(4.5,\infty)}(Y_i) \frac{f(Y_i)}{g(Y_i)}= \displaystyle\frac{1}{n}\sum_{i=1}^n \frac{e^{-\frac{Y_i^2}{2}+Y_i-4.5}}{\sqrt{2\pi}}, \]
where here \(\mathbf{1}_{(4.5,\infty)}(Y_i)=1\) since \(\{Y_i\}_{i=1}^n \overset{i.i.d.}\sim g\).
Remark: we need \(supp(g)\supseteq supp(f)\) to make sure the value of importance weight \(\frac{f(X_i)}{g(X_i)}\) is meaningful. Actually here having a weaker condition \(supp(g)\supseteq supp(h \times f)\) suffices since we have the fraction \(h(X_i)f(X_i)/g(X_i)\).
Variance reduction: Let \(h^*:=h(X)f(X)/g(X)\). Note that \(E_g(h^*(X))=E_f(h(X))\), and \[\begin{equation*} \operatorname{Var}_g\left(h^*({X})\right) =\int \frac{h({x})^2 f({x})}{g({x})} f({x}) d {x}-(E_g(h^*(X))^2. \end{equation*}\]
Note that \[\begin{equation*} \operatorname{Var}_f(h({X}))-\operatorname{Var}_g\left(h^*({X})\right)=\int h({x})^2\left(1-\frac{f({x})}{g({x})}\right) f({x}) d {x}. \end{equation*}\]
A variance deduction can be achieved if: (1). \(f({x}) / g({x})>1\) for \(x\) where \(h({x}) f({x})\) is small; and (2). \(f({x}) / g({x})<1\) for \(x\) where \(h({x}) f({x})\) is large. An ideal \(g\) should put more weight on where \(h({x}) f({x})\) is large (nontrivial).
Self-normalized version of importance sampling
Again, we want to evaluate \({E}_f (h(X))\). But now \(f \propto \tilde f\), \(g \propto \tilde g\). Say \(f=c_o f_0\) where \(f_0\) is the unnormalized p.d.f, \(c_0\) is the normalizing constant. And \(g=c_1 g_1\), where \(g_0\) is the unnormalized p.d.f, \(c_1\) is the normalizing constant. Suppose we know \(f_0\) and \(g_0\), but possibly not \(c_0\) and \(c_1\); and know how to generate random sample from \(g\). \[\begin{align*} {E}_f (h(X))=\int h(x) f(x) dx=\frac{\int w(x)h(x)g(x)dx}{\int w(x)g(x)dx}= \frac{{E}_g(w(x)h(x))}{{E}_g(w(x))}. \end{align*}\]
We then generate a sample of size \(n\), \(\{X_i\}_{i=1}^n \overset{i.i.d.}\sim g\), and use \(\hat \mu = \frac{1}{n}\sum_{i=1}^n w(X_i)h(X_i)/\frac{1}{n}\sum_{i=1}^n w(X_i)\) to estimate \(\mu={E}_f (h(X))\). Here \(\hat\mu\) is called self-normalized importance sampling estimator for \(\mu\). The expectation and variance of this estimator can be derived similarly; as an alternative, in practice, one can estimates them using bootstrap.
Suppose \(f\propto f_0\), \(g\) is importance pdf, and \(supp(f)\subset supp(g).\) Self-normalized importance sampling gives \(\Big(x_i , \frac{f_0(x_i)}{g_0(x_i)}\Big)_{i}\) a collection of importance samples. It turns out the collection can be recycled by multinomial resampling into a sample that is approximately distributed from \(f\).
Step 1: Sample \(x_i\sim g\), obtain \(\big(x_i,\frac{f_0(x_i)}{g_0(x_i)}\big)_{i=1}^n.\)
Step 2: Compute importance weights \(w_i = \frac{f_0(x_i)}{g_0(x_i)}\). Let \[\hat{w_i}=\frac{\frac{1}{n}w_i}{\frac{1}{n}\sum_j^n w_j}.\] Notice that \(\hat{w_i}\in(0,1), \sum\hat{w_i}=1.\)
Step 3: Draw a random sample of size \(m\) with replacement from \(x_1,\cdots,x_n\) with weighted probabilities by \(\hat{w_1},\cdots,\hat{w_n}\): that is for \(k=1, 2, \ldots, m\), \[\begin{equation*} X^*_k= \left\{ \begin{aligned} x_1 &\quad& \mbox{with prob}=\hat{w_1}\\ x_2 &\quad& \mbox{with prob}=\hat{w_2}\\ &\vdots&\\ x_n &\quad& \mbox{with prob}=\hat{w_n}. \end{aligned} \right. \end{equation*}\] It turns out that \((X_1^*,X_2^*,\cdots,X_m^*)\) is a random sample approximately from \(f\) if \(n\) is large.
Remark: for this sampling to work, \(n\) should be large and the target sample size \(m\) should be no more than \(10\%\) of \(n\).
Remark: If \(f \propto f_0\), and \(g\) is a p.d.f, a byproduct of the algorithm is that we obtain an estimator for the normalization constant in \(f\) by using \((1/n)\times\sum_{i} (f_0(X_i) /g(X_i))\) where \(X_i \sim g.\)
Remark: Accept-reject sampling and the self-normalize importance sampling may be used for multi-dimensional density function. However, in practice, the performance could be unsatisfactory.
Provided and discussed in lab.
Back Updated: 2024-09-30 Statistical Simulation, Wei Li