Back Updated: 2024-10-31 Statistical Simulation, Wei Li

Global strategies

Some algorithms have been developed as attempts to find the global minimum of a smooth target function. We discuss two here: stochastic local search and simulated annealing. There two algorithms provide no absolute guarantee for finding the global minimum, but are less likely to get stuck in the local minima.

Simulated Annealing

In metallurgy, annealing refers to the process where a solid is heated to release its thermal stresses, and then cooled slowly to allow its crystals to arrange in a configuration that minimizes energy. The quality of the end result, i.e., the arrangement of the crystals, is dependent on the rate of cooling. Faster cooling may trap the material in a sub-optimal configuration, whereas slow cooling is more likely to result in a configuration closer to the global minimum energy state.

Consider maximize \(h(\theta)\) (global maximization):

Assume \(h: \Theta \mapsto \mathbb{R}, \int_H h(\theta)<\infty\).

Define the so-called Boltzman-Gibbs transformation of \(h\) to be \[\pi_t(\theta) = exp \Big(\frac{h(\theta)}{T_t}\Big),\] where \(T_t>0\) is the temperature at time \(t\).

One can treat \(\pi_t(\theta)\) as some unnormalized version (i.e., the kernel) for the density function of \(\theta\).

As \(T_t \downarrow\) 0, the kernel of density will become more and more concentrated in a narrower neighborhood of the maximizer of \(h\).

example: The following is the plot of \(\exp \left(\frac{h(\theta)}{T_t}\right)\), where \(h\) is two dimensional Rastrigin function: \(-\big(2A+\sum_{i=1}^2(x_i^2-A \cos \left(2 \pi x_i\right))\big)\) where \(A=10\) and \(x_i \in[-5.12,5.12]\):

Algorithm (Metropolis-Hastings):

Suppose \(g(\cdot)\) symmetric density function \(g(-x)=g(x)\), e.g. \(N(0,1)\).

  1. simulate \(\zeta^{(t)}\) from \(g\).

  2. generate \(\theta^{(t+1)}\) as \[\theta^{(t+1)}= \begin{cases} \theta^{(t)}+\zeta^{(t)} & \text{with } \ prob\ \rho=\min\{exp(\frac{\triangle h}{T_t}), 1\}, \triangle h = h(\theta^{(t)}+\zeta^{(t)})-h(\theta^{(t)})\\ \theta^{(t)} & \text{with }\ prob \ 1-\rho\\ \end{cases}.\]

  3. reduce temperature \(T_{t+1} \leftarrow T_t\). For example, one can let \(T_t = 1/log(1+t)\) or \((1/1.2)^t.\)

Remarks:

  1. If the perturbation increases \(h\) (i.e., \(\triangle h\geq 0\)), the new proposal is automatically accepted. If \(\triangle h<0\), then still accept it with certain probability \(\rho(<1)\).

  2. When \(T_t\) is relatively large, many “bad” proposals (that is those \(\triangle h<0\)) may still be accepted. Therefore a larger part of the space can be explored. As \(T_t \downarrow 0\), that will limit the “bad” proposal that we allow.

  3. When \(T_t\) is sufficiently small, one essentially “anneal” the process by accepting only “good” proposals (i.e., \(\triangle h\geq 0\)).

  4. In step 2, one can instead sample \(u\sim unif(0, 1)\), and if \(u \leq \exp(\Delta h/T_t)\), we set \(\theta^{(t+1)} \leftarrow \theta^{(t)} + \zeta^{(t)},\) otherwise, we \(\theta^{(t+1)} \leftarrow \theta^{(t)}.\)

  5. If \(T_t\) is fixed, the algorithm essentially draws a Markov Chain whose stationary distribution ig given by \(\theta \propto \exp(h(\theta)/T)\).

The following is the plot of the function \(\rho\left(T_t\right)\) against \(T_t\), with \(T_t\) decreasing from 10 to 0.1 and \(\Delta h=-0.1\):

Python Code Examples

Provided and discussed in lab.

Back Updated: 2024-10-31 Statistical Simulation, Wei Li