Back Updated: 2024-09-06 Statistical Simulation, Wei Li

Numerical method for optimization

For some smooth function \(f\), consider \(\min f(x) \rightarrow x^*\).

We’ll consider the gradient method, and Newton’s method (and its variants). Both methods guarantee that a stationary point of \(f\) can be found (i.e. \(\nabla f(x^*) = 0\)). The basic idea is

  1. start the process with some initial point \(x_0\);
  2. iterate steps \(x_k \rightarrow x_{k+1}\) by going downhill.
  3. repeat step 2 until the sequence of points converge to a stationary point.

For a general (non-convex) function, we run the procedure several times with different initial values \(x_0\).

Gradient descent

Choose a direction \(p_k\) and search along the direction from the current iterate \(x_k\) for a new iterate with lower function value \(f(x_{k+1}) < f(x_k)\). Consider \(x_k+\alpha P_k\) ( \(\alpha\) = “step length”, “learning rate”, \(\alpha>0\) scalar) as the next iterate.

Fix \(\alpha\): Taylor approximation of \(f\) at \(x_k\) is given by \[f(x_k +\alpha p) \approx f(x_k)+\alpha p^\top\nabla f(x_k):=T(x_k, \alpha p).\] To solve for the direction \(p\), \(\min_{||p||=1} p^\top \nabla f(x_k)\), whose solution gives us the the unit direction that has the most rapid decrease in \(f\).

\(p^\top\nabla f(x_k) = ||p|| \cdot ||\nabla f(x_k)|| cos(\theta)\), \(0\leq \theta \leq \pi\) \(\Rightarrow cos(\theta) = -1\) \(\Rightarrow\) \(p\) is the exact opposite direction of \(\nabla f(x_k)\) , \(p=\frac{-\nabla f(x_k)}{||\nabla f(x_k)||}.\)

Note: Any \(p_k\) such that \(p_k^\top \nabla f(x_k)<0\) would work. It’s called “descent direction”.

Algorithm

Gradient descent (fix \(\alpha\)), set \(k = 0\), given \(x_0\).

Repeat

  • \(x_{k+1} \leftarrow x_k - \alpha \nabla f(x_k)\)
  • \(k \leftarrow k+1\)

Until stopping condition is met (e.g., \(||\nabla f(x_k)||=0\)).

Newton’s Method

Take a second order Taylor expansion of \(T(x_k, \alpha p)\) in the direction given by \(\alpha p\): \[T(x_k,\alpha p):=f(x_k)+\alpha p^\top\nabla f(x_k)+\alpha^2p^\top\nabla^2f(x_k)p/2.\] We travel to a stationary point of this quadratic approximation by solving for a minimizer \(p\) (if \(f\) is convex) and move \(x_{k+1} \leftarrow x_k +\alpha p\).

To see this, set \(\alpha \equiv 1\); set the gradient of \(T(x_k,\alpha p)\) to zero and solve for \(p\): \[\begin{align*} FOC: &\nabla_pT(x_k,\alpha p) \overset{set}{=} 0\\ &\nabla f(x_k) + \nabla^2f(x_k)p = 0\\ &\Rightarrow p = p_k = -\nabla f(x_k)^{-1} \cdot \nabla f(x_k), \text{ if} \ \nabla^2f(x_k) \ \text{ is invertible}\\ &x_{k+1} \leftarrow x_k - \big[\nabla^2f(x_k)\big]^{-1} \cdot \nabla f(x_k). \end{align*}\]

If \(\nabla^2f(x_k)\) is not invertible, you can use \(\big[\nabla^2f(x_k)\big]^\dagger\). An alternative way is to solve: \[\begin{align*} &\nabla f(x_k) + \nabla^2 f(x_k)(x_{k+1} - x_k) = 0.\\ \Leftrightarrow &\nabla^2f(x_k)x_{k+1} = \nabla^2f(x_k) x_k - \nabla f(x_k).\\ \Rightarrow &\text{solve for } x_{k+1}.\\ \end{align*}\]

Remark: Newton method produces a sequence of points \(x_1 , x_2, ...\) that minimizes a sequence of the function by repeatedly creating a quadratic approximation to the function \(f\) centered at \(x_{k}\). With a quadratic approximation more closely mimicking the object function. Newton method is often more effective than the gradient method. However, the reliance on the quadratic approximation makes Newton’s method more difficult to use, especially for non-convex function. \(\nabla^2f(x_k)\) may not be invertible, so can use \(\big[ \nabla^2f(x_k) \big]^+\) or solving for \(x_{k+1}\) from this system as above. However, both approaches do not guarantee that \(p_k^\top \nabla f(x_k) < 0.\)

Quasi Newton Method

\[\begin{align*} & p_k = -\big[\nabla^2f(x_k)\big]^{-1} \cdot \nabla f(x_k) \qquad \text{Newton Method}\\ & p_k = -B_k^{-1} \nabla f(x_k) \qquad \text{Quasi -Newton Method} \end{align*}\] here, \(B_k\) is some approximation to \(\nabla^2f(x_k)\). Note that \[\begin{align*} & \nabla f(x_{k+1}) = \nabla f(x_k) + \nabla^2f(x_k)(x_{k+1} - x_k) + o(||x_{k+1} - x_k||),\\ & \Leftrightarrow \nabla^2f(x_k)(x_{k+1} - x_k) \small \approx \nabla f(x_{k+1}) - \nabla f(x_k). \end{align*}\]

\(B_k\) is required to satisfy: (1). \(B_k\) is symmetric; (2). \(B_k(x_{k+1} - x_k) = \nabla f(x_{k+1}) - \nabla f(x_k)\) (secant equation).

Let \(s_k= x_{k+1} - x_k\), \(y_k = \nabla f(x_{k+1}) - \nabla f(x_k)\). If \(B_0 >0\) and \(s_k^\top y_k > 0\) (curvature condition), the BFGS update produces positive definite approximation to Hessian. The curvature condition is satisfied if the Wolfe conditions are imposed on the line search for the step length \(\alpha_k\) (see below).

BFGS: The idea of BFGS method is to update \(B_k\) using \(B_{k+1}=B_k+\alpha {u} {u}^{\top}+\beta {v} {v}^{\top}\). Imposing the secant condition, \(B_{k+1} {s}_k={y}_k\). Choosing \({u}={y}_k\) and \({v}=B_k {s}_k\), we can solve for \(\alpha, \beta\): \[\begin{align*} \alpha=\frac{1}{{y}_k^\top {s}_k}, \qquad \beta=-\frac{1}{{s}_k^\top B_k {s}_k} . \end{align*}\] Then substitute \(\alpha\) and \(\beta\) back into \(B_{k+1}=B_k+\alpha {u} {u}^{\top}+\beta {v} {v}^{\top}\) and get the update equation of \(B_{k+1}\) : \[\begin{align*} B_{k+1} = B_k - \frac{B_ks_ks_k^\top B_k}{s_k^\top B_ks_k} + \frac{y_ky_k^\top}{y_k^\top s_k}.\\ \end{align*}\]

Alternative version of BFGS

Direct update the inverse of “Hessians” \(B_k\): \(p_k= -\tilde{B}_k \nabla f(x_k)\) where \(\tilde{B}_k = B_k^{-1}:\) \[\begin{align*} \tilde{B}_{k+1}=\left({I}-\frac{s_t {y}_k^{\top}}{{y}_k^{\top} {s}_k}\right) \tilde{B}_k\left({I}-\frac{{y}_k {s}_k^{\top}}{{y}_k^{\top} {s}_k}\right)+\frac{{s}_k {s}_k^{\top}}{{y}_k^{\top}{s}_k}. \end{align*}\]

Maximization Problem

Note: \(\max f(x) = \min (-f(x))\)

Newton method \[ x_{k+1} \leftarrow x_k - \big[\nabla^2f(x_k)\big]^{-1} \cdot \nabla f(x_k) \quad \text{ note: } \nabla^2f(x_k) \text{ is n.d. } \] gradient descend method \[ x_{k+1} \leftarrow x_k - (-\nabla f(x_k)) \]