Back Updated: 2024-09-24 Statistical Simulation, Wei Li

Gradient Descent

Data: \(\left\{x_{i}, y_{i}\right\}_{i=1}^{n}\).

Goal: find \(w\) such that: \[\underset{w}{\operatorname{minimize}} L(w)=\frac{1}{n}\sum_{i=1}^{n}\ell\left(y_{i},f\left(x_{i}\right)\right),\] here \(f(x_i)=f(x_i;w)\), i.e., parametrized by some parameters \(w\). Special case, \(\ell\left(y_{i},f\left(x_{i}\right)\right)=\left(y_{i}-f\left(x_{i}\right)\right)^{2}\).

Recap

Consider an iterative method to find the minimizaer \(w^*\): \[w_{new}=w_{old}+\alpha \Delta,\] for some \(\Delta\) and some fixed positive scalar \(\alpha\).

By Taylor expansion, note that \[L(w+\alpha \Delta)=L(w)+\alpha \Delta^\top \nabla L(w)+ \cdots.\] The requirement that \(L(w+\alpha \nabla)<L(w)\) implies \(\Delta^\top \nabla L(w)<0\).

Let \(\beta\) be the angle between \(\Delta\) and \(\nabla L(w)\), by linear algebra fact, \[-1 \leq \cos(\beta)= \frac{ \Delta^\top \nabla L(w)}{ \|\Delta\| \|\nabla L(w)\|}\leq 1.\] Therefore, \(L(w+\alpha \Delta)-L(w)=\alpha \Delta^\top \nabla L(w)= \alpha \|\Delta\| \|\nabla L(w)\|\cos(\beta)\) becomes smallest (most negative) when \(cos(\beta)=-1\),i.e., \(\beta=180^{\circ}.\)

Therefore, the best direction \(\Delta\) is the opposite direction of the gradient \(\nabla L(w)\). The solution \(\Delta = -\nabla L(w)\) is the steepest descent direction.

X, Y: data matrices
def gradient_descent():
    for b in range(max_epochs):
        dw=0
        for x, y in (X,Y): # update gradient across observations
            dw+= grad_w(w,x,y)
        w=w-alpha*dw

Gradient Descent with Momentum

One major problem with gradient descent is that the update of the parameters becomes slow as gradients in flat regions become very small.

The gradient descent with momentum instead update the velocity or momentum using the gradient vector, then update the parameters (locations) using the velocity: \[\begin{align*} {v}_{t} &=\gamma {v}_{t-1}+\alpha \nabla L(w_{t}), \qquad \gamma>0, \alpha>0 \\ w_{t+1} &=w_{t}-{v}_{t}. \end{align*}\]

Note that, if \(\gamma=0\), we recover the gradient descent. When \(\gamma\neq 0\) (setting \(v_0=0\)), one gets \[\begin{align*} {v}_{t}=\gamma \text {v}_{t-1}+\alpha \nabla L(w_{t})=\gamma^{t-1} \alpha \nabla L(w_{1})+\gamma^{t-2} \alpha \nabla L(w_{2})+\ldots+\gamma \alpha \nabla L(w_{t-1})+\alpha \nabla L(w_{t}), \end{align*}\] Indeed the current velocity (or momentum) depends on the history of the gradients, thus the name “with momentum”.

With Momentum update, the parameter vector will build up velocity in any direction that has consistent gradient. The velocity effectively is exponentially decaying moving average of gradients, as shown in \[v_t=\alpha \sum_{j=1}^t \gamma^{t-j} \nabla L(w_j).\] Therefore the weight for \(\nabla L(w_j)\) is \(\gamma^{t-j}\), a weight factor that decays exponentially as further in the past (i.e., \(j\) decreases). So \(\gamma\) is the exponential decay rate. The larger \(\gamma\), the slower is the decay.

X, Y: data matrices
def momentum_gradient_descent():
    pre_vw=0
    for b in range(max_epochs):
        dw=0
        for x, y in (X,Y): # update gradient across observations
            dw+= grad_w(w,x,y)
        v_w=gamma*pre_vw+alpha*dw
        w=w-v_w
        prev_vw=v_w

An equivalent algorithm is: \[\begin{align*} {v}_{t} &=\gamma {v}_{t-1} - \alpha \nabla L(w_{t}), \qquad \gamma>0, \alpha>0 \\ w_{t+1} &=w_{t}+{v}_{t}. \end{align*}\]

There is a potential issue with the momentum. Even in the regions having gentle slopes, momentum based gradient descent is able to take large steps because the momentum carries it along But is moving fast always good? Would there be a situation where momentum would cause us to run pass our goal? In some situaions, momentum based gradient descent oscillates in and out of the minima valley as the momentum carries it out of the valley. We want to do something to reduce these oscillations.

We know that we will use our momentum term \(\gamma v_{t-1}\) to move the parameters \(w_t\). To attempt to decrease the effect of momentum preemptively, we compute \(w_t-\gamma v_{t-1}\) which gives us an approximation of the next position of the parameters. We can now effectively “look ahead” by calculating the gradient evaluated not at our current parameters \(w_t\) but at the approximate future position of our parameters.

Nesterov Momentum: It enjoys stronger theoretical converge guarantees for convex functions and in practice it also consistently works slightly better than standard momentum.

\[\begin{align*} w_{\text{look ahead},t} &=w_{t}-\gamma {v}_{t-1} \\ {v}_{t} &=\gamma {v}_{t-1}+\alpha \nabla L(w_{\text{look ahead},t}) \\ w_{t+1} &=w_{t}-v_{t} \end{align*}\]

X, Y: data matrices
def nesterov_gradient_descent():
    prev_vw=0
    for b in range(max_epochs):
        dw=0
        v_w=gamma*prev_vw  # partial update
        for x, y in (X,Y): # update gradient across observations
            dw+= grad_w(w-v_w,x,y)
        v_w=gamma*prev_vw+alpha*dw  # full update        
        w=w-v_w
        prev_vw=v_w

An alternative form is to store only the look-ahead version of \(w\), and update this version: \[\begin{align*} {v}_{t} &=\gamma {v}_{t-1}+\alpha \nabla L(w_{\text{look ahead}, t}), \\ w_{\text{look ahead}, t+1} &=w_{\text{look ahead}, t}+\gamma v_{t-1}-(1+\gamma){v}_{t}. \end{align*}\]

Looking ahead helps correcting the descending course quicker than momentum based gradient descent. Hence the oscillations are smaller and the chances of escaping the minima valley also smaller.

An equivalent algorithm is: \[\begin{align*} w_{\text{look ahead},t} &=w_{t} + \gamma {v}_{t-1}, \\ {v}_{t} &=\gamma {v}_{t-1} - \alpha \nabla L(w_{\text{look ahead},t}), \\ w_{t+1} &=w_{t} + v_{t}. \end{align*}\]

Adagrad, RMSProp, Adam

Adagrad:

Note that the momentum and Nesterov momentum update all components of \(w\) with the same learning rate \(\alpha\). Adagrad, or the adaptive subgradient method, adapts a learning rate for each component of \(w\). A component adaptive discount factor is in place to adjust learning rate. This factor is a square root of the cumulative sum of squared gradients.

Its update rule is (interpreted element-wise): \[\begin{align*} s_{t} &=s_{t-1}+\left(\nabla L(w_{t})\right)^{2} \\ w_{t+1} &=w_{t}-\frac{\alpha}{\sqrt{s_{t}+\epsilon}} \nabla L( w_{t}), \quad \epsilon>0, \end{align*}\] or if setting \(s_{0}=0\), for \(j\)-th element, \[\begin{align*} &s_{t,j}=\sum_{t=1}^t (\nabla_j L(w_{t}))^{2},\\ w_{t+1,j} &=w_{t,j}-\frac{\alpha}{\sqrt{s_{t,j}+\epsilon}} \nabla_j L( w_{t}). \end{align*}\]

Adagrad is particularly effective for situations where the gradient is sparse (only a subset of features are informative). It decreases the impact of parameters that have consistently large gradients, allowing parameters that are updated less frequently to have more influence. However, Adagrad has a downside: the components of its squared gradient accumulator, \(s\)’s may grow too fast. As a result, over time, parameters that are updated frequently may suffer from a vanishingly small effective learning rate before the model fully converges. Also Adagrad doesn’t have a mechanism to “forget” past gradients, so it may become overly conservative in its updates.

X, Y: data matrices
def adagrad():
    s_w=0
    for b in range(max_epochs):
        dw=0
        for x, y in (X,Y): # update gradient across all observations
            dw+= grad_w(w,x,y) 
        s_w=s_w+dw**2
        w=w-(alpha/sqrt(s_w+eps))*dw

RMSProp:

RMSProp (Root Mean Square Propagation) builds upon Adagrad to avoid the quick, one-directional reduction of the learning rate. It accomplishes this by normalizing the learning rate through an exponentially decaying moving average of past squared gradients.

The update rule is \[\begin{align*} s_{t} &=\beta s_{t-1}+(1-\beta)\left(\nabla L(w_{t})\right)^{2} \\ w_{t+1} &=w_{t}-\frac{\alpha}{\sqrt{s_{t}+\epsilon}} \nabla L( w_{t}) \end{align*}\]

Note that setting \(s_0=0\), we obtain \[s_t=(1-\beta)\sum_{i=1}^t \beta^{t-i} (\nabla L(W_i))^2,\] therefore the weight for \((\nabla L(W_i))^2\) is \((1-\beta)\beta^{t-i}\), a weight factor that decays exponentially as further in the past (i.e., i decreases). So \(\beta\) is the exponential decay rate. The larger \(\beta\), the slower is the decay of the weight, the higher the effects the past gradients have on \(s_t\).

An alternative update rule is \(w_{t+1} =w_{t}-\frac{\alpha}{\sqrt{s_{t}}+\epsilon} \nabla L( w_{t})\), where \(\sqrt{s_t}\) is the squared root of exponentially decaying moving average of squared gradients (hence the name RMS).

The value of \(\epsilon\) is set at 10e-08, so that we do not get a divide by zero situation. Some reference suggests a value of \(0.9\) for \(\beta\), and a value of \(0.001\) for the learning rate \(\alpha\).

X, Y: data matrices
def rmsprop():
    s_w=0
    for b in range(max_epochs):
        dw=0
        for x, y in (X,Y): # update gradient across all observations
            dw+= grad_w(w,x,y) 
        s_w=beta1*s_w+(1-beta1)*dw**2
        w=w-(alpha/sqrt(s_w+eps))*dw

Adam:

Adam, adaptive moment estimation method: it stores both an exponentially decaying squared gradients like RMSProp, but also an exponentially decaying gradient like momentum. Some references describe Adam as combining the advantages of both AdaGrad and RMSProp.

\[\begin{align*} v_{t} &=\beta_{1} v_{t-1}+\left(1-\beta_{1}\right) \nabla L( w_{t}) \text{ : (biased) decaying momentum}\\ s_{t} &=\beta_{2} s_{t-1}+\left(1-\beta_{2}\right) \left(\nabla L( w_{t})\right)^{2} \text{ : (biased) decaying sq. gradients} \\ \hat{v}_{t} &=\frac{v_{t}}{1-\beta_{1}^{t}}, \quad \hat{s}_{t}=\frac{s_{t}}{1-\beta_{2}^{t}} \text{ : corrected decaying momentum} \\ w_{t+1} &=w_{t}-\frac{\alpha}{\sqrt{\hat{s}_{t}+\epsilon}} \hat{v}_{t} \text{ : corrected decaying sq.gradients} \end{align*}\]

X, Y: data matrices
def adam():
    v_w,s_w,v_w_hat,s_w_hat=0,0,0,0
    for b in range(max_epochs):
        dw=0
        for x, y in (X,Y): # update gradient across all observations
            dw+= grad_w(w,x,y) 
        v_w=beta1*v_w+(1-beta1)*dw
        s_w=beta2*s_w+(1-beta2)*dw**2
        v_w_hat=v_w/(1-pow(beta1,b+1))
        s_w_hat=s_w/(1-pow(beta2,b+1))        
        w=w-(alpha/sqrt(s_w_hat+eps))*v_w_hat

Default choices \(\beta_{1}=0.9, \beta_{2}=0.999\) and \(\epsilon=10^{-8}\) and \(\alpha=0.001\) or \(0.0001\) seems to work best in many circumstances.

Why need biased corrected terms \(\hat{v}_t\) and \(\hat{s}_t\)?

The purpose of the bias correction term \(1-\beta_1^t\) is to counteract the bias introduced by the initialization at zero and the decay factor \(\beta_1\). When \(t\) is small, \(\beta_1^t\) will be relatively large, making the term \(1-\beta_1^t\) small. As \(t\) increases, \(\beta_1^t\) approaches zero, making \(1-\beta_1^t\) approach 1 . Therefore, the bias correction term effectively scales up the estimated average gradient \(v_t\) at the beginning of training, where the bias is most significant, and gradually becomes less impactful as training progresses.

Note that we are taking a running average of the gradients as \(v_{t}\). The reason we are doing this is that we don’t want to rely too much on the current gradient and instead rely on the overall behaviour of the gradients over many timesteps. One way of looking at this is that we are interested in the expected value of the gradients and not on a single point estimate computed at time \(t\). However, instead of computing \(E(\nabla L(w_{t}))\) we are computing \(v_{t}\) as the exponentially moving average. Ideally we would want \(E(v_{t})\) to be equal to \(E(\nabla L(w_{t}))\). It turns out that for stationary \(E(\nabla L(w_{t}))\), one can show that \(E(\hat{v}_{t})=E(\nabla L(w_{t})).\)

Comments: KM (2022) p.295. For adaptive learning rates methods, the base learning rate is still needed. It is shown that with the weighted exponentially decaying moving average methods, they may result in non-convergence even in convex problems.

Stochastic gradient descent

Note that our objective function has the form: \(L(w)=E_{\mathcal{P}_n}(\ell(Y, f(X; w)))\), where \(\mathcal{P}_n\) is the empirical distribution of \((X, Y)\)’s, the population version of \(L(w)\) is \(E(\ell(Y, f(X; w)))\), in principle, for the descent algorithm, any gradient \(g\) such that \(E(g)=E(\nabla\ell(Y, f(X; w)))\) would work. Obviously, \(\nabla L(w)\) is one example: \[ w_{t+1} = w_t - \alpha \nabla L(w_t)=w_t - \alpha\frac{1}{n}\sum_{i=1}^n \nabla\ell (y_i, f(x_i ;w_t)).\] Note that \(\nabla L(w)\) sums over the entire data once before updating the parameters. This is the vanilla gradient descent, or (batch) gradient descent, as we calculate the gradients using the whole dataset to perform just one update.

The pros is that the gradient is accurately calculated. The cons is that when the data size is large (such as that in deep learning), it is very slow to update parameters.

More generally, in each step, instead of using \(w_{t+1} = w_t - \alpha \nabla L(w_t)\), \(\nabla L(w_t)\) may be replaced with the gradient at one data point at a time as we update \[ w_{t+1} = w_t - \alpha \nabla \ell(y_i, f(x_i; w_t)), \qquad \text{ for } (y_i, x_i) \sim \mathcal{P}_n. \]

X, Y: data matrices
def stochastic_gradient_descent():
  for b in range(max_epochs):
      dw=0
      for x, y in (X,Y): # update gradient across observations
          dw= grad_w(w,x,y) # replace dw
          w=w-alpha*dw

This is often called stochastic gradient descent or online learning, although the term “stochastic gradient descent” can be more broadly defined (to be discussed in a separate note).

X, Y: data matrices
def stochastic_momentum_gradient_descent():
    pre_vw=0
    for b in range(max_epochs):
        dw=0
        for x, y in (X,Y): # update gradient across observations
            dw = grad_w(w,x,y) # replace dw
            v_w=gamma*pre_vw+alpha*dw
            w=w-v_w
            prev_vw=v_w

X, Y: data matrices
def stochastic_nesterov_gradient_descent():
    prev_vw=0
    for b in range(max_epochs):
        dw=0
        for x, y in (X,Y): # update gradient across observations
            v_w=gamma*prev_vw  # partial update
            dw= grad_w(w-v_w,x,y) # replace dw
            v_w=gamma*prev_vw+alpha*dw  # full update        
            w=w-v_w
            prev_vw=v_w

The drawback with this approach is for each update the gradient is not accurately estimated. A compromise is to do the mini-batch gradient descent by using gradient \[\frac{1}{|B_t|}\sum_{i\in B_t} \nabla \ell(y_i, f(x_i; w_t)),\] where \(B_t\) is a set of randomly chosen observations at iteration \(t\).

In general, we choose a fix \(B\)–called batch size.

online learning (stochastic gradient descent): \(B=1\).
mini batch learning: the number of batches \(n / B>1\).

In the mini-batch gradient descent method, the parameters are updated based only on the current mini-batch and continue iterating over all the mini-batches till we have seen the entire data set. The process of iterating through the whole dataset is referred to as an epoch. We often will have to loop over the whole dataset for multiple times, i.e., for multiple epochs.

Terminologies: 1 epoch = 1 pass over all data; 1 step = 1 update of parameters.

Total number of epochs \(M\): the number of times the algorithm “sees” the entire data. So one epoch would take \(n / B\) steps (iterations).
Total number of steps (iterations): the total number of epochs (\(M\)) times number of batches \((n / B)\).

\[\begin{align*} \begin{array}{cc} \text { Algorithm } & \# \text { of steps per epoch } \\ \hline \text { (Batch) Gradient Descent } & 1 \\ \text { Stochastic Gradient Descent } & \mathrm{n} \\ \text { Mini-Batch Gradient Descent } & \frac{n}{B} \\ \hline \end{array} \end{align*}\]

With mini-batch gradient descent, we will loop over the mini-batches instead of looping over individual examples.

The algorithm makes one update on the parameters after it sees \(B\) data points.

X, Y: data matrices
def minibatch_gradient_descent():
    minibatch_size, num_points_seen=k,0
    for b in range(max_epochs):
        dw,num_points=0,0
        for x, y in (X,Y): # update gradient across observations
            dw+= grad_w(w,x,y)
            num_points_seen +=1
            if num_points_seen % minibath_size==0:
            # seen one mini_batch
                w=w-alpha*dw
                dw=0 # reset gradients

Oftentimes there are two stages involved in building a mini-batch:

Shuffle the data set (both \(X\) and \(Y\) values simultaneously), so that the observations are split randomly.
Partition the data set into the defined mini-batch size set.

In training a neural network with a large data set, the commonly used mini-batch sizes are between 50 and 256.

For a relatively small training data set, use batch gradient descent.
For a larger training data set, use mini-batch sizes \(64,128,256,512, \ldots\), i.e., a power of 2 (due to the computer architecture).

X, Y: data matrices
def minibatch_gradient_descent():
    for b in range(max_epochs):
        dw = 0
        # shuffling the data
        combined = list(zip(X, Y))
        random.shuffle(combined)
        X[:], Y[:] = zip(*combined)
        # create mini-batches
        for i in range(0, len(X), minibatch_size):
            X_mini_batch = X[i:i + minibatch_size]
            Y_mini_batch = Y[i:i + minibatch_size]
            # loop over points in the mini-batch
            for x, y in zip(X_mini_batch, Y_mini_batch):
                dw += grad_w(w, x, y)
            # update weights after processing one mini-batch
            w = w - alpha * dw
            dw = 0  # reset gradients

Note: the minibatch versions for momemtum and Nesterov gradient descent can be obtained similarly.

learning rate: The choice of the learning rate is delicate and very influential on the convergence of the SGD algorithm. It is helpful to anneal (decay) the learning rate \(\alpha\) over time. If we decay it slowly there are chances that our system will bounce around chaotically with little improvement; if we decay it very fast, our system will be unable to reach the best position. A sufficient condition for SGD to converge is the Robbins-Monro Condition: \[\begin{align*} \frac{\sum_{t=1}^{\infty} \alpha_t^2}{\sum_{t=1}^{\infty} \alpha_t} \rightarrow 0 \quad \text { as } \alpha_t \rightarrow 0. \end{align*}\]

Step decay - Reduce the learning rate by some factor every few epochs. In machine learning, a typical heuristic could be to observe the validation error (error on the validation dataset), while training with a fixed learning rate, and reduce the learning rate by a constant (say 0.5 ) whenever there is no improvement in the validation error. Another method is to reduce the learning rate by half every 5 epochs, or by 0.1 every 20 epochs.
Exponential decay - such as \(\alpha_t=\epsilon_0 e^{-k t}\) where \(\epsilon_0\) and \(k\) are hyperparameters and \(t\) is the iteration number or epoch number.
Polynomial decay - such as \(\alpha_t=\frac{\epsilon_0}{(1+k t)^a}\) where, \(\epsilon_0\) and \(a, k\) are the hyperparameters and \(t\) is the iteration number. This is often used in stochastic gradient descent (i.e., batch size=1).

Training data and validation data

In machine learning, the separation of training and validation data is a common practice to ensure that machine learning models not only learn the patterns from the data but also generalize well to unseen data.

Training data is the actual dataset on which the machine learning model is trained. It consists of labeled examples, where the input features (data points) are paired with the corresponding target output (labels) for supervised learning tasks. The model uses this data to learn the relationship between inputs and outputs, identifying patterns and adjusting its internal parameters to minimize the error in its predictions.

However, if the model learns too well from the training data, it may “overfit” by memorizing the data rather than learning generalizable patterns. This may lead to poor performance on unseen data (test data). In order to assess the performance on unseen data, a portion of the original data could be set aside–called validation data, and is to be used as the “test data”. Validation dataset is used to evaluate the model’s performance during or after training. Unlike the training data, the validation data is never used to train the model. It is used to assess how well the model generalizes to new data.

For instance, the original dataset is randomly split into two parts \(\{(x_i, y_i)\}_{i=1}^{n_{\text{train}}}\) and \(\{(x_i, y_i)\}_{i=1}^{n_{\text{val}}}\), while we minimizes over the training data \[\begin{align*} \underset{w}{\operatorname{minimize}} \frac{1}{n_{\text{train}}} \sum_{i=1}^{n_{\text{train}}} \ell\left(y_i, f\left(x_i; w\right)\right), \end{align*}\] we also evaluate the performance of \(w\) periodically and after the training by assessing the loss on the validation data \[\begin{align*} \frac{1}{n_{\text{val}}} \sum_{i=1}^{n_{\text{val}}} \ell\left(y_i, f\left(x_i; w\right)\right). \end{align*}\] It is the \(w\) desired that minimizes the loss on the validation data.

X_train, Y_train: training data matrices
X_val, Y_val: validation data matrices
loss_function: function to calculate loss (e.g., MSE, cross-entropy)

def minibatch_gradient_descent():
    best_w = None  # to store the best weights
    best_val_loss = float('inf')  # to store the lowest validation loss

    for epoch in range(max_epochs):
        dw = 0
        
        # shuffling the training data
        combined_train = list(zip(X_train, Y_train))
        random.shuffle(combined_train)
        X_train[:], Y_train[:] = zip(*combined_train)
        
        # Mini-batch gradient descent
        for i in range(0, len(X_train), minibatch_size):
            X_mini_batch = X_train[i:i + minibatch_size]
            Y_mini_batch = Y_train[i:i + minibatch_size]
            
            # calculate gradients for the mini-batch
            for x, y in zip(X_mini_batch, Y_mini_batch):
                dw += grad_w(w, x, y)
            
            # update weights after processing one mini-batch
            w = w - alpha * dw
            dw = 0  # reset gradients after each mini-batch
        
        # evaluate on validation set after each epoch
        val_loss = 0
        for x_val, y_val in zip(X_val, Y_val):
            val_loss += loss_function(w, x_val, y_val)
        val_loss /= len(X_val)  # average loss over validation set
        
        # early stopping: check if validation loss improves
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_w = w  # store the best weights
        
        # (Optional) early stopping condition: stop if validation loss increases
        if val_loss > best_val_loss:
            print("Early stopping at epoch", epoch)
            break

    return best_w  # return the best weights after training

Python Code Examples

Provided and discussed in lab.