Momentum

Created	@May 18, 2022
Tags	NN

42) What is Momentum (w.r.t NN optimization)?

Momentum is

a method of avoiding local minima by weighing more recent gradient values more heavily.

lets the optimization algorithm remembers its last step, and adds some proportion of it to the current step.

This way, even if the algorithm is stuck in a flat region, or a small local minimum, it can get out and continue towards the true minimum. [src]

In the context of optimization algorithms, momentum is a technique used to accelerate gradient descent algorithms, especially in the training of neural networks and other machine learning models. It helps to overcome local minima, saddle points, and oscillations in the loss landscape, leading to faster convergence and better generalization.

Description:

Momentum is based on the idea of adding a fraction of the update vector from the previous time step to the current update. This momentum term helps to smooth out the variations in the gradient descent trajectory and helps the optimization algorithm to maintain a more consistent direction towards the minimum.

Mathematics:

Mathematically, the update rule with momentum can be expressed as:

$v_{t+1} = \beta v_t + \eta \nabla J(\theta_t)$

$\theta_{t+1} = \theta_t - v_{t+1}$

where:

$v_t$ is the momentum term at time step $t$,

$\eta$ is the learning rate,

$\nabla J(\theta_t)$ is the gradient of the loss function $J$ with respect to the parameters $\theta_t$ at time step $t$,

$\beta$ is the momentum parameter, usually a value between 0 and 1.

Interpretation:

When the gradient keeps pointing in the same direction (with consistent sign), momentum will increase the size of the steps in that direction.

When the gradient changes direction (e.g., in the presence of oscillations or noise), momentum will dampen the oscillations and prevent frequent changes in direction.

Advantages:

Faster Convergence: Momentum helps to accelerate the convergence of optimization algorithms, especially in regions with high curvature or long narrow valleys.

Better Generalization: By smoothing out the trajectory, momentum can help escape shallow local minima and saddle points, leading to better generalization and improved performance on unseen data.

Python Implementation (using TensorFlow):

import tensorflow as tf

# Define momentum optimizer
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)

# Example usage in training loop
for batch in dataset:
    with tf.GradientTape() as tape:
        loss = model(batch)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

In this example, we create a stochastic gradient descent (SGD) optimizer with momentum (with a momentum parameter of 0.9) using TensorFlow. During training, the optimizer applies momentum to update the model parameters based on the computed gradients.