Adam optimizer

Created	@August 29, 2021
Tags	Training

Adam computes the adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared estimated gradients of the first moment, it also keeps an exponentially decaying average of the past estimated gradients of the second moment.

. Adam which is a gradient-descent-based algorithm that mean and variance moment to do adaptive learning

Adam Optimizer

The Adam optimizer (short for Adaptive Moment Estimation) is a popular gradient-based optimization algorithm for training machine learning models, especially deep neural networks. Introduced by Diederik P. Kingma and Jimmy Ba in a paper titled "Adam: A Method for Stochastic Optimization," Adam combines the best properties of the AdaGrad and RMSProp optimizers to handle sparse gradients on noisy problems.

How Adam Works

Adam maintains two moving averages for each parameter; one for the gradients (similar to momentum) and one for the squared gradients (similar to RMSprop). These moving averages help to estimate the first (mean) and second (uncentered variance) moments of the gradients. The optimizer then uses these estimates to adaptively adjust the learning rates for each parameter. The steps involved in the Adam optimization algorithm are:

Initialize moving averages: Both the moving averages of the gradients and the squared gradients are initialized to zero.

Compute gradient: For each parameter, compute the gradient of the loss function with respect to the parameter at the current step.

Update biased first moment estimate: Update the moving average of the gradients, incorporating the gradient computed at the current step.

Update biased second raw moment estimate: Update the moving average of the squared gradients.

Correct bias in first and second moments: Compute bias-corrected estimates of both moments. This correction is necessary because both moving averages are initialized as zeros, leading to initial estimates that are biased towards zero.

Update parameters: Use the bias-corrected estimates to adaptively adjust learning rates for each parameter and update the parameters.

Formulae

The update rules for the parameters involve calculations given by:

$m_t = \beta_1 m_{t-1} + (1 - \beta_1)g_t$

$v_t = \beta_2 v_{t-1} + (1 - \beta_2)g_t^2$

$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$

$\hat{v}t = \frac{v_t}{1 - \beta_2^t}$

$\theta{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

where:

$m_t$ and $v_t$ are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients, respectively.

$\beta_1$ and $\beta_2$ are exponential decay rates for these moment estimates, typically close to 1.

$g_t$ is the gradient at time step $t$.

$\hat{m}_t$ and $\hat{v}_t$ are bias-corrected versions of $m_t$ and $v_t$.

$\theta_t$ is the parameter vector.

$\eta$ is the learning rate.

$\epsilon$ is a small scalar added for numerical stability (to avoid division by zero).

Advantages of Adam

Adaptive Learning Rates: Adam adjusts the learning rate for each parameter based on the estimates of the gradients' first and second moments, making it suitable for problems with noisy or sparse gradients.

Efficiency: It is computationally efficient and requires little memory, making it suitable for high-dimensional problems.

Bias Correction: The bias correction mechanism ensures that the optimizer does not have a bias towards zero at the start of training.

Limitations

Hyperparameter Tuning: While Adam works well with default settings, tuning the learning rate and the decay factors ($\beta_1$ and $\beta_2$) can sometimes improve performance on specific problems.

Convergence Issues: Some research suggests that Adam might not converge to an optimal solution in certain theoretical scenarios, although in practice, it performs very well.

Implementation in Python with TensorFlow

Here's an example of how to use the Adam optimizer with TensorFlow:

import tensorflow as tf

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Assuming X_train and y_train are loaded with training data
# model.fit(X_train, y_train, epochs=10, batch_size=32)

In this example, the optimizer argument of the compile method is set to 'adam', indicating that the Adam optimizer should be used for training. TensorFlow's implementation of Adam includes reasonable default values for the learning rate ($\eta$), $\beta_1$, $\beta_2$, and $\epsilon$, making it