Adam optimizer

Created
TagsTraining

Adam computes the adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared estimated gradients of the first moment, it also keeps an exponentially decaying average of the past estimated gradients of the second moment.

Adam which is a gradient-descent-based algorithm that mean and variance moment  to do adaptive learning

Adam Optimizer

The Adam optimizer (short for Adaptive Moment Estimation) is a popular gradient-based optimization algorithm for training machine learning models, especially deep neural networks. Introduced by Diederik P. Kingma and Jimmy Ba in a paper titled "Adam: A Method for Stochastic Optimization," Adam combines the best properties of the AdaGrad and RMSProp optimizers to handle sparse gradients on noisy problems.

How Adam Works

Adam maintains two moving averages for each parameter; one for the gradients (similar to momentum) and one for the squared gradients (similar to RMSprop). These moving averages help to estimate the first (mean) and second (uncentered variance) moments of the gradients. The optimizer then uses these estimates to adaptively adjust the learning rates for each parameter. The steps involved in the Adam optimization algorithm are:

  1. Initialize moving averages: Both the moving averages of the gradients and the squared gradients are initialized to zero.
  1. Compute gradient: For each parameter, compute the gradient of the loss function with respect to the parameter at the current step.
  1. Update biased first moment estimate: Update the moving average of the gradients, incorporating the gradient computed at the current step.
  1. Update biased second raw moment estimate: Update the moving average of the squared gradients.
  1. Correct bias in first and second moments: Compute bias-corrected estimates of both moments. This correction is necessary because both moving averages are initialized as zeros, leading to initial estimates that are biased towards zero.
  1. Update parameters: Use the bias-corrected estimates to adaptively adjust learning rates for each parameter and update the parameters.

Formulae

The update rules for the parameters involve calculations given by:

 mt=β1mt1+(1β1)gt  m_t = \beta_1 m_{t-1} + (1 - \beta_1)g_t 

vt=β2vt1+(1β2)gt2 v_t = \beta_2 v_{t-1} + (1 - \beta_2)g_t^2 

m^t=mt1β1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}

v^t=vt1β2t\hat{v}t = \frac{v_t}{1 - \beta_2^t}

θt+1=θtηv^t+ϵm^t\theta{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

where:

Advantages of Adam

Limitations

Implementation in Python with TensorFlow

Here's an example of how to use the Adam optimizer with TensorFlow:

import tensorflow as tf

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Assuming X_train and y_train are loaded with training data
# model.fit(X_train, y_train, epochs=10, batch_size=32)

In this example, the optimizer argument of the compile method is set to 'adam', indicating that the Adam optimizer should be used for training. TensorFlow's implementation of Adam includes reasonable default values for the learning rate (\(\eta\)), \(\beta_1\), \(\beta_2\), and \(\epsilon\), making it