Batch norm, mini-batch, layer norm

Created
TagsNN

Mini-batch size

BN

why we need BN?

How does this help?

Batch = (x-m)/v

Batch Normalization

Batch Normalization is a technique used in deep learning to stabilize and accelerate the training of deep neural networks. It does this by normalizing the inputs to a layer for each mini-batch, which helps to reduce the internal covariate shift problem. Internal covariate shift refers to the change in the distribution of network activations due to the update of weights during training, which can slow down the training process and make it harder to tune hyperparameters.

Definition and Operation:

The process involves normalizing the inputs of each layer so that they have a mean of 0 and a standard deviation of 1. This normalization is followed by a scaling and shifting operation, where two trainable parameters per layer are introduced to scale and shift the normalized values. This allows the network to undo the normalization if it finds it beneficial for learning. The operations can be summarized as follows for a given layer's input \(X\):

  1. Compute the mean and variance of \(X\) within a mini-batch.
  1. Normalize \(X\) using the computed mean and variance.
  1. Scale and shift the normalized \(X\) using trainable parameters \(γ\) (scale) and \(β\) (shift).

The mathematical formula for Batch Normalization is given by:

 X^=XμBσB2+ϵ  \hat{X} = \frac{X - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} 

Y=γX^+βY = γ\hat{X} + β

where \(\mu_B\) and \(\sigma_B^2\) are the mean and variance of the batch, respectively, \(\epsilon\) is a small constant added for numerical stability, and \(Y\) is the output.

Pros:

Cons:

Mini-Batch

A mini-batch is a subset of the training dataset used to train a model in iterations. Instead of training the model on the entire dataset at once (batch gradient descent) or on a single sample at a time (stochastic gradient descent), mini-batch gradient descent splits the training dataset into small batches and updates the model's weights based on the gradient of the loss with respect to each mini-batch.

Key Points:

Implementing Batch Normalization and Using Mini-Batches in Python

Here's a simplified example of how you might configure a neural network layer with batch normalization in TensorFlow/Keras:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization

model = Sequential([
    Dense(64, input_shape=(784,)),
    BatchNormalization(),
    Dense(10, activation='softmax')
])

This example demonstrates adding a Batch Normalization layer immediately after a Dense layer in a Keras model. During training, data would be fed to the model in mini-batches, automatically leveraging the benefits of both batch normalization and mini-batch gradient descent.

Batch normalization and mini-batches together play a crucial role in modern deep learning by improving training stability, efficiency, and model performance.