Batch norm, mini-batch, layer norm

Created	@May 8, 2022
Tags	NN

Mini-batch size

the number of samples passed through the network before the network is updated.

a technique that standardizes the inputs to a layer for each mini-batch.

has the effect of stabilizing the learning process and dramatically reducing the number of the training epochs.

During training, the mean and variance are calculated based on the specific batch while they are calculated based on all the batches during testing phase.

Large batch size results in faster convergence but sometimes it may make the model stuck with local minimum and requires more epochs.

Small batch makes convergence slow but may have better results since more randomness is introduced.

why we need BN?

distribution of each layer's inputs changes during training, as the parameters of the previous layers change.

normalize the inputs of for each individual mini-batch at each layer that they have a mean output activation of zero and standard deviation of one.
- i.e compute the mean and variance of that mini-batch alone, then normalize.

This is analogous to how the inputs to networks are standardized.

How does this help?

Normalizing the inputs helps the network learn.

the output of one layer becomes the input to the next, we can think of any layer in a neural network as the first layer of a smaller subsequent network.

we normalize the output of one layer before applying the activation function, and then feed it into the following layer (sub-network).

Batch = (x-m)/v

Batch Normalization

Batch Normalization is a technique used in deep learning to stabilize and accelerate the training of deep neural networks. It does this by normalizing the inputs to a layer for each mini-batch, which helps to reduce the internal covariate shift problem. Internal covariate shift refers to the change in the distribution of network activations due to the update of weights during training, which can slow down the training process and make it harder to tune hyperparameters.

Definition and Operation:

The process involves normalizing the inputs of each layer so that they have a mean of 0 and a standard deviation of 1. This normalization is followed by a scaling and shifting operation, where two trainable parameters per layer are introduced to scale and shift the normalized values. This allows the network to undo the normalization if it finds it beneficial for learning. The operations can be summarized as follows for a given layer's input $X$:

Compute the mean and variance of $X$ within a mini-batch.

Normalize $X$ using the computed mean and variance.

Scale and shift the normalized $X$ using trainable parameters $γ$ (scale) and $β$ (shift).

The mathematical formula for Batch Normalization is given by:

$\hat{X} = \frac{X - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$

$Y = γ\hat{X} + β$

where $\mu_B$ and $\sigma_B^2$ are the mean and variance of the batch, respectively, $\epsilon$ is a small constant added for numerical stability, and $Y$ is the output.

Pros:

Improved Training Speed: It allows the use of higher learning rates and reduces the importance of initialization.

Regularization Effect: Batch normalization adds a slight noise to each hidden layer’s activations, which can act as a form of regularization, reducing overfitting in some cases.

Cons:

Dependency on Mini-Batch Size: Its effectiveness can degrade with smaller batch sizes, as the estimates of the mean and variance become less accurate.

Computational Overhead: It introduces additional computations and parameters to the model, which can slightly increase the complexity and runtime.

Mini-Batch

A mini-batch is a subset of the training dataset used to train a model in iterations. Instead of training the model on the entire dataset at once (batch gradient descent) or on a single sample at a time (stochastic gradient descent), mini-batch gradient descent splits the training dataset into small batches and updates the model's weights based on the gradient of the loss with respect to each mini-batch.

Key Points:

Balance Between Efficiency and Convergence: Mini-batches offer a balance between the computational efficiency of batch gradient descent and the convergence speed of stochastic gradient descent.

GPU Utilization: Mini-batches are particularly well-suited for training on GPUs, as parallel processing can be leveraged to efficiently handle multiple training examples at once.

Hyperparameter: The size of the mini-batch is a critical hyperparameter that can affect the model's training dynamics and performance. Typical sizes range from 16 to 512, depending on the available memory and the specific problem.

Implementing Batch Normalization and Using Mini-Batches in Python

Here's a simplified example of how you might configure a neural network layer with batch normalization in TensorFlow/Keras:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization

model = Sequential([
    Dense(64, input_shape=(784,)),
    BatchNormalization(),
    Dense(10, activation='softmax')
])

This example demonstrates adding a Batch Normalization layer immediately after a Dense layer in a Keras model. During training, data would be fed to the model in mini-batches, automatically leveraging the benefits of both batch normalization and mini-batch gradient descent.

Batch normalization and mini-batches together play a crucial role in modern deep learning by improving training stability, efficiency, and model performance.