Vanishing gradient

Created	@August 29, 2021
Tags	NN

As we add more and more hidden layers, back propagation becomes less and less useful in passing information to the lower layers. In effect, as information is passed back, the gradients begin to vanish and become small relative to the weights of the networks.

The gradient will be vanishingly small, effectively preventing the weight from changing its value. In worst case, this may completely stop the neural network from further training.

Common solutions:

Use of different activation functions: Rectified Linear Unit (ReLU) and its variants (e.g., Leaky ReLU, Parametric ReLU) have gradients that are less likely to vanish for positive input values, making them more suitable for deep networks.

Careful weight initialization: Initializing weights using techniques such as Xavier or He initialization can help alleviate the vanishing gradient problem by ensuring that activations and gradients are neither too small nor too large.

Batch normalization: Normalizing the activations within each mini-batch can help stabilize training and mitigate the vanishing gradient problem by reducing the internal covariate shift.

Skip connections: Architectures like Residual Networks (ResNets) use skip connections to bypass certain layers, allowing gradients to flow more directly and avoid vanishing in deep networks.