ReLU

Created
TagsActivation Function

It is a piecewise linear function that will output the input directly if it is positive otherwise, it will output zero.

Pros: it avoids vanishing gradient problem.

Cons: it is not differentiable at 0 and may result in exploding gradient.

Leaky ReLU: similar to ReLU but it has a small slope for negative values.

Compared with normal ReLU, leaky ReLU does not have the dead ReLU problem.

Imagine a network with random initialized weights ( or normalized ) and almost 50% of the network yields 0 activation because of the characteristic of ReLu ( output 0 for negative values of x ). This means a fewer neurons are firing ( sparse activation ) and the network is lighter. [src]

why it avoids vanishing gradient problem:

If the input value is positive, the ReLU function returns it; if it is negative, it returns 0. The ReLU's derivative is 1 for values larger than zero. Because multiplying 1 by itself several times still gives 1, this basically addresses the vanishing gradient problem

ReLU (Rectified Linear Unit)

ReLU stands for Rectified Linear Unit, and it is a type of activation function that is widely used in deep learning models, particularly in convolutional neural networks (CNNs) and deep feed-forward neural networks. The function is defined as:

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

This means that for any input \(x\), the ReLU function outputs \(x\) if \(x\) is greater than zero, and it outputs \(0\) otherwise. The simplicity of this function allows for efficient computation and gradient propagation, as its derivative is either \(0\) (for \(x < 0\)) or \(1\) (for \(x > 0\)), with the gradient being undefined at \(x = 0\), though in practice, this point is not problematic and is handled as if the gradient were \(1\).

Pros of ReLU

  1. Efficiency: ReLU is computationally efficient, allowing for faster training of deep neural networks due to its simple mathematical operations.
  1. Sparsity: By outputting zero for negative inputs, ReLU can create sparse representations, which can be advantageous for neural networks by introducing sparsity at the neuron level.
  1. Mitigates Vanishing Gradient Problem: Unlike sigmoid or tanh activation functions, ReLU does not saturate in the positive domain, which helps alleviate the vanishing gradient problem that can hinder the training of deep networks.

Cons of ReLU

  1. Dying ReLU Problem: Neurons can sometimes output zero for all inputs, in which case they stop contributing to the learning process—a problem known as the "dying ReLU" phenomenon. This can happen if a large gradient flows through a ReLU neuron, updating the weights in such a way that the neuron will only output zeros.
  1. Non-zero Centered Output: ReLU activations are not centered around zero, which can sometimes make the optimization process less efficient.

Applications

ReLU and its variants (like Leaky ReLU, Parametric ReLU, and Exponential Linear Unit) are used across a wide range of deep learning applications, including:

Python Code Example

Here's a simple Python example of implementing the ReLU function:

import numpy as np

def relu(x):
    return np.maximum(0, x)

# Example usage
x = np.array([-2, -1, 0, 1, 2])
print(relu(x))

This code demonstrates the ReLU function using NumPy, a popular library for numerical computations in Python. The np.maximum function is used to compute the ReLU of each element in the input array x, resulting in an array where all negative values are replaced with zero, while positive values are left unchanged.

ReLU has become a default choice for activation in many types of neural networks due to its simplicity and effectiveness in promoting faster and more effective training.