GELU

Created	@February 18, 2024
Tags	Activation Function

The Gaussian Error Linear Unit (GELU) is a type of activation function that has gained popularity in deep learning, particularly in the field of natural language processing (NLP) and beyond. It was first introduced in the paper "Gaussian Error Linear Units (GELUs)" by Dan Hendrycks and Kevin Gimpel. The GELU function is similar to other rectified units such as ReLU, Leaky ReLU, and ELU, but it introduces non-linearity in a slightly different way, incorporating elements of both stochastic and non-stochastic approaches.

Formula

The GELU activation function is defined as:

$\text{GELU}(x) = x \Phi(x)$

where $\Phi(x)$ is the cumulative distribution function (CDF) of the standard Gaussian distribution. In simpler terms, $\Phi(x)$ represents the probability that a random variable with a standard normal distribution takes on a value less than or equal to x .

An approximate formulation of the GELU function, which is computationally more efficient, is given by:

$\text{GELU}(x) \approx 0.5 x \left(1 + \tanh\left[\sqrt{\frac{2}{\pi}} \left(x + 0.044715 x^3\right)\right]\right)$

This approximation makes it easier to implement and compute in practice while maintaining similar characteristics to the exact formulation.

Characteristics and Advantages

Smooth Non-linearity: GELU provides a smooth, non-linear thresholding behavior. Unlike ReLU, which abruptly cuts off at zero, GELU smoothly interpolates between the linear and non-linear regime, which can help in learning complex patterns.

Non-monotonic: The GELU function is non-monotonic, which means it allows for a small negative slope for negative values. This characteristic helps in mitigating the dying neuron problem common in ReLU activations.

Dynamic Gate: The GELU activation acts like a dynamic gate for the input signals. For large positive values, the function approximates the identity function, allowing the input to pass through, while for negative values, it suppresses the signal.

Applications

The GELU activation function has seen significant adoption in state-of-the-art models, especially in the field of NLP. For instance, it is used as the default activation function in the Transformer model architecture, as seen in models like GPT (Generative Pretrained Transformer) and BERT (Bidirectional Encoder Representations from Transformers). Its success in these models highlights its effectiveness in handling complex patterns and sequences in data.

Implementing GELU in Python

Most deep learning frameworks, such as TensorFlow and PyTorch, include built-in support for the GELU activation function. Here's how you can use it in PyTorch:

import torch
import torch.nn.functional as F

x = torch.tensor([-1.0, 0.0, 1.0])
y = F.gelu(x)

print(y)

And in TensorFlow:

import tensorflow as tf

x = tf.constant([-1.0, 0.0, 1.0])
y = tf.nn.gelu(x)

print(y)

These examples demonstrate how to apply the GELU activation function to a tensor in both PyTorch and TensorFlow, showcasing its ease of use in modern deep learning frameworks. The adoption of GELU in prominent models underscores its utility and effectiveness as an activation function in complex neural network architectures.