Activation functions

Created	@February 28, 2024
Tags	Activation Function

Activation functions, also known as activation neurons in some contexts, are crucial components in neural networks. They determine the output of a neural network model for given inputs. Activation functions introduce non-linearity into the network, enabling it to learn complex patterns and perform tasks beyond mere linear regression. Without non-linear activation functions, no matter how many layers a neural network has, it would still behave like a single-layer network because the composition of linear functions is itself a linear function.

https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions

Types of Activation Functions

Linear Activation Function:
- Formula: $f(x) = x$
- Use: Mostly used in the output layer for regression problems.
- Characteristics: It doesn’t introduce non-linearity into the network.

Sigmoid (Logistic) Activation Function:
- Formula: $f(x) = \frac{1}{1 + e^{-x}}$
- Use: Previously popular for hidden layers and for binary classification problems in the output layer.
- Characteristics: Introduces non-linearity; outputs are in the range (0, 1). However, it suffers from vanishing gradients for very high or very low input values.

Hyperbolic Tangent (tanh) Activation Function:
- Formula: $f(x) = \tanh(x) = \frac{2}{1 + e^{-2x}} - 1$
- Use: Hidden layers in neural networks.
- Characteristics: Similar to the sigmoid but outputs range from (-1, 1), which can lead to faster convergence in some cases due to being zero-centered.

Rectified Linear Unit (ReLU) Activation Function:
- Formula: $f(x) = \max(0, x)$
- Use: Very popular in hidden layers of neural networks, especially in deep learning models.
- Characteristics: Introduces non-linearity with efficient computation and helps with the vanishing gradient problem. However, it can suffer from the "dying ReLU" problem, where neurons stop learning completely.

Leaky Rectified Linear Unit (Leaky ReLU) Activation Function:
- Formula: f(x) = x if x > 0, else $\alpha x$ where $\alpha$ is a small constant.
- Use: An attempt to fix the dying ReLU problem.
- Characteristics: Allows a small, non-zero gradient when the unit is not active.

Softmax Activation Function:
- Formula: For a vector x of raw class scores from the last layer of a network, the softmax function is $f(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}$ .
- Use: Output layer of a neural network for multi-class classification problems.
- Characteristics: Converts raw class scores into probabilities by taking the exponentials of the inputs and normalizing them.

Exponential Linear Unit (ELU) Activation Function:
- Formula: f(x) = x if x > 0, else $\alpha(e^x - 1)$ where $\alpha$ is a hyperparameter.
- Use: Hidden layers in neural networks.
- Characteristics: Tries to make the mean activations closer to zero, which speeds up learning. It has all the benefits of ReLU and fixes the dying ReLU problem, with a non-zero gradient for negative input.
Gelu:
```
def gelu(x):
    """Gaussian Error Linear Unit (GELU) activation function."""
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * np.power(x, 3))))
```

Choosing an Activation Function

The choice of activation function depends on the specific application and the problem you are trying to solve:

For binary classification problems, the sigmoid function in the output layer is a common choice.

For multi-class classification problems, the softmax function in the output layer can convert logits to probabilities.

ReLU and its variants (Leaky ReLU, ELU) are generally preferred for hidden layers in deep learning models due to their computational efficiency and effectiveness in dealing with the vanishing gradient problem.

It's often beneficial to experiment with different activation functions and their configurations to determine what works best for a specific neural network architecture and dataset.