Loss Functions

Created
TagsLoss

In neural networks (NN), various loss functions are used depending on the type of problem being solved. Here are some commonly used loss functions in neural networks:

Regresssion Loss (continous value)

  1. Mean Absolute Error(MAE) L1

    MAE=1ni=1nyiy^i \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| 

    • provides a straightforward interpretation of the average error magnitude across all predictions.
    • Scale-dependent, Robustness to Outliers, Interpretability
  1. Mean Squared Error (MSE) L2
    • MSE=1ni=1n(yiy^i)2 \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 
    • Suitable for regression problems where the goal is to minimize the squared differences between predicted and actual values.
  1. Huber Loss:
  1. Quantile loss:
    • can give more value to positive error or negative error.
    • y<lambdayp+y>=p(lambda1)yp∑y<\text{lambda}*|y- p| + \sum_{y >= p} (\text{lambda}-1) * |y - p|
    • If you set lambda to 0.5, it becomes MAE.

       Lτ(y,y^)={τ(yy^)if yy^(1τ)(y^y)if y<y^ L_{\tau}(y, \hat{y}) = \begin{cases} \tau \cdot (y - \hat{y}) & \text{if } y \geq \hat{y} \\ (1 - \tau) \cdot (\hat{y} - y) & \text{if } y < \hat{y} \end{cases}

      Uber uses pseudo-Huber loss and log-cosh loss to approximate Huber loss and Mean Absolute Error in their distributed XGBoost training. Doordash Estimated Time Arrival models uses MSE then they move to Quantile loss and Custom Asymmetric MSE.

Classification Loss (discrete value)

  1. Log Loss (Logarithmic Loss)
    • Log Loss=1Ni=1N[yilog(y^i)+(1yi)log(1y^i)]\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]
    • Log Loss is widely used in binary and multi-class classification problems. It's particularly valuable when you need to predict the probability of an outcome rather than just the outcome itself
  1. Binary Cross-Entropy Loss (same as Log Loss):
    • Binary Cross-Entropy=1ni=1n[yilog(y^i)+(1yi)log(1y^i)] \text{Binary Cross-Entropy} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)] 
    • Used for binary classification problems where the output is a probability between 0 and 1. Penalizes models that are confidently wrong.
  1. Categorical Cross-Entropy Loss (Multiclass Log Loss):
    •  Categorical Cross-Entropy=1ni=1nj=1myijlog(y^ij)  \text{Categorical Cross-Entropy} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{m} y_{ij} \log(\hat{y}_{ij}) 
    • Suitable for multiclass classification problems. Penalizes models based on the divergence between the predicted probability distribution and the true distribution.
  1. Focal Loss
    • Focal Loss=αt(1y^t)γlog(y^t)\text{Focal Loss} = -\alpha_t (1 - \hat{y}_t)^\gamma \log(\hat{y}_t)
    • Focal Loss modifies the standard Cross-Entropy loss by adding a focusing parameter γ(gamma), which reduces the relative loss for well-classified examples and puts more focus on hard, misclassified examples.
  1. Sparse Categorical Cross-Entropy Loss:
    • Similar to Categorical Cross-Entropy Loss but suitable when the target labels are integers rather than one-hot encoded vectors.
  1. Hinge Loss:
    •  Hinge Loss=max(0,1yiy^i) \text{Hinge Loss} = \max(0, 1 - y_i \cdot \hat{y}_i)
    • Used for binary classification with support vector machines (SVMs) or linear classifiers. Encourages correct classification with a margin of at least 1.
  1. Kullback-Leibler Divergence (KL Divergence):
    • KL Divergence=iyilog(yiy^i) \text{KL Divergence} = \sum_{i} y_i \log\left(\frac{y_i}{\hat{y}_i}\right) 
    • Measures how one probability distribution diverges from a second, expected probability distribution. It is used in variational autoencoders (VAEs) and other generative models.
  1. Custom Loss Functions:
    • Tailored loss functions can be defined based on specific requirements of the problem or to incorporate domain knowledge.

The choice of loss function depends on the nature of the problem, the output space, and the desired behavior of the model. Selecting an appropriate loss function is crucial for training neural networks effectively and achieving good performance on the task at hand.

Here's a Python implementation of some commonly used loss functions in neural networks:

import numpy as np

class LossFunctions:
    @staticmethod
    def mean_squared_error(y_true, y_pred):
        return np.mean((y_true - y_pred)**2)

    @staticmethod
    def binary_cross_entropy(y_true, y_pred):
        return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

    @staticmethod
    def categorical_cross_entropy(y_true, y_pred):
        return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

    @staticmethod
    def hinge_loss(y_true, y_pred):
        return np.mean(np.maximum(0, 1 - y_true * y_pred))

    @staticmethod
    def huber_loss(y_true, y_pred, delta=1.0):
        error = y_true - y_pred
        is_small_error = np.abs(error) <= delta
        squared_loss = 0.5 * error**2
        absolute_loss = delta * (np.abs(error) - 0.5 * delta)
        return np.mean(np.where(is_small_error, squared_loss, absolute_loss))

    @staticmethod
    def kl_divergence(y_true, y_pred):
        return np.sum(y_true * np.log(y_true / y_pred))

# Example usage:
y_true = np.array([1, 0, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.2])

print("Mean Squared Error:", LossFunctions.mean_squared_error(y_true, y_pred))
print("Binary Cross-Entropy Loss:", LossFunctions.binary_cross_entropy(y_true, y_pred))

This implementation defines static methods for each loss function, allowing you to easily compute the loss given the true labels y_true and the predicted values y_pred. You can adjust the parameters of the loss functions as needed, such as the delta parameter for the Huber loss.