L1 loss vs. L2 loss

Created	@May 12, 2022
Tags	Loss

The differences of L1-norm and L2-norm as a loss function can be promptly summarized as follows:

L2 penalize the outliers better and in the application which cares more about the outliers or if we don’t want the prediction far away from the actual value, L2 will be preferable.

L1 loss and L2 loss, also known as Mean Absolute Error (MAE) and Mean Squared Error (MSE) respectively, are two commonly used loss functions in machine learning. Here are the main differences between them:

Formula:
- L1 loss (MAE): $L1 = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$
- L2 loss (MSE): $L2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Sensitivity to Outliers:
- L1 loss is less sensitive to outliers compared to L2 loss. This is because L1 loss penalizes errors linearly, so large errors do not disproportionately influence the total loss. In contrast, L2 loss squares the errors, resulting in larger errors having a greater impact on the total loss.

Robustness:
- L1 loss is more robust to outliers and noisy data because it treats large errors more gently. It is often preferred in situations where the dataset contains outliers or when the goal is to minimize the impact of extreme errors.
- L2 loss is sensitive to outliers due to its squaring effect, which amplifies the impact of large errors. However, it can provide better performance when the dataset is clean and free of outliers.

Solution Space:
- L1 loss tends to produce sparse solutions, meaning that it encourages some coefficients to be exactly zero. This property makes it useful for feature selection and model interpretability.
- L2 loss typically produces dense solutions and does not encourage sparsity. It tends to distribute the error more evenly across all coefficients.

Gradient Behavior:
- The gradient of L1 loss is constant for all non-zero values and undefined at zero. This can make optimization challenging near the minimum.
- The gradient of L2 loss is proportional to the error, resulting in smoother optimization behavior.

In summary, L1 loss and L2 loss have different characteristics and are suited to different scenarios. L1 loss is preferred when the dataset contains outliers or when sparsity and feature selection are important. L2 loss is suitable for clean datasets and tends to provide smoother optimization behavior.