Why is L1 regularization supposed to lead to sparsity than L2?

Created	@August 28, 2021
Tags	Regularization

The obvious disadvantage of ridge regression, is model interpretability. It will shrink the coefficients for least important predictors, very close to zero. But it will never make them exactly zero. In other words, the final model will include all predictors. However, in the case of the lasso, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Therefore, the lasso method also performs variable selection and is said to yield sparse models

If the original derivative is not 0, then the derivative is still 0 after applying L2, so the optimal w will not become 0. When applying L1, as long as the coefficient lambda of the regularization term is larger than the absolute value of the derivative of the original cost function at point 0, w = 0 will become a minimum point.

L1 regularization (Lasso) is more likely to produce sparse solutions compared to L2 regularization (Ridge) due to the geometric properties of the penalty terms.

Geometric Interpretation:

L1 Regularization (Lasso):
- L1 regularization adds a penalty term proportional to the sum of the absolute values of the coefficients ( $\sum_{i=1}^{n} |w_i|$ ).
- The constraint boundary of L1 regularization forms a diamond shape in the coefficient space.
- Because of the diamond shape, the optimal solution is more likely to intersect the axes, leading to many coefficients being exactly zero.
- As a result, Lasso tends to perform feature selection by effectively ignoring less important features and setting their coefficients to zero.

L2 Regularization (Ridge):
- L2 regularization adds a penalty term proportional to the sum of the squared magnitudes of the coefficients ( $\sum_{i=1}^{n} w_i^2$ ).
- The constraint boundary of L2 regularization forms a circular shape in the coefficient space.
- The circular shape allows the coefficients to shrink towards zero, but it rarely forces them to exactly zero.
- As a result, Ridge tends to shrink all coefficients uniformly, without performing aggressive feature selection.

Effect on Coefficients:

In Lasso regularization, the penalty term is not differentiable at zero, which causes the optimization process to "favor" sparse solutions by pushing coefficients to exactly zero.

In contrast, Ridge regularization penalizes large coefficients but doesn't make them exactly zero unless the regularization parameter is very large.

Practical Implications:

Lasso is particularly useful when dealing with high-dimensional datasets where feature selection is important or when interpretability of the model is desired.

Ridge is more suitable when all features are expected to contribute to the prediction, and aggressive feature selection is not required.

Summary:

While both L1 and L2 regularization techniques help prevent overfitting and improve the generalization performance of models, L1 regularization (Lasso) tends to produce sparse solutions by setting many coefficients to zero, making it more suitable for feature selection. In contrast, L2 regularization (Ridge) shrinks the coefficients towards zero without forcing them to zero, leading to smoother solutions without aggressive feature selection. The choice between Lasso and Ridge regularization depends on the specific characteristics of the dataset and the goals of the modeling task.