Why MSE doesn’t work with logistic regression?

Created	@August 28, 2021
Tags	Metrics

The prediction of logistic regression is non-linear (due to the sigmoid transform). Squaring this prediction like MSE results in a non-convex function with many local minimums, which makes it difficult to find the global minimum.

Instead, using MLE in logistic regression as the cost function is convex.

Mean Squared Error (MSE) is not typically used as a loss function for logistic regression primarily because of the nature of logistic regression's output and objectives. Logistic regression is used for binary classification tasks, where the goal is to predict probabilities that a given input belongs to a certain class (usually 0 or 1). The reasons MSE is not suitable for logistic regression include:

Non-Convexity in Relation to Logits: The main issue with using MSE in logistic regression is that it leads to a non-convex loss function with respect to the model's parameters. Logistic regression models use the sigmoid (or logistic) function to map predicted values to probabilities. When you apply MSE as the loss function for logistic regression, the combination of the sigmoid function and the squared error term creates a loss surface that is non-convex. This means there can be multiple local minima, making it hard for gradient-based optimization methods to find the global minimum efficiently.

Mismatch with Model Output: Logistic regression outputs a probability that the given input belongs to a certain class, while MSE measures the square of the difference between the predicted numerical values and the actual values. For classification tasks, especially binary classification, it's more meaningful to measure how well the model discriminates between classes rather than how close the predicted probabilities are to 0 or 1 in a squared error sense.

Poor Calibration with Class Probabilities: MSE focuses on minimizing the average squared difference, which can lead to predictions that are poorly calibrated as probabilities. In classification tasks, especially when outcomes are categorical, it's important that the model's predicted probabilities accurately reflect the true likelihood of outcomes. Loss functions like cross-entropy (or log loss) are better suited for this purpose because they directly penalize the divergence between the predicted probabilities and the actual class labels.

Incentive to Predict Extreme Probabilities: Using MSE can inadvertently encourage the model to predict probabilities closer to 0 or 1 due to the nature of squaring errors. This is undesirable in cases where we want well-calibrated probabilities. Logistic regression, using a log loss, inherently avoids this by penalizing incorrect confident predictions more heavily than incorrect uncertain predictions.

For these reasons, logistic regression typically uses a loss function like cross-entropy (log loss), which is both convex with respect to the model's parameters (ensuring a single global minimum) and more appropriate for the probabilistic nature of the outputs in classification tasks. Cross-entropy directly measures the distance between the distribution of the predictions and the actual distribution of the outcomes, making it a natural fit for logistic regression.