Log loss
Created | |
---|---|
Tags | Loss |
Log loss, also known as logarithmic loss or cross-entropy loss, is a widely used loss function in binary and multiclass classification tasks. It measures the performance of a classification model where the predicted output is a probability value between 0 and 1.
Definition:
For binary classification, the log loss function is defined as:
where:
- \(N\) is the number of samples,
- \(y_i\) is the true label of the \(i\)-th sample (either 0 or 1),
- \(p_i\) is the predicted probability that the \(i\)-th sample belongs to the positive class.
Interpretation:
- Log loss measures the accuracy of the model's predicted probabilities by penalizing underconfident or overconfident predictions.
- Lower log loss values indicate better model performance, with 0 representing perfect predictions.
- Log loss is sensitive to the uncertainty of predictions and penalizes confidently wrong predictions more heavily than uncertain ones.
Applications:
- Classification Models Evaluation:
- Log loss is commonly used as a performance metric for evaluating the quality of probabilistic predictions generated by classification models, especially in scenarios where class probabilities are important, such as in medical diagnosis or fraud detection.
- Kaggle Competitions:
- Log loss is a frequently used evaluation metric in data science competitions on platforms like Kaggle. Competitors aim to minimize log loss to improve the predictive performance of their models.
- Multi-class Classification:
- Log loss can be extended to multi-class classification tasks, where it measures the accuracy of predicted probabilities across multiple classes.
Python Implementation (using scikit-learn):
from sklearn.metrics import log_loss
# Example ground truth and predicted probabilities
y_true = [0, 1, 1, 0, 1]
y_prob = [[0.9, 0.1], [0.3, 0.7], [0.8, 0.2], [0.2, 0.8], [0.6, 0.4]] # Predicted probabilities of positive class
# Calculate log loss
logloss = log_loss(y_true, y_prob)
print("Log Loss:", logloss)
In this example, y_true
contains the true labels of the samples (0 for the negative class and 1 for the positive class), and y_prob
contains the predicted probabilities of each class. We calculate the log loss using the log_loss
function from scikit-learn's metrics module. Lower log loss values indicate better predictive performance.