Logistic Regression
Created | |
---|---|
Tags | Basic Concepts |
Introduction
Logistic Regression is a classification algorithm to predict discrete set of classes.
Logistic Regression is a statistical method used for binary classification tasks, which can also be extended to multiclass classification. It models the probability that a given input belongs to a particular category. Despite its name suggesting a regression analysis, logistic regression is a classification algorithm, making it a fundamental tool in the toolbox of machine learning.
How Logistic Regression Works
Logistic Regression uses the logistic function (also known as the sigmoid function) to convert linear combinations of features into values between 0 and 1, which are interpreted as probabilities. The logistic function is defined as:
where \(z\) is the linear combination of the input features (\(X\)) and weights (\(w\)), plus a bias term (\(b\)), i.e., ). The output of the sigmoid function is then used to determine the probability of the input belonging to the positive class (usually denoted as "1").
Model Training
Training a logistic regression model involves finding the set of weights (\(w\)) that minimize a cost function, which is typically the binary cross-entropy loss, also known as the log loss, for binary classification tasks. The cost function is given by:
where \(N\) is the number of training examples, \(y_i\) is the actual label of the \(i\)th example, and \() is the predicted probability that the \(i\)th example belongs to the positive class. The optimization of \(J(w)\) is typically performed using gradient descent or other optimization algorithms.
Decision Boundary
The decision boundary in logistic regression is a property that makes it very intuitive. It's the set of points where the probability of belonging to the positive class is 50%, leading to a linear equation . For higher-dimensional data, this boundary can be a plane or a hyperplane.
Multiclass Classification
For multiclass classification tasks, logistic regression can be extended using techniques such as "One-vs-Rest" (OvR) or "One-vs-One" (OvO), or by using the softmax function in place of the sigmoid for the direct multiclass classification (Multinomial Logistic Regression).
Python Implementation Example
Implementing logistic regression for a binary classification task using scikit-learn is straightforward:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
data = load_iris()
X = data.data
y = (data.target != 0) * 1 # Convert to binary classification problem
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print(f"Model accuracy: {accuracy * 100:.2f}%")
Pros and Cons of Logistic Regression
Pros:
- Simple and efficient for linearly separable data.
- Provides probabilities for outcomes, which can be useful for decision-making.
- Easy to implement, interpret, and understand.
Cons:
- Assumes linear decision boundaries, which limits complexity.
- Can be outperformed by more complex models on non-linear data.
- Sensitive to unbalanced data and outliers.
Logistic regression remains a popular choice for binary classification problems, especially as a baseline model, due to its simplicity, interpretability, and efficiency.
Comparison between linear regression and logistic regression
Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then mapped to two or more discrete classes
Log loss based on maximum likelihood estimation
Python code example for implementing Logistic Regression from scratch:
import numpy as np
class LogisticRegression:
def __init__(self, learning_rate=0.01, num_iterations=1000):
self.learning_rate = learning_rate
self.num_iterations = num_iterations
def sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def fit(self, X, y):
# Initialize weights and bias to zeros
self.weights = np.zeros(X.shape[1])
self.bias = 0
for i in range(self.num_iterations):
# Linear combination
model = np.dot(X, self.weights) + self.bias
# Sigmoid activation
predictions = self.sigmoid(model)
# Compute gradient
dw = (1 / X.shape[0]) * np.dot(X.T, (predictions - y))
db = (1 / X.shape[0]) * np.sum(predictions - y)
# Update weights and bias
self.weights -= self.learning_rate * dw
self.bias -= self.learning_rate * db
def predict(self, X):
model = np.dot(X, self.weights) + self.bias
predictions = self.sigmoid(model)
return [1 if i > 0.5 else 0 for i in predictions]
# Assuming X_train, X_test, y_train, y_test are already defined
model = LogisticRegression(learning_rate=0.01, num_iterations=1000)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Evaluate accuracy
accuracy = np.mean(predictions == y_test)
print(f"Model accuracy: {accuracy}")
This class initializes with a learning rate and a number of iterations. The fit
method trains the model using gradient descent, and the predict
method outputs predictions for given input features. Note: This is a basic example and lacks features like regularization and convergence checking for simplicity.
Multiclass Logistic Regression
Procedure:
- Divide the problem into n+1 binary classification problems
- For each class...
- Predict the prbability the observations are in that single class
- Prediction = <math> max (probability of the classes)
- For each sub-problem, we select one class (Yes) and lump all the others into a second class (No). Then we take the class with the highest predicted value.
Softmax activation: