Cross Validation / Stratified cross-validation

Created	@August 28, 2021
Tags	Basic Concepts

Cross-Validation in Machine Learning

Cross-validation is a statistical method used to estimate the skill of machine learning models. It is used to protect against overfitting in a predictive model, particularly when the amount of data is limited. In cross-validation, the dataset is randomly split into "k" groups, or folds, of approximately equal size. The model is trained on k-1 of these folds and tested on the remaining fold. This process is repeated k times (the folds), with each of the k folds used exactly once as the test set. The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation. The most common cross-validation technique is k-fold cross-validation.

Types of Cross-Validation

K-Fold Cross-Validation: The data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set, and the other k-1 subsets are put together to form a training set.

Stratified K-Fold Cross-Validation: Similar to k-fold cross-validation, but in this method, the folds are made by preserving the percentage of samples for each class. This is particularly useful for imbalanced datasets.

Leave-One-Out Cross-Validation (LOOCV): A special case of k-fold cross-validation where k equals the number of data points in the dataset. Essentially, for n data points, n models are trained on all data points except one and tested on the remaining single data point.

Leave-P-Out Cross-Validation (LPOCV): Similar to LOOCV, but instead of leaving one out, p observations are left out.

Repeated K-Fold Cross-Validation: The k-fold cross-validation is repeated n times, with each of the k folds used exactly once as the test set over the n repetitions.

Implementing K-Fold Cross-Validation in Python

Here's an example using Scikit-learn to perform k-fold cross-validation:

from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create a logistic regression model
model = LogisticRegression(max_iter=200)

# Perform 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)

print("Accuracy scores for the 5 folds: ", scores)
print("Mean cross-validation score: {:.2f}".format(scores.mean()))

Pros and Cons of Cross-Validation

Pros:

Less Bias: By using each data point in both training and testing phases, it ensures that the model is tested thoroughly, reducing bias.

More Accurate Estimate: Provides a more accurate estimate of the model's performance on unseen data compared to a single train-test split.

Cons:

Computationally Expensive: Especially with large datasets and complex models, or with methods like LOOCV, it can be computationally intensive.

Not Ideal for Time-Series: Special care must be taken when applying cross-validation for time-series data to avoid look-ahead bias.

Applications

Cross-validation is widely used in machine learning for tuning hyperparameters, comparing the performance of different models, and as part of the model selection process to ensure that the chosen model has not just memorized the training data but can generalize well to unseen data.

In cross validation, data is split into k equally sized folds. One of the fold is used as the validation set and the rest is used to train the model. So a score is obtained. Repeat this process until each fold is used as the validation set. An average of the scores is used to assess the performance of the overall model.

Cross-validation is a technique for dividing data between training and validation sets. On typical cross-validation this split is done randomly. But in stratified cross-validation, the split preserves the ratio of the categories on both the training and validation datasets.

For example, if we have a dataset with 10% of category A and 90% of category B, and we use stratified cross-validation, we will have the same proportions in training and validation. In contrast, if we use simple cross-validation, in the worst case we may find that there are no samples of category A in the validation set.

Stratified cross-validation may be applied in the following scenarios:

On a dataset with multiple categories. The smaller the dataset and the more imbalanced the categories, the more important it will be to use stratified cross-validation.

On a dataset with data of different distributions. For example, in a dataset for autonomous driving, we may have images taken during the day and at night. If we do not ensure that both types are present in training and validation, we will have generalization problems.