Handling Outliers in data

Created	@May 10, 2022
Tags	Data prepare

How would you remove outliers when trying to estimate a flat plane from noisy sample?

Random sample consensus (RANSAC) is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates. [src]

Outlier handling is very situation-dependent. In some cases outliers are problematic, at other times, outliers are best ignored or even embraced. For this question we look at ways to handle problematic outliers. Whether we should handle outliers is a separate discussion.

Huber Loss

Huber Loss is a loss function that combines MSE and MAE. It is less sensitive to outliers than a squared error loss and has the advantage of not needing to remove outlier data from the dataset. It is less apt to intensify outliers than MSE alone.

Winsorization

Winsorization clamps the value of the outliers to a value at a certain percentile. For example a winsorization rate of 90% would limit all values below the 5th percentile to equal the value at the 5th percentile and the values above the 95th to the value at the 95th.

Discretization

One common version of discretization is “Binning” which sorts ordered values into groups and then modifies every value in one group to either the mean or median of that group. This helps with reducing distinct values, reducing outlier impact, and smoothing.

Handling outliers in data is an important preprocessing step in data analysis and machine learning. Outliers are data points that deviate significantly from the rest of the data and may negatively impact the performance of models. Here are some common techniques for handling outliers:

Identify Outliers:
- Visual inspection: Plot the data using histograms, box plots, or scatter plots to identify data points that lie far from the bulk of the data.
- Statistical methods: Calculate summary statistics such as mean, median, standard deviation, and interquartile range (IQR) to detect outliers based on cutoff values or z-scores.

Remove Outliers:
- Trim data: Remove outliers from the dataset based on a predefined threshold, such as removing data points beyond a certain number of standard deviations from the mean.
- Percentile-based trimming: Remove data points beyond a certain percentile (e.g., 99th percentile) to eliminate extreme values.

Transform Data:
- Log transformation: Apply logarithmic transformation to the data to reduce the impact of extreme values and make the distribution more symmetric.
- Box-Cox transformation: Use the Box-Cox transformation to stabilize variance and make the data more Gaussian-like.

Winsorization:
- Replace outliers with the nearest non-outlying values. For example, replace outliers with the nearest values within a certain percentile range.

Robust Statistical Methods:
- Use robust statistical estimators that are less sensitive to outliers, such as the median instead of the mean for central tendency and the median absolute deviation (MAD) instead of standard deviation for dispersion.

Model-Based Approaches:
- Train models that are robust to outliers, such as robust regression models (e.g., RANSAC, Huber regression) or tree-based models (e.g., Random Forests) that are less sensitive to individual data points.

Anomaly Detection:
- Apply anomaly detection algorithms, such as isolation forests, k-nearest neighbors (k-NN), or autoencoders, to detect and flag outliers in the dataset.

Feature Engineering:
- Create new features or transformations of existing features that are less sensitive to outliers, such as ratios, differences, or rank-based features.

Data Partitioning:
- Partition the data into subsets based on known or suspected outlier characteristics and apply different preprocessing techniques to each subset.

Consider Domain Knowledge:
- Leverage domain knowledge to determine the nature of outliers and choose appropriate techniques for handling them. For example, in financial data, negative stock prices may be legitimate outliers.

It's important to carefully evaluate the impact of outlier handling techniques on the data distribution and model performance, as inappropriate handling may lead to loss of information or biased results. Additionally, it's advisable to document the outlier handling process to ensure transparency and reproducibility of the analysis.