Handle Imbalance Class Distribution

Created
TagsTraining

In machine learning use cases like fraud detection, click prediction or spam detection, it’s common to have imbalance labels. For example, in ad click prediction, it’s very common to have 0.2% conversion rate. If there are 1,000 clicks, only two clicks lead to some desired actions, such as installing the app
or buying the product. Why is this a problem? With too few positive examples compared to negative examples, your model spends most of the time learning about negative examples.
There are few strategies to handle them.

Use class weights in the loss function. For example, in spam detection problems, where non-spam data might account for 95% of data compared to other spam data that is only 5%, we want to penalize more on the major class. In this case, we can modify the entropy loss function using weight.
//w0 is weight for class 0,
//w1 is weight for class 1

LossFunction=w0ylog(p)w1(1y)log(1p)LossFunction = -w0 * ylog(p) - w1*(1-y)*log(1-p)



Use naive resampling: resample major class at a certain rate to reduce the imbalance in the training set. It’s important to have validation data and test data intact (no resampling).

Use synthetic resampling: synthetic minority over-sampling technique (SMOTE) consists of synthesizing elements for the minority class, based on those that already exist. It works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors. For practical reasons, SMOTE is not as widely used as other methods. In practice, this method is not commonly used, especially for large-scale applications.