Common Resampling Use Cases

Created
TagsTraining

Due to the huge data size, it’s more common for big companies like Facebook and Google to use downsampling for the dominant class. For training pipeline, if your feature store has a SQL interface, you can use the built-in rand() function for downsampling your dataset.

//sampling 10% of the data, source: nqbao.medium.com
SELECT d.* FROM dataset d WHERE RAND() < 0.1

For deep learning models, we can sometimes use downsample as the majority class examples and then upweight them. It helps the model train faster and calibrate the model well with the true distribution.

example_weight=original_weightdownsampling_factor \text{example\_weight}= \text{original\_weight} * \text{downsampling\_factor}

Quiz about Weight for Positive Class