data augmentation

Created	@May 6, 2022
Tags	Data prepare

34) What is data augmentation? Can you give some examples? [src]

Data augmentation is a technique for synthesizing new data by modifying existing data in such a way that the target is not changed, or it is changed in a known way.

Computer vision is one of fields where data augmentation is very useful. There are many modifications that we can do to images:

Resize

Horizontal or vertical flip

Rotate

Add noise

Deform

Translation

rotation

Cut-Mix

Gaussian blur

Modify colors

Each problem needs a customized data augmentation pipeline. For example, on OCR, doing flips will change the text and won’t be beneficial; however, resizes and small rotations may help.

Random oversampling: duplicating the sample for the class with less data
Under sampling: select the sample from the class with more data

Smote: oversampling by x = x+ rand(x-xn)
Bagging: under sampling by random select multiple groups for the same amount of class with less data and combine the performances of multiple models
Sample weight: calculate the sample weight and add to loss

Data augmentation is a technique used to increase the diversity of your dataset by applying various transformations to your existing data. This method helps improve the performance and generalizability of machine learning models, especially in domains where collecting more data is challenging or expensive. Data augmentation is particularly popular in image processing, natural language processing (NLP), and audio analysis but can be applied to any data type. Here are some examples across different domains:

Image Data Augmentation

Rotation: Rotating the image by a certain angle to simulate the effect of viewing the object from different perspectives.

Translation: Shifting the image horizontally or vertically to mimic the object's position changes in the frame.

Rescaling: Adjusting the size of the image to simulate the object being closer or further away.

Flipping: Mirroring the image either horizontally or vertically to increase variability.

Cropping: Cutting out parts of the image and using the crops as new images to focus on different parts of the scene.

Color Jittering: Modifying the colors of the image (e.g., adjusting brightness, contrast, saturation) to simulate different lighting conditions.

Noise Injection: Adding random noise to images to mimic imperfections in the data acquisition process.

Text Data Augmentation

Synonym Replacement: Replacing words with their synonyms to change the text slightly without altering the meaning.

Back Translation: Translating the text into another language and then back to the original language to introduce variability in phrasing.

Sentence Shuffling: Rearranging the sentences in a paragraph to change the structure without affecting the overall narrative.

Word or Character Dropout: Randomly removing words or characters from the text to simulate missing or incomplete data.

Audio Data Augmentation

Time Stretching: Slowing down or speeding up the audio to simulate different speaking rates or musical tempos.

Pitch Shifting: Changing the pitch of the audio without altering the tempo to simulate different voice characteristics or musical keys.

Background Noise Injection: Adding different background sounds (e.g., street noise, cafe ambiance) to simulate various recording environments.

Volume Adjustment: Varying the audio's volume to simulate changes in the speaker's distance or recording levels.

General Techniques

Random Erasing or Dropout: Randomly removing parts of the data (pixels in images, tokens in text, samples in audio) to make the model more robust to parts of the input being missing or occluded.

Feature Space Augmentation: Generating new samples by applying transformations in the feature space, such as interpolating between existing samples (e.g., SMOTE for tabular data).

Data augmentation can significantly improve model robustness and generalization by presenting it with a wider variety of examples during training. This helps prevent overfitting, especially in situations where the amount of available training data is limited.