Loss becomes Inf or NaN - reason

Created	@May 9, 2022
Tags	Loss

input data is wrong
- check the input and output data
- print the data value
- preprocessing: wrong normalization

model is wrong
- torch.autograd.tect_anomaly

gradient explosion
- decrease the learning rate
- adjust the loss weight hyper parameters

loss explosion
- whether can be normally backward
- input Tensor has been type transformed to the same type
- a small constant is added to the divisor to ensure the stability of calculation

batchNorm
- check if Nan after batchNorm
- while training and validation set has different distribution, the mean and var learned in the training might not be able to applied to validation
- while encoder and decoder has different architecture
  - encoder: Unet + resnet34
  - decoder: Unet + resnet50
- model.eval():model set to model.train(True)
- Batchnorm: track_running_stats=False

pooling layer

stride is larger than kernel size

layer {
  name: "faulty_pooling"
  type: "Pooling"
  bottom: "x"
  top: "y"
  pooling_param {
    pool: AVE
    stride: 5
    kernel: 3
  }
}

Shuffle
- batch_norm on data with variance distribution
  - training stage : use shuffle
  - testing stage: no shuffle
  - Distribution might be very different! - causing Nan

GPU vs. CPU
- model trained on GPU cannot work on CPU