Loss becomes Inf or NaN - reason
Created | |
---|---|
Tags | Loss |
- input data is wrong
- check the input and output data
- print the data value
- preprocessing: wrong normalization
- model is wrong
- torch.autograd.tect_anomaly
- gradient explosion
- decrease the learning rate
- adjust the loss weight hyper parameters
- loss explosion
- whether can be normally backward
- input Tensor has been type transformed to the same type
- a small constant is added to the divisor to ensure the stability of calculation
- batchNorm
- check if Nan after batchNorm
- while training and validation set has different distribution, the mean and var learned in the training might not be able to applied to validation
- while encoder and decoder has different architecture
- encoder: Unet + resnet34
- decoder: Unet + resnet50
model.eval():
model set to model.train(True)
- Batchnorm:
track_running_stats=False
- pooling layer
- stride is larger than kernel size
layer { name: "faulty_pooling" type: "Pooling" bottom: "x" top: "y" pooling_param { pool: AVE stride: 5 kernel: 3 } }
- Shuffle
- batch_norm on data with variance distribution
- training stage : use shuffle
- testing stage: no shuffle
- Distribution might be very different! - causing Nan
- batch_norm on data with variance distribution
- GPU vs. CPU
- model trained on GPU cannot work on CPU