-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving average ignored a value of inf/nan #222
Comments
What was your command used to run this? And are you on the latest master? |
Yes, it was the latest one. After that, I did the following. and it failed once again. |
Wow that's quite weird. This is the unmodified base config? I'll see if I can reproduce this on monday. And just to sanity check, does it happen with 1 gpu? |
As I mentioned above, I changed the coco DB location in the config file so that I can use my coco DB without moving it to the 'data' folder. Except that, the other values have the default values. export CUDA_VISIBLE_DEVICES=0 I'll let you know what happens. |
Quick related question! |
You mean as to what point it the loss shouldn't be able to explode anymore? That should be around epoch 1-2, but evidently it can still explode on epoch 2 as you found. I'm currently testing 4 GPUs on yolact_base_config, I'll let you know if I can reproduce this. |
Ok this is an issue, I was able to reproduce this on 4 gpus. I'm currently testing on 1 gpu, but dang everything was working fine with pascal, no idea why it's not working with COCO. I'll take a deeper look as to why this is happening and see if I can fix it. |
I hope all of those problems that occurred in single/multi GPU cases are originated from the same flaw. |
And what about gradient clipping? |
Ok sorry about the delay (thanksgiving and all). The single GPU test came back successful and reproduced the mAP in the paper, so I guess the issue may be somewhere else? Now that I'm back, I'll investigate this further. |
OK Thank you for your notice |
An update: I figured out the source of the random infs during training that you see every now and again. Augmentation could actually output detections with 0 bbox area, hence why some things are divided by 0 and thus produce infinity. Since I ignore batches with inf loss, this wouldn't be the cause of the explosion, but it might be boxes with really small area instead that's causing this explosion (since I don't filter those out). Right now I'm trying out discarding any box with fewer than than 4 pixels of width and 4 pixels of height. I haven't seen a single inf or nan so far, so this looks promising. I'll update you once / if it finishes training. |
That's good news! |
@dbolya Thanks for your work. Do you finish the training that discarding any box with fewer than than 4 pixels of width and 4 pixels of height ? |
It's epoch 49/54 but that's enough for me to verify that it reached the same mAP as in the paper with 4 GPUs and batch size 28 (so 7 / gpu) with no loss explosion! I'll push the code now so you guys can test. |
Pushed the change. Let me know if I borked anything! |
It works fine at epoch 45/54. |
…ugmented boxes < 4 px in width and height. Fixes dbolya#222
Hi, me again.
"Moving average ignored a value of inf/nan "
issue happens again with multi-GPUs!
Would you please take a look at this issue once again?
(prev issue link : #186)
Thanks!
The text was updated successfully, but these errors were encountered: