Moving average ignored a value of inf/nan #222

emjay73 · 2019-11-22T12:30:21Z

Hi, me again.
"Moving average ignored a value of inf/nan "
issue happens again with multi-GPUs!

Would you please take a look at this issue once again?

(prev issue link : #186)
Thanks!

dbolya · 2019-11-25T01:38:12Z

What was your command used to run this? And are you on the latest master?

emjay73 · 2019-11-25T02:26:16Z

Yes, it was the latest one.
But to make it clear, I just cloned the repository to the newly created folder and changed the coco DB location in the config file so that I can use my own coco DB without moving it to the 'data' folder.
Then I copied the 'weights' folder with the following elements.

After that, I did the following.
export CUDA_VISIBLE_DEVICES=0,1,2,3
python train.py --config=yolact_base_config --batch_size=32

and it failed once again.

dbolya · 2019-11-25T02:42:44Z

Wow that's quite weird. This is the unmodified base config? I'll see if I can reproduce this on monday.

And just to sanity check, does it happen with 1 gpu?

emjay73 · 2019-11-25T02:53:10Z

As I mentioned above, I changed the coco DB location in the config file so that I can use my coco DB without moving it to the 'data' folder.

Except that, the other values have the default values.
I just started training with the following command.

export CUDA_VISIBLE_DEVICES=0
python train.py --config=yolact_base_config --batch_size=8

I'll let you know what happens.

emjay73 · 2019-11-25T09:52:22Z

Quick related question!
When do you suppose it is trained enough?

dbolya · 2019-11-25T18:29:55Z

You mean as to what point it the loss shouldn't be able to explode anymore? That should be around epoch 1-2, but evidently it can still explode on epoch 2 as you found. I'm currently testing 4 GPUs on yolact_base_config, I'll let you know if I can reproduce this.

emjay73 · 2019-11-26T01:01:20Z

I trained it roughly for a day and it's safe with a single GPU !

emjay73 · 2019-11-26T08:19:24Z

With second attempt, loss exploded with single GPU :<

dbolya · 2019-11-27T05:00:11Z

Ok this is an issue, I was able to reproduce this on 4 gpus. I'm currently testing on 1 gpu, but dang everything was working fine with pascal, no idea why it's not working with COCO.

I'll take a deeper look as to why this is happening and see if I can fix it.

emjay73 · 2019-11-27T09:14:53Z

I hope all of those problems that occurred in single/multi GPU cases are originated from the same flaw.
Thanks and plz,
You are my only hope T-T

emjay73 · 2019-11-27T09:29:44Z

And what about gradient clipping?

dbolya · 2019-12-02T18:04:31Z

Ok sorry about the delay (thanksgiving and all). The single GPU test came back successful and reproduced the mAP in the paper, so I guess the issue may be somewhere else? Now that I'm back, I'll investigate this further.

emjay73 · 2019-12-03T01:41:34Z

OK Thank you for your notice

dbolya · 2019-12-04T18:32:47Z

An update: I figured out the source of the random infs during training that you see every now and again. Augmentation could actually output detections with 0 bbox area, hence why some things are divided by 0 and thus produce infinity. Since I ignore batches with inf loss, this wouldn't be the cause of the explosion, but it might be boxes with really small area instead that's causing this explosion (since I don't filter those out).

Right now I'm trying out discarding any box with fewer than than 4 pixels of width and 4 pixels of height. I haven't seen a single inf or nan so far, so this looks promising. I'll update you once / if it finishes training.

emjay73 · 2019-12-05T02:45:34Z

That's good news!
And well, it seems like the problem that is not related to the multiple GPUs..?

HuangLian126 · 2019-12-06T02:09:39Z

@dbolya Thanks for your work. Do you finish the training that discarding any box with fewer than than 4 pixels of width and 4 pixels of height ?

dbolya · 2019-12-06T23:23:24Z

It's epoch 49/54 but that's enough for me to verify that it reached the same mAP as in the paper with 4 GPUs and batch size 28 (so 7 / gpu) with no loss explosion! I'll push the code now so you guys can test.

dbolya · 2019-12-06T23:47:27Z

Pushed the change. Let me know if I borked anything!

emjay73 · 2019-12-09T01:58:19Z

It works fine at epoch 45/54.
Thank you!

…ugmented boxes < 4 px in width and height. Fixes dbolya#222

dbolya mentioned this issue Dec 6, 2019

What will happen if I change the backbone network into MobileNet-v2 with FPN? #27

Open

dbolya closed this as completed in 821e830 Dec 6, 2019

dbolya mentioned this issue Dec 8, 2019

How to use big batchsize? #242

Open

dbolya mentioned this issue Jan 24, 2020

While training a custom data in intermediate i'm getting error #164

Closed

dbolya mentioned this issue Feb 19, 2020

custom dataset training: mask mAP 0 and "Moving average ignored a value of inf/nan" error #340

Closed

jasonkena mentioned this issue Feb 27, 2020

16-bit Support and Dynamic Loss Scaling #360

Open

rileypsmith mentioned this issue Sep 8, 2020

Moving average ignored value of nan -- exploding loss during training #518

Open

cyrilzakka mentioned this issue Mar 15, 2021

Mixup and Mosaic Data Augmentation WisconsinAIVision/yolact_edge#74

Closed

conica-cui pushed a commit to conica-cui/yolact that referenced this issue Mar 16, 2021

(Hopefully) fixed the inf loss and loss explosion issue by ignoring a…

f1d4d29

…ugmented boxes < 4 px in width and height. Fixes dbolya#222

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving average ignored a value of inf/nan #222

Moving average ignored a value of inf/nan #222

emjay73 commented Nov 22, 2019

dbolya commented Nov 25, 2019

emjay73 commented Nov 25, 2019 •

edited

Loading

dbolya commented Nov 25, 2019

emjay73 commented Nov 25, 2019 •

edited

Loading

emjay73 commented Nov 25, 2019

dbolya commented Nov 25, 2019

emjay73 commented Nov 26, 2019 •

edited

Loading

emjay73 commented Nov 26, 2019

dbolya commented Nov 27, 2019

emjay73 commented Nov 27, 2019

emjay73 commented Nov 27, 2019

dbolya commented Dec 2, 2019

emjay73 commented Dec 3, 2019 •

edited

Loading

dbolya commented Dec 4, 2019

emjay73 commented Dec 5, 2019

HuangLian126 commented Dec 6, 2019

dbolya commented Dec 6, 2019

dbolya commented Dec 6, 2019

emjay73 commented Dec 9, 2019

Moving average ignored a value of inf/nan #222

Moving average ignored a value of inf/nan #222

Comments

emjay73 commented Nov 22, 2019

dbolya commented Nov 25, 2019

emjay73 commented Nov 25, 2019 • edited Loading

dbolya commented Nov 25, 2019

emjay73 commented Nov 25, 2019 • edited Loading

emjay73 commented Nov 25, 2019

dbolya commented Nov 25, 2019

emjay73 commented Nov 26, 2019 • edited Loading

emjay73 commented Nov 26, 2019

dbolya commented Nov 27, 2019

emjay73 commented Nov 27, 2019

emjay73 commented Nov 27, 2019

dbolya commented Dec 2, 2019

emjay73 commented Dec 3, 2019 • edited Loading

dbolya commented Dec 4, 2019

emjay73 commented Dec 5, 2019

HuangLian126 commented Dec 6, 2019

dbolya commented Dec 6, 2019

dbolya commented Dec 6, 2019

emjay73 commented Dec 9, 2019

emjay73 commented Nov 25, 2019 •

edited

Loading

emjay73 commented Nov 25, 2019 •

edited

Loading

emjay73 commented Nov 26, 2019 •

edited

Loading

emjay73 commented Dec 3, 2019 •

edited

Loading