Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving average ignored a value of inf/nan #222

Closed
emjay73 opened this issue Nov 22, 2019 · 19 comments
Closed

Moving average ignored a value of inf/nan #222

emjay73 opened this issue Nov 22, 2019 · 19 comments

Comments

@emjay73
Copy link

emjay73 commented Nov 22, 2019

Hi, me again.
"Moving average ignored a value of inf/nan "
issue happens again with multi-GPUs!

Would you please take a look at this issue once again?

Screenshot from 2019-11-22 21-26-17

(prev issue link : #186)
Thanks!

@dbolya
Copy link
Owner

dbolya commented Nov 25, 2019

What was your command used to run this? And are you on the latest master?

@emjay73
Copy link
Author

emjay73 commented Nov 25, 2019

Yes, it was the latest one.
But to make it clear, I just cloned the repository to the newly created folder and changed the coco DB location in the config file so that I can use my own coco DB without moving it to the 'data' folder.
Then I copied the 'weights' folder with the following elements.

Screenshot from 2019-11-25 11-28-31

After that, I did the following.
export CUDA_VISIBLE_DEVICES=0,1,2,3
python train.py --config=yolact_base_config --batch_size=32

and it failed once again.

Screenshot from 2019-11-25 11-23-21

Screenshot from 2019-11-25 11-23-52

Screenshot from 2019-11-25 11-25-50

@dbolya
Copy link
Owner

dbolya commented Nov 25, 2019

Wow that's quite weird. This is the unmodified base config? I'll see if I can reproduce this on monday.

And just to sanity check, does it happen with 1 gpu?

@emjay73
Copy link
Author

emjay73 commented Nov 25, 2019

As I mentioned above, I changed the coco DB location in the config file so that I can use my coco DB without moving it to the 'data' folder.

Screenshot from 2019-11-25 11-45-03

Except that, the other values have the default values.
I just started training with the following command.

export CUDA_VISIBLE_DEVICES=0
python train.py --config=yolact_base_config --batch_size=8

I'll let you know what happens.

@emjay73
Copy link
Author

emjay73 commented Nov 25, 2019

Quick related question!
When do you suppose it is trained enough?

@dbolya
Copy link
Owner

dbolya commented Nov 25, 2019

You mean as to what point it the loss shouldn't be able to explode anymore? That should be around epoch 1-2, but evidently it can still explode on epoch 2 as you found. I'm currently testing 4 GPUs on yolact_base_config, I'll let you know if I can reproduce this.

@emjay73
Copy link
Author

emjay73 commented Nov 26, 2019

I trained it roughly for a day and it's safe with a single GPU !

Screenshot from 2019-11-26 09-59-58

@emjay73
Copy link
Author

emjay73 commented Nov 26, 2019

With second attempt, loss exploded with single GPU :<
Screenshot from 2019-11-26 17-17-18
Screenshot from 2019-11-26 17-17-33
Screenshot from 2019-11-26 17-17-40

@dbolya
Copy link
Owner

dbolya commented Nov 27, 2019

Ok this is an issue, I was able to reproduce this on 4 gpus. I'm currently testing on 1 gpu, but dang everything was working fine with pascal, no idea why it's not working with COCO.

I'll take a deeper look as to why this is happening and see if I can fix it.

@emjay73
Copy link
Author

emjay73 commented Nov 27, 2019

I hope all of those problems that occurred in single/multi GPU cases are originated from the same flaw.
Thanks and plz,
You are my only hope T-T

@emjay73
Copy link
Author

emjay73 commented Nov 27, 2019

And what about gradient clipping?

@dbolya
Copy link
Owner

dbolya commented Dec 2, 2019

Ok sorry about the delay (thanksgiving and all). The single GPU test came back successful and reproduced the mAP in the paper, so I guess the issue may be somewhere else? Now that I'm back, I'll investigate this further.

@emjay73
Copy link
Author

emjay73 commented Dec 3, 2019

OK Thank you for your notice

@dbolya
Copy link
Owner

dbolya commented Dec 4, 2019

An update: I figured out the source of the random infs during training that you see every now and again. Augmentation could actually output detections with 0 bbox area, hence why some things are divided by 0 and thus produce infinity. Since I ignore batches with inf loss, this wouldn't be the cause of the explosion, but it might be boxes with really small area instead that's causing this explosion (since I don't filter those out).

Right now I'm trying out discarding any box with fewer than than 4 pixels of width and 4 pixels of height. I haven't seen a single inf or nan so far, so this looks promising. I'll update you once / if it finishes training.

@emjay73
Copy link
Author

emjay73 commented Dec 5, 2019

That's good news!
And well, it seems like the problem that is not related to the multiple GPUs..?

@HuangLian126
Copy link

@dbolya Thanks for your work. Do you finish the training that discarding any box with fewer than than 4 pixels of width and 4 pixels of height ?

@dbolya
Copy link
Owner

dbolya commented Dec 6, 2019

It's epoch 49/54 but that's enough for me to verify that it reached the same mAP as in the paper with 4 GPUs and batch size 28 (so 7 / gpu) with no loss explosion! I'll push the code now so you guys can test.

@dbolya
Copy link
Owner

dbolya commented Dec 6, 2019

Pushed the change. Let me know if I borked anything!

@emjay73
Copy link
Author

emjay73 commented Dec 9, 2019

It works fine at epoch 45/54.
Thank you!

conica-cui pushed a commit to conica-cui/yolact that referenced this issue Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants