custom dataset training: mask mAP 0 and "Moving average ignored a value of inf/nan" error #340

jungseokhong · 2020-02-16T04:39:20Z

Hello,

I've been trying to train my custom dataset.

It is fairly small size dataset. Total about 700 images and 500 for training and 200 validation.
there are 3 classes.
I used yolact_resnet50_54_800000.pth to start training and I gave --start-iter:0 as an option.
Image sizes are 480x640, but some are like 920x1040, etc.

my_config={
    'num_classes': 4,
    # dw' = momentum * dw - lr * (grad + decay * w)
    'lr': 5e-4, # i changed this
    'momentum': 0.9,
    'decay': 5e-4,

    # Image Size
    'max_size': 550,
    # Training params
    'lr_steps': (280000, 600000, 700000, 750000), # I left it as it is since I wanted to see if training works at all.
    'max_iter': 10000
}

When I ran training, it seemed to work well at the beginning. But mask mAP are never greater than 0 which seems odd to me. then loss starts to explode. Since my dataset is pretty small, increasing the size of dataset is not an option. Any ideas? should I change base weight?

I've attached json file that I used (I converted to txt file to upload).
instances_train.txt

The text was updated successfully, but these errors were encountered:

sdimantsd · 2020-02-16T08:22:04Z

@jungseokhong
Take a look here:
#318

zhawhjw · 2020-02-17T20:33:08Z

Same problem here. I just trained with "resnt101" model with COCO dataset on 1 GPU and default batch size 8. I got loss explosion when I reached around 5000th iteration on the initial epoch.

I looked #318 and adjusted batch size, 'warmup', but failed after several trials. Then, I just switched from 1 GPU to 2 GPU and everything looks fine(at least the first 2 epoch for now).

The explanation of this problem from #318 is that we have a high batch size and it causes a high learning rate(batch size * 12, I suppose?). I guess code evenly distributes the batch size to every GPU when they are available, so the learning rate is reduced on loss calculation and makes it stable.

If it is the condition, sorry no offence, the capability of running job on single GPU seems not convincing.(I assume the setting in paper is aligned with the default setting in config.py).

Correct me if I am wrong.

dbolya · 2020-02-19T01:29:01Z

@jungseokhong Did you forget to specify the dataset in your config? I don't see it in the config you posted. Also with a dataset that small, I'd suggest doing something like in #334.

@zhawhjw If you're getting a loss explosion on 1 GPU with batch size 8 using COCO, then that's a problem, since we used those parameters to train all the models in our paper. And #222 should have fixed any random loses hitting inf. If I can ask, what configs were you training with? And was this after the commit that fixed #222?

Also, learning rate is increased when more GPUs are added, which was the problem in #318 (it was increased too high), so the most stable should be 1 GPU with batch_size=8 (any lower and you get instability due to batch norm).

zhawhjw · 2020-02-22T06:44:52Z

@dbolya Sorry for the late reply, I used version which is balled with "Latest commit f54b0a5 11 days ago" in master branch.

The running environment is RTX Titan, cuda 10.1 and torch 1.4.

I used 'yolact_base_config' to train Resnet101 with COCO.

I checked my config.py, the only modified parameter is 'lrwarmup' which originally is set to 500 and I changed it to 1000 that try to solve the diverge problem.

The weird thing is I am fine in two GPU but get diverge on one, even I changed 'lrwarmup' back to 500.

jungseokhong · 2020-02-24T08:02:15Z

@sdimantsd, @dbolya Thanks! I found that I had a problem with my annotation files. For the small dataset training, I will refer to your suggestion.

dbolya · 2020-02-25T00:36:16Z

@zhawhjw Ok, I'll have to investigate this later (though it'll be two weeks until I can get around to doing that). It should definitely be trainable on one GPU, but maybe some change broke something and I haven't realized it.

For now, I'll just close this issue since @jungseokhong figured it out, but I'll reply here if I find anything.

dbolya closed this as completed Feb 25, 2020

jasonkena mentioned this issue Feb 27, 2020

16-bit Support and Dynamic Loss Scaling #360

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom dataset training: mask mAP 0 and "Moving average ignored a value of inf/nan" error #340

custom dataset training: mask mAP 0 and "Moving average ignored a value of inf/nan" error #340

jungseokhong commented Feb 16, 2020 •

edited by dbolya

Loading

sdimantsd commented Feb 16, 2020

zhawhjw commented Feb 17, 2020

dbolya commented Feb 19, 2020

zhawhjw commented Feb 22, 2020

jungseokhong commented Feb 24, 2020

dbolya commented Feb 25, 2020

custom dataset training: mask mAP 0 and "Moving average ignored a value of inf/nan" error #340

custom dataset training: mask mAP 0 and "Moving average ignored a value of inf/nan" error #340

Comments

jungseokhong commented Feb 16, 2020 • edited by dbolya Loading

sdimantsd commented Feb 16, 2020

zhawhjw commented Feb 17, 2020

dbolya commented Feb 19, 2020

zhawhjw commented Feb 22, 2020

jungseokhong commented Feb 24, 2020

dbolya commented Feb 25, 2020

jungseokhong commented Feb 16, 2020 •

edited by dbolya

Loading