Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom dataset training: mask mAP 0 and "Moving average ignored a value of inf/nan" error #340

Closed
jungseokhong opened this issue Feb 16, 2020 · 6 comments

Comments

@jungseokhong
Copy link

jungseokhong commented Feb 16, 2020

Hello,

I've been trying to train my custom dataset.

It is fairly small size dataset. Total about 700 images and 500 for training and 200 validation.
there are 3 classes.
I used yolact_resnet50_54_800000.pth to start training and I gave --start-iter:0 as an option.
Image sizes are 480x640, but some are like 920x1040, etc.

my_config={
    'num_classes': 4,
    # dw' = momentum * dw - lr * (grad + decay * w)
    'lr': 5e-4, # i changed this
    'momentum': 0.9,
    'decay': 5e-4,

    # Image Size
    'max_size': 550,
    # Training params
    'lr_steps': (280000, 600000, 700000, 750000), # I left it as it is since I wanted to see if training works at all.
    'max_iter': 10000
}

When I ran training, it seemed to work well at the beginning. But mask mAP are never greater than 0 which seems odd to me. then loss starts to explode. Since my dataset is pretty small, increasing the size of dataset is not an option. Any ideas? should I change base weight?

1

2

3

4

5

6

I've attached json file that I used (I converted to txt file to upload).
instances_train.txt

@sdimantsd
Copy link

@jungseokhong
Take a look here:
#318

@zhawhjw
Copy link

zhawhjw commented Feb 17, 2020

Same problem here. I just trained with "resnt101" model with COCO dataset on 1 GPU and default batch size 8. I got loss explosion when I reached around 5000th iteration on the initial epoch.

I looked #318 and adjusted batch size, 'warmup', but failed after several trials. Then, I just switched from 1 GPU to 2 GPU and everything looks fine(at least the first 2 epoch for now).

The explanation of this problem from #318 is that we have a high batch size and it causes a high learning rate(batch size * 12, I suppose?). I guess code evenly distributes the batch size to every GPU when they are available, so the learning rate is reduced on loss calculation and makes it stable.

If it is the condition, sorry no offence, the capability of running job on single GPU seems not convincing.(I assume the setting in paper is aligned with the default setting in config.py).

Correct me if I am wrong.

@dbolya
Copy link
Owner

dbolya commented Feb 19, 2020

@jungseokhong Did you forget to specify the dataset in your config? I don't see it in the config you posted. Also with a dataset that small, I'd suggest doing something like in #334.

@zhawhjw If you're getting a loss explosion on 1 GPU with batch size 8 using COCO, then that's a problem, since we used those parameters to train all the models in our paper. And #222 should have fixed any random loses hitting inf. If I can ask, what configs were you training with? And was this after the commit that fixed #222?

Also, learning rate is increased when more GPUs are added, which was the problem in #318 (it was increased too high), so the most stable should be 1 GPU with batch_size=8 (any lower and you get instability due to batch norm).

@zhawhjw
Copy link

zhawhjw commented Feb 22, 2020

@dbolya Sorry for the late reply, I used version which is balled with "Latest commit f54b0a5 11 days ago" in master branch.

The running environment is RTX Titan, cuda 10.1 and torch 1.4.

I used 'yolact_base_config' to train Resnet101 with COCO.

I checked my config.py, the only modified parameter is 'lrwarmup' which originally is set to 500 and I changed it to 1000 that try to solve the diverge problem.

The weird thing is I am fine in two GPU but get diverge on one, even I changed 'lrwarmup' back to 500.

@jungseokhong
Copy link
Author

@sdimantsd, @dbolya Thanks! I found that I had a problem with my annotation files. For the small dataset training, I will refer to your suggestion.

@dbolya
Copy link
Owner

dbolya commented Feb 25, 2020

@zhawhjw Ok, I'll have to investigate this later (though it'll be two weeks until I can get around to doing that). It should definitely be trainable on one GPU, but maybe some change broke something and I haven't realized it.

For now, I'll just close this issue since @jungseokhong figured it out, but I'll reply here if I find anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants