Acc drop significantly during the last epoch of stage1 #16

FANG-Xiaolin · 2018-04-16T14:30:16Z

Hi Xingyi,
After training the 2D hourglass component for 50+ epochs, the accuracy is approximately 83%, but after the 60th epoch, the accuracy suddenly drop to 43%.

Here's the log.

xingyizhou · 2018-04-16T16:36:27Z

Hi,
As far as I know, it should be caused by a pytorch internal bug in BN. You can comment model.eval() in testing to see if the validation acc gets better (but it still won't match the desired performance). The bug should not be reproducible. And re-train the network once more (better on another machine) should have different results. Or you can downgrade your pytorch version below 0.1.12, which is a version where I haven't met/ heard about this bug (but still not guaranteed). Please let me know if the above solutions help. Thanks!

FANG-Xiaolin · 2018-04-19T09:45:39Z

I tried for another 3 times. The train acc is approximately 0.87 during the last epoch(the 60th epoch) but the validation acc changes every time and always lower than 0.50. The validation acc is around 0.80 in the 55th epoch so it seems that there is a sudden drop during the last epoch and I notice that the training loss gets slightly higher during the last epoch.

xingyizhou · 2018-04-19T22:17:05Z

Hi,
Thanks for reporting the problem. However I don't have other solutions yet and will keep looking into it. It might not be a bug of the code, since an isolated implementation of HourglassNet (I am not sure if the bug is from the network architecture) also has this problem (bearpaw/pytorch-pose#33). People there suggest using learning rate 1e-4. You can have a try to see if the bug still exists.

FANG-Xiaolin · 2018-04-20T13:29:52Z

Hi,
Thanks for your advice. Yes it works if using LR 1e-4. The val acc is 0.80+ in this way.

xingyizhou · 2018-04-27T18:31:59Z

Hi,
I have investigated this problem (on another project, while I can not reproduce the bug on this project). It seems it is caused by very large intermediate features (e.g. > 10000) before batch normalization. Then the train() mode is on, it will be normalized be itself so training is OK. But when eval() mode is on, a slight difference (of the intermediate feature) with the BN mean/std from training will results in large offsets for output. I don't know the causal of the problem but it looks mathematically reasonable. However, down-grading PyTorch version to 0.1.12 will eliminate the problem. Please notify me if you have any other observations on this bug. Thanks!

FANG-Xiaolin · 2018-05-01T01:27:30Z

Hi,
Yes I think it is reasonable. Sure I will notify you if I observe something new. Thanks for your reply!

ssnl · 2018-06-15T06:21:59Z

IIRC, your repo sets batch size to 1. If that is the case it's not really a PyTorch bug. Running stats with batch size = 1 is unstable itself.

xingyizhou · 2018-06-15T06:25:58Z

Thanks for the suggestion! The training batch size is 6 and testing is 1. When testing, eval() mode is on and the batch size does not affect the computation.

ssnl · 2018-06-15T06:27:18Z

I see. 6 is still too small though. People usually use >128 with BN.

…

On Fri, Jun 15, 2018 at 02:25 Xingyi Zhou ***@***.***> wrote: Thanks for the suggestion! The training batch size is 6 and testing is 1. When testing, eval() mode is on and the batch size does not affect the computation. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#16 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFaWZf1qIJrKrOlsgtOtU14Fr2KuH1iWks5t81N3gaJpZM4TWpam> .

xingyizhou · 2018-07-14T06:08:53Z

Hi all,
As pointed by @leoxiaobin, turn off cudnn of BN layer resolves the issue. It can be realized by set torch.backends.cudnn.enabled = False in main.py, which disables cudnn for all layers and slows down the training by about 1.5x time, or re-build pytorch from source by hacking cudnn in BN layers https://github.com/pytorch/pytorch/blob/e8536c08a16b533fe0a9d645dd4255513f9f4fdd/aten/src/ATen/native/Normalization.cpp#L46 .

FANG-Xiaolin · 2018-07-14T06:36:25Z

Get it. Thanks.

xingyizhou · 2018-07-14T06:39:01Z

Oh I still want this issue to be opened to wait for better solutions...

FANG-Xiaolin · 2018-07-14T06:42:13Z

Sure! My bad.

wangg12 · 2019-07-28T15:14:49Z

@ssnl @xingyizhou Does this bug still exist with pytorch >= 1.0?

ujsyehao · 2019-12-26T08:01:25Z

@wangg12 I am doing experiments to observe if the bug exists in pytorch >= 1.0.

qiangruoyu · 2020-01-18T12:07:36Z

@ wangg12 我正在做实验，以观察pytorch> = 1.0中是否存在该错误。

Can you meet this error when the version of pytorch >= 1.0

ygean · 2020-05-26T05:00:18Z

@ujsyehao 你好，请问你的实验结果如何？

sisrfeng · 2020-07-21T08:21:04Z

Hi all,
As pointed by @leoxiaobin, turn off cudnn of BN layer resolves the issue. It can be realized by set torch.backends.cudnn.enabled = False in main.py, which disables cudnn for all layers and slows down the training by about 1.5x time, or re-build pytorch from source by hacking cudnn in BN layers https://github.com/pytorch/pytorch/blob/e8536c08a16b533fe0a9d645dd4255513f9f4fdd/aten/src/ATen/native/Normalization.cpp#L46 .

torch.backends.cudnn.enabled = Falseinmain.py`
Should it be "torch.backends.cudnn.benchmark = False"?

If I have followed this step, I need not modify main.py, right? :
For other pytorch version, you can manually open torch/nn/functional.py and find the line with torch.batch_norm and replace the torch.backends.cudnn.enabled with False

xingyizhou mentioned this issue Jun 15, 2018

validation acc drops down drastically after epoch 10 bearpaw/pytorch-pose#33

Open

FightForCS mentioned this issue Jul 12, 2018

About pytorch version #29

Closed

FANG-Xiaolin closed this as completed Jul 14, 2018

xingyizhou reopened this Jul 14, 2018

This was referenced Nov 19, 2018

Question：result of stage1 model to estimate 3d depth is different? #44

Closed

Why effect is very poor use the model trained by myself? #43

Open

dragonbook mentioned this issue Dec 12, 2018

I got stuck in val，can you give me some suggest? dragonbook/V2V-PoseNet-pytorch#2

Closed

xingyizhou mentioned this issue Dec 20, 2018

Terrible result during training #49

Closed

xingyizhou closed this as completed Jan 11, 2019

xingyizhou mentioned this issue Sep 24, 2019

Detecte nothing during validation , train loss decrease but test increase ？ xingyizhou/CenterNet#335

Open

zzzxxxttt mentioned this issue Jan 6, 2020

Hi, why do you setup the torch.backends.cudnn.enabled False? zzzxxxttt/pytorch_simple_CenterNet_45#1

Closed

xingyizhou mentioned this issue Mar 18, 2021

Output logs from coco pose training xingyizhou/CenterTrack#198

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acc drop significantly during the last epoch of stage1 #16

Acc drop significantly during the last epoch of stage1 #16

FANG-Xiaolin commented Apr 16, 2018

xingyizhou commented Apr 16, 2018

FANG-Xiaolin commented Apr 19, 2018

xingyizhou commented Apr 19, 2018

FANG-Xiaolin commented Apr 20, 2018

xingyizhou commented Apr 27, 2018

FANG-Xiaolin commented May 1, 2018

ssnl commented Jun 15, 2018

xingyizhou commented Jun 15, 2018

ssnl commented Jun 15, 2018 via email

xingyizhou commented Jul 14, 2018

FANG-Xiaolin commented Jul 14, 2018

xingyizhou commented Jul 14, 2018

FANG-Xiaolin commented Jul 14, 2018

wangg12 commented Jul 28, 2019

ujsyehao commented Dec 26, 2019

qiangruoyu commented Jan 18, 2020

ygean commented May 26, 2020

sisrfeng commented Jul 21, 2020

Acc drop significantly during the last epoch of stage1 #16

Acc drop significantly during the last epoch of stage1 #16

Comments

FANG-Xiaolin commented Apr 16, 2018

xingyizhou commented Apr 16, 2018

FANG-Xiaolin commented Apr 19, 2018

xingyizhou commented Apr 19, 2018

FANG-Xiaolin commented Apr 20, 2018

xingyizhou commented Apr 27, 2018

FANG-Xiaolin commented May 1, 2018

ssnl commented Jun 15, 2018

xingyizhou commented Jun 15, 2018

ssnl commented Jun 15, 2018 via email

xingyizhou commented Jul 14, 2018

FANG-Xiaolin commented Jul 14, 2018

xingyizhou commented Jul 14, 2018

FANG-Xiaolin commented Jul 14, 2018

wangg12 commented Jul 28, 2019

ujsyehao commented Dec 26, 2019

qiangruoyu commented Jan 18, 2020

ygean commented May 26, 2020

sisrfeng commented Jul 21, 2020