Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The output format of mkldnn conv is wrong when data_format is NHWC #38126

Closed
baoachun opened this issue Dec 14, 2021 · 18 comments
Closed

The output format of mkldnn conv is wrong when data_format is NHWC #38126

baoachun opened this issue Dec 14, 2021 · 18 comments
Assignees
Milestone

Comments

@baoachun
Copy link
Contributor

图片

The problem can be reproduced according to this pr:#38107

Runing the following command to reproduce.

git clone the code and cmake with -DWITH_TESTING=ON
then running the following command

ctest -R test_mkldnn_conv_gelu_fuse_pass -V
@paddle-bot-old
Copy link

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

@lidanqing-intel
Copy link
Contributor

@jczaja Please comment on this issue. Otherwise that disabling PR will be merged.

@lidanqing-intel lidanqing-intel changed the title the output format of mkldnn conv is wrong when data_format is NHWC The output format of mkldnn conv is wrong when data_format is NHWC Dec 23, 2021
@jczaja
Copy link
Contributor

jczaja commented Dec 23, 2021

@baoachun I implemented NHWC support for oneDNN kernels in PaddlePaddle so I can offer some insight.
If PaddlePaddle input is of layout NCHW then shape is [N,C,H,W]. If PaddlePaddle input is of layout NHWC then shape is [N,H,W,C]. For oneDNN kernels input regardless if it is NCHW or NHWC then shape is always [N,C,H,W]. So internally for NHWC models when there is a CPU op followed by oneDNN op then there is shape rotation from [N,H,W,C] to [N,C,H,W]. And when there is oneDNN op followed by CPU op then there is shape roatation from [N,C, H, W] to [N,H,W,C]. The thing is that this assumes that after oneDNN op there is some CPU op. When We implemented NHWC support then there was fetch op always the last one so shape rotation happened there. But I noticed that later on PaddlePaddle by default is skipping fetch op and user get access to data that has still oneDNN shape. So proper fix would be to add this shape rotation in a mechanics where you get access to oneDNN tensor. Here is line calling shape rotation:

// For exepected NHWC data format we need to reshape the Output tensor
// As MKL-DNN description was in NCHW and paddle is expecting NHWC
platform::MatchShapeToLayout(out, in_layout, out_layout);

Some more information (design doc on NHWC in oneDNN kernels):
https://github.com/PaddlePaddle/docs/blob/develop/docs/design/mkldnn/nhwc/nhwc.md

@baoachun
Copy link
Contributor Author

Hi @jczaja , at present, a user encounters the inference error problem, which is caused because conv2d does not support NHWC format.
图片
图片

am_veo.tar.gz

@lidanqing-intel lidanqing-intel removed their assignment Jan 4, 2022
@lidanqing-intel lidanqing-intel modified the milestones: Q4, 2022 Q1 Jan 4, 2022
@jczaja
Copy link
Contributor

jczaja commented Jan 10, 2022

@baoachun Thanks for more information. I'm planning to work on this issue right after I finish current issue so I will start work on NHWC next week.

@jczaja
Copy link
Contributor

jczaja commented Jan 18, 2022

@baoachun Just to let You know that I started to investigate this issue. Reproduction was achieved. Problem seems to be that oneDNN elementwise_add (X,Y) kernel with Y being broadcasted , does not work properly for NHWC situation. I will write some more when I get more details.

@jczaja jczaja mentioned this issue Jan 20, 2022
@jczaja
Copy link
Contributor

jczaja commented Jan 20, 2022

@baoachun Situation is as follows. There are two problems encountered:

  1. broadcasting in oneDNN's elementwise kernels for NHWC is not working properly. Candidate fix is here : The output format of mkldnn conv is wrong when data_format is NHWC #38126 .
  2. batch norm of NHWC in oneDNN's kernel does not work properly for 3D data. Status: under investigaton

@jczaja
Copy link
Contributor

jczaja commented Jan 24, 2022

@baoachun I just rebased #39097 and it does make this issue to go away. So please test it if this PR #39097 works for you.

jczaja added a commit that referenced this issue Feb 8, 2022
* - 38126 potential fix

* - fix

* - build fix

* - another candidate fix

* - compilation fix

* - another fix

* - Fix to activation of NHWC being first oneDNN op in chain on oneDNN ops

* - compilation fix

* - added NHWC reotating for elementwise being first op

* - compilation fix

* - compilation fix

* - Added UT

* - cosmetic fixes
@lidanqing-intel lidanqing-intel assigned baoachun and unassigned jczaja Feb 14, 2022
@baoachun
Copy link
Contributor Author

Hi @jczaja , most of the UTs have passed, except for test_mkldnn_conv_elementwise_add_fuse_pass, the error message is as follows. I noticed that the default input in the conv operator is NCHW format. Is this judgment reasonable?

const bool channel_last = (ctx->IsRunMKLDNNKernel() == false) &&

图片

@jczaja
Copy link
Contributor

jczaja commented Feb 15, 2022

@baoachun Ok, I understand that in #38107 there is UT added with ignored NHWC and I should enable and fix that one. I will start working on this.

@jczaja
Copy link
Contributor

jczaja commented Feb 15, 2022

@baoachun Not sure if I'm solving proper issue, but I enabled NHWC part of UT in test checking fusing pass of conv and gelu. PR is here : #39591 . Check if this is what you have expected to fix?

@baoachun
Copy link
Contributor Author

Hi @jczaja , most of the UTs have passed, except for test_mkldnn_conv_elementwise_add_fuse_pass, the error message is as follows. I noticed that the default input in the conv operator is NCHW format. Is this judgment reasonable?

const bool channel_last = (ctx->IsRunMKLDNNKernel() == false) &&

图片

Hi @jczaja, I have removed most of the skip settings for NHWC, pr is here: #39551, but the test_mkldnn_conv_elementwise_add_fuse_pass UT failed, ans I got the above error message, I'm not sure this judgment is reasonable at this time.

@jczaja
Copy link
Contributor

jczaja commented Feb 16, 2022

@baoachun Ok, so to reproduce this problem I should use enable NHWC in test_mkldnn_conv_elementwise_add_fuse_pass ?

@baoachun
Copy link
Contributor Author

Hi @jczaja, you can use this pr #39654 to reproduce.

@jczaja
Copy link
Contributor

jczaja commented Mar 15, 2022

The issues reported here are fixed by #40049 . Please retest and let us now if all works for you

@lidanqing-intel
Copy link
Contributor

@baoachun said PaddleHub reported failures after the PR merged. Hence please keep this issue open. More info will follow

@lidanqing-intel lidanqing-intel assigned jczaja and unassigned jczaja Mar 28, 2022
@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Mar 28, 2022

To summarize:

  1. The model error has been fixed by oneDNN NHWC fixes #40049, However, a newer Baidu commit broke the model again Advice on model that stopped to work #40540
  2. We expect the commit author could fix the issue, if you need any support or explanation of oneDNN NHWC fixes #40049, @jczaja will answer in time.

@yaomichael
Copy link

notes from 5/20 meeting:
The regression caused by PaddleHub team should have been fixed already. @jiangjiajun will check internally and close this ticket.

@paddle-bot-old paddle-bot-old bot added the status/close 已关闭 label May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants