The output format of mkldnn conv is wrong when data_format is NHWC #38126

baoachun · 2021-12-14T08:23:14Z

The problem can be reproduced according to this pr：#38107

Runing the following command to reproduce.

git clone the code and cmake with -DWITH_TESTING=ON
then running the following command

ctest -R test_mkldnn_conv_gelu_fuse_pass -V

The text was updated successfully, but these errors were encountered:

paddle-bot-old · 2021-12-14T08:23:17Z

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

lidanqing-intel · 2021-12-23T09:29:01Z

@jczaja Please comment on this issue. Otherwise that disabling PR will be merged.

jczaja · 2021-12-23T10:05:33Z

@baoachun I implemented NHWC support for oneDNN kernels in PaddlePaddle so I can offer some insight.
If PaddlePaddle input is of layout NCHW then shape is [N,C,H,W]. If PaddlePaddle input is of layout NHWC then shape is [N,H,W,C]. For oneDNN kernels input regardless if it is NCHW or NHWC then shape is always [N,C,H,W]. So internally for NHWC models when there is a CPU op followed by oneDNN op then there is shape rotation from [N,H,W,C] to [N,C,H,W]. And when there is oneDNN op followed by CPU op then there is shape roatation from [N,C, H, W] to [N,H,W,C]. The thing is that this assumes that after oneDNN op there is some CPU op. When We implemented NHWC support then there was fetch op always the last one so shape rotation happened there. But I noticed that later on PaddlePaddle by default is skipping fetch op and user get access to data that has still oneDNN shape. So proper fix would be to add this shape rotation in a mechanics where you get access to oneDNN tensor. Here is line calling shape rotation:

Paddle/paddle/fluid/framework/data_layout_transform.cc

Lines 189 to 191 in 21b7ed3

    
           // For exepected NHWC data format we need to reshape the Output tensor 
        
           // As MKL-DNN description was in NCHW and paddle is expecting NHWC 
        
           platform::MatchShapeToLayout(out, in_layout, out_layout);

Some more information (design doc on NHWC in oneDNN kernels):
https://github.com/PaddlePaddle/docs/blob/develop/docs/design/mkldnn/nhwc/nhwc.md

baoachun · 2021-12-31T05:53:05Z

Hi @jczaja , at present, a user encounters the inference error problem, which is caused because conv2d does not support NHWC format.

am_veo.tar.gz

jczaja · 2022-01-10T13:18:04Z

@baoachun Thanks for more information. I'm planning to work on this issue right after I finish current issue so I will start work on NHWC next week.

jczaja · 2022-01-18T17:53:14Z

@baoachun Just to let You know that I started to investigate this issue. Reproduction was achieved. Problem seems to be that oneDNN elementwise_add (X,Y) kernel with Y being broadcasted , does not work properly for NHWC situation. I will write some more when I get more details.

jczaja · 2022-01-20T18:04:18Z

@baoachun Situation is as follows. There are two problems encountered:

broadcasting in oneDNN's elementwise kernels for NHWC is not working properly. Candidate fix is here : The output format of mkldnn conv is wrong when data_format is NHWC #38126 .
batch norm of NHWC in oneDNN's kernel does not work properly for 3D data. Status: under investigaton

jczaja · 2022-01-24T14:00:34Z

@baoachun I just rebased #39097 and it does make this issue to go away. So please test it if this PR #39097 works for you.

* - 38126 potential fix * - fix * - build fix * - another candidate fix * - compilation fix * - another fix * - Fix to activation of NHWC being first oneDNN op in chain on oneDNN ops * - compilation fix * - added NHWC reotating for elementwise being first op * - compilation fix * - compilation fix * - Added UT * - cosmetic fixes

baoachun · 2022-02-15T08:32:23Z

Hi @jczaja , most of the UTs have passed, except for test_mkldnn_conv_elementwise_add_fuse_pass, the error message is as follows. I noticed that the default input in the conv operator is NCHW format. Is this judgment reasonable?

Paddle/paddle/fluid/operators/conv_op.cc

Line 60 in 0d46a10

const bool channel_last = (ctx->IsRunMKLDNNKernel() == false) &&

jczaja · 2022-02-15T11:06:12Z

@baoachun Ok, I understand that in #38107 there is UT added with ignored NHWC and I should enable and fix that one. I will start working on this.

jczaja · 2022-02-15T16:40:45Z

@baoachun Not sure if I'm solving proper issue, but I enabled NHWC part of UT in test checking fusing pass of conv and gelu. PR is here : #39591 . Check if this is what you have expected to fix?

baoachun · 2022-02-16T02:53:02Z

Hi @jczaja , most of the UTs have passed, except for test_mkldnn_conv_elementwise_add_fuse_pass, the error message is as follows. I noticed that the default input in the conv operator is NCHW format. Is this judgment reasonable?

Paddle/paddle/fluid/operators/conv_op.cc

Line 60 in 0d46a10

const bool channel_last = (ctx->IsRunMKLDNNKernel() == false) &&

Hi @jczaja, I have removed most of the skip settings for NHWC, pr is here: #39551, but the test_mkldnn_conv_elementwise_add_fuse_pass UT failed, ans I got the above error message, I'm not sure this judgment is reasonable at this time.

jczaja · 2022-02-16T14:16:28Z

@baoachun Ok, so to reproduce this problem I should use enable NHWC in test_mkldnn_conv_elementwise_add_fuse_pass ?

baoachun · 2022-02-17T02:20:40Z

Hi @jczaja, you can use this pr #39654 to reproduce.

jczaja · 2022-03-15T07:51:28Z

The issues reported here are fixed by #40049 . Please retest and let us now if all works for you

lidanqing-intel · 2022-03-28T07:23:59Z

@baoachun said PaddleHub reported failures after the PR merged. Hence please keep this issue open. More info will follow

lidanqing-intel · 2022-03-28T09:36:22Z

To summarize:

The model error has been fixed by oneDNN NHWC fixes #40049, However, a newer Baidu commit broke the model again Advice on model that stopped to work #40540
We expect the commit author could fix the issue, if you need any support or explanation of oneDNN NHWC fixes #40049, @jczaja will answer in time.

yaomichael · 2022-05-24T06:55:20Z

notes from 5/20 meeting:
The regression caused by PaddleHub team should have been fixed already. @jiangjiajun will check internally and close this ticket.

paddle-bot-old bot assigned tink2123 Dec 14, 2021

Aganlengzi added the Intel label Dec 14, 2021

lidanqing-intel assigned jczaja and unassigned tink2123 Dec 20, 2021

baoachun mentioned this issue Dec 23, 2021

fix conv2d and conv2d_transpose output format error when using mkldnn #38387

Closed

lidanqing-intel added this to the Q4 milestone Dec 23, 2021

lidanqing-intel added the high priority label Dec 23, 2021

lidanqing-intel self-assigned this Dec 23, 2021

lidanqing-intel changed the title ~~the output format of mkldnn conv is wrong when data_format is NHWC~~ The output format of mkldnn conv is wrong when data_format is NHWC Dec 23, 2021

lidanqing-intel removed their assignment Jan 4, 2022

lidanqing-intel modified the milestones: Q4, 2022 Q1 Jan 4, 2022

jczaja mentioned this issue Jan 20, 2022

Fix to #38126 #39097

Merged

lidanqing-intel assigned baoachun and unassigned jczaja Feb 14, 2022

baoachun mentioned this issue Feb 17, 2022

update mkldnn conv_elementwise_add_fuse_pass ut #39654

Closed

jczaja mentioned this issue Mar 4, 2022

oneDNN NHWC fixes #40049

Merged

lidanqing-intel assigned jczaja Mar 7, 2022

jczaja mentioned this issue Mar 14, 2022

Advice on model that stopped to work #40540

Closed

lidanqing-intel assigned jczaja and unassigned jczaja Mar 28, 2022

lidanqing-intel unassigned jczaja Apr 18, 2022

jiangjiajun closed this as completed May 30, 2022

paddle-bot-old bot added the status/close 已关闭 label May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The output format of mkldnn conv is wrong when data_format is NHWC #38126

The output format of mkldnn conv is wrong when data_format is NHWC #38126

baoachun commented Dec 14, 2021

paddle-bot-old bot commented Dec 14, 2021

lidanqing-intel commented Dec 23, 2021

jczaja commented Dec 23, 2021 •

edited

Loading

baoachun commented Dec 31, 2021

jczaja commented Jan 10, 2022

jczaja commented Jan 18, 2022

jczaja commented Jan 20, 2022

jczaja commented Jan 24, 2022

baoachun commented Feb 15, 2022

jczaja commented Feb 15, 2022

jczaja commented Feb 15, 2022

baoachun commented Feb 16, 2022

jczaja commented Feb 16, 2022

baoachun commented Feb 17, 2022

jczaja commented Mar 15, 2022

lidanqing-intel commented Mar 28, 2022

lidanqing-intel commented Mar 28, 2022 •

edited

Loading

yaomichael commented May 24, 2022

The output format of mkldnn conv is wrong when data_format is NHWC #38126

The output format of mkldnn conv is wrong when data_format is NHWC #38126

Comments

baoachun commented Dec 14, 2021

paddle-bot-old bot commented Dec 14, 2021

lidanqing-intel commented Dec 23, 2021

jczaja commented Dec 23, 2021 • edited Loading

baoachun commented Dec 31, 2021

jczaja commented Jan 10, 2022

jczaja commented Jan 18, 2022

jczaja commented Jan 20, 2022

jczaja commented Jan 24, 2022

baoachun commented Feb 15, 2022

jczaja commented Feb 15, 2022

jczaja commented Feb 15, 2022

baoachun commented Feb 16, 2022

jczaja commented Feb 16, 2022

baoachun commented Feb 17, 2022

jczaja commented Mar 15, 2022

lidanqing-intel commented Mar 28, 2022

lidanqing-intel commented Mar 28, 2022 • edited Loading

yaomichael commented May 24, 2022

jczaja commented Dec 23, 2021 •

edited

Loading

lidanqing-intel commented Mar 28, 2022 •

edited

Loading