Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't clear mkldnn cache in block_op executor dtor #25735

Closed

Conversation

sfraczek
Copy link
Contributor

@sfraczek sfraczek commented Jul 27, 2020

PR types

Bug fixes

PR changes

Others

Describe

MKLDNN cache is removed in Executor's destructor. This should not happen at the end of RunImpl of conditional_block_op, where locally created Executor is destroyed. When working on dygraph resnet enablement of MKLDNN, I found that test_resnet.py with MKLDNN enabled will crash because it cannot find forward MKLDNN primitive in cache.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

grygielski
grygielski previously approved these changes Jul 27, 2020
Copy link
Contributor

@grygielski grygielski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sfraczek sfraczek added the dygraph issues related to dygraph mode label Jul 28, 2020
@grygielski
Copy link
Contributor

grygielski commented Aug 4, 2020

@luotao1 PR-CI-Coverage fails because of test coverage in fl_listen_and_serve_op and conditional_block_infer_op. However, these lines are not executed in tests, only because test build does not have WITH_MKLDNN flag. Could you advice how we can pass this CI?

@luotao1
Copy link
Contributor

luotao1 commented Aug 5, 2020

When working on dygraph resnet enablement of MKLDNN, I found that test_resnet.py with MKLDNN enabled will crash because it cannot find forward MKLDNN primitive in cache.

Is this related with don't clear mkldnn cache at the end of RunImpl of conditional_block_op?

MKLDNN cache is removed in Executor's destructor. This should not happen at the end of RunImpl of conditional_block_op, where locally created Executor is destroyed.

Do you have any other method to solve this problem? Since the current method is not grace.

@@ -99,6 +99,9 @@ class CGenNCCLIdOp : public framework::OperatorBase {

framework::ProgramDesc empty_program;
framework::Executor executor(dev_ctx.GetPlace());
#ifdef PADDLE_WITH_MKLDNN
executor.KeepMKLDNNCache(true);
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c_gen_nccl_id_op.cc is used only in GPU, thus, it doesn't need to be updated.

@@ -214,6 +214,9 @@ class GenNCCLIdOp : public framework::OperatorBase {

framework::ProgramDesc empty_program;
framework::Executor executor(dev_ctx.GetPlace());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gen_nccl_id_op.cc is used only in GPU, thus, it doesn't need to be updated.

@@ -134,6 +134,9 @@ class TensorRTEngineOp : public framework::OperatorBase {
void RunNativeImpl(const framework::Scope &scope,
const platform::Place &dev_place) const {
framework::Executor executor(dev_place);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensorrt_engine_op.h is used only in GPU, thus, it doesn't need to be updated.

@lidanqing-intel
Copy link
Contributor

Luotao think this is not elegant. Please consider submitting one issue they will change the executor.

@grygielski
Copy link
Contributor

@luotao1 I've submitted an Issue about this problem: #25988

@lidanqing-intel
Copy link
Contributor

This PR and issue #25988 Luotao said they need to discuss inside team. We wait some time.

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Aug 19, 2020

@grygielski This PR seems have compatible issues. Test resnet does not pass on windows but on Linux it can pass. We could ask luotao for help for deploying on Windows and see what is the difference between windows and linux. Why linux pass but windows don't

@grygielski
Copy link
Contributor

Closing since #26502 is already merged

@grygielski grygielski closed this Aug 24, 2020
@sfraczek sfraczek deleted the block-op-executor-dnnl-cache branch September 3, 2020 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dygraph issues related to dygraph mode Intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants