Verification of dygraph MKLDNN accuracy convergence #25872

sfraczek · 2020-07-31T13:38:28Z

Quick Note: OneDNN was previously named DNNL and MKLDNN.

Instructions how to run Dygraph OneDNN training

When you merge the required pull requests, you can run training of few dygraph models with OneDNNN kernels. Some modifications to the models are still required. The models which training we are starting to support now are Mnist, ResNet, MobileNetV1, MobileNetV2.

You can prepend DNNL_VERBOSE=1 to see which primitives are created in OneDNN library to verify which ops are using OneDNN primitives.

All models info

To some training scripts you have to add a switch such as --use_gpu and disable it to use cpu because gpu is always used in i.e. ResNet. Then add it to the command.

Mnist

FLAGS_use_mkldnn=true python train.py

Mobilenet

FLAGS_use_mkldnn=true python train.py --use_gpu=False --batch_size=64 --total_images=1281167 --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output/ --lr_strategy=cosine_decay --lr=0.1 --num_epochs=240 --data_dir=/data/ILSVRC2012 --l2_decay=4e-5 --model=MobileNetV2 <- or V1
You also have to add use_mkldnn=True to ops which are not imported from dygraph:

diff --git a/dygraph/mobilenet/mobilenet_v2.py b/dygraph/mobilenet/mobilenet_v2.py
index 6da031f..8541c87 100644
--- a/dygraph/mobilenet/mobilenet_v2.py
+++ b/dygraph/mobilenet/mobilenet_v2.py
@@ -66,7 +66,7 @@ class ConvBNLayer(fluid.dygraph.Layer):
         y = self._conv(inputs)
         y = self._batch_norm(y)
         if if_act:
-            y = fluid.layers.relu6(y)
+            y = fluid.layers.relu6(y, use_mkldnn=True)
         return y


@@ -112,7 +112,7 @@ class InvertedResidualUnit(fluid.dygraph.Layer):
         y = self._bottleneck_conv(y, if_act=True)
         y = self._linear_conv(y, if_act=False)
         if ifshortcut:
-            y = fluid.layers.elementwise_add(inputs, y)
+            y = fluid.layers.elementwise_add(inputs, y, use_mkldnn=True)
         return y

ResNet

FLAGS_use_mkldnn=true python train.py
You also have to add use_mkldnn=True to ops which are not imported from dygraph:

diff --git a/dygraph/resnet/train.py b/dygraph/resnet/train.py
index 6bf86f9..f53c5a2 100644
--- a/dygraph/resnet/train.py
+++ b/dygraph/resnet/train.py
@@ -239,9 +239,9 @@ class BottleneckBlock(fluid.dygraph.Layer):
         else:
             short = self.short(inputs)

-        y = fluid.layers.elementwise_add(x=short, y=conv2)
+        y = fluid.layers.elementwise_add(x=short, y=conv2, use_mkldnn=True)

-        layer_helper = LayerHelper(self.full_name(), act='relu')
+        layer_helper = LayerHelper(self.full_name(), act='relu', use_mkldnn=True)
         return layer_helper.append_activation(y)

Which PRs have to be merged to run Dygraph OneDNN training

Required

add use_mkldnn attribute to ops in dygraph add use_mkldnn attribute to ops in dygraph #25773
don't clear mkldnn cache in block_op executor dtor don't clear mkldnn cache in block_op executor dtor #25735
Enable mkldnn layout conversion Enable mkldnn layout conversion #25778
Added Relu6 mkldnn op Added Relu6 mkldnn op #25713

Related PRs

support mnist and resnet dygraph_to_static test support mnist and resnet dygraph_to_static test #25774
enable check_dygraph for mkldnn activation tests enable check_dygraph for mkldnn activation tests #25779

Request for verifying training accuracy convergence

Could you please provide us with a verification of proper accuracy convergence? With limited resources it is hard for us to do. We don't have a procedure for that in place. For example mobilenet might take many days to train.

We have only been able to run full test with --ce FLAG of Mnist and Resnet (Flowers).

Mnist training

Name	Result
Reference	Loss at epoch 4 , Test avg_loss is: 0.0372143830631, acc is: 0.989182692308
OneDNN	Loss at epoch 4 , Test avg_loss is: 0.0370462594453, acc is: 0.98828125

Resnet flowers training

Name	Result
Reference	final eval acc1 0.689 acc5 0.886
OneDNN	final eval acc1 0.746 acc5 0.927

Final note

Since we are just starting to support OneDNN training in PaddlePaddle, there may still be some bugs with training that may impact accuracy.

The text was updated successfully, but these errors were encountered:

arlesniak · 2020-08-19T08:03:55Z

Update

Because some PR's are already merged, please see the updated info:

Which PRs have to be merged to run Dygraph OneDNN training

Required

don't clear mkldnn cache in block_op executor dtor don't clear mkldnn cache in block_op executor dtor #25735

Instructions how to run Dygraph OneDNN training

ResNet

FLAGS_use_mkldnn=true python train.py
You also have to add use_mkldnn=True to ops which are not imported from dygraph:

diff --git a/dygraph/resnet/train.py b/dygraph/resnet/train.py
index 6bf86f9..f53c5a2 100644
--- a/dygraph/resnet/train.py
+++ b/dygraph/resnet/train.py
@@ -239,9 +239,9 @@ class BottleneckBlock(fluid.dygraph.Layer):
         else:
             short = self.short(inputs)

         y = fluid.layers.elementwise_add(x=short, y=conv2)

-        layer_helper = LayerHelper(self.full_name(), act='relu')
+        layer_helper = LayerHelper(self.full_name(), act='relu', use_mkldnn=True)
         return layer_helper.append_activation(y)

luotao1 · 2020-08-21T10:11:28Z

mobilenet-v1.log
mobilenet-v2.log
resnet50.log
I download the training log of three models in https://github.com/PaddlePaddle/benchmark/tree/master/dynamic_graph. @sfraczek @lidanqing-intel

luotao1 · 2020-09-14T12:26:00Z

MobileNetV1_1card.log

paddle-bot-old · 2021-09-21T08:54:40Z

Since you haven't replied for more than a year, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您超过一年未回复，我们将关闭这个issue/pr。
若问题未解决或有后续问题，请随时重新打开，我们会继续跟进。

$@sfraczek$ sfraczek added dygraph issues related to dygraph mode Intel labels Jul 31, 2020

paddle-bot-old bot closed this as completed Sep 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verification of dygraph MKLDNN accuracy convergence #25872

Verification of dygraph MKLDNN accuracy convergence #25872

sfraczek commented Jul 31, 2020

arlesniak commented Aug 19, 2020

luotao1 commented Aug 21, 2020

luotao1 commented Sep 14, 2020

paddle-bot-old bot commented Sep 21, 2021

Verification of dygraph MKLDNN accuracy convergence #25872

Verification of dygraph MKLDNN accuracy convergence #25872

Comments

sfraczek commented Jul 31, 2020

Instructions how to run Dygraph OneDNN training

All models info

Mnist

Mobilenet

ResNet

Which PRs have to be merged to run Dygraph OneDNN training

Required

Related PRs

Request for verifying training accuracy convergence

Mnist training

Resnet flowers training

Final note

arlesniak commented Aug 19, 2020

Update

Which PRs have to be merged to run Dygraph OneDNN training

Required

Instructions how to run Dygraph OneDNN training

ResNet

luotao1 commented Aug 21, 2020

luotao1 commented Sep 14, 2020

paddle-bot-old bot commented Sep 21, 2021