Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verification of dygraph MKLDNN accuracy convergence #25872

Closed
sfraczek opened this issue Jul 31, 2020 · 4 comments
Closed

Verification of dygraph MKLDNN accuracy convergence #25872

sfraczek opened this issue Jul 31, 2020 · 4 comments
Labels
dygraph issues related to dygraph mode Intel

Comments

@sfraczek
Copy link
Contributor

Quick Note: OneDNN was previously named DNNL and MKLDNN.

Instructions how to run Dygraph OneDNN training

When you merge the required pull requests, you can run training of few dygraph models with OneDNNN kernels. Some modifications to the models are still required. The models which training we are starting to support now are Mnist, ResNet, MobileNetV1, MobileNetV2.

You can prepend DNNL_VERBOSE=1 to see which primitives are created in OneDNN library to verify which ops are using OneDNN primitives.

All models info

To some training scripts you have to add a switch such as --use_gpu and disable it to use cpu because gpu is always used in i.e. ResNet. Then add it to the command.

Mnist

FLAGS_use_mkldnn=true python train.py

Mobilenet

FLAGS_use_mkldnn=true python train.py --use_gpu=False --batch_size=64 --total_images=1281167 --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output/ --lr_strategy=cosine_decay --lr=0.1 --num_epochs=240 --data_dir=/data/ILSVRC2012 --l2_decay=4e-5 --model=MobileNetV2 <- or V1
You also have to add use_mkldnn=True to ops which are not imported from dygraph:

diff --git a/dygraph/mobilenet/mobilenet_v2.py b/dygraph/mobilenet/mobilenet_v2.py
index 6da031f..8541c87 100644
--- a/dygraph/mobilenet/mobilenet_v2.py
+++ b/dygraph/mobilenet/mobilenet_v2.py
@@ -66,7 +66,7 @@ class ConvBNLayer(fluid.dygraph.Layer):
         y = self._conv(inputs)
         y = self._batch_norm(y)
         if if_act:
-            y = fluid.layers.relu6(y)
+            y = fluid.layers.relu6(y, use_mkldnn=True)
         return y


@@ -112,7 +112,7 @@ class InvertedResidualUnit(fluid.dygraph.Layer):
         y = self._bottleneck_conv(y, if_act=True)
         y = self._linear_conv(y, if_act=False)
         if ifshortcut:
-            y = fluid.layers.elementwise_add(inputs, y)
+            y = fluid.layers.elementwise_add(inputs, y, use_mkldnn=True)
         return y

ResNet

FLAGS_use_mkldnn=true python train.py
You also have to add use_mkldnn=True to ops which are not imported from dygraph:

diff --git a/dygraph/resnet/train.py b/dygraph/resnet/train.py
index 6bf86f9..f53c5a2 100644
--- a/dygraph/resnet/train.py
+++ b/dygraph/resnet/train.py
@@ -239,9 +239,9 @@ class BottleneckBlock(fluid.dygraph.Layer):
         else:
             short = self.short(inputs)

-        y = fluid.layers.elementwise_add(x=short, y=conv2)
+        y = fluid.layers.elementwise_add(x=short, y=conv2, use_mkldnn=True)

-        layer_helper = LayerHelper(self.full_name(), act='relu')
+        layer_helper = LayerHelper(self.full_name(), act='relu', use_mkldnn=True)
         return layer_helper.append_activation(y)

Which PRs have to be merged to run Dygraph OneDNN training

Required

Related PRs

Request for verifying training accuracy convergence

Could you please provide us with a verification of proper accuracy convergence? With limited resources it is hard for us to do. We don't have a procedure for that in place. For example mobilenet might take many days to train.

We have only been able to run full test with --ce FLAG of Mnist and Resnet (Flowers).

Mnist training

Name Result
Reference Loss at epoch 4 , Test avg_loss is: 0.0372143830631, acc is: 0.989182692308
OneDNN Loss at epoch 4 , Test avg_loss is: 0.0370462594453, acc is: 0.98828125

Resnet flowers training

Name Result
Reference final eval acc1 0.689 acc5 0.886
OneDNN final eval acc1 0.746 acc5 0.927

Final note

Since we are just starting to support OneDNN training in PaddlePaddle, there may still be some bugs with training that may impact accuracy.

@sfraczek sfraczek added dygraph issues related to dygraph mode Intel labels Jul 31, 2020
@arlesniak
Copy link
Contributor

Update

Because some PR's are already merged, please see the updated info:

Which PRs have to be merged to run Dygraph OneDNN training

Required

Instructions how to run Dygraph OneDNN training

ResNet

FLAGS_use_mkldnn=true python train.py
You also have to add use_mkldnn=True to ops which are not imported from dygraph:

diff --git a/dygraph/resnet/train.py b/dygraph/resnet/train.py
index 6bf86f9..f53c5a2 100644
--- a/dygraph/resnet/train.py
+++ b/dygraph/resnet/train.py
@@ -239,9 +239,9 @@ class BottleneckBlock(fluid.dygraph.Layer):
         else:
             short = self.short(inputs)

         y = fluid.layers.elementwise_add(x=short, y=conv2)

-        layer_helper = LayerHelper(self.full_name(), act='relu')
+        layer_helper = LayerHelper(self.full_name(), act='relu', use_mkldnn=True)
         return layer_helper.append_activation(y)

@luotao1
Copy link
Contributor

luotao1 commented Aug 21, 2020

mobilenet-v1.log
mobilenet-v2.log
resnet50.log
I download the training log of three models in https://github.com/PaddlePaddle/benchmark/tree/master/dynamic_graph. @sfraczek @lidanqing-intel

@luotao1
Copy link
Contributor

luotao1 commented Sep 14, 2020

MobileNetV1_1card.log

@paddle-bot-old
Copy link

Since you haven't replied for more than a year, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您超过一年未回复,我们将关闭这个issue/pr。
若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dygraph issues related to dygraph mode Intel
Projects
None yet
Development

No branches or pull requests

3 participants