Add v2 dist benchmark vgg #7539

typhoonzero · 2018-01-15T13:06:14Z

No description provided.

… dist_train_benchmark_vgg16

…onzero/Paddle into dist_train_benchmark_vgg16

helinwang · 2018-01-24T00:22:03Z

benchmark/cluster/vgg16/README.md

+
+| Batch Size | 32 | 64 | 128 | 256 |
+| -- | -- | -- | -- | -- |
+| PaddlePaddle Fluid | - | 247.40 | - | - |


It seesm fluid's performance is 247.40/64=3.866 batch per second, and v2's performance is 256.14/128=2.001 batch per second.
Seems the different is huge, do you have an idea why? (also could you please check if my math is correct).

Sorry, wrong column. I'll update this PR with full test result.

gongweibao

As discussed offline, we should think about how to
avoid duplication with the same content of PaddleCloud.

helinwang · 2018-01-30T23:47:51Z

Thanks! Looks like we have a nice improvement over V2 on batch size 256!

gongweibao · 2018-01-31T07:56:30Z

benchmark/cluster/vgg16/Dockerfile

+#RUN mkdir -p /workspace
+#ADD reader.py /workspace/
+#RUN python /workspace/reader.py
+FROM python:2.7.14


我觉得既然是测试，最好不用这个而是用paddle:dev。

不用安装其他的依赖

调试的时候进入容器可以用各种命令查看系统的状态。

gongweibao · 2018-01-31T07:57:07Z

benchmark/cluster/vgg16/Dockerfile

+RUN pip install /*.whl && rm -f /*.whl
+ENV LD_LIBRARY_PATH=/usr/local/lib
+ADD reader.py /workspace/
+RUN python /workspace/reader.py


这个基本上下载不下来，所以需要加提示，提示用户使用代理。

gongweibao · 2018-01-31T07:58:35Z

benchmark/cluster/vgg16/v2_trainer.yaml

+        - name: TOPOLOGY
+          value: ""
+        - name: ENTRY
+          value: "cd /workspace && MKL_NUM_THREADS=1 python /workspace/vgg16_v2.py"


python -u,强制输出日志。

gongweibao · 2018-01-31T07:59:09Z

benchmark/cluster/vgg16/v2_pserver.yaml

+        - name: TOPOLOGY
+          value: ""
+        - name: ENTRY
+          value: "python train.py"


python -u,强制输出日志。

… dist_train_benchmark_vgg16

Yancey1989 · 2018-02-01T05:53:04Z

benchmark/cluster/vgg16/README.md

+| PaddlePaddle v2 | 15.97 | 17.04 | 17.60 | 17.83 |
+| TensorFlow | - | - | - | - |
+
+### different batch size


different batch size
=>
Different Batch Size

Yancey1989 · 2018-02-01T05:53:13Z

benchmark/cluster/vgg16/README.md

+| TensorFlow | - | - | - | - |
+
+
+### Accelerate rate


Accelerate Rate

Yancey1989 · 2018-02-01T05:53:28Z

benchmark/cluster/vgg16/README.md

+| PaddlePaddle v2 (need more tests) | 326.85 | 534.58 | 853.30 | 1041.99 |
+| TensorFlow | - | - | - | - |
+
+### different pserver number


Different PServer Count

Yancey1989 · 2018-02-01T05:56:25Z

benchmark/cluster/vgg16/README.md

+| TensorFlow | - | - | - | - |
+
+
+### Accelerate rate


it's a rate metrics, so maybe we need to calculate this value by https://github.com/PaddlePaddle/Paddle/tree/develop/benchmark/cluster#measure-parallel-efficiency-by-increasing-trainer-count ?

Add results.

…onzero/Paddle into dist_train_benchmark_vgg16

Yancey1989

LGTM, and please refine the titles with the web-site: http://www.titlecase.com

Yancey1989 · 2018-02-01T10:52:22Z

benchmark/cluster/vgg16/README.md

+
+- Trainer Count: 60
+- Batch Size: 128
+- Metrics: mini-batch / sec


mini-batch / sec

Do you mean samples / sec ?

Yancey1989 · 2018-02-01T10:54:16Z

benchmark/cluster/vgg16/README.md

+
+## Enable verbos logs
+
+Edit `pserver.yaml` and `trainer.yaml` and add an environment variable `GLOG_v=3` to see what happend in detail.


I'm not sure whether we need to add GLOG_logtostderr=1, if you have tested it, please ignore this comment.

Yancey1989 · 2018-02-01T10:59:13Z

benchmark/cluster/vgg16/Dockerfile

+RUN pip install -U kubernetes opencv-python &&   apt-get update -y &&   apt-get install -y iputils-ping libgtk2.0-dev
+# NOTE: By default CI built wheel packages turn WITH_DISTRIBUTE=OFF,
+#       so we must build one with distribute support to install in this image.
+RUN pip install paddlepaddle


Does this pip install is redundant? Move the dataset download after line12 ?

No, in order to make debugging faster, lines below changes much, and download dataset is slow, so add this line.

Yancey1989 · 2018-02-02T02:33:22Z

paddle/gserver/layers/MultiBoxLossLayer.h

@@ -1,3 +1,16 @@
+//  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.


The copyright message is duplicated.

Yancey1989

LGTM!!

typhoonzero added 7 commits January 15, 2018 21:03

add v2 dist benchmark vgg

373f8ba

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

27e31f6

… dist_train_benchmark_vgg16

update docker file

bbff57e

fix copyright check

9ad149a

add copyright for newly merged files

311d159

update job

a0ac133

update

b315a40

helinwang self-requested a review January 18, 2018 04:39

typhoonzero added 13 commits January 19, 2018 11:15

update using cifar10

9f50195

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

820ee78

… dist_train_benchmark_vgg16

fix style

541b42e

add fluid vgg16 dist test

d3905fb

update fluid vgg16 and add readme

cb34f6a

fix styles

b38452d

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

08b529a

… dist_train_benchmark_vgg16

fix style check

900e911

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

438d2ab

… dist_train_benchmark_vgg16

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

a28fd4e

… dist_train_benchmark_vgg16

Merge branch 'dist_train_benchmark_vgg16' of https://github.com/typho…

da3b14b

…onzero/Paddle into dist_train_benchmark_vgg16

update dist benchmark to one image

70142ae

Merge branch 'dist_train_benchmark_vgg16' of https://github.com/typho…

7aed1c1

…onzero/Paddle into dist_train_benchmark_vgg16

helinwang reviewed Jan 24, 2018

View reviewed changes

gongweibao reviewed Jan 24, 2018

View reviewed changes

typhoonzero mentioned this pull request Jan 29, 2018

Current distributed train performance is low with large batch size. #7944

Closed

update for today

bd64719

modify some

419e4c4

gongweibao reviewed Jan 31, 2018

View reviewed changes

gongweibao added 2 commits January 31, 2018 09:09

add results

38b8b7f

clean code

cfbbb98

typhoonzero added 3 commits January 31, 2018 20:09

update results

f9db562

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

8d9c3fc

… dist_train_benchmark_vgg16

update points

d6edfd0

Yancey1989 reviewed Feb 1, 2018

View reviewed changes

typhoonzero and others added 9 commits February 1, 2018 14:15

fix style check

355ecaf

follow comments

b7fbb91

clean code

c98b40e

add others

5530212

add comments

ccef94a

fix typo

00b9aed

Merge pull request #3 from gongweibao/wuyi7539_3

747df80

Add results.

update dockerfile

7c2d32b

Merge branch 'dist_train_benchmark_vgg16' of https://github.com/typho…

978396e

…onzero/Paddle into dist_train_benchmark_vgg16

Yancey1989 previously approved these changes Feb 1, 2018

View reviewed changes

Yancey1989 reviewed Feb 1, 2018

View reviewed changes

fix style

52df85f

typhoonzero dismissed Yancey1989’s stale review via 52df85f February 1, 2018 12:35

Yancey1989 reviewed Feb 2, 2018

View reviewed changes

typhoonzero added 2 commits February 2, 2018 11:05

follow comments

0bbd7bc

update docs

a5acad1

Yancey1989 approved these changes Feb 2, 2018

View reviewed changes

typhoonzero merged commit 42c98f4 into PaddlePaddle:develop Feb 2, 2018

typhoonzero mentioned this pull request Feb 5, 2018

Need a baseline performance benchmark for distributed training. #6941

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add v2 dist benchmark vgg #7539

Add v2 dist benchmark vgg #7539

typhoonzero commented Jan 15, 2018

helinwang Jan 24, 2018 •

edited

Loading

typhoonzero Jan 24, 2018

gongweibao left a comment

helinwang commented Jan 30, 2018

gongweibao Jan 31, 2018

gongweibao Jan 31, 2018

gongweibao Jan 31, 2018

gongweibao Jan 31, 2018

Yancey1989 Feb 1, 2018

Yancey1989 Feb 1, 2018

Yancey1989 Feb 1, 2018

Yancey1989 Feb 1, 2018

Yancey1989 left a comment

Yancey1989 Feb 1, 2018

Yancey1989 Feb 1, 2018

Yancey1989 Feb 1, 2018

typhoonzero Feb 1, 2018

Yancey1989 Feb 2, 2018

Yancey1989 left a comment


		## Enable verbos logs

		Edit `pserver.yaml` and `trainer.yaml` and add an environment variable `GLOG_v=3` to see what happend in detail.

		@@ -1,3 +1,16 @@
		// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.

Add v2 dist benchmark vgg #7539

Add v2 dist benchmark vgg #7539

Conversation

typhoonzero commented Jan 15, 2018

helinwang Jan 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gongweibao left a comment

Choose a reason for hiding this comment

helinwang commented Jan 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 left a comment

Choose a reason for hiding this comment

helinwang Jan 24, 2018 •

edited

Loading