Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add v2 dist benchmark vgg #7539

Merged
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
373f8ba
add v2 dist benchmark vgg
typhoonzero Jan 15, 2018
27e31f6
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
typhoonzero Jan 16, 2018
bbff57e
update docker file
typhoonzero Jan 16, 2018
9ad149a
fix copyright check
typhoonzero Jan 16, 2018
311d159
add copyright for newly merged files
typhoonzero Jan 16, 2018
a0ac133
update job
typhoonzero Jan 16, 2018
b315a40
update
typhoonzero Jan 16, 2018
9f50195
update using cifar10
typhoonzero Jan 19, 2018
820ee78
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
typhoonzero Jan 19, 2018
541b42e
fix style
typhoonzero Jan 19, 2018
d3905fb
add fluid vgg16 dist test
typhoonzero Jan 19, 2018
cb34f6a
update fluid vgg16 and add readme
typhoonzero Jan 22, 2018
b38452d
fix styles
typhoonzero Jan 22, 2018
08b529a
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
typhoonzero Jan 22, 2018
900e911
fix style check
typhoonzero Jan 22, 2018
438d2ab
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
typhoonzero Jan 22, 2018
a28fd4e
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
typhoonzero Jan 23, 2018
da3b14b
Merge branch 'dist_train_benchmark_vgg16' of https://github.com/typho…
typhoonzero Jan 23, 2018
70142ae
update dist benchmark to one image
typhoonzero Jan 23, 2018
7aed1c1
Merge branch 'dist_train_benchmark_vgg16' of https://github.com/typho…
typhoonzero Jan 23, 2018
bd64719
update for today
typhoonzero Jan 29, 2018
419e4c4
modify some
gongweibao Jan 31, 2018
38b8b7f
add results
gongweibao Jan 31, 2018
cfbbb98
clean code
gongweibao Jan 31, 2018
f9db562
update results
typhoonzero Jan 31, 2018
8d9c3fc
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
typhoonzero Jan 31, 2018
d6edfd0
update points
typhoonzero Feb 1, 2018
355ecaf
fix style check
typhoonzero Feb 1, 2018
b7fbb91
follow comments
typhoonzero Feb 1, 2018
c98b40e
clean code
gongweibao Feb 1, 2018
5530212
add others
gongweibao Feb 1, 2018
ccef94a
add comments
gongweibao Feb 1, 2018
00b9aed
fix typo
gongweibao Feb 1, 2018
747df80
Merge pull request #3 from gongweibao/wuyi7539_3
typhoonzero Feb 1, 2018
7c2d32b
update dockerfile
typhoonzero Feb 1, 2018
978396e
Merge branch 'dist_train_benchmark_vgg16' of https://github.com/typho…
typhoonzero Feb 1, 2018
52df85f
fix style
typhoonzero Feb 1, 2018
0bbd7bc
follow comments
typhoonzero Feb 2, 2018
a5acad1
update docs
typhoonzero Feb 2, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ ENV LD_LIBRARY_PATH=/usr/local/lib
ADD reader.py /workspace/
RUN python /workspace/reader.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个基本上下载不下来,所以需要加提示,提示用户使用代理。


ADD vgg16.py /workspace/
ADD vgg16_fluid.py vgg16_v2.py /workspace/
58 changes: 58 additions & 0 deletions benchmark/cluster/vgg16/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Performance for distributed vgg16

## Test Result

### Single node single thread

| Batch Size | 32 | 64 | 128 | 256 |
| -- | -- | -- | -- | -- |
| PaddlePaddle Fluid | - | - | 16.74 | - |
| PaddlePaddle v2 | - | - | 17.60 | - |
| TensorFlow | - | - | - | - |

### different batch size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different batch size
=>
Different Batch Size


- PServer Count: 10
- Trainer Count: 20
- Metrics: samples / sec

| Batch Size | 32 | 64 | 128 | 256 |
| -- | -- | -- | -- | -- |
| PaddlePaddle Fluid | - | 247.40 | - | - |
Copy link
Contributor

@helinwang helinwang Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seesm fluid's performance is 247.40/64=3.866 batch per second, and v2's performance is 256.14/128=2.001 batch per second.
Seems the different is huge, do you have an idea why? (also could you please check if my math is correct).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, wrong column. I'll update this PR with full test result.

| PaddlePaddle v2 | - | - | 256.14 | - |
| TensorFlow | - | - | - | - |

### different pserver number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different PServer Count


- Trainer Count: 100
- Batch Size: 64
- Metrics: mini-batch / sec

| PServer Count | 10 | 20 | 40 | 60 |
| -- | -- | -- | -- | -- |
| PaddlePaddle Fluid | - | - | - | - |
| PaddlePaddle v2 | - | - | - | - |
| TensorFlow | - | - | - | - |

### Accelerate rate

| Trainer Counter | 20 | 40 | 80 | 100 |
| -- | -- | -- | -- | -- |
| PaddlePaddle Fluid | - | - | - | - |
| PaddlePaddle v2 | - | - | - | - |
| TensorFlow | - | - | - | - |


## Steps to run the performance test

1. You must re-compile PaddlePaddle and enable `-DWITH_DISTRIBUTE` to build PaddlePaddle with distributed support.
1. When the build finishes, copy the output `whl` package located under `build/python/dist` to current directory.
1. Run `docker build -t [image:tag] .` to build the docker image and run `docker push [image:tag]` to push the image to reponsitory so kubernetes can find it.
1. Run `kubectl create -f pserver.yaml && kubectl create -f trainer.yaml` to start the job on your kubernetes cluster (you must configure the `kubectl` client before this step).
1. Run `kubectl get po` to get running pods, and run `kubectl logs [podID]` to fetch the pod log of pservers and trainers.

Check the logs for the distributed training progress and analyze the performance.

## Enable verbos logs

Edit `pserver.yaml` and `trainer.yaml` and add an environment variable `GLOG_v=3` to see what happend in detail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether we need to add GLOG_logtostderr=1, if you have tested it, please ignore this comment.

15 changes: 0 additions & 15 deletions benchmark/cluster/vgg16/fluid/README.md

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ spec:
- name: job-registry-secret
containers:
- name: pserver
image: "registry.baidu.com/paddlepaddle/rawjob:vgg16_fluid"
image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16"
imagePullPolicy: Always
ports:
- name: jobport-30236
Expand All @@ -33,7 +33,7 @@ spec:
- name: TOPOLOGY
value: ""
- name: ENTRY
value: "LD_LIBRARY_PATH=/usr/local/lib MKL_NUM_THREADS=1 python /workspace/vgg16.py --local 0"
value: "MKL_NUM_THREADS=1 python /workspace/vgg16_fluid.py --local 0"
- name: TRAINER_PACKAGE
value: "/workspace"
- name: PADDLE_INIT_PORT
Expand All @@ -53,7 +53,7 @@ spec:
- name: PADDLE_INIT_USE_GPU
value: "0"
- name: LD_LIBRARY_PATH
value: "/usr/local/nvidia/lib64"
value: "/usr/local/lib:/usr/local/nvidia/lib64"
- name: NAMESPACE
valueFrom:
fieldRef:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ spec:
hostNetwork: true
containers:
- name: trainer
image: "registry.baidu.com/paddlepaddle/rawjob:vgg16_fluid"
image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16"
imagePullPolicy: Always
command: ["paddle_k8s", "start_fluid"]
env:
Expand All @@ -30,7 +30,7 @@ spec:
- name: TOPOLOGY
value: ""
- name: ENTRY
value: "cd /workspace && LD_LIBRARY_PATH=/usr/local/lib MKL_NUM_THREADS=1 python /workspace/vgg16.py --local 0"
value: "MKL_NUM_THREADS=1 python /workspace/vgg16_fluid.py --local 0 --batch_size 128"
- name: TRAINER_PACKAGE
value: "/workspace"
- name: PADDLE_INIT_PORT
Expand All @@ -50,7 +50,7 @@ spec:
- name: PADDLE_INIT_USE_GPU
value: "0"
- name: LD_LIBRARY_PATH
value: "/usr/local/nvidia/lib64"
value: "/usr/local/lib:/usr/local/nvidia/lib64"
- name: NAMESPACE
valueFrom:
fieldRef:
Expand Down
7 changes: 0 additions & 7 deletions benchmark/cluster/vgg16/v2/Dockerfile

This file was deleted.

70 changes: 0 additions & 70 deletions benchmark/cluster/vgg16/v2/reader.py

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ spec:
- name: job-registry-secret
containers:
- name: pserver
image: "registry.baidu.com/paddlepaddle/rawjob:vgg16"
image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16"
imagePullPolicy: Always
ports:
- name: jobport-30236
Expand Down Expand Up @@ -49,7 +49,7 @@ spec:
- name: PADDLE_INIT_USE_GPU
value: "0"
- name: LD_LIBRARY_PATH
value: "/usr/local/nvidia/lib64"
value: "/usr/local/lib:/usr/local/nvidia/lib64"
- name: NAMESPACE
valueFrom:
fieldRef:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,40 +15,42 @@ spec:
hostNetwork: true
containers:
- name: trainer
image: "registry.baidu.com/paddlepaddle/rawjob:vgg16"
image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16"
imagePullPolicy: Always
command: ["paddle_k8s", "start_trainer", "v2"]
env:
- name: PADDLE_JOB_NAME
value: vgg16v2job
- name: BATCH_SIZE
value: "128"
- name: TRAINERS
value: "20"
- name: PSERVERS
value: "10"
- name: TOPOLOGY
value: ""
- name: ENTRY
value: "cd /workspace && MKL_NUM_THREADS=1 python /workspace/vgg16.py"
value: "cd /workspace && MKL_NUM_THREADS=1 python /workspace/vgg16_v2.py"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python -u,强制输出日志。

- name: TRAINER_PACKAGE
value: "/workspace"
- name: PADDLE_INIT_PORT
value: "30236"
- name: PADDLE_INIT_NICS
value: "xgbe0"
- name: PADDLE_INIT_TRAINER_COUNT
value: "1"
value: "2"
- name: PADDLE_INIT_PORTS_NUM
value: "1"
- name: PADDLE_INIT_PORTS_NUM_FOR_SPARSE
value: "1"
- name: PADDLE_INIT_NUM_GRADIENT_SERVERS
value: "20"
- name: PADDLE_INIT_NUM_PASSES
value: "1"
value: "2"
- name: PADDLE_INIT_USE_GPU
value: "0"
- name: LD_LIBRARY_PATH
value: "/usr/local/nvidia/lib64"
value: "/usr/local/lib:/usr/local/nvidia/lib64"
- name: NAMESPACE
valueFrom:
fieldRef:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,17 @@

import paddle.v2.dataset.cifar as cifar
import paddle.v2 as paddle
import reader
import time
import os

DATA_DIM = 3 * 32 * 32
CLASS_DIM = 10
BATCH_SIZE = 128
BATCH_SIZE = os.getenv("BATCH_SIZE")
if BATCH_SIZE:
BATCH_SIZE = int(BATCH_SIZE)
else:
BATCH_SIZE = 128
NODE_COUNT = int(os.getenv("TRAINERS"))
ts = 0


Expand Down Expand Up @@ -77,14 +82,15 @@ def vgg19(input, class_dim):

def main():
global ts
paddle.init(use_gpu=False, trainer_count=1)
paddle.init(use_gpu=False)
image = paddle.layer.data(
name="image", type=paddle.data_type.dense_vector(DATA_DIM))
lbl = paddle.layer.data(
name="label", type=paddle.data_type.integer_value(CLASS_DIM))

extra_layers = None
learning_rate = 0.01
# NOTE: for v2 distributed training need averaging updates.
learning_rate = 1e-3 / NODE_COUNT
out = vgg16(image, class_dim=CLASS_DIM)
cost = paddle.layer.classification_cost(input=out, label=lbl)

Expand Down Expand Up @@ -123,7 +129,9 @@ def main():

# End batch and end pass event handler
def event_handler(event):
global ts
global ts, ts_pass
if isinstance(event, paddle.event.BeginPass):
ts_pass = time.time()
if isinstance(event, paddle.event.BeginIteration):
ts = time.time()
if isinstance(event, paddle.event.EndIteration):
Expand All @@ -132,9 +140,8 @@ def event_handler(event):
event.pass_id, event.batch_id, event.cost, event.metrics,
time.time() - ts)
if isinstance(event, paddle.event.EndPass):
with gzip.open('params_pass_%d.tar.gz' % event.pass_id, 'w') as f:
trainer.save_parameter_to_tar(f)

print "Pass %d end, spent: %f" % (event.pass_id,
time.time() - ts_pass)
result = trainer.test(reader=test_reader)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)

Expand Down