-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add v2 dist benchmark vgg #7539
Changes from 4 commits
373f8ba
27e31f6
bbff57e
9ad149a
311d159
a0ac133
b315a40
9f50195
820ee78
541b42e
d3905fb
cb34f6a
b38452d
08b529a
900e911
438d2ab
a28fd4e
da3b14b
70142ae
7aed1c1
bd64719
419e4c4
38b8b7f
cfbbb98
f9db562
8d9c3fc
d6edfd0
355ecaf
b7fbb91
c98b40e
5530212
ccef94a
00b9aed
747df80
7c2d32b
978396e
52df85f
0bbd7bc
a5acad1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# Performance for distributed vgg16 | ||
|
||
## Test Result | ||
|
||
### Single node single thread | ||
|
||
| Batch Size | 32 | 64 | 128 | 256 | | ||
| -- | -- | -- | -- | -- | | ||
| PaddlePaddle Fluid | - | - | 16.74 | - | | ||
| PaddlePaddle v2 | - | - | 17.60 | - | | ||
| TensorFlow | - | - | - | - | | ||
|
||
### different batch size | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
- PServer Count: 10 | ||
- Trainer Count: 20 | ||
- Metrics: samples / sec | ||
|
||
| Batch Size | 32 | 64 | 128 | 256 | | ||
| -- | -- | -- | -- | -- | | ||
| PaddlePaddle Fluid | - | 247.40 | - | - | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seesm fluid's performance is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, wrong column. I'll update this PR with full test result. |
||
| PaddlePaddle v2 | - | - | 256.14 | - | | ||
| TensorFlow | - | - | - | - | | ||
|
||
### different pserver number | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Different PServer Count |
||
|
||
- Trainer Count: 100 | ||
- Batch Size: 64 | ||
- Metrics: mini-batch / sec | ||
|
||
| PServer Count | 10 | 20 | 40 | 60 | | ||
| -- | -- | -- | -- | -- | | ||
| PaddlePaddle Fluid | - | - | - | - | | ||
| PaddlePaddle v2 | - | - | - | - | | ||
| TensorFlow | - | - | - | - | | ||
|
||
### Accelerate rate | ||
|
||
| Trainer Counter | 20 | 40 | 80 | 100 | | ||
| -- | -- | -- | -- | -- | | ||
| PaddlePaddle Fluid | - | - | - | - | | ||
| PaddlePaddle v2 | - | - | - | - | | ||
| TensorFlow | - | - | - | - | | ||
|
||
|
||
## Steps to run the performance test | ||
|
||
1. You must re-compile PaddlePaddle and enable `-DWITH_DISTRIBUTE` to build PaddlePaddle with distributed support. | ||
1. When the build finishes, copy the output `whl` package located under `build/python/dist` to current directory. | ||
1. Run `docker build -t [image:tag] .` to build the docker image and run `docker push [image:tag]` to push the image to reponsitory so kubernetes can find it. | ||
1. Run `kubectl create -f pserver.yaml && kubectl create -f trainer.yaml` to start the job on your kubernetes cluster (you must configure the `kubectl` client before this step). | ||
1. Run `kubectl get po` to get running pods, and run `kubectl logs [podID]` to fetch the pod log of pservers and trainers. | ||
|
||
Check the logs for the distributed training progress and analyze the performance. | ||
|
||
## Enable verbos logs | ||
|
||
Edit `pserver.yaml` and `trainer.yaml` and add an environment variable `GLOG_v=3` to see what happend in detail. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure whether we need to add |
This file was deleted.
This file was deleted.
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,40 +15,42 @@ spec: | |
hostNetwork: true | ||
containers: | ||
- name: trainer | ||
image: "registry.baidu.com/paddlepaddle/rawjob:vgg16" | ||
image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16" | ||
imagePullPolicy: Always | ||
command: ["paddle_k8s", "start_trainer", "v2"] | ||
env: | ||
- name: PADDLE_JOB_NAME | ||
value: vgg16v2job | ||
- name: BATCH_SIZE | ||
value: "128" | ||
- name: TRAINERS | ||
value: "20" | ||
- name: PSERVERS | ||
value: "10" | ||
- name: TOPOLOGY | ||
value: "" | ||
- name: ENTRY | ||
value: "cd /workspace && MKL_NUM_THREADS=1 python /workspace/vgg16.py" | ||
value: "cd /workspace && MKL_NUM_THREADS=1 python /workspace/vgg16_v2.py" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
- name: TRAINER_PACKAGE | ||
value: "/workspace" | ||
- name: PADDLE_INIT_PORT | ||
value: "30236" | ||
- name: PADDLE_INIT_NICS | ||
value: "xgbe0" | ||
- name: PADDLE_INIT_TRAINER_COUNT | ||
value: "1" | ||
value: "2" | ||
- name: PADDLE_INIT_PORTS_NUM | ||
value: "1" | ||
- name: PADDLE_INIT_PORTS_NUM_FOR_SPARSE | ||
value: "1" | ||
- name: PADDLE_INIT_NUM_GRADIENT_SERVERS | ||
value: "20" | ||
- name: PADDLE_INIT_NUM_PASSES | ||
value: "1" | ||
value: "2" | ||
- name: PADDLE_INIT_USE_GPU | ||
value: "0" | ||
- name: LD_LIBRARY_PATH | ||
value: "/usr/local/nvidia/lib64" | ||
value: "/usr/local/lib:/usr/local/nvidia/lib64" | ||
- name: NAMESPACE | ||
valueFrom: | ||
fieldRef: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个基本上下载不下来,所以需要加提示,提示用户使用代理。