Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add v2 dist benchmark vgg #7539

Merged
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
373f8ba
add v2 dist benchmark vgg
typhoonzero Jan 15, 2018
27e31f6
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
typhoonzero Jan 16, 2018
bbff57e
update docker file
typhoonzero Jan 16, 2018
9ad149a
fix copyright check
typhoonzero Jan 16, 2018
311d159
add copyright for newly merged files
typhoonzero Jan 16, 2018
a0ac133
update job
typhoonzero Jan 16, 2018
b315a40
update
typhoonzero Jan 16, 2018
9f50195
update using cifar10
typhoonzero Jan 19, 2018
820ee78
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
typhoonzero Jan 19, 2018
541b42e
fix style
typhoonzero Jan 19, 2018
d3905fb
add fluid vgg16 dist test
typhoonzero Jan 19, 2018
cb34f6a
update fluid vgg16 and add readme
typhoonzero Jan 22, 2018
b38452d
fix styles
typhoonzero Jan 22, 2018
08b529a
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
typhoonzero Jan 22, 2018
900e911
fix style check
typhoonzero Jan 22, 2018
438d2ab
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
typhoonzero Jan 22, 2018
a28fd4e
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
typhoonzero Jan 23, 2018
da3b14b
Merge branch 'dist_train_benchmark_vgg16' of https://github.com/typho…
typhoonzero Jan 23, 2018
70142ae
update dist benchmark to one image
typhoonzero Jan 23, 2018
7aed1c1
Merge branch 'dist_train_benchmark_vgg16' of https://github.com/typho…
typhoonzero Jan 23, 2018
bd64719
update for today
typhoonzero Jan 29, 2018
419e4c4
modify some
gongweibao Jan 31, 2018
38b8b7f
add results
gongweibao Jan 31, 2018
cfbbb98
clean code
gongweibao Jan 31, 2018
f9db562
update results
typhoonzero Jan 31, 2018
8d9c3fc
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…
typhoonzero Jan 31, 2018
d6edfd0
update points
typhoonzero Feb 1, 2018
355ecaf
fix style check
typhoonzero Feb 1, 2018
b7fbb91
follow comments
typhoonzero Feb 1, 2018
c98b40e
clean code
gongweibao Feb 1, 2018
5530212
add others
gongweibao Feb 1, 2018
ccef94a
add comments
gongweibao Feb 1, 2018
00b9aed
fix typo
gongweibao Feb 1, 2018
747df80
Merge pull request #3 from gongweibao/wuyi7539_3
typhoonzero Feb 1, 2018
7c2d32b
update dockerfile
typhoonzero Feb 1, 2018
978396e
Merge branch 'dist_train_benchmark_vgg16' of https://github.com/typho…
typhoonzero Feb 1, 2018
52df85f
fix style
typhoonzero Feb 1, 2018
0bbd7bc
follow comments
typhoonzero Feb 2, 2018
a5acad1
update docs
typhoonzero Feb 2, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions benchmark/cluster/vgg16/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#FROM paddlepaddle/paddlecloud-job
#RUN mkdir -p /workspace
#ADD reader.py /workspace/
#RUN python /workspace/reader.py
FROM python:2.7.14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得既然是测试,最好不用这个而是用paddle:dev。

  • 不用安装其他的依赖
  • 调试的时候进入容器可以用各种命令查看系统的状态。

ADD paddle_k8s /usr/bin
ADD k8s_tools.py /root
RUN pip install -U kubernetes opencv-python && apt-get update -y && apt-get install -y iputils-ping libgtk2.0-dev
ADD *.whl /
RUN pip install /*.whl && rm -f /*.whl
ENV LD_LIBRARY_PATH=/usr/local/lib
ADD reader.py /workspace/
RUN python /workspace/reader.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个基本上下载不下来,所以需要加提示,提示用户使用代理。


ADD vgg16_fluid.py vgg16_v2.py /workspace/
58 changes: 58 additions & 0 deletions benchmark/cluster/vgg16/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Performance for distributed vgg16

## Test Result

### Single node single thread

| Batch Size | 32 | 64 | 128 | 256 |
| -- | -- | -- | -- | -- |
| PaddlePaddle Fluid | - | - | 16.74 | - |
| PaddlePaddle v2 | - | - | 17.60 | - |
| TensorFlow | - | - | - | - |

### different batch size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different batch size
=>
Different Batch Size


- PServer Count: 10
- Trainer Count: 20
- Metrics: samples / sec

| Batch Size | 32 | 64 | 128 | 256 |
| -- | -- | -- | -- | -- |
| PaddlePaddle Fluid | - | 247.40 | - | - |
Copy link
Contributor

@helinwang helinwang Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seesm fluid's performance is 247.40/64=3.866 batch per second, and v2's performance is 256.14/128=2.001 batch per second.
Seems the different is huge, do you have an idea why? (also could you please check if my math is correct).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, wrong column. I'll update this PR with full test result.

| PaddlePaddle v2 | - | - | 256.14 | - |
| TensorFlow | - | - | - | - |

### different pserver number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different PServer Count


- Trainer Count: 100
- Batch Size: 64
- Metrics: mini-batch / sec

| PServer Count | 10 | 20 | 40 | 60 |
| -- | -- | -- | -- | -- |
| PaddlePaddle Fluid | - | - | - | - |
| PaddlePaddle v2 | - | - | - | - |
| TensorFlow | - | - | - | - |

### Accelerate rate

| Trainer Counter | 20 | 40 | 80 | 100 |
| -- | -- | -- | -- | -- |
| PaddlePaddle Fluid | - | - | - | - |
| PaddlePaddle v2 | - | - | - | - |
| TensorFlow | - | - | - | - |


## Steps to run the performance test

1. You must re-compile PaddlePaddle and enable `-DWITH_DISTRIBUTE` to build PaddlePaddle with distributed support.
1. When the build finishes, copy the output `whl` package located under `build/python/dist` to current directory.
1. Run `docker build -t [image:tag] .` to build the docker image and run `docker push [image:tag]` to push the image to reponsitory so kubernetes can find it.
1. Run `kubectl create -f pserver.yaml && kubectl create -f trainer.yaml` to start the job on your kubernetes cluster (you must configure the `kubectl` client before this step).
1. Run `kubectl get po` to get running pods, and run `kubectl logs [podID]` to fetch the pod log of pservers and trainers.

Check the logs for the distributed training progress and analyze the performance.

## Enable verbos logs

Edit `pserver.yaml` and `trainer.yaml` and add an environment variable `GLOG_v=3` to see what happend in detail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether we need to add GLOG_logtostderr=1, if you have tested it, please ignore this comment.

72 changes: 72 additions & 0 deletions benchmark/cluster/vgg16/fluid_pserver.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
apiVersion: extensions/v1beta1
kind: ReplicaSet
metadata:
name: vgg16job-pserver
spec:
replicas: 10
template:
metadata:
labels:
paddle-job-pserver: vgg16job
spec:
hostNetwork: true
imagePullSecrets:
- name: job-registry-secret
containers:
- name: pserver
image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16"
imagePullPolicy: Always
ports:
- name: jobport-30236
containerPort: 30236
env:
- name: PADDLE_JOB_NAME
value: vgg16job
- name: MKL_NUM_THREADS
value: "1"
- name: TRAINING_ROLE
value: "PSERVER"
- name: TRAINERS
value: "20"
- name: PSERVERS
value: "10"
- name: TOPOLOGY
value: ""
- name: ENTRY
value: "MKL_NUM_THREADS=1 python /workspace/vgg16_fluid.py --local 0"
- name: TRAINER_PACKAGE
value: "/workspace"
- name: PADDLE_INIT_PORT
value: "30236"
- name: PADDLE_INIT_NICS
value: "xgbe0"
- name: PADDLE_INIT_TRAINER_COUNT
value: "1"
- name: PADDLE_INIT_PORTS_NUM
value: "1"
- name: PADDLE_INIT_PORTS_NUM_FOR_SPARSE
value: "1"
- name: PADDLE_INIT_NUM_GRADIENT_SERVERS
value: "20"
- name: PADDLE_INIT_NUM_PASSES
value: "1"
- name: PADDLE_INIT_USE_GPU
value: "0"
- name: LD_LIBRARY_PATH
value: "/usr/local/lib:/usr/local/nvidia/lib64"
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: "metadata.namespace"
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: "status.podIP"
command: ["paddle_k8s", "start_fluid"]
resources:
requests:
memory: 10Gi
cpu: 4
limits:
memory: 10Gi
cpu: 4
69 changes: 69 additions & 0 deletions benchmark/cluster/vgg16/fluid_trainer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
apiVersion: batch/v1
kind: Job
metadata:
name: vgg16job-trainer
spec:
parallelism: 20
completions: 20
template:
metadata:
labels:
paddle-job: vgg16job
spec:
imagePullSecrets:
- name: job-registry-secret
hostNetwork: true
containers:
- name: trainer
image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16"
imagePullPolicy: Always
command: ["paddle_k8s", "start_fluid"]
env:
- name: PADDLE_JOB_NAME
value: vgg16job
- name: TRAINING_ROLE
value: "TRAINER"
- name: TRAINERS
value: "20"
- name: PSERVERS
value: "10"
- name: TOPOLOGY
value: ""
- name: ENTRY
value: "MKL_NUM_THREADS=1 python /workspace/vgg16_fluid.py --local 0 --batch_size 128"
- name: TRAINER_PACKAGE
value: "/workspace"
- name: PADDLE_INIT_PORT
value: "30236"
- name: PADDLE_INIT_NICS
value: "xgbe0"
- name: PADDLE_INIT_TRAINER_COUNT
value: "1"
- name: PADDLE_INIT_PORTS_NUM
value: "1"
- name: PADDLE_INIT_PORTS_NUM_FOR_SPARSE
value: "1"
- name: PADDLE_INIT_NUM_GRADIENT_SERVERS
value: "20"
- name: PADDLE_INIT_NUM_PASSES
value: "1"
- name: PADDLE_INIT_USE_GPU
value: "0"
- name: LD_LIBRARY_PATH
value: "/usr/local/lib:/usr/local/nvidia/lib64"
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: "metadata.namespace"
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: "status.podIP"
resources:
requests:
memory: 40Gi
cpu: 2
limits:
memory: 40Gi
cpu: 2
restartPolicy: Never
94 changes: 94 additions & 0 deletions benchmark/cluster/vgg16/k8s_tools.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#!/bin/env python
import os
import sys
import time
import socket
from kubernetes import client, config
PADDLE_JOB_NAME = os.getenv("PADDLE_JOB_NAME")
NAMESPACE = os.getenv("NAMESPACE")
PORT = os.getenv("PSERVER_PORT")
if os.getenv("KUBERNETES_SERVICE_HOST", None):
config.load_incluster_config()
else:
config.load_kube_config()
v1 = client.CoreV1Api()


def fetch_pods_info(label_selector):
api_response = v1.list_namespaced_pod(
namespace=NAMESPACE, pretty=True, label_selector=label_selector)
pod_list = []
for item in api_response.items:
pod_list.append((item.status.phase, item.status.pod_ip))
return pod_list


def wait_pods_running(label_selector, desired):
print "label selector: %s, desired: %s" % (label_selector, desired)
while True:
count = count_pods_by_phase(label_selector, 'Running')
# NOTE: pods may be scaled.
if count >= int(desired):
break
print 'current cnt: %d sleep for 5 seconds...' % count
time.sleep(5)


def count_pods_by_phase(label_selector, phase):
pod_list = fetch_pods_info(label_selector)
filtered_pod_list = filter(lambda x: x[0] == phase, pod_list)
return len(filtered_pod_list)


def fetch_pserver_ips():
label_selector = "paddle-job-pserver=%s" % PADDLE_JOB_NAME
pod_list = fetch_pods_info(label_selector)
pserver_ips = [item[1] for item in pod_list]
return ",".join(pserver_ips)


def fetch_master_ip():
label_selector = "paddle-job-master=%s" % PADDLE_JOB_NAME
pod_list = fetch_pods_info(label_selector)
master_ips = [item[1] for item in pod_list]
return master_ips[0]


def fetch_trainer_id():
label_selector = "paddle-job=%s" % PADDLE_JOB_NAME
pod_list = fetch_pods_info(label_selector)
trainer_ips = [item[1] for item in pod_list]
trainer_ips.sort()
local_ip = socket.gethostbyname(socket.gethostname())
for i in xrange(len(trainer_ips)):
if trainer_ips[i] == local_ip:
return i
return None


if __name__ == "__main__":
command = sys.argv[1]
if command == "fetch_pserver_ips":
print fetch_pserver_ips()
elif command == "fetch_trainer_id":
print fetch_trainer_id()
elif command == "fetch_master_ip":
print fetch_master_ip()
elif command == "count_pods_by_phase":
print count_pods_by_phase(sys.argv[2], sys.argv[3])
elif command == "wait_pods_running":
wait_pods_running(sys.argv[2], sys.argv[3])
Loading