-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add v2 dist benchmark vgg #7539
Changes from 17 commits
373f8ba
27e31f6
bbff57e
9ad149a
311d159
a0ac133
b315a40
9f50195
820ee78
541b42e
d3905fb
cb34f6a
b38452d
08b529a
900e911
438d2ab
a28fd4e
da3b14b
70142ae
7aed1c1
bd64719
419e4c4
38b8b7f
cfbbb98
f9db562
8d9c3fc
d6edfd0
355ecaf
b7fbb91
c98b40e
5530212
ccef94a
00b9aed
747df80
7c2d32b
978396e
52df85f
0bbd7bc
a5acad1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
#FROM paddlepaddle/paddlecloud-job | ||
#RUN mkdir -p /workspace | ||
#ADD reader.py /workspace/ | ||
#RUN python /workspace/reader.py | ||
FROM python:2.7.14 | ||
ADD paddle_k8s /usr/bin | ||
ADD k8s_tools.py /root | ||
RUN pip install -U kubernetes opencv-python && apt-get update -y && apt-get install -y iputils-ping libgtk2.0-dev | ||
ADD *.whl / | ||
RUN pip install /*.whl && rm -f /*.whl | ||
ENV LD_LIBRARY_PATH=/usr/local/lib | ||
ADD reader.py /workspace/ | ||
RUN python /workspace/reader.py | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这个基本上下载不下来,所以需要加提示,提示用户使用代理。 |
||
|
||
ADD vgg16_fluid.py vgg16_v2.py /workspace/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# Performance for distributed vgg16 | ||
|
||
## Test Result | ||
|
||
### Single node single thread | ||
|
||
| Batch Size | 32 | 64 | 128 | 256 | | ||
| -- | -- | -- | -- | -- | | ||
| PaddlePaddle Fluid | - | - | 16.74 | - | | ||
| PaddlePaddle v2 | - | - | 17.60 | - | | ||
| TensorFlow | - | - | - | - | | ||
|
||
### different batch size | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
- PServer Count: 10 | ||
- Trainer Count: 20 | ||
- Metrics: samples / sec | ||
|
||
| Batch Size | 32 | 64 | 128 | 256 | | ||
| -- | -- | -- | -- | -- | | ||
| PaddlePaddle Fluid | - | 247.40 | - | - | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seesm fluid's performance is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, wrong column. I'll update this PR with full test result. |
||
| PaddlePaddle v2 | - | - | 256.14 | - | | ||
| TensorFlow | - | - | - | - | | ||
|
||
### different pserver number | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Different PServer Count |
||
|
||
- Trainer Count: 100 | ||
- Batch Size: 64 | ||
- Metrics: mini-batch / sec | ||
|
||
| PServer Count | 10 | 20 | 40 | 60 | | ||
| -- | -- | -- | -- | -- | | ||
| PaddlePaddle Fluid | - | - | - | - | | ||
| PaddlePaddle v2 | - | - | - | - | | ||
| TensorFlow | - | - | - | - | | ||
|
||
### Accelerate rate | ||
|
||
| Trainer Counter | 20 | 40 | 80 | 100 | | ||
| -- | -- | -- | -- | -- | | ||
| PaddlePaddle Fluid | - | - | - | - | | ||
| PaddlePaddle v2 | - | - | - | - | | ||
| TensorFlow | - | - | - | - | | ||
|
||
|
||
## Steps to run the performance test | ||
|
||
1. You must re-compile PaddlePaddle and enable `-DWITH_DISTRIBUTE` to build PaddlePaddle with distributed support. | ||
1. When the build finishes, copy the output `whl` package located under `build/python/dist` to current directory. | ||
1. Run `docker build -t [image:tag] .` to build the docker image and run `docker push [image:tag]` to push the image to reponsitory so kubernetes can find it. | ||
1. Run `kubectl create -f pserver.yaml && kubectl create -f trainer.yaml` to start the job on your kubernetes cluster (you must configure the `kubectl` client before this step). | ||
1. Run `kubectl get po` to get running pods, and run `kubectl logs [podID]` to fetch the pod log of pservers and trainers. | ||
|
||
Check the logs for the distributed training progress and analyze the performance. | ||
|
||
## Enable verbos logs | ||
|
||
Edit `pserver.yaml` and `trainer.yaml` and add an environment variable `GLOG_v=3` to see what happend in detail. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure whether we need to add |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
apiVersion: extensions/v1beta1 | ||
kind: ReplicaSet | ||
metadata: | ||
name: vgg16job-pserver | ||
spec: | ||
replicas: 10 | ||
template: | ||
metadata: | ||
labels: | ||
paddle-job-pserver: vgg16job | ||
spec: | ||
hostNetwork: true | ||
imagePullSecrets: | ||
- name: job-registry-secret | ||
containers: | ||
- name: pserver | ||
image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16" | ||
imagePullPolicy: Always | ||
ports: | ||
- name: jobport-30236 | ||
containerPort: 30236 | ||
env: | ||
- name: PADDLE_JOB_NAME | ||
value: vgg16job | ||
- name: MKL_NUM_THREADS | ||
value: "1" | ||
- name: TRAINING_ROLE | ||
value: "PSERVER" | ||
- name: TRAINERS | ||
value: "20" | ||
- name: PSERVERS | ||
value: "10" | ||
- name: TOPOLOGY | ||
value: "" | ||
- name: ENTRY | ||
value: "MKL_NUM_THREADS=1 python /workspace/vgg16_fluid.py --local 0" | ||
- name: TRAINER_PACKAGE | ||
value: "/workspace" | ||
- name: PADDLE_INIT_PORT | ||
value: "30236" | ||
- name: PADDLE_INIT_NICS | ||
value: "xgbe0" | ||
- name: PADDLE_INIT_TRAINER_COUNT | ||
value: "1" | ||
- name: PADDLE_INIT_PORTS_NUM | ||
value: "1" | ||
- name: PADDLE_INIT_PORTS_NUM_FOR_SPARSE | ||
value: "1" | ||
- name: PADDLE_INIT_NUM_GRADIENT_SERVERS | ||
value: "20" | ||
- name: PADDLE_INIT_NUM_PASSES | ||
value: "1" | ||
- name: PADDLE_INIT_USE_GPU | ||
value: "0" | ||
- name: LD_LIBRARY_PATH | ||
value: "/usr/local/lib:/usr/local/nvidia/lib64" | ||
- name: NAMESPACE | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: "metadata.namespace" | ||
- name: POD_IP | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: "status.podIP" | ||
command: ["paddle_k8s", "start_fluid"] | ||
resources: | ||
requests: | ||
memory: 10Gi | ||
cpu: 4 | ||
limits: | ||
memory: 10Gi | ||
cpu: 4 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
apiVersion: batch/v1 | ||
kind: Job | ||
metadata: | ||
name: vgg16job-trainer | ||
spec: | ||
parallelism: 20 | ||
completions: 20 | ||
template: | ||
metadata: | ||
labels: | ||
paddle-job: vgg16job | ||
spec: | ||
imagePullSecrets: | ||
- name: job-registry-secret | ||
hostNetwork: true | ||
containers: | ||
- name: trainer | ||
image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16" | ||
imagePullPolicy: Always | ||
command: ["paddle_k8s", "start_fluid"] | ||
env: | ||
- name: PADDLE_JOB_NAME | ||
value: vgg16job | ||
- name: TRAINING_ROLE | ||
value: "TRAINER" | ||
- name: TRAINERS | ||
value: "20" | ||
- name: PSERVERS | ||
value: "10" | ||
- name: TOPOLOGY | ||
value: "" | ||
- name: ENTRY | ||
value: "MKL_NUM_THREADS=1 python /workspace/vgg16_fluid.py --local 0 --batch_size 128" | ||
- name: TRAINER_PACKAGE | ||
value: "/workspace" | ||
- name: PADDLE_INIT_PORT | ||
value: "30236" | ||
- name: PADDLE_INIT_NICS | ||
value: "xgbe0" | ||
- name: PADDLE_INIT_TRAINER_COUNT | ||
value: "1" | ||
- name: PADDLE_INIT_PORTS_NUM | ||
value: "1" | ||
- name: PADDLE_INIT_PORTS_NUM_FOR_SPARSE | ||
value: "1" | ||
- name: PADDLE_INIT_NUM_GRADIENT_SERVERS | ||
value: "20" | ||
- name: PADDLE_INIT_NUM_PASSES | ||
value: "1" | ||
- name: PADDLE_INIT_USE_GPU | ||
value: "0" | ||
- name: LD_LIBRARY_PATH | ||
value: "/usr/local/lib:/usr/local/nvidia/lib64" | ||
- name: NAMESPACE | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: "metadata.namespace" | ||
- name: POD_IP | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: "status.podIP" | ||
resources: | ||
requests: | ||
memory: 40Gi | ||
cpu: 2 | ||
limits: | ||
memory: 40Gi | ||
cpu: 2 | ||
restartPolicy: Never |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
#!/bin/env python | ||
import os | ||
import sys | ||
import time | ||
import socket | ||
from kubernetes import client, config | ||
PADDLE_JOB_NAME = os.getenv("PADDLE_JOB_NAME") | ||
NAMESPACE = os.getenv("NAMESPACE") | ||
PORT = os.getenv("PSERVER_PORT") | ||
if os.getenv("KUBERNETES_SERVICE_HOST", None): | ||
config.load_incluster_config() | ||
else: | ||
config.load_kube_config() | ||
v1 = client.CoreV1Api() | ||
|
||
|
||
def fetch_pods_info(label_selector): | ||
api_response = v1.list_namespaced_pod( | ||
namespace=NAMESPACE, pretty=True, label_selector=label_selector) | ||
pod_list = [] | ||
for item in api_response.items: | ||
pod_list.append((item.status.phase, item.status.pod_ip)) | ||
return pod_list | ||
|
||
|
||
def wait_pods_running(label_selector, desired): | ||
print "label selector: %s, desired: %s" % (label_selector, desired) | ||
while True: | ||
count = count_pods_by_phase(label_selector, 'Running') | ||
# NOTE: pods may be scaled. | ||
if count >= int(desired): | ||
break | ||
print 'current cnt: %d sleep for 5 seconds...' % count | ||
time.sleep(5) | ||
|
||
|
||
def count_pods_by_phase(label_selector, phase): | ||
pod_list = fetch_pods_info(label_selector) | ||
filtered_pod_list = filter(lambda x: x[0] == phase, pod_list) | ||
return len(filtered_pod_list) | ||
|
||
|
||
def fetch_pserver_ips(): | ||
label_selector = "paddle-job-pserver=%s" % PADDLE_JOB_NAME | ||
pod_list = fetch_pods_info(label_selector) | ||
pserver_ips = [item[1] for item in pod_list] | ||
return ",".join(pserver_ips) | ||
|
||
|
||
def fetch_master_ip(): | ||
label_selector = "paddle-job-master=%s" % PADDLE_JOB_NAME | ||
pod_list = fetch_pods_info(label_selector) | ||
master_ips = [item[1] for item in pod_list] | ||
return master_ips[0] | ||
|
||
|
||
def fetch_trainer_id(): | ||
label_selector = "paddle-job=%s" % PADDLE_JOB_NAME | ||
pod_list = fetch_pods_info(label_selector) | ||
trainer_ips = [item[1] for item in pod_list] | ||
trainer_ips.sort() | ||
local_ip = socket.gethostbyname(socket.gethostname()) | ||
for i in xrange(len(trainer_ips)): | ||
if trainer_ips[i] == local_ip: | ||
return i | ||
return None | ||
|
||
|
||
if __name__ == "__main__": | ||
command = sys.argv[1] | ||
if command == "fetch_pserver_ips": | ||
print fetch_pserver_ips() | ||
elif command == "fetch_trainer_id": | ||
print fetch_trainer_id() | ||
elif command == "fetch_master_ip": | ||
print fetch_master_ip() | ||
elif command == "count_pods_by_phase": | ||
print count_pods_by_phase(sys.argv[2], sys.argv[3]) | ||
elif command == "wait_pods_running": | ||
wait_pods_running(sys.argv[2], sys.argv[3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我觉得既然是测试,最好不用这个而是用paddle:dev。