Add demo about Click-Through-Rate distributed training with PaddlePad… #434

sivanzcw · 2019-09-04T12:25:42Z

Add demo about distributed training with PaddlePaddle on Volcano, source demo taken from https://github.com/PaddlePaddle/edl/tree/develop/example/ctr

k82cn · 2019-09-04T13:11:14Z

example/integrations/paddlepaddle/ctr-paddlepaddle-on-volcano.yaml

+            type: ""
+          name: seqdata
+        containers:
+        - image: sivanzcw/edlctr:v1


let's move this image to volcanosh

Already replaced with volcanosh image

k82cn · 2019-09-04T13:11:58Z

/approve

volcano-sh-bot · 2019-09-04T13:12:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: k82cn, sivanzcw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [k82cn]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hzxuzhonghu · 2019-09-05T01:54:28Z

example/integrations/paddlepaddle/ctr-paddlepaddle-on-volcano.yaml

+          - name: TRAINER_PACKAGE
+            value: /workspace
+          - name: PADDLE_INIT_NICS
+            value: eth2


I am not familiar with paddlepaddle, but what's this?

PADDLE INIT NICS is used to pass the nics parameter to specify the network card in the paddle pserver or paddle train command here

start_pserver() { stdbuf -oL paddle pserver \ --use_gpu=0 \ --port=$PADDLE_INIT_PORT \ --ports_num=$PADDLE_INIT_PORTS_NUM \ --ports_num_for_sparse=$PADDLE_INIT_PORTS_NUM_FOR_SPARSE \ --nics=$PADDLE_INIT_NICS \ --comment=paddle_process_k8s \ --num_gradient_servers=$PADDLE_INIT_NUM_GRADIENT_SERVERS }

The default value is taken from the relevant configuration in the PaddlePaddle EDL demo here

What if eth2 not exist in a container?

After testing, for this demo, setting a non-existent NIC for PaddlePaddle does not affect the training process. Based on the original Baidu demo, this parameter was removed. During the implementation of the demo, each pod filters out the server and trainer components through the preset PServer pod lable and Trainer pod lable, thus obtaining the ip list of the server and trainer components. When obtaining the server ip list, the system assign a port to the server at the same time. After the server gets the assigned port, it starts listening to the corresponding port and provides services. Finally, each server or trainer component in the computing cluster knows the detailed communication address of other components under the cluster, so the components can communicate directly without using the network card configuration.

…dle on Volcano

hzxuzhonghu · 2019-09-05T07:09:27Z

/lgtm

volcano-sh-bot requested review from hzxuzhonghu and k82cn September 4, 2019 12:25

volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 4, 2019

sivanzcw changed the title ~~add demo about Click-Throuth-Rate distributed training with PaddlePad…~~ Add demo about Click-Throuth-Rate distributed training with PaddlePad… Sep 4, 2019

k82cn reviewed Sep 4, 2019

View reviewed changes

volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 4, 2019

hzxuzhonghu changed the title ~~Add demo about Click-Throuth-Rate distributed training with PaddlePad…~~ Add demo about Click-Through-Rate distributed training with PaddlePad… Sep 5, 2019

hzxuzhonghu reviewed Sep 5, 2019

View reviewed changes

sivanzcw force-pushed the paddlepaddle branch from adf9388 to 9e93105 Compare September 5, 2019 03:09

add demo about Click-Throuth-Rate distributed training with PaddlePad…

3d10085

…dle on Volcano

sivanzcw force-pushed the paddlepaddle branch from 9e93105 to 3d10085 Compare September 5, 2019 06:55

volcano-sh-bot assigned hzxuzhonghu Sep 5, 2019

volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 5, 2019

volcano-sh-bot merged commit 18750d1 into volcano-sh:master Sep 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add demo about Click-Through-Rate distributed training with PaddlePad… #434

Add demo about Click-Through-Rate distributed training with PaddlePad… #434

sivanzcw commented Sep 4, 2019

k82cn Sep 4, 2019

sivanzcw Sep 5, 2019

k82cn commented Sep 4, 2019

volcano-sh-bot commented Sep 4, 2019

hzxuzhonghu Sep 5, 2019

sivanzcw Sep 5, 2019

hzxuzhonghu Sep 5, 2019

sivanzcw Sep 5, 2019

hzxuzhonghu commented Sep 5, 2019

Add demo about Click-Through-Rate distributed training with PaddlePad… #434

Add demo about Click-Through-Rate distributed training with PaddlePad… #434

Conversation

sivanzcw commented Sep 4, 2019

k82cn Sep 4, 2019

Choose a reason for hiding this comment

sivanzcw Sep 5, 2019

Choose a reason for hiding this comment

k82cn commented Sep 4, 2019

volcano-sh-bot commented Sep 4, 2019

hzxuzhonghu Sep 5, 2019

Choose a reason for hiding this comment

sivanzcw Sep 5, 2019

Choose a reason for hiding this comment

hzxuzhonghu Sep 5, 2019

Choose a reason for hiding this comment

sivanzcw Sep 5, 2019

Choose a reason for hiding this comment

hzxuzhonghu commented Sep 5, 2019