Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add demo about Click-Through-Rate distributed training with PaddlePad… #434

Merged
merged 1 commit into from
Sep 5, 2019

Conversation

sivanzcw
Copy link
Contributor

@sivanzcw sivanzcw commented Sep 4, 2019

Add demo about distributed training with PaddlePaddle on Volcano, source demo taken from https://github.com/PaddlePaddle/edl/tree/develop/example/ctr

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 4, 2019
@sivanzcw sivanzcw changed the title add demo about Click-Throuth-Rate distributed training with PaddlePad… Add demo about Click-Throuth-Rate distributed training with PaddlePad… Sep 4, 2019
type: ""
name: seqdata
containers:
- image: sivanzcw/edlctr:v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this image to volcanosh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already replaced with volcanosh image

@k82cn
Copy link
Member

k82cn commented Sep 4, 2019

/approve

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: k82cn, sivanzcw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 4, 2019
@hzxuzhonghu hzxuzhonghu changed the title Add demo about Click-Throuth-Rate distributed training with PaddlePad… Add demo about Click-Through-Rate distributed training with PaddlePad… Sep 5, 2019
- name: TRAINER_PACKAGE
value: /workspace
- name: PADDLE_INIT_NICS
value: eth2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with paddlepaddle, but what's this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PADDLE INIT NICS is used to pass the nics parameter to specify the network card in the paddle pserver or paddle train command here

start_pserver() {
   stdbuf -oL paddle pserver \
     --use_gpu=0 \
     --port=$PADDLE_INIT_PORT \
     --ports_num=$PADDLE_INIT_PORTS_NUM \
     --ports_num_for_sparse=$PADDLE_INIT_PORTS_NUM_FOR_SPARSE \
     --nics=$PADDLE_INIT_NICS \
     --comment=paddle_process_k8s \
     --num_gradient_servers=$PADDLE_INIT_NUM_GRADIENT_SERVERS
}

The default value is taken from the relevant configuration in the PaddlePaddle EDL demo here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if eth2 not exist in a container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After testing, for this demo, setting a non-existent NIC for PaddlePaddle does not affect the training process. Based on the original Baidu demo, this parameter was removed. During the implementation of the demo, each pod filters out the server and trainer components through the preset PServer pod lable and Trainer pod lable, thus obtaining the ip list of the server and trainer components. When obtaining the server ip list, the system assign a port to the server at the same time. After the server gets the assigned port, it starts listening to the corresponding port and provides services. Finally, each server or trainer component in the computing cluster knows the detailed communication address of other components under the cluster, so the components can communicate directly without using the network card configuration.

@hzxuzhonghu
Copy link
Collaborator

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 5, 2019
@volcano-sh-bot volcano-sh-bot merged commit 18750d1 into volcano-sh:master Sep 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants