-
Notifications
You must be signed in to change notification settings - Fork 994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add demo about Click-Through-Rate distributed training with PaddlePad… #434
Conversation
type: "" | ||
name: seqdata | ||
containers: | ||
- image: sivanzcw/edlctr:v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move this image to volcanosh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already replaced with volcanosh image
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: k82cn, sivanzcw The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- name: TRAINER_PACKAGE | ||
value: /workspace | ||
- name: PADDLE_INIT_NICS | ||
value: eth2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with paddlepaddle, but what's this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PADDLE INIT NICS
is used to pass the nics
parameter to specify the network card in the paddle pserver
or paddle train
command here
start_pserver() {
stdbuf -oL paddle pserver \
--use_gpu=0 \
--port=$PADDLE_INIT_PORT \
--ports_num=$PADDLE_INIT_PORTS_NUM \
--ports_num_for_sparse=$PADDLE_INIT_PORTS_NUM_FOR_SPARSE \
--nics=$PADDLE_INIT_NICS \
--comment=paddle_process_k8s \
--num_gradient_servers=$PADDLE_INIT_NUM_GRADIENT_SERVERS
}
The default value is taken from the relevant configuration in the PaddlePaddle EDL demo here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if eth2
not exist in a container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After testing, for this demo, setting a non-existent NIC for PaddlePaddle does not affect the training process. Based on the original Baidu demo, this parameter was removed. During the implementation of the demo, each pod filters out the server
and trainer
components through the preset PServer pod lable
and Trainer pod lable
, thus obtaining the ip list of the server
and trainer
components. When obtaining the server ip list, the system assign a port to the server
at the same time. After the server
gets the assigned port, it starts listening to the corresponding port and provides services. Finally, each server
or trainer
component in the computing cluster knows the detailed communication address of other components under the cluster, so the components can communicate directly without using the network card configuration.
adf9388
to
9e93105
Compare
9e93105
to
3d10085
Compare
/lgtm |
Add demo about distributed training with PaddlePaddle on Volcano, source demo taken from https://github.com/PaddlePaddle/edl/tree/develop/example/ctr