-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed training overview doc #9937
Add distributed training overview doc #9937
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG overall, please sync with @Yancey1989.
@@ -0,0 +1,57 @@ | |||
## distributed training overview doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalize the first letter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this file can be renamed to README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
The training process of synchronous training is: | ||
|
||
![lookup table training](./src/sync_distributed_training.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use a "cross functional flowchart" to represent both procedure and communications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with little comments.
|
||
The training process of synchronous training is: | ||
|
||
![lookup table training](./src/sync_distributed_training.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lookup table training
=> synchronous distributed training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
1. Pserver: | ||
1. Each parameter has a queue to receive its gradient from trainers. | ||
1. Each parameter has a thread to read data from the queue and run optimize block, using the gradient to optimize the parameter. | ||
1. Use a independente thread to handle RPC call `GetVariable` for trainers to get parameters back.(Maybe here we should use a thread pool to speed up the parameter fetch.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a independente thread => Using an independent thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
speed up the parameter fetch => speed up fetching the parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
The training process of synchronous training is: | ||
|
||
![synchronous distributed training](./src/sync_distributed_training.png) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the figure, param_1@GRAD.shard_n.trainer_0
=> param_1@GRAD.block_n.trainer_0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks! @tonyyang-svail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM again.
refs: #9932
refs: https://docs.google.com/document/d/1-W3HHC0yNYolGuYL7L0Crxs3Us8rZfGjlSe0o5X7lQA/edit#heading=h.td8lf867c0rd
task list: #9941