Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed training overview doc #9937

Conversation

jacquesqiao
Copy link
Member

@jacquesqiao jacquesqiao commented Apr 16, 2018

@jacquesqiao jacquesqiao changed the title Add asynchronous training design doc Add distributed training overview doc Apr 16, 2018
panyx0718
panyx0718 previously approved these changes Apr 16, 2018
Copy link
Contributor

@panyx0718 panyx0718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG overall, please sync with @Yancey1989.

@@ -0,0 +1,57 @@
## distributed training overview doc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize the first letter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file can be renamed to README.md

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


The training process of synchronous training is:

![lookup table training](./src/sync_distributed_training.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use a "cross functional flowchart" to represent both procedure and communications.

Yancey1989
Yancey1989 previously approved these changes Apr 16, 2018
Copy link
Contributor

@Yancey1989 Yancey1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with little comments.


The training process of synchronous training is:

![lookup table training](./src/sync_distributed_training.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lookup table training => synchronous distributed training.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

1. Pserver:
1. Each parameter has a queue to receive its gradient from trainers.
1. Each parameter has a thread to read data from the queue and run optimize block, using the gradient to optimize the parameter.
1. Use a independente thread to handle RPC call `GetVariable` for trainers to get parameters back.(Maybe here we should use a thread pool to speed up the parameter fetch.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a independente thread => Using an independent thread

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speed up the parameter fetch => speed up fetching the parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

The training process of synchronous training is:

![synchronous distributed training](./src/sync_distributed_training.png)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the figure, param_1@GRAD.shard_n.trainer_0 => param_1@GRAD.block_n.trainer_0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks! @tonyyang-svail

Copy link
Contributor

@Yancey1989 Yancey1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again.

@jacquesqiao jacquesqiao merged commit 91f0e3f into PaddlePaddle:develop Apr 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants