Add distributed training overview doc #9937

jacquesqiao · 2018-04-16T07:10:43Z

refs: #9932
refs: https://docs.google.com/document/d/1-W3HHC0yNYolGuYL7L0Crxs3Us8rZfGjlSe0o5X7lQA/edit#heading=h.td8lf867c0rd

task list: #9941

panyx0718

LG overall, please sync with @Yancey1989.

typhoonzero · 2018-04-16T08:45:04Z

doc/fluid/design/dist_train/distributed_training_overview.md

@@ -0,0 +1,57 @@
+## distributed training overview doc


Capitalize the first letter.

I think this file can be renamed to README.md

typhoonzero · 2018-04-16T09:31:28Z

doc/fluid/design/dist_train/distributed_training_overview.md

+
+The training process of synchronous training is:
+
+![lookup table training](./src/sync_distributed_training.png)


Better to use a "cross functional flowchart" to represent both procedure and communications.

Yancey1989

LGTM with little comments.

Yancey1989 · 2018-04-16T12:32:26Z

doc/fluid/design/dist_train/README.md

+
+The training process of synchronous training is:
+
+![lookup table training](./src/sync_distributed_training.png)


lookup table training => synchronous distributed training.

Yancey1989 · 2018-04-16T12:45:14Z

doc/fluid/design/dist_train/README.md

+1. Pserver:
+	1. Each parameter has a queue to receive its gradient from trainers.
+	1. Each parameter has a thread to read data from the queue and run optimize block, using the gradient to optimize the parameter.
+	1. Use a independente thread to handle RPC call `GetVariable` for trainers to get parameters back.(Maybe here we should use a thread pool to speed up the parameter fetch.)


Use a independente thread => Using an independent thread

speed up the parameter fetch => speed up fetching the parameters.

tonyyang-svail · 2018-04-16T17:44:29Z

doc/fluid/design/dist_train/README.md

+The training process of synchronous training is:
+
+![synchronous distributed training](./src/sync_distributed_training.png)
+


In the figure, param_1@GRAD.shard_n.trainer_0 => param_1@GRAD.block_n.trainer_0.

fixed, thanks! @tonyyang-svail

Yancey1989

LGTM again.

jacquesqiao added 5 commits April 16, 2018 14:31

add asynchronous design doc

05e8c18

update distributed_training.png

2deab8c

add distributed_training.graffle

def5f8e

add async_distributed_training.png

8af744f

update design doc

483406d

jacquesqiao changed the title ~~Add asynchronous training design doc~~ Add distributed training overview doc Apr 16, 2018

jacquesqiao requested review from Yancey1989, helinwang, panyx0718, emailweixu and typhoonzero April 16, 2018 07:15

change name to distributed_training_overview

809e182

jacquesqiao mentioned this pull request Apr 16, 2018

fluid support asynchronous training #9941

Closed

13 tasks

panyx0718 previously approved these changes Apr 16, 2018

View reviewed changes

typhoonzero reviewed Apr 16, 2018

View reviewed changes

follow comment

51432d1

jacquesqiao dismissed panyx0718’s stale review via 51432d1 April 16, 2018 11:38

jacquesqiao added 2 commits April 16, 2018 19:42

change distributed_training_overview.md to README.md

6050bca

update sync_distributed_training.png

b7bb146

Yancey1989 previously approved these changes Apr 16, 2018

View reviewed changes

follow comment

fe44661

jacquesqiao dismissed Yancey1989’s stale review via fe44661 April 16, 2018 13:46

tonyyang-svail reviewed Apr 16, 2018

View reviewed changes

fix typo

1db0f3e

Yancey1989 approved these changes Apr 17, 2018

View reviewed changes

jacquesqiao merged commit 91f0e3f into PaddlePaddle:develop Apr 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed training overview doc #9937

Add distributed training overview doc #9937

jacquesqiao commented Apr 16, 2018 •

edited

Loading

panyx0718 left a comment

typhoonzero Apr 16, 2018

typhoonzero Apr 16, 2018

jacquesqiao Apr 16, 2018

typhoonzero Apr 16, 2018

Yancey1989 left a comment

Yancey1989 Apr 16, 2018

jacquesqiao Apr 16, 2018

Yancey1989 Apr 16, 2018

Yancey1989 Apr 16, 2018

jacquesqiao Apr 16, 2018

tonyyang-svail Apr 16, 2018

jacquesqiao Apr 17, 2018

Yancey1989 left a comment


		The training process of synchronous training is:

		![lookup table training](./src/sync_distributed_training.png)

		The training process of synchronous training is:

		![synchronous distributed training](./src/sync_distributed_training.png)

Add distributed training overview doc #9937

Add distributed training overview doc #9937

Conversation

jacquesqiao commented Apr 16, 2018 • edited Loading

panyx0718 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 left a comment

Choose a reason for hiding this comment

jacquesqiao commented Apr 16, 2018 •

edited

Loading