Add async update design doc #9932

Yancey1989 · 2018-04-16T04:01:18Z

jacquesqiao · 2018-04-16T04:22:20Z

doc/fluid/design/dist_train/async_update.md

+
+In the synchronously distributed training, there should be a `Barrier` to synchronise the
+parameters after the optimizing stage. The performance of a distributed training job 
+depends on the lowest node, if there were hundreds or thousand training nodes in a Job,


lowest => slowest

thousand => thousands of

jacquesqiao · 2018-04-16T04:23:23Z

doc/fluid/design/dist_train/async_update.md

+In the synchronously distributed training, there should be a `Barrier` to synchronise the
+parameters after the optimizing stage. The performance of a distributed training job 
+depends on the lowest node, if there were hundreds or thousand training nodes in a Job,
+the performance of synchronously distributed training might be very slow because of 


performance slow => performance pool

Thanks, done.

jacquesqiao · 2018-04-16T04:30:00Z

doc/fluid/design/dist_train/async_update.md

+
+### Parameter Server
+
+<img src="./src/async_pserver.png" width="750"/>


I think we do not need an aggregation stage and can just read a variable from the queue and update it to the parameter.

Yes, you're right, don't aggregation the received gradients, will update it.

jacquesqiao · 2018-04-16T04:33:02Z

doc/fluid/design/dist_train/async_update.md

+
+1. A Trainer will compute the gradients and SEND them to the Parameter
+Server(PServer) nodes.
+1. After the PServer node received gradients came from all the Trainers, it would apply the gradient to the respective variables, and using an optimize algorithms(SGD,


It will aggregate the gradient for the same parameter into one gradient and then apply the aggregated gradient to the respective parameter.

jacquesqiao · 2018-04-16T04:38:46Z

doc/fluid/design/dist_train/async_update.md

+
+As the figure above, we describe a global view of asynchronously update process and use
+the parameter `w1` as an example to introduce the steps:
+1. For each gradient variables, they may distribute on different GPU card and aggregate


I think we can only consider the final gradient on one trainer, do not need to consider how it comes, for example, we do not need to consider if the gradient is aggregate from multi-thread or multi GPU training on this server.

May not, we need to confirm the relation between multiple devices and multiple machines, for this design doc, it's sync process inside a trainer and async between machines.

Oh, yes, it seems that one process muli-thread async training is an independent project.

panyx0718

The answer described the way TensorFlow async works: https://stackoverflow.com/questions/43147435/how-does-asynchronous-training-work-in-distributed-tensorflow

panyx0718 · 2018-04-16T04:34:32Z

doc/fluid/design/dist_train/async_update.md

+
+As the figure above, we describe a global view of asynchronously update process and use
+the parameter `w1` as an example to introduce the steps:
+1. For each gradient variables, they may distribute on different GPU card and aggregate


Do you mean aggregate multiple gradients from gpus in a single machine? Have you considered to send gradients to pserver directly without aggregation? Each gpu is a independent training instance.

Do you mean aggregate multiple gradients from gpus in a single machine

Yes, because of the speed between different GPU cards is much faster than different nodes, so execute aggregation firstly and then send them may archive higher training speed.

But I think Each gpu is a independent training instance is also a feasible way, maybe we implement it in future and do some experiments to compare the speed and accuracy.

panyx0718 · 2018-04-16T04:35:58Z

doc/fluid/design/dist_train/async_update.md

+instances and sent them.
+1. PServer would run an `Optimize Block` to use a specified optimize algorithm to update
+the specified parameter, such as `w1`.
+1. The trainer will fetch the latest parameter after PServer finished the optimize stage.


Does the trainer wait for the "latest" parameter?

No, the trainer does not need to wait, it can just get and use the current parameter on parameter server.

panyx0718 · 2018-04-16T04:54:23Z

doc/fluid/design/dist_train/async_update.md

+
+### Trainer
+
+- We need a new Operator named `RemoteOptimize` to send gradients to multiple PServer 


does the logic need to be implemented in a single Op? can send and fetch operations be run independently and scheduled on demand?

Run send and fetch op independently would cause an uncertain action, for example, the Ops execution pipline on PServer:
send(w1'_trainer0)->send(w1'_trainer1)->fetch(w1_trainer0)->fetch(w1_trainer1)

Trainer0 fetched the result of sgd(w1'_trainer0)->sgd(w1'_trainer1) but the expective result of send(w1_trainer0) is sgd(w1'_trainer0).
The benefit is we can achievement much faster-training speed.

I also discussed with @jacquesqiao just now and he reminds me that we can use a no-locking way to implement the async distributed training, that may archive much faster speed. The reference paper: https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf

Maybe we can implement the no-locking async distributed training firstly, and then do some experiment.

Yes, the Adam project(https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf) from Microsoft uses multi-threaded model parameter updates without locks to improve their model accuracy. We can implement it and have a test.

panyx0718

LG overall. please sync with @jacquesqiao on final design

helinwang · 2018-04-16T20:44:13Z

doc/fluid/design/dist_train/async_update.md

+
+### Parameter Server
+
+<img src="./src/async_pserver.png" width="750"/>


"No Pserver" -> "On Pserver"

putcn · 2018-04-16T22:45:42Z

doc/fluid/design/dist_train/async_update.md

+1. For each gradient variables, they may distribute on different GPU card and aggregate
+them while they are all calculated.
+1. Split the gradient variable into multiple blocks according to the number of PServer
+instances and then sent them.


sent -> send

putcn · 2018-04-16T22:47:09Z

doc/fluid/design/dist_train/async_update.md

+instances and then sent them.
+1. PServer would run an `Optimize Block` using a specified optimize algorithm to update
+the specified parameter.
+1. The trainer will fetch the parameter before running forward Op depends on the specified


don't quite get the point

ah, I think you may want to add "which" between 'OP' and 'depends'

putcn · 2018-04-16T22:50:41Z

doc/fluid/design/dist_train/async_update.md

+- Schedule `FetchVars` operator to fetch the latest parameter from PServer before running
+the forward ops.
+- There could be a large number of gradient variables to be sent, so we need to use another
+thread pool(IO Threadpool) which a number of the schedulable threads is larger than the


which -> whose

putcn · 2018-04-16T22:51:39Z

doc/fluid/design/dist_train/async_update.md

+the forward ops.
+- There could be a large number of gradient variables to be sent, so we need to use another
+thread pool(IO Threadpool) which a number of the schedulable threads is larger than the
+computing thread pool to avoid competitive the thread resources with computing.


competitive -> competing

putcn · 2018-04-16T22:52:49Z

doc/fluid/design/dist_train/async_update.md

+<img src="./src/async_pserver.png" width="750"/>
+
+- There should be multiple trainer instances want to optimize the same parameter at
+the same time, to avoid the pollution, we need one `BlockingQueue` for each gradient


pollution -> racing
maybe?

Thanks, if there is no locking for the update operator to the same parameter, the data will be polluted by the multiple threads, but I think using racing here is more clearly. :)

jacquesqiao

LGTM!

wangkuiyi · 2018-04-16T04:46:27Z

doc/fluid/design/dist_train/async_update.md

+
+For the typical synchronous distributed training, some significant steps are as follows:
+
+1. A Trainer will compute the gradients and SEND them to the Parameter


A Trainer => A trainer process

wangkuiyi · 2018-04-16T04:46:54Z

doc/fluid/design/dist_train/async_update.md

+
+For the typical synchronous distributed training, some significant steps are as follows:
+
+1. A Trainer will compute the gradients and SEND them to the Parameter


SEND => send

Use * if you want to emphasize the text.

wangkuiyi · 2018-04-16T04:47:22Z

doc/fluid/design/dist_train/async_update.md

+
+For the typical synchronous distributed training, some significant steps are as follows:
+
+1. A Trainer will compute the gradients and SEND them to the Parameter


the Parameter Server(PServer) nodes => the parameter server (pserver) processes

If we use Parameter Server then the shorthand should be PS instead PServer.

In English, there needs a space before the left parenthesis.

wangkuiyi · 2018-04-16T04:52:57Z

doc/fluid/design/dist_train/async_update.md

+parameters after the optimizing stage. The performance of a distributed training job 
+depends on the lowest node, if there were hundreds or thousand training nodes in a Job,
+the performance of synchronously distributed training might be very slow because of 
+the slow node. So this design doc would introduce an approach to implement 


这里的so是是不成立的 —— 因为前面说的是 slow workers, 那么解决方法应该是 backup workers。一个很有名的例子是MapReduce 的backup worker：http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

我的理解是：在深度学习里，asynchronous updating 主要是为了 scalability，具体的说是fault-over——若干trainer 挂了（而不是慢了），job 可以继续。

Thanks @wangkuiyi

这里的so是是不成立的 —— 因为前面说的是 slow workers, 那么解决方法应该是 backup workers。

同意，在同步训练下会使用barrier来保证所有trainer的参数一致，而slow worker的存在会使训练的时间变长，所以这里原因应该是barrier的存在而不是slow worker。

理解是：在深度学习里，asynchronous updating 主要是为了 scalability，具体的说是fault-over——若干trainer 挂了（而不是慢了），job 可以继续。

补充一下，asynchronous updating也为了提供系统的吞吐，例如在https://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf 也有提到利用async update提高吞吐甚至是准确率的一些实验。

Yancey1989 · 2018-04-19T06:17:34Z

Thanks, @wangkuiyi, I will update this design doc according to your comments in the next PR.

add async update design doc

8d73752

Yancey1989 requested review from helinwang, putcn, jacquesqiao and typhoonzero April 16, 2018 04:01

panyx0718 self-requested a review April 16, 2018 04:19

jacquesqiao reviewed Apr 16, 2018

View reviewed changes

panyx0718 reviewed Apr 16, 2018

View reviewed changes

This was referenced Apr 16, 2018

Add distributed training overview doc #9937

Merged

fluid support asynchronous training #9941

Closed

panyx0718 previously approved these changes Apr 16, 2018

View reviewed changes

update by comment

15c3a8e

Yancey1989 dismissed panyx0718’s stale review via 15c3a8e April 16, 2018 11:37

update

49e885b

helinwang reviewed Apr 16, 2018

View reviewed changes

putcn reviewed Apr 16, 2018

View reviewed changes

Yancey1989 added 2 commits April 17, 2018 10:27

update by comment

a54962d

update

936dfcb

jacquesqiao approved these changes Apr 17, 2018

View reviewed changes

Yancey1989 merged commit 7a2297d into PaddlePaddle:develop Apr 17, 2018

Yancey1989 deleted the async_update_design branch April 17, 2018 03:05

wangkuiyi reviewed Apr 19, 2018

View reviewed changes

Yancey1989 mentioned this pull request Apr 19, 2018

Refine async update design doc #10065

Merged


		### Parameter Server

		<img src="./src/async_pserver.png" width="750"/>


		### Trainer

		- We need a new Operator named `RemoteOptimize` to send gradients to multiple PServer


		For the typical synchronous distributed training, some significant steps are as follows:

		1. A Trainer will compute the gradients and SEND them to the Parameter

Add async update design doc #9932

Add async update design doc #9932

Conversation

Yancey1989 commented Apr 16, 2018 • edited by jacquesqiao Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao Apr 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

panyx0718 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

panyx0718 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 commented Apr 19, 2018

Yancey1989 commented Apr 16, 2018 •

edited by jacquesqiao

Loading

jacquesqiao Apr 16, 2018 •

edited

Loading