Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add async update design doc #9932

Merged
merged 5 commits into from
Apr 17, 2018

Conversation

Yancey1989
Copy link
Contributor

@Yancey1989 Yancey1989 commented Apr 16, 2018

Fixed #9925
Tasks: #9941


In the synchronously distributed training, there should be a `Barrier` to synchronise the
parameters after the optimizing stage. The performance of a distributed training job
depends on the lowest node, if there were hundreds or thousand training nodes in a Job,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowest => slowest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thousand => thousands of

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

In the synchronously distributed training, there should be a `Barrier` to synchronise the
parameters after the optimizing stage. The performance of a distributed training job
depends on the lowest node, if there were hundreds or thousand training nodes in a Job,
the performance of synchronously distributed training might be very slow because of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

performance slow => performance pool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.


### Parameter Server

<img src="./src/async_pserver.png" width="750"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need an aggregation stage and can just read a variable from the queue and update it to the parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right, don't aggregation the received gradients, will update it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


1. A Trainer will compute the gradients and SEND them to the Parameter
Server(PServer) nodes.
1. After the PServer node received gradients came from all the Trainers, it would apply the gradient to the respective variables, and using an optimize algorithms(SGD,
Copy link
Member

@jacquesqiao jacquesqiao Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will aggregate the gradient for the same parameter into one gradient and then apply the aggregated gradient to the respective parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


As the figure above, we describe a global view of asynchronously update process and use
the parameter `w1` as an example to introduce the steps:
1. For each gradient variables, they may distribute on different GPU card and aggregate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can only consider the final gradient on one trainer, do not need to consider how it comes, for example, we do not need to consider if the gradient is aggregate from multi-thread or multi GPU training on this server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May not, we need to confirm the relation between multiple devices and multiple machines, for this design doc, it's sync process inside a trainer and async between machines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, it seems that one process muli-thread async training is an independent project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@panyx0718 panyx0718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


As the figure above, we describe a global view of asynchronously update process and use
the parameter `w1` as an example to introduce the steps:
1. For each gradient variables, they may distribute on different GPU card and aggregate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean aggregate multiple gradients from gpus in a single machine? Have you considered to send gradients to pserver directly without aggregation? Each gpu is a independent training instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean aggregate multiple gradients from gpus in a single machine

Yes, because of the speed between different GPU cards is much faster than different nodes, so execute aggregation firstly and then send them may archive higher training speed.

But I think Each gpu is a independent training instance is also a feasible way, maybe we implement it in future and do some experiments to compare the speed and accuracy.

instances and sent them.
1. PServer would run an `Optimize Block` to use a specified optimize algorithm to update
the specified parameter, such as `w1`.
1. The trainer will fetch the latest parameter after PServer finished the optimize stage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the trainer wait for the "latest" parameter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the trainer does not need to wait, it can just get and use the current parameter on parameter server.


### Trainer

- We need a new Operator named `RemoteOptimize` to send gradients to multiple PServer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the logic need to be implemented in a single Op? can send and fetch operations be run independently and scheduled on demand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run send and fetch op independently would cause an uncertain action, for example, the Ops execution pipline on PServer:
send(w1'_trainer0)->send(w1'_trainer1)->fetch(w1_trainer0)->fetch(w1_trainer1)

Trainer0 fetched the result of sgd(w1'_trainer0)->sgd(w1'_trainer1) but the expective result of send(w1_trainer0) is sgd(w1'_trainer0).
The benefit is we can achievement much faster-training speed.

I also discussed with @jacquesqiao just now and he reminds me that we can use a no-locking way to implement the async distributed training, that may archive much faster speed. The reference paper: https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf

Maybe we can implement the no-locking async distributed training firstly, and then do some experiment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the Adam project(https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf) from Microsoft uses multi-threaded model parameter updates without locks to improve their model accuracy. We can implement it and have a test.

panyx0718
panyx0718 previously approved these changes Apr 16, 2018
Copy link
Contributor

@panyx0718 panyx0718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG overall. please sync with @jacquesqiao on final design


### Parameter Server

<img src="./src/async_pserver.png" width="750"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"No Pserver" -> "On Pserver"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

1. For each gradient variables, they may distribute on different GPU card and aggregate
them while they are all calculated.
1. Split the gradient variable into multiple blocks according to the number of PServer
instances and then sent them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sent -> send

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

instances and then sent them.
1. PServer would run an `Optimize Block` using a specified optimize algorithm to update
the specified parameter.
1. The trainer will fetch the parameter before running forward Op depends on the specified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't quite get the point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I think you may want to add "which" between 'OP' and 'depends'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- Schedule `FetchVars` operator to fetch the latest parameter from PServer before running
the forward ops.
- There could be a large number of gradient variables to be sent, so we need to use another
thread pool(IO Threadpool) which a number of the schedulable threads is larger than the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which -> whose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

the forward ops.
- There could be a large number of gradient variables to be sent, so we need to use another
thread pool(IO Threadpool) which a number of the schedulable threads is larger than the
computing thread pool to avoid competitive the thread resources with computing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

competitive -> competing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

<img src="./src/async_pserver.png" width="750"/>

- There should be multiple trainer instances want to optimize the same parameter at
the same time, to avoid the pollution, we need one `BlockingQueue` for each gradient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pollution -> racing
maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, if there is no locking for the update operator to the same parameter, the data will be polluted by the multiple threads, but I think using racing here is more clearly. :)

Copy link
Member

@jacquesqiao jacquesqiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Yancey1989 Yancey1989 merged commit 7a2297d into PaddlePaddle:develop Apr 17, 2018
@Yancey1989 Yancey1989 deleted the async_update_design branch April 17, 2018 03:05

For the typical synchronous distributed training, some significant steps are as follows:

1. A Trainer will compute the gradients and SEND them to the Parameter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Trainer => A trainer process


For the typical synchronous distributed training, some significant steps are as follows:

1. A Trainer will compute the gradients and SEND them to the Parameter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SEND => send

Use * if you want to emphasize the text.


For the typical synchronous distributed training, some significant steps are as follows:

1. A Trainer will compute the gradients and SEND them to the Parameter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Parameter Server(PServer) nodes => the parameter server (pserver) processes

  1. If we use Parameter Server then the shorthand should be PS instead PServer.
  2. In English, there needs a space before the left parenthesis.

parameters after the optimizing stage. The performance of a distributed training job
depends on the lowest node, if there were hundreds or thousand training nodes in a Job,
the performance of synchronously distributed training might be very slow because of
the slow node. So this design doc would introduce an approach to implement
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的so是是不成立的 —— 因为前面说的是 slow workers, 那么解决方法应该是 backup workers。一个很有名的例子是MapReduce 的backup worker:http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的理解是:在深度学习里,asynchronous updating 主要是为了 scalability,具体的说是fault-over——若干trainer 挂了(而不是慢了),job 可以继续。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wangkuiyi

这里的so是是不成立的 —— 因为前面说的是 slow workers, 那么解决方法应该是 backup workers。

同意,在同步训练下会使用barrier来保证所有trainer的参数一致,而slow worker的存在会使训练的时间变长, 所以这里原因应该是barrier的存在而不是slow worker。

理解是:在深度学习里,asynchronous updating 主要是为了 scalability,具体的说是fault-over——若干trainer 挂了(而不是慢了),job 可以继续。

补充一下,asynchronous updating也为了提供系统的吞吐,例如在https://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf 也有提到利用async update提高吞吐甚至是准确率的一些实验。

@Yancey1989
Copy link
Contributor Author

Thanks, @wangkuiyi, I will update this design doc according to your comments in the next PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants