-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add async update design doc #9932
Add async update design doc #9932
Conversation
|
||
In the synchronously distributed training, there should be a `Barrier` to synchronise the | ||
parameters after the optimizing stage. The performance of a distributed training job | ||
depends on the lowest node, if there were hundreds or thousand training nodes in a Job, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lowest => slowest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thousand => thousands of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
In the synchronously distributed training, there should be a `Barrier` to synchronise the | ||
parameters after the optimizing stage. The performance of a distributed training job | ||
depends on the lowest node, if there were hundreds or thousand training nodes in a Job, | ||
the performance of synchronously distributed training might be very slow because of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
performance slow => performance pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
|
||
### Parameter Server | ||
|
||
<img src="./src/async_pserver.png" width="750"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do not need an aggregation stage and can just read a variable from the queue and update it to the parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right, don't aggregation the received gradients, will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
1. A Trainer will compute the gradients and SEND them to the Parameter | ||
Server(PServer) nodes. | ||
1. After the PServer node received gradients came from all the Trainers, it would apply the gradient to the respective variables, and using an optimize algorithms(SGD, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will aggregate the gradient for the same parameter into one gradient and then apply the aggregated gradient to the respective parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
As the figure above, we describe a global view of asynchronously update process and use | ||
the parameter `w1` as an example to introduce the steps: | ||
1. For each gradient variables, they may distribute on different GPU card and aggregate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can only consider the final gradient on one trainer, do not need to consider how it comes, for example, we do not need to consider if the gradient is aggregate from multi-thread or multi GPU training on this server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May not, we need to confirm the relation between multiple devices
and multiple machines
, for this design doc, it's sync process inside a trainer and async between machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, it seems that one process muli-thread async training is an independent project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answer described the way TensorFlow async works: https://stackoverflow.com/questions/43147435/how-does-asynchronous-training-work-in-distributed-tensorflow
|
||
As the figure above, we describe a global view of asynchronously update process and use | ||
the parameter `w1` as an example to introduce the steps: | ||
1. For each gradient variables, they may distribute on different GPU card and aggregate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean aggregate multiple gradients from gpus in a single machine? Have you considered to send gradients to pserver directly without aggregation? Each gpu is a independent training instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean aggregate multiple gradients from gpus in a single machine
Yes, because of the speed between different GPU cards is much faster than different nodes, so execute aggregation firstly and then send them may archive higher training speed.
But I think Each gpu is a independent training instance
is also a feasible way, maybe we implement it in future and do some experiments to compare the speed and accuracy.
instances and sent them. | ||
1. PServer would run an `Optimize Block` to use a specified optimize algorithm to update | ||
the specified parameter, such as `w1`. | ||
1. The trainer will fetch the latest parameter after PServer finished the optimize stage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the trainer wait for the "latest" parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the trainer does not need to wait, it can just get and use the current parameter on parameter server.
|
||
### Trainer | ||
|
||
- We need a new Operator named `RemoteOptimize` to send gradients to multiple PServer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the logic need to be implemented in a single Op? can send and fetch operations be run independently and scheduled on demand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run send
and fetch
op independently would cause an uncertain action, for example, the Ops execution pipline on PServer:
send(w1'_trainer0)->send(w1'_trainer1)->fetch(w1_trainer0)->fetch(w1_trainer1)
Trainer0 fetched the result of sgd(w1'_trainer0)->sgd(w1'_trainer1)
but the expective result of send(w1_trainer0)
is sgd(w1'_trainer0)
.
The benefit is we can achievement much faster-training speed.
I also discussed with @jacquesqiao just now and he reminds me that we can use a no-locking
way to implement the async distributed training, that may archive much faster speed. The reference paper: https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf
Maybe we can implement the no-locking async distributed training
firstly, and then do some experiment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the Adam project
(https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf) from Microsoft uses multi-threaded model parameter updates without locks to improve their model accuracy. We can implement it and have a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG overall. please sync with @jacquesqiao on final design
|
||
### Parameter Server | ||
|
||
<img src="./src/async_pserver.png" width="750"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"No Pserver" -> "On Pserver"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
1. For each gradient variables, they may distribute on different GPU card and aggregate | ||
them while they are all calculated. | ||
1. Split the gradient variable into multiple blocks according to the number of PServer | ||
instances and then sent them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sent -> send
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
instances and then sent them. | ||
1. PServer would run an `Optimize Block` using a specified optimize algorithm to update | ||
the specified parameter. | ||
1. The trainer will fetch the parameter before running forward Op depends on the specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't quite get the point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I think you may want to add "which" between 'OP' and 'depends'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
- Schedule `FetchVars` operator to fetch the latest parameter from PServer before running | ||
the forward ops. | ||
- There could be a large number of gradient variables to be sent, so we need to use another | ||
thread pool(IO Threadpool) which a number of the schedulable threads is larger than the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which -> whose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
the forward ops. | ||
- There could be a large number of gradient variables to be sent, so we need to use another | ||
thread pool(IO Threadpool) which a number of the schedulable threads is larger than the | ||
computing thread pool to avoid competitive the thread resources with computing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
competitive -> competing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
<img src="./src/async_pserver.png" width="750"/> | ||
|
||
- There should be multiple trainer instances want to optimize the same parameter at | ||
the same time, to avoid the pollution, we need one `BlockingQueue` for each gradient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pollution -> racing
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, if there is no locking for the update
operator to the same parameter, the data will be polluted by the multiple threads, but I think using racing
here is more clearly. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
||
For the typical synchronous distributed training, some significant steps are as follows: | ||
|
||
1. A Trainer will compute the gradients and SEND them to the Parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Trainer => A trainer process
|
||
For the typical synchronous distributed training, some significant steps are as follows: | ||
|
||
1. A Trainer will compute the gradients and SEND them to the Parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SEND => send
Use * if you want to emphasize the text.
|
||
For the typical synchronous distributed training, some significant steps are as follows: | ||
|
||
1. A Trainer will compute the gradients and SEND them to the Parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Parameter Server(PServer) nodes => the parameter server (pserver) processes
- If we use Parameter Server then the shorthand should be PS instead PServer.
- In English, there needs a space before the left parenthesis.
parameters after the optimizing stage. The performance of a distributed training job | ||
depends on the lowest node, if there were hundreds or thousand training nodes in a Job, | ||
the performance of synchronously distributed training might be very slow because of | ||
the slow node. So this design doc would introduce an approach to implement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的so是是不成立的 —— 因为前面说的是 slow workers, 那么解决方法应该是 backup workers。一个很有名的例子是MapReduce 的backup worker:http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我的理解是:在深度学习里,asynchronous updating 主要是为了 scalability,具体的说是fault-over——若干trainer 挂了(而不是慢了),job 可以继续。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wangkuiyi
这里的so是是不成立的 —— 因为前面说的是 slow workers, 那么解决方法应该是 backup workers。
同意,在同步训练下会使用barrier来保证所有trainer的参数一致,而slow worker的存在会使训练的时间变长, 所以这里原因应该是barrier的存在而不是slow worker。
理解是:在深度学习里,asynchronous updating 主要是为了 scalability,具体的说是fault-over——若干trainer 挂了(而不是慢了),job 可以继续。
补充一下,asynchronous updating也为了提供系统的吞吐,例如在https://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf 也有提到利用async update提高吞吐甚至是准确率的一些实验。
Thanks, @wangkuiyi, I will update this design doc according to your comments in the next PR. |
Fixed #9925
Tasks: #9941