Design doc: operator based parameter server. #3747

helinwang · 2017-08-29T23:09:31Z

Here could be easier to review

Superjomn · 2017-08-29T23:10:20Z

doc/design/ops/dist_train.md

+
+## Abstract
+
+We propose an approach to implment the parameter server. In this


Fixed, thanks!

jacquesqiao · 2017-08-29T23:41:18Z

doc/design/ops/dist_train.md

+Below is an example of converting the user defined graph to the
+sub-graphs for the trainer and the parameter server:
+
+<img src="src/local-graph.png" width="300"/>


W is also an input of parameter update op

putcn · 2017-08-30T01:32:10Z

if I understand correctly, for SEND op, it only sends graph, for RECEIVE op, it only receives gradient. any given worker sees only part of the whole graph, but how would the training data travel through the whole graph? would there be some part of the graph idle until data reaches it's parent?

typhoonzero · 2017-08-30T01:51:16Z

doc/design/ops/dist_train.md

+
+1. The parameter variable W and it's optimizer subgraph are placed on the parameter server.
+1. Operators are added to the sub-graphs.
+   - *send* operator sends data and sender's address to the destination.


Question: if there are multiple parameters(or variables) to send to parameter server, are we:

create multiple Send operators for each variable or,

create one Hash operator to divide parameters equally and one Sender operator to do send.

And also, maybe add some description about the send, recv operators like:

Send:

Inputs:

Outputs:

Description:

The same confusion with @typhoonzero , maybe we need an operator to sharded parameters.

For others, from @typhoonzero

create multiple Send operators for each variable

Maybe we only need one Sender operator. If we have too many parameters, too much Sender operator will cased too much connection to parameter server.

@typhoonzero @Yancey1989 sorry, the PR could be more clear:
In short, the answer is "create multiple Send operators for each variable".

From the graph's perspective, the Send and Recv OP are one for each variable (but not one per replica: different (trainer) replicas share one Send and Recv for each variable).
In the implementation detail, we could group send implementations to a single port handler.

In this design the variable placement is done by the graph converter before the graph is sent to worker, so it's not a runtime concept like a Hash operator. I think the "Hash" solution is for the simplest element-wise optimization case. If we want the parameter server to do things more than element-wise operation, we need to decide the parameter variable and OP placement before the graph is sent to worker.

Understood. You mean people who develop graphs, do not need to look into the implement details about how we actually send and recv variables, the graph is how the calculations flow logically. But when we build and optimize the graph, we can make the actual send operation one per trainer.

Will you add some implementation thoughts in this PR or in another one?

@typhoonzero Thanks for the reminder on the implementation thoughts! That's a good idea. I will perhaps not mention implementation detail in this PR, but create a separate issue discussing it. After receiving you guys' comments, I have some point need to re-think and will update this PR and create the implementation detail issue at that time.

helinwang · 2017-08-30T17:12:27Z

@putcn Sorry, my PR could be more clear, the Send OP is for sending any tensor (not the graph), the Recv OP is for receiving any tensor. In this way data (tensor) can travel through the graph.
Yes, the Recv OP will be blocked until it receives the data, so the OPs that depends on Recv will be idle until then.

helinwang · 2017-08-31T01:13:43Z

From @Superjom : graph拼起来之后应该还是一个可以用的graph。

typhoonzero

LGTM!

dzhwinter · 2017-09-04T18:51:20Z

According to the current design, there is more concept need to clarify in this design doc.
1、How the scope should be implemented in a distributed environment? Every sub-graph must run with a scope to get variable value from it. How to partition the global scope?
here is some discussion related
2、The distributed graph need to compatible with Block design.

dzhwinter

LGTM

* add python train time eval * add mpii infer support

helinwang added 2 commits August 29, 2017 16:05

Design doc: operator based parameter server.

46034fa

change image size

0e80f74

helinwang requested review from Superjomn, reyoung, wangkuiyi, jacquesqiao, dzhwinter, JiayiFeng, emailweixu, Yancey1989, gongweibao and typhoonzero August 29, 2017 23:09

Superjomn reviewed Aug 29, 2017

View reviewed changes

fix typo

75856ec

jacquesqiao reviewed Aug 29, 2017

View reviewed changes

typhoonzero reviewed Aug 30, 2017

View reviewed changes

typhoonzero previously approved these changes Aug 31, 2017

View reviewed changes

update figures

74b22c3

helinwang dismissed typhoonzero’s stale review via 74b22c3 August 31, 2017 22:47

helinwang added 2 commits August 31, 2017 16:50

update OP based parameter server design

c2a16b5

polish

67eff9f

dzhwinter mentioned this pull request Sep 2, 2017

How the scope should be implemented in distribute environment. #3825

Closed

dzhwinter approved these changes Sep 8, 2017

View reviewed changes

helinwang merged commit 99e3d1e into PaddlePaddle:develop Sep 9, 2017

helinwang deleted the dist_op branch September 9, 2017 00:43

QiJune added the Block design label Sep 13, 2017

heavengate pushed a commit to heavengate/Paddle that referenced this pull request Aug 16, 2021

Develop add hrnet mpii infer (PaddlePaddle#3747)

ab51c5c

* add python train time eval * add mpii infer support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design doc: operator based parameter server. #3747

Design doc: operator based parameter server. #3747

helinwang commented Aug 29, 2017 •

edited

Loading

Superjomn Aug 29, 2017

helinwang Aug 29, 2017

jacquesqiao Aug 29, 2017 •

edited

Loading

putcn commented Aug 30, 2017

typhoonzero Aug 30, 2017

typhoonzero Aug 30, 2017

Yancey1989 Aug 30, 2017

helinwang Aug 30, 2017 •

edited

Loading

typhoonzero Aug 31, 2017

helinwang Aug 31, 2017 •

edited

Loading

helinwang commented Aug 30, 2017 •

edited

Loading

helinwang commented Aug 31, 2017

typhoonzero left a comment

dzhwinter commented Sep 4, 2017

dzhwinter left a comment


		## Abstract

		We propose an approach to implment the parameter server. In this

Design doc: operator based parameter server. #3747

Design doc: operator based parameter server. #3747

Conversation

helinwang commented Aug 29, 2017 • edited Loading

Superjomn Aug 29, 2017

Choose a reason for hiding this comment

helinwang Aug 29, 2017

Choose a reason for hiding this comment

jacquesqiao Aug 29, 2017 • edited Loading

Choose a reason for hiding this comment

putcn commented Aug 30, 2017

typhoonzero Aug 30, 2017

Choose a reason for hiding this comment

typhoonzero Aug 30, 2017

Choose a reason for hiding this comment

Yancey1989 Aug 30, 2017

Choose a reason for hiding this comment

helinwang Aug 30, 2017 • edited Loading

Choose a reason for hiding this comment

typhoonzero Aug 31, 2017

Choose a reason for hiding this comment

helinwang Aug 31, 2017 • edited Loading

Choose a reason for hiding this comment

helinwang commented Aug 30, 2017 • edited Loading

helinwang commented Aug 31, 2017

typhoonzero left a comment

Choose a reason for hiding this comment

dzhwinter commented Sep 4, 2017

dzhwinter left a comment

Choose a reason for hiding this comment

helinwang commented Aug 29, 2017 •

edited

Loading

jacquesqiao Aug 29, 2017 •

edited

Loading

helinwang Aug 30, 2017 •

edited

Loading

helinwang Aug 31, 2017 •

edited

Loading

helinwang commented Aug 30, 2017 •

edited

Loading