-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ExecutionPlan design. #6078
Conversation
9a038d1
to
617b8f6
Compare
paddle/framework/framework.proto
Outdated
message ExecutionPlan { | ||
optional ProgramDesc program = 1; | ||
repeated OpPlacement op_placement = 2; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, how to find the correspondence between OpPlacement in ExecutionPlan and OpDesc in ProgramDesc?
Are the number and order of operators in ExecutionPlace and ProgramDesc the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number will be the same, each OP will have one placement. The order does not have to be the same, otherwise the "name" field in OpPlacement
is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Program{Block{Op}}. A Program has many blocks. A block has many ops.
However, the Program has many operator placements. We cannot get a one-to-one map by this data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the Program has many operator placements. We cannot get a one-to-one map by this data structure.
Sorry I don't fully get this point, I thought different OPs have different names?
doc/design/program.md
Outdated
@@ -2,7 +2,7 @@ | |||
|
|||
## Compile and Execution | |||
|
|||
A PaddlePaddle program consists of two parts -- the first generates a `ProgramDesc` protobuf message that describes the program, and the second runs this message using a C++ class `Executor`. | |||
A PaddlePaddle program consists of three parts -- the first generates a `ProgramDesc` protobuf message that describes the program, the second optimizes this message using a C++ class `Optimizer` and generates an `ExecutionPlan` protobuf messages, and the third run the message using a C++ class `Executor`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ExecutionPlan is not dependent on optimizer. In an inference ProgramDesc, we can also have a ExecutionPlan.
- In which time we can decide the device where an operator runs? In current code, an operator has CPU kernel and GPU kernel. At running time, the kernel is decided by the place of DeviceContext. Actually, it's decided in running time.
Since we have ExecutionPlan which is a proto message storing device information, the device must be decided at compile time.
In a word, we still have two parts, compile-time and run-time. At compile-time, we will generate two proto message, the first is ProgramDesc and the second is ExecutionPlan.
The ExecutionPlan is set by users' configuration and Paddle's own auto device placement policy. If user switch to another hardware environment, and he/she do not want to provide a ExecutionPlan, Paddle can generate a ExecutionPlan under Paddle's own auto device placement policy.
An interface could be:
void GenerateExecutionPlan(const ProgramDesc& input, OpDeviceMap* user_config, ExecutionPlan* output);
The user_config could be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExecutionPlan is not dependent on optimizer. In an inference ProgramDesc, we can also have a ExecutionPlan.
Agree, will find a better name for optimizer
.
the kernel is decided by the place of DeviceContext. Actually, it's decided in running time.
Understand, but I think deciding at runtime make us no way to control where to place the OP. Being able to control it is very important.
In a word, we still have two parts, compile-time and run-time. At compile-time, we will generate two proto message, the first is ProgramDesc and the second is ExecutionPlan.
Agree.
An interface could be:
void GenerateExecutionPlan(const ProgramDesc& input, OpDeviceMap* user_config, ExecutionPlan* output);
The user_config
should be part of ProgramDesc
, since ProgramDesc
describes what the user wants.
doc/design/program.md
Outdated
@@ -2,7 +2,7 @@ | |||
|
|||
## Compile and Execution | |||
|
|||
A PaddlePaddle program consists of two parts -- the first generates a `ProgramDesc` protobuf message that describes the program, and the second runs this message using a C++ class `Executor`. | |||
A PaddlePaddle program consists of three parts -- the first generates a `ProgramDesc` protobuf message that describes the program, the second optimizes this message using a C++ class `Optimizer` and generates an `ExecutionPlan` protobuf messages, and the third run the message using a C++ class `Executor`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd better switch Optimizer
to another term. Our python already have the Optimizer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as @dzhwinter , just my personal view, this Optimizer does not do the optimize, like the four steps which run a C program, COMPILER -> ASSEMBLER -> LINKER -> LOADER. How about convert optimizer
-> assembler
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generating ExecutionPlan
is exactly like gcc
's -O
option.
We probably do not need a single C++ class to optimize the graph since we can just create a member function OptimizeProgram
in the Executor
class. ExecutionPlan
object should also be the member of Executor
, so that we call executor.run
is executing optimized graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, ExecutionPlan
is no need to be a protobuf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to #6141 , I think that ProgramDesc
is not enough to run a network. We also need to provide Device Type and Data Type for each operator. Exposing these interface to users is necessary, even though paddle framework could provide a solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd better switch Optimizer to another term. Our python already have the Optimizer.
Agree we need a better naming, thank @Yancey1989 ! Assembler is a good name candidate!
@typhoonzero I think whoever generates the ExecutionPlan
from ProgramDesc
should have the global view: the global program desc, and the number of devices. And different ExecutionPlans
are sent to different nodes. On the other hand, Executor
runs locally, it does not know the devices on other nodes.
@QiJune thanks, agree that we will need enable user's manual placement configuration, and that configuration should be in ProgramDesc
. At the same time, ExecutionPlan
should have placement information too. ProgramDesc
and ExecutionPlan
are two different things with different focus, it's fine for them to have similar fields, it's not duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yancey1989 @helinwang Sorry that I did not quite get the point of the name Assembler, if this name is to be used, what is Compiler/Linker/Loader in PaddlePaddle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zealoct thank you! I have changed the name to Planner
, do you think it conveys the means correctly?
doc/design/program.md
Outdated
@@ -15,7 +15,7 @@ optimize(cost) | |||
train(cost, reader=mnist.train()) | |||
``` | |||
|
|||
The first five lines of the following PaddlePaddle program generates, or, compiles, the `ProgramDesc` message. The last line runs it. | |||
The first five lines of the following PaddlePaddle program generates, or, compiles, the `ProgramDesc` message. The last line optimizes and runs it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe optimizes
=> transform
?
doc/design/program.md
Outdated
|
||
### Optimizer | ||
|
||
The optimizer takes `ProgramDesc` as the input and outputs the `ExcutionPlan`, the steps are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo ExcutionPlan
doc/design/program.md
Outdated
|
||
The goal of `ProgramDesc` is to describe **what** the user wants to calculate, and the goal of `ExecutionPlan` is to specify **how** to calculate it. | ||
|
||
For example, the `ExecutionPlan` has OP placement information to indicate which device the OP will run, but the `ProgramDesc` does not have this information since currently our Python API does not support manually pinning an OP onto a type of device (e.g., GPU or FPGA). On the other hand, the `ProgramDesc` should have information about if an OP belongs to an optimizer, this information is provided by the user and helps to place the OPs onto the parameter servers, but the `ExecutionPlan` does not have this information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be more clear if we add
ProgramDesc
describe the device independent computing process, but the ExecutionPlan
describe the device related computing process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indicate which device the OP will run
missing an on
before which
(=
optional string name = 1; | ||
optional string device = 2; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a detail example in comment.
message OpPlacement {
// pserver:gpu0
optional string name = 1;
optional string device = 2;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "pserver" in "pserver:gpu0" is not necessary, the executor does not need to know what role (e.g., pserver) it takes. Maybe only "gpu0" is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bit confused how would name
and device
values be at runtime, can you give an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name
should be the name of the OP (every OP should have a name), will add this into the PR.
device
should be something like: "gpu0", "cpu".
doc/design/program.md
Outdated
### Optimizer | ||
|
||
The optimizer takes `ProgramDesc` as the input and outputs the `ExcutionPlan`, the steps are: | ||
1. Add the prgram in `ProgramDesc` and the coresponding backward pass program into the `ExecutionPlan`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo prgram
and avaiable
doc/design/program.md
Outdated
|
||
The optimizer takes `ProgramDesc` as the input and outputs the `ExcutionPlan`, the steps are: | ||
1. Add the prgram in `ProgramDesc` and the coresponding backward pass program into the `ExecutionPlan`. | ||
1. Optimizes the program according to the avaiable devices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused at the avaiable devices
.
Which part should own the Optimizer
module? The cluster or the client program?
Especially in the Elastic DeepLearning, if the user request for nodes in a range 5-10, how should we generate the ExecutionPlan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimizer
module (not a good name, maybe assembler/transformer/? would be better) should be in a binary running in the cluster for distributed training. For local training, the module should be compiled locally.
doc/design/program.md
Outdated
For example, add data parallelism by spliting the input mini-batches and replicating the OPs onto different GPUs. Note that even if the OPs are replicated on different GPUs, there is still only **one** execution plan. One executor runs and only runs one `ExecutionPlan`. | ||
1. Place each OP onto available devices, the placement information is written in the `ExecutionPlan`. | ||
1. In distributed training, split the `ExecutionPlan` into multiple `ExecutionPlans` and add send/recv OP between them. For local training, this step is not necessary since there is only one executor. | ||
1. Send the `ExecutionPlan` to the executor for execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still the same question above. In a local machine with Multi-GPUs, which module should send the ExecutionPlan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see https://github.com/PaddlePaddle/Paddle/pull/6078/files#r154270688 , does it answer your question?
|
||
message OpPlacement { | ||
optional string name = 1; | ||
optional string device = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also wondering if device info for Operator is enough.
In Tensorflow, tf.Variable is actually an operator, and tf.Tensor has a operator data member. Tensorflow is a graph of operator, so device info in operator is enough.
But we have both variable and operator. Do we need device info for Variable? Do we need another VarPlacement
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In most common case(add/sub/relu...), the output variable device is the same with operator device.
- For control related operators and LoD related operators, the operator device is always CPU. And the output variable is in CPU too.
- If we get training data/parameter using load operator/initialize operator , the variable device is the same with load operator/initialize operator.
- If we get training data/parameter using python reader, the variable device need to be set manually.
So, this a only one case which we should set device for variable. For other cases, the variable device can be decided by operator's device info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@QiJune thanks, great question, I guess we need VarPlacement
only if we will use explicit OP for copying data from CPU to GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get training data/parameter using python reader, the variable device need to be set manually.
Isn't the data initially CPU, and copied to GPU implicitly when needed, since we don't do explicit copies, maybe we don't need VarPlacement
?
doc/design/program.md
Outdated
@@ -2,7 +2,7 @@ | |||
|
|||
## Compile and Execution | |||
|
|||
A PaddlePaddle program consists of two parts -- the first generates a `ProgramDesc` protobuf message that describes the program, and the second runs this message using a C++ class `Executor`. | |||
A PaddlePaddle program consists of three parts -- the first generates a `ProgramDesc` protobuf message that describes the program, the second optimizes this message using a C++ class `Optimizer` and generates an `ExecutionPlan` protobuf messages, and the third run the message using a C++ class `Executor`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as @dzhwinter , just my personal view, this Optimizer does not do the optimize, like the four steps which run a C program, COMPILER -> ASSEMBLER -> LINKER -> LOADER. How about convert optimizer
-> assembler
?
doc/design/program.md
Outdated
|
||
The goal of `ProgramDesc` is to describe **what** the user wants to calculate, and the goal of `ExecutionPlan` is to specify **how** to calculate it. | ||
|
||
For example, the `ExecutionPlan` has OP placement information to indicate which device the OP will run, but the `ProgramDesc` does not have this information since currently our Python API does not support manually pinning an OP onto a type of device (e.g., GPU or FPGA). On the other hand, the `ProgramDesc` should have information about if an OP belongs to an optimizer, this information is provided by the user and helps to place the OPs onto the parameter servers, but the `ExecutionPlan` does not have this information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we also updte the describe of Op placement: https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/refactor/parameter_server.md#graph-converter, in the newer design, op placement includ device and trainer/pserver information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have two independent concepts: trainer & pserver, or we only have one concept worker and the role is decided by the subgraph it receives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only have one concept worker and the role is decided by the subgraph it receives.
@helinwang @typhoonzero Is that Another concern is if we do not want a general plan for every kind executor. The execution plan might be the internal data structure of the executor. To use |
@reyoung Thanks for reviewing!
The executor here is the cpp implementation, not the Python executor (the Python executor is more like TensorFlow session, it's a gateway to the cpp executor that runs the The Python executor we probably can have different kind executors, local executor and remote executor. I think we should just have one cpp executor implementation, multiple nodes should run the same executor implementation as single node. Having multiple executor probably makes code very hard to maintain and optimize (e.g., need to update all executors when a fix/optimization is needed), and I don't see much benefit. The reason for using protobuf is just for the convenience of serialization when sending the |
22fc63c
to
ab3e54c
Compare
If we keep one |
paddle/framework/framework.proto
Outdated
} | ||
|
||
message ExecutionPlan { | ||
optional ProgramDesc program = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pserver and trainer may use different ProgramDesc
, seems one field is not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the new design the pserver and trainer will be exactly same binary (executor), only thing that is different is the ExecutionPlan
they run. The Planner
will know the roles of different executors (e.g., pserver role, trainer role) to help generating the ExecutionPlans
.
|
||
message OpPlacement { | ||
optional string name = 1; | ||
optional string device = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not put device
field in ProgramDesc
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we also need to allow users the specify the device information by two approaches:
- device ID such as
CPU:0/GPU:0
. - The maximal device count such as
CPU:{5}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ProgramDesc
is used to specify the information from the user, since currently we don't have API to do that, we probably should not put that information into ProgramDesc
.
In the future when we have that API we can add it to ProgramDesc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to add it to ProgramDesc
since we are not using this field for now, so then we don't need further changes to the protobuf files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero I think ProgramDesc
and ExecutionPlan
are used for different purposes, ProgramDesc
is the output from Python, specifying what the user need. ExecutionPlan
is the input and output for IR optimizers, and input for executor. So they better be two separate entities.
Since there are two entities: ProgramDesc
and ExecutionPlan
, and the device placement is about optimization, not about what the user specified, it probably should be in ExecutionPlan
but not ProgramDesc
.
In the future when we want enable the user to configure which device an OP runs, we can put the field indicating device in ProgramDesc
.
Maybe I need to change ExecutionPlan
to (not depend on ProgramDesc
anymore):
message ExecutionPlan {
repeated BlockDesc blocks = 1;
repeated OpPlacement op_placement = 2;
}
What do you think?
The sequence of optimizers to generate the final ExecutionPlan is a good idea, we can also add |
There is one more concern. |
@dzhwinter Yes, I think we need a unify solution, otherwise there are too much code path to develop / maintain. The |
(CPU/single GPU/multiple GPU/multiple nodes), with the following | ||
requirements: | ||
|
||
1. It should be programming language agnostic. Currently, we have a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be a way of exporting ProgramDesc? so that user can share it, like export(cost, SAVE_TO_PATH)
? how we are going to differentiate saving algorithm(ProgramDesc) from saving model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a model should be saved separately: ProgramDesc
and the weights. So that the weights can be re-used for different ProgramDescs
.
Maybe saving model is not strictly related to this PR, we can discuss more in a separate issue if we wish :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, thanks:)
The `ExecutionPlan` contains all the details of running the program, | ||
including which device each OP is placed on. One `Executor` could have | ||
multiple devices (e.g, CPU, GPUs), but it runs only one | ||
`ExecutionPlan`. In distributed training there will be `n` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
available devices for distributed training are dynamic, should this plan be generated every time when available devices change (device added/removed/updated)? how are we going to efficiently deploy it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this should be generated every time when available devices change. Currently in distributed training we can have a constant number of trainers/pservers, I think it's a good starting point.
After several discussions, we reached conclusion that we no longer need execution plan, the internal representation and the input for the executor will be ProgramDesc. |
No description provided.