Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ExecutionPlan design. #6078

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 18 additions & 2 deletions doc/design/program.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Compile and Execution

A PaddlePaddle program consists of two parts -- the first generates a `ProgramDesc` protobuf message that describes the program, and the second runs this message using a C++ class `Executor`.
A PaddlePaddle program consists of three parts -- the first generates a `ProgramDesc` protobuf message that describes the program, the second optimizes this message using a C++ class `Optimizer` and generates an `ExecutionPlan` protobuf messages, and the third run the message using a C++ class `Executor`.
Copy link
Member

@QiJune QiJune Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • ExecutionPlan is not dependent on optimizer. In an inference ProgramDesc, we can also have a ExecutionPlan.
  • In which time we can decide the device where an operator runs? In current code, an operator has CPU kernel and GPU kernel. At running time, the kernel is decided by the place of DeviceContext. Actually, it's decided in running time.
    Since we have ExecutionPlan which is a proto message storing device information, the device must be decided at compile time.

In a word, we still have two parts, compile-time and run-time. At compile-time, we will generate two proto message, the first is ProgramDesc and the second is ExecutionPlan.
The ExecutionPlan is set by users' configuration and Paddle's own auto device placement policy. If user switch to another hardware environment, and he/she do not want to provide a ExecutionPlan, Paddle can generate a ExecutionPlan under Paddle's own auto device placement policy.

An interface could be:

void GenerateExecutionPlan(const ProgramDesc& input, OpDeviceMap* user_config, ExecutionPlan* output);

The user_config could be null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExecutionPlan is not dependent on optimizer. In an inference ProgramDesc, we can also have a ExecutionPlan.

Agree, will find a better name for optimizer.

the kernel is decided by the place of DeviceContext. Actually, it's decided in running time.

Understand, but I think deciding at runtime make us no way to control where to place the OP. Being able to control it is very important.

In a word, we still have two parts, compile-time and run-time. At compile-time, we will generate two proto message, the first is ProgramDesc and the second is ExecutionPlan.

Agree.

An interface could be: void GenerateExecutionPlan(const ProgramDesc& input, OpDeviceMap* user_config, ExecutionPlan* output);

The user_config should be part of ProgramDesc, since ProgramDesc describes what the user wants.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better switch Optimizer to another term. Our python already have the Optimizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as @dzhwinter , just my personal view, this Optimizer does not do the optimize, like the four steps which run a C program, COMPILER -> ASSEMBLER -> LINKER -> LOADER. How about convert optimizer -> assembler?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating ExecutionPlan is exactly like gcc's -O option.

We probably do not need a single C++ class to optimize the graph since we can just create a member function OptimizeProgram in the Executor class. ExecutionPlan object should also be the member of Executor, so that we call executor.run is executing optimized graph.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, ExecutionPlan is no need to be a protobuf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to #6141 , I think that ProgramDesc is not enough to run a network. We also need to provide Device Type and Data Type for each operator. Exposing these interface to users is necessary, even though paddle framework could provide a solution.

Copy link
Contributor Author

@helinwang helinwang Dec 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzhwinter @Yancey1989

We'd better switch Optimizer to another term. Our python already have the Optimizer.

Agree we need a better naming, thank @Yancey1989 ! Assembler is a good name candidate!

@typhoonzero I think whoever generates the ExecutionPlan from ProgramDesc should have the global view: the global program desc, and the number of devices. And different ExecutionPlans are sent to different nodes. On the other hand, Executor runs locally, it does not know the devices on other nodes.

@QiJune thanks, agree that we will need enable user's manual placement configuration, and that configuration should be in ProgramDesc. At the same time, ExecutionPlan should have placement information too. ProgramDesc and ExecutionPlan are two different things with different focus, it's fine for them to have similar fields, it's not duplication.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yancey1989 @helinwang Sorry that I did not quite get the point of the name Assembler, if this name is to be used, what is Compiler/Linker/Loader in PaddlePaddle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zealoct thank you! I have changed the name to Planner, do you think it conveys the means correctly?


A simple example PaddlePaddle program can be found in [graph.md](./graph.md):

Expand All @@ -15,7 +15,7 @@ optimize(cost)
train(cost, reader=mnist.train())
```

The first five lines of the following PaddlePaddle program generates, or, compiles, the `ProgramDesc` message. The last line runs it.
The first five lines of the following PaddlePaddle program generates, or, compiles, the `ProgramDesc` message. The last line optimizes and runs it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe optimizes => transform ?


## Programs and Blocks

Expand Down Expand Up @@ -120,6 +120,22 @@ message AttrDesc {
}
```

## ProgramDesc and ExecutionPlan

The goal of `ProgramDesc` is to describe **what** the user wants to calculate, and the goal of `ExecutionPlan` is to specify **how** to calculate it.

For example, the `ExecutionPlan` has OP placement information to indicate which device the OP will run, but the `ProgramDesc` does not have this information since currently our Python API does not support manually pinning an OP onto a type of device (e.g., GPU or FPGA). On the other hand, the `ProgramDesc` should have information about if an OP belongs to an optimizer, this information is provided by the user and helps to place the OPs onto the parameter servers, but the `ExecutionPlan` does not have this information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be more clear if we add
ProgramDesc describe the device independent computing process, but the ExecutionPlan describe the device related computing process

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indicate which device the OP will run

missing an on before which (=

Copy link
Contributor

@Yancey1989 Yancey1989 Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we also updte the describe of Op placement: https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/refactor/parameter_server.md#graph-converter, in the newer design, op placement includ device and trainer/pserver information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will do.

Copy link
Member

@jacquesqiao jacquesqiao Dec 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have two independent concepts: trainer & pserver, or we only have one concept worker and the role is decided by the subgraph it receives?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only have one concept worker and the role is decided by the subgraph it receives.


### Optimizer

The optimizer takes `ProgramDesc` as the input and outputs the `ExcutionPlan`, the steps are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo ExcutionPlan

1. Add the prgram in `ProgramDesc` and the coresponding backward pass program into the `ExecutionPlan`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo prgram and avaiable

1. Optimizes the program according to the avaiable devices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused at the avaiable devices.
Which part should own the Optimizer module? The cluster or the client program?

Especially in the Elastic DeepLearning, if the user request for nodes in a range 5-10, how should we generate the ExecutionPlan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimizer module (not a good name, maybe assembler/transformer/? would be better) should be in a binary running in the cluster for distributed training. For local training, the module should be compiled locally.

For example, add data parallelism by spliting the input mini-batches and replicating the OPs onto different GPUs. Note that even if the OPs are replicated on different GPUs, there is still only **one** execution plan. One executor runs and only runs one `ExecutionPlan`.
1. Place each OP onto available devices, the placement information is written in the `ExecutionPlan`.
1. In distributed training, split the `ExecutionPlan` into multiple `ExecutionPlans` and add send/recv OP between them. For local training, this step is not necessary since there is only one executor.
1. Send the `ExecutionPlan` to the executor for execution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still the same question above. In a local machine with Multi-GPUs, which module should send the ExecutionPlan ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see https://github.com/PaddlePaddle/Paddle/pull/6078/files#r154270688 , does it answer your question?


## InferShape

With this design, the InferShape function should take the following parameters:
Expand Down
10 changes: 10 additions & 0 deletions paddle/framework/framework.proto
Original file line number Diff line number Diff line change
Expand Up @@ -143,3 +143,13 @@ message BlockDesc {
// https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/program.md
// for more details.
message ProgramDesc { repeated BlockDesc blocks = 1; }

message OpPlacement {
optional string name = 1;
optional string device = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also wondering if device info for Operator is enough.
In Tensorflow, tf.Variable is actually an operator, and tf.Tensor has a operator data member. Tensorflow is a graph of operator, so device info in operator is enough.
But we have both variable and operator. Do we need device info for Variable? Do we need another VarPlacement?

Copy link
Member

@QiJune QiJune Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • In most common case(add/sub/relu...), the output variable device is the same with operator device.
  • For control related operators and LoD related operators, the operator device is always CPU. And the output variable is in CPU too.
  • If we get training data/parameter using load operator/initialize operator , the variable device is the same with load operator/initialize operator.
  • If we get training data/parameter using python reader, the variable device need to be set manually.

So, this a only one case which we should set device for variable. For other cases, the variable device can be decided by operator's device info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@QiJune thanks, great question, I guess we need VarPlacement only if we will use explicit OP for copying data from CPU to GPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get training data/parameter using python reader, the variable device need to be set manually.

Isn't the data initially CPU, and copied to GPU implicitly when needed, since we don't do explicit copies, maybe we don't need VarPlacement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put device field in ProgramDesc directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we also need to allow users the specify the device information by two approaches:

  • device ID such as CPU:0/GPU:0.
  • The maximal device count such as CPU:{5}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ProgramDesc is used to specify the information from the user, since currently we don't have API to do that, we probably should not put that information into ProgramDesc.

In the future when we have that API we can add it to ProgramDesc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to add it to ProgramDesc since we are not using this field for now, so then we don't need further changes to the protobuf files.

Copy link
Contributor Author

@helinwang helinwang Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero I think ProgramDesc and ExecutionPlan are used for different purposes, ProgramDesc is the output from Python, specifying what the user need. ExecutionPlan is the input and output for IR optimizers, and input for executor. So they better be two separate entities.

Since there are two entities: ProgramDesc and ExecutionPlan, and the device placement is about optimization, not about what the user specified, it probably should be in ExecutionPlan but not ProgramDesc.

In the future when we want enable the user to configure which device an OP runs, we can put the field indicating device in ProgramDesc.

Maybe I need to change ExecutionPlan to (not depend on ProgramDesc anymore):

message ExecutionPlan {
  repeated BlockDesc blocks = 1; 
  repeated OpPlacement op_placement = 2;
}

What do you think?

}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a detail example in comment.

message OpPlacement {
   // pserver:gpu0
   optional string name = 1;
   optional string device = 2;
 }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "pserver" in "pserver:gpu0" is not necessary, the executor does not need to know what role (e.g., pserver) it takes. Maybe only "gpu0" is sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit confused how would name and device values be at runtime, can you give an example?

Copy link
Contributor Author

@helinwang helinwang Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name should be the name of the OP (every OP should have a name), will add this into the PR.
device should be something like: "gpu0", "cpu".

message ExecutionPlan {
optional ProgramDesc program = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pserver and trainer may use different ProgramDesc, seems one field is not enough?

Copy link
Contributor Author

@helinwang helinwang Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new design the pserver and trainer will be exactly same binary (executor), only thing that is different is the ExecutionPlan they run. The Planner will know the roles of different executors (e.g., pserver role, trainer role) to help generating the ExecutionPlans.

repeated OpPlacement op_placement = 2;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, how to find the correspondence between OpPlacement in ExecutionPlan and OpDesc in ProgramDesc?
Are the number and order of operators in ExecutionPlace and ProgramDesc the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number will be the same, each OP will have one placement. The order does not have to be the same, otherwise the "name" field in OpPlacement is not necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Program{Block{Op}}. A Program has many blocks. A block has many ops.

However, the Program has many operator placements. We cannot get a one-to-one map by this data structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyoung

However, the Program has many operator placements. We cannot get a one-to-one map by this data structure.

Sorry I don't fully get this point, I thought different OPs have different names?