Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Fluid Compiler design doc #7178

Merged
merged 2 commits into from
Jan 21, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 1 addition & 9 deletions doc/design/fluid.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,18 +105,10 @@ There are two ways to execute a Fluid program. When a program is executed, it c

There is a C++ class [`Executor`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h), which runs a `ProgramDesc`, similar to how an interpreter runs a Python program.

Fluid is moving towards the direction of a compiler, which is explain in more detail later in this article.
Fluid is moving towards the direction of a compiler, which is explain in [fluid_compiler.md](fluid_compiler.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain -> explained


## Backward Compatibility of Fluid

Given all the advantages from the removal of the concept of a *model*, hardware manufacturers might still prefer the existence of the concept of a model, so it would be easier for them to support multiple frameworks all at once and could run a trained model during inference. For example, Nervana, a startup company acquired by Intel, has been working on an XPU that reads the models in the format known as [n-graph](https://github.com/NervanaSystems/ngraph). Similarly, [Movidius](https://www.movidius.com/) is producing a mobile deep learning chip that reads and runs graphs of operators. The well-known [ONNX](https://github.com/onnx/onnx) is also a file format of graphs of operators.

For Fluid, we can write a converter that extracts the parts in the `ProgramDesc` protobuf message, converts them into a graph of operators, and exports the graph into the ONNX or n-graph format.

## Towards a Deep Learning Language and the Compiler

We can change the `if-then-else` and loop structure a little bit in the above Fluid example programs, to make it into a new programming language, different than Python.

Even if we do not invent a new language, as long as we get the `ProgramDesc` message filled in, we can write a transpiler, which translates each invocation to an operator, into a C++ call to a kernel function of that operator. For example, a transpiler that weaves the CUDA kernels outputs an NVIDIA-friendly C++ program, which can be built using `nvcc`. Another transpiler could generate MKL-friendly code that should be built using `icc` from Intel. More interestingly, we can translate a Fluid program into its distributed version of two `ProgramDesc` messages, one for running on the trainer process, and the other one for the parameter server. For more details of the last example, the [concurrent programming design](concurrent_programming.md) document would be a good pointer. The following figure explains the proposed two-stage process:

![](fluid-compiler.png)
110 changes: 110 additions & 0 deletions doc/design/fluid_compiler.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# PaddlePaddle Fluid: Towards a Compiled Programming Language

As described in [fluid.md](fluid.md), when a Fluid application program
runs, it generates a `ProgramDesc` protobuf message as an intermediate
representation of itself. The C++ class `Executor` can run this
protobuf message as an interpreter. This article describes the Fluid
compiler.

![](fluid-compiler.png)
Copy link
Contributor

@helinwang helinwang Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need Executor in this graph? It seems unrelated to this PR.


## ProgramDesc

Before we go deeper into the idea of compiled language, let us take a
look at a simple example Fluid application.

```python
import "fluid"

func paddlepaddle() {
X = fluid.read(...)
W = fluid.Tensor(...)
Y = fluid.mult(X, W)
}
```

This program consists of a [block](block.md) of three operators --
`read`, `assign`, and `mult`. Its `ProgramDesc` message looks like
the following

```protobuf
message ProgramDesc {
Copy link
Contributor

@helinwang helinwang Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding the ProgramDesc is a intermediate representation (IR), currently we have Python as frontend that generates the IR, and this PR discusses a cpp code backend.

I think having Python as a frontend is a huge pain. In my opinion the benefit of Python in the machine learning field is:

  1. Libraries such as numpy
  2. Python native program control primitives such as for, if to control the training process, the researchers are familiar with them.

In our case we are benefit from neither of these two points:

  1. the transpiled code can not use numpy.
  2. the transpiled code can not use Python native program control primitives.

And we are trapped in the Python grammar.

I think a better way is to invent our own language.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and I believe that a new language is the future.

block[0] = Block {
vars = [X, W, Y],
ops = [
read(output = X)
assign(input = ..., output = W)
mult(input = {X, W}, output = Y)
],
}
}
```

## Transpilers

We can write a transpiler program that takes a `ProgramDesc`, e.g.,
the above one, and outputs another `ProgramDesc`. Let us take some
examples:

1. *Memory optimization transpiler*: We can write a transpiler that
inserts some `FreeMemoryOp`s in the above example `ProgramDesc` so
to free memory early, before the end of an iteration, so to keep a
small memory footprint.

1. *Distributed training transpiler*: We can write a transpiler that
converts a`ProgramDesc` into its distributed version of two
`ProgramDesc`s -- one for running by the trainer processes and the
other for the parameter server.

In the rest of this article, we talk about a special kind of
transpiler, *Native code generator*, which takes a `ProgramDesc` and
generates a `.cu` (or `.cc`) file, which could be built by C++
compilers (gcc, nvcc, icc) into binaries.

## Native Code Generator

For the above example, the native code generator transpiler, say, the
CUDA code generator, should generate a `main` function:
Copy link
Member

@jacquesqiao jacquesqiao Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most case, user may need a library such as a .so or .a to use in some other program, such as a Face recognition App, so do we also need to be able to generate these libraries at the same time?

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

I prefer that our transpiler generates source code only, not reusing some .a/.so files, so to simplify the building process.

To be precise, if the transpiler generates source code only, the general workflow would be

               PaddlePaddle
	       operator/kernel
	       source code
	          |
		  v
ProgramDesc -> transpiler -> .cc file -> nvcc -> binary file

Otherwise, if we try to reuse the .a/.so files

               PaddlePaddle
	       operator/kernel -> nvcc(1) -> .a/.so
	       source code                   |
	          |                          |
		  v                          v
ProgramDesc -> transpiler -> .cc file -> nvcc(2) -> binary file

It is error-prone because there is a chance we are using different compilers for nvcc(1) and nvcc(2), or we build in different environments with mutually exclusive configurations.

It is true that the generated code might depend on third-party libraries, so our transpiler might also need to generate build commands, including dependencies.


```c++
void main() {
auto X = fluid_cuda_read(...);
auto W = fluid_cuda_create_tensor(...);
auto Y = fluid_cuda_mult(X, W);
}
```

and the definitions of functions `fluid_cuda_read`,
`fluid_cuda_create_tensor`, and `fluid_cuda_mult`. Please be aware
that each function could just define a C++ instance of an operator and
run it. For example

```c++
paddle::Tensor fluid_cuda_read(...) {
paddle::Tensor t;
paddle::operator::Read r(&t, ...);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we copy the operator::Read files (read_op.cc/read_op.cu/read_op.h) to the generated project. Or just read_op.a/.so?

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to copy the source files instead of reusing .a/.so files due to reasons in #7178 (comment)

r.Run();
Copy link
Contributor

@sidgoyal78 sidgoyal78 Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a very basic question, but from this example, I see r.Run(). Is this bringing execution into picture along with the compilation? (Maybe we instead have a Run() method, which is called by the executor later?, or maybe i misunderstood).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nevermind, I mixed up the 2 things. I think post this code generation, the executor will indeed run these as usual i guess.

return t;
}
```

For computational operators that have multiple *kernels*, each for a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also consider the possibility of having a default fallback device in case some operator cannot run on a device. For example, we might have some CPU only operators in that case our transpiler should make sure that it generates CPU code for that op even though the rest of the native code might be CUDA code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhinavarora
Yes. We provide the kernel selection and kernel fallback mechanism.

void DataTransform(const OpKernelType& expected_kernel_type,
const OpKernelType& kernel_type_for_var,
const Tensor& input_tensor, Tensor* out);
void CopyVariableWithTensor(const Variable& in_var, const Tensor& tensor,
Variable& out_var);
} // namespace framework

If the target machine does not have the target device, it will try to use naive implement kernel (say CPU kernel) instead of terminated.
In the fluid overview design, there will be a runtime .a link to target program. In my view, the runtime library will solve the fallback problem, not the transpiler does.

specific hardware platform, for example, the `mult` operator, the
generated code should call its CUDA kernel:

```c++
paddle::Tensor fluid_cuda_mult(const paddle::Tensor& a,
const paddle::Tensor& b) {
paddle::Tensor t;
paddle::operator::Mult m(a, b, ...);
Mult.Run(cuda_context);
}
```

where `cuda_context` could be a global variable of type
`paddle::CUDADeviceContext`.

## Multi-Block Code Generation

Most Fluid application programs may have more than one blocks. To
execute them, we need to trace [scopes](scope.md).