diff --git a/doc/design/fluid.md b/doc/design/fluid.md index 585dc8ef39c0c..2acc168007d25 100644 --- a/doc/design/fluid.md +++ b/doc/design/fluid.md @@ -105,18 +105,10 @@ There are two ways to execute a Fluid program. When a program is executed, it c There is a C++ class [`Executor`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h), which runs a `ProgramDesc`, similar to how an interpreter runs a Python program. -Fluid is moving towards the direction of a compiler, which is explain in more detail later in this article. +Fluid is moving towards the direction of a compiler, which is explain in [fluid_compiler.md](fluid_compiler.md). ## Backward Compatibility of Fluid Given all the advantages from the removal of the concept of a *model*, hardware manufacturers might still prefer the existence of the concept of a model, so it would be easier for them to support multiple frameworks all at once and could run a trained model during inference. For example, Nervana, a startup company acquired by Intel, has been working on an XPU that reads the models in the format known as [n-graph](https://github.com/NervanaSystems/ngraph). Similarly, [Movidius](https://www.movidius.com/) is producing a mobile deep learning chip that reads and runs graphs of operators. The well-known [ONNX](https://github.com/onnx/onnx) is also a file format of graphs of operators. For Fluid, we can write a converter that extracts the parts in the `ProgramDesc` protobuf message, converts them into a graph of operators, and exports the graph into the ONNX or n-graph format. - -## Towards a Deep Learning Language and the Compiler - -We can change the `if-then-else` and loop structure a little bit in the above Fluid example programs, to make it into a new programming language, different than Python. - -Even if we do not invent a new language, as long as we get the `ProgramDesc` message filled in, we can write a transpiler, which translates each invocation to an operator, into a C++ call to a kernel function of that operator. For example, a transpiler that weaves the CUDA kernels outputs an NVIDIA-friendly C++ program, which can be built using `nvcc`. Another transpiler could generate MKL-friendly code that should be built using `icc` from Intel. More interestingly, we can translate a Fluid program into its distributed version of two `ProgramDesc` messages, one for running on the trainer process, and the other one for the parameter server. For more details of the last example, the [concurrent programming design](concurrent_programming.md) document would be a good pointer. The following figure explains the proposed two-stage process: - -![](fluid-compiler.png) diff --git a/doc/design/fluid_compiler.md b/doc/design/fluid_compiler.md new file mode 100644 index 0000000000000..2a6beafc52e81 --- /dev/null +++ b/doc/design/fluid_compiler.md @@ -0,0 +1,110 @@ +# PaddlePaddle Fluid: Towards a Compiled Programming Language + +As described in [fluid.md](fluid.md), when a Fluid application program +runs, it generates a `ProgramDesc` protobuf message as an intermediate +representation of itself. The C++ class `Executor` can run this +protobuf message as an interpreter. This article describes the Fluid +compiler. + +![](fluid-compiler.png) + +## ProgramDesc + +Before we go deeper into the idea of compiled language, let us take a +look at a simple example Fluid application. + +```python +import "fluid" + +func paddlepaddle() { + X = fluid.read(...) + W = fluid.Tensor(...) + Y = fluid.mult(X, W) +} +``` + +This program consists of a [block](block.md) of three operators -- +`read`, `assign`, and `mult`. Its `ProgramDesc` message looks like +the following + +```protobuf +message ProgramDesc { + block[0] = Block { + vars = [X, W, Y], + ops = [ + read(output = X) + assign(input = ..., output = W) + mult(input = {X, W}, output = Y) + ], + } +} +``` + +## Transpilers + +We can write a transpiler program that takes a `ProgramDesc`, e.g., +the above one, and outputs another `ProgramDesc`. Let us take some +examples: + +1. *Memory optimization transpiler*: We can write a transpiler that + inserts some `FreeMemoryOp`s in the above example `ProgramDesc` so + to free memory early, before the end of an iteration, so to keep a + small memory footprint. + +1. *Distributed training transpiler*: We can write a transpiler that + converts a`ProgramDesc` into its distributed version of two + `ProgramDesc`s -- one for running by the trainer processes and the + other for the parameter server. + +In the rest of this article, we talk about a special kind of +transpiler, *Native code generator*, which takes a `ProgramDesc` and +generates a `.cu` (or `.cc`) file, which could be built by C++ +compilers (gcc, nvcc, icc) into binaries. + +## Native Code Generator + +For the above example, the native code generator transpiler, say, the +CUDA code generator, should generate a `main` function: + +```c++ +void main() { + auto X = fluid_cuda_read(...); + auto W = fluid_cuda_create_tensor(...); + auto Y = fluid_cuda_mult(X, W); +} +``` + +and the definitions of functions `fluid_cuda_read`, +`fluid_cuda_create_tensor`, and `fluid_cuda_mult`. Please be aware +that each function could just define a C++ instance of an operator and +run it. For example + +```c++ +paddle::Tensor fluid_cuda_read(...) { + paddle::Tensor t; + paddle::operator::Read r(&t, ...); + r.Run(); + return t; +} +``` + +For computational operators that have multiple *kernels*, each for a +specific hardware platform, for example, the `mult` operator, the +generated code should call its CUDA kernel: + +```c++ +paddle::Tensor fluid_cuda_mult(const paddle::Tensor& a, + const paddle::Tensor& b) { + paddle::Tensor t; + paddle::operator::Mult m(a, b, ...); + Mult.Run(cuda_context); +} +``` + +where `cuda_context` could be a global variable of type +`paddle::CUDADeviceContext`. + +## Multi-Block Code Generation + +Most Fluid application programs may have more than one blocks. To +execute them, we need to trace [scopes](scope.md).