-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new design of Parallel Do #8631
Comments
The background is, we met some problem when trying to implement parallel execution with parallel_op, and when we start to think about go/select operator, we find that the backward/optimize/memory analyze of these ops are complex to implement, because they need special logic in the current framework to analyze and transpile the program_desc(#8592). |
I like the general idea of separation of model logic and execution plan. There are some ideas to completely remove the requirement for user to specify device placement (i.e. partition, parallelism). It's very difficult for user to reason about the performance of multi-machine-multi-device-multi-thread environment. |
Agree. As you mentioned, parallel do belong to execution/schedule, if we want full separation, parallel do should not be a part of model expression/algorithm, the user would not need to specify it, the executor should be able to automatically parallel what is possible to parallel. In this sense, parallel do should not be added by the transpiler, instead, a parallel-able "model expression/algorithm" should be generated by the transpiler and the executor will automatically parallel it. We may not need the parallel do operator. I think similar logic can apply to the memory optimizer, if the executor were given all the information, we would not need to transpile a memory-optimized program, instead, the executor should be able to automatically release the unnecessary memory. |
您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持! |
Actually, we want to implement a new language which is differentiable. Please refer to https://medium.com/@maxbendick/designing-a-differentiable-language-for-deep-learning-1812ee480ff1.
The differentiable program language should focus on the expressing the data and function. We should separate expression and execution.
In operator level, it has been a trend that the kernel code will be generate automatically. Please refer to the tvm from mxnet community and TensorComprehensions from Facebook research. Users only need to write what's an operator do, and how the operator do in certain hardware is automatically generated.
Both these two work learn a lot from Halide language, which targets at separating algorithm description and schedule.
In Graph/Program level, I think that we may should also separate algorithm description and schedule.
Currently, in our design, parallel do operator, and go/select operator are actually solving the execution of a program. These operators are actually executing a block. It's hard to say what's the backward of a parallel_do/go/select operator.
I propose that we'd separate our model expression/algorithm description with its execution/schedule, and keep our ProgramDesc differentiable. Otherwise, we have to write a lot of if/else codes in our transpiler to distinguish which operator is for scheduling, and which operator is for describing algorithm.
This problem has been met in memory optimization transpiler. Actually, memory optimization transpiler analyze neural network operator and related variables, not parallel do operator. We have to skip it.
There are several ways to separate the description and schedule:
In the third way, parallel do and nccl init operator will be inserted in block 0, and block 1 actually include forward/backward/nccl allreduce/sgd operators. And the total logic will be very clean, parallel do operator in block 0 will launch four threads which binds on four GPU cards. And each thread runs the same block 1. The backward/optimize transpiler, memory optimize transpiler will only need to focus on block 1. This will avoid too much if/else logic for parallel do operator.
Users prefer to focus on algorithm description and would like to write codes which only take a single GPU into consideration. We can write a parallel do transpiler to transpiler users' original ProgramDesc to a parallel do ProgramDesc.
And the logic of this parallel do transpiler will be very simple, all OpDesc and VarDesc in original ProgramDesc will be moved to block 1. And additional Parallel do OpDesc will be inserted in block 0.
The text was updated successfully, but these errors were encountered: