Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new design of Parallel Do #8631

Closed
QiJune opened this issue Feb 28, 2018 · 4 comments
Closed

new design of Parallel Do #8631

QiJune opened this issue Feb 28, 2018 · 4 comments

Comments

@QiJune
Copy link
Member

QiJune commented Feb 28, 2018

Actually, we want to implement a new language which is differentiable. Please refer to https://medium.com/@maxbendick/designing-a-differentiable-language-for-deep-learning-1812ee480ff1.

Using a deep learning framework like TensorFlow requires users to create a graph of symbolic tensors connected by operations like layers. It feels weird because you’re not writing the program, you’re writing a program that constructs a program (you write the Python code that constructs the computation graph, which is then interpreted by TensorFlow).

In differentiable programming, computation graphs are the implicit substrate of the language. You could read a lot of differentiable code before even realizing it’s differentiable.

Because the graphs are implicit, differentiable languages are much more expressive at making complex models. First-class lists are used for variable-length data. First-class conditionals make control-flow easy. Likewise, deep learning models, like stacked convolutional layers, are first-class functions. Any function in a differentiable language a model, because you necessarily can run backprop through any of them.

The differentiable program language should focus on the expressing the data and function. We should separate expression and execution.

In operator level, it has been a trend that the kernel code will be generate automatically. Please refer to the tvm from mxnet community and TensorComprehensions from Facebook research. Users only need to write what's an operator do, and how the operator do in certain hardware is automatically generated.

Both these two work learn a lot from Halide language, which targets at separating algorithm description and schedule.

In Graph/Program level, I think that we may should also separate algorithm description and schedule.

Currently, in our design, parallel do operator, and go/select operator are actually solving the execution of a program. These operators are actually executing a block. It's hard to say what's the backward of a parallel_do/go/select operator.

I propose that we'd separate our model expression/algorithm description with its execution/schedule, and keep our ProgramDesc differentiable. Otherwise, we have to write a lot of if/else codes in our transpiler to distinguish which operator is for scheduling, and which operator is for describing algorithm.

This problem has been met in memory optimization transpiler. Actually, memory optimization transpiler analyze neural network operator and related variables, not parallel do operator. We have to skip it.

There are several ways to separate the description and schedule:

In the third way, parallel do and nccl init operator will be inserted in block 0, and block 1 actually include forward/backward/nccl allreduce/sgd operators. And the total logic will be very clean, parallel do operator in block 0 will launch four threads which binds on four GPU cards. And each thread runs the same block 1. The backward/optimize transpiler, memory optimize transpiler will only need to focus on block 1. This will avoid too much if/else logic for parallel do operator.

Users prefer to focus on algorithm description and would like to write codes which only take a single GPU into consideration. We can write a parallel do transpiler to transpiler users' original ProgramDesc to a parallel do ProgramDesc.

And the logic of this parallel do transpiler will be very simple, all OpDesc and VarDesc in original ProgramDesc will be moved to block 1. And additional Parallel do OpDesc will be inserted in block 0.

@jacquesqiao
Copy link
Member

The background is, we met some problem when trying to implement parallel execution with parallel_op, and when we start to think about go/select operator, we find that the backward/optimize/memory analyze of these ops are complex to implement, because they need special logic in the current framework to analyze and transpile the program_desc(#8592).

@panyx0718
Copy link
Contributor

I like the general idea of separation of model logic and execution plan. There are some ideas to completely remove the requirement for user to specify device placement (i.e. partition, parallelism). It's very difficult for user to reason about the performance of multi-machine-multi-device-multi-thread environment.
https://arxiv.org/abs/1706.04972

@helinwang
Copy link
Contributor

helinwang commented Feb 28, 2018

I propose that we'd separate our model expression/algorithm description with its execution/schedule

Agree.

As you mentioned, parallel do belong to execution/schedule, if we want full separation, parallel do should not be a part of model expression/algorithm, the user would not need to specify it, the executor should be able to automatically parallel what is possible to parallel. In this sense, parallel do should not be added by the transpiler, instead, a parallel-able "model expression/algorithm" should be generated by the transpiler and the executor will automatically parallel it. We may not need the parallel do operator.

I think similar logic can apply to the memory optimizer, if the executor were given all the information, we would not need to transpile a memory-optimized program, instead, the executor should be able to automatically release the unnecessary memory.
To be more precise, one way to do it is to use reference counted input and output, so unnecessary tensor can be automatically released once the reference count is 0. We don't need to reuse tensor in the next step in order to save malloc time, this should be optimized by the memory allocator. In this sense, we don't need a map as the scope, because if we use map, the map will always keep a reference to the variables.

@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants