Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel do #8

Merged
merged 51 commits into from
Jan 5, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
2bdd3e4
Update the version of openblas.
Xreki Dec 7, 2017
6dc0e66
Merge branch 'develop' into fix_build_android_openblas
Xreki Dec 21, 2017
d13d787
Refine the cross-compiling documentations.
Xreki Dec 21, 2017
9b3f2c3
Add a simple example for fluid to do inference in C++ code.
Xreki Dec 28, 2017
0c5202c
Tiny enhance of while_op
reyoung Jan 2, 2018
cd7d0f8
Merge branch 'develop' into core_inference_example
Xreki Jan 2, 2018
f3851fe
auto pybind when *_op.cc contains several operators
luotao1 Jan 2, 2018
e4e95be
manually pybind some specific operators
luotao1 Jan 2, 2018
f0e797e
Doc fix and enhancement for lstm_unit python wrapper.
pkuyym Jan 3, 2018
d6ec963
Minor correction.
pkuyym Jan 3, 2018
c0f6f49
Add shape info for arguments.
pkuyym Jan 3, 2018
42a0603
Merge branch 'develop' into core_inference_example
Xreki Jan 3, 2018
5974c1b
refine comments in CMakelists.txt of operator
luotao1 Jan 3, 2018
90a5a55
Expose some activations
reyoung Jan 3, 2018
60fecce
Fix unit test for lstm_unit.
pkuyym Jan 3, 2018
0590967
Add init glog
reyoung Jan 3, 2018
0c16f4f
Update
reyoung Jan 3, 2018
2b3d946
Update init.cc
reyoung Jan 3, 2018
63e3150
Update code
reyoung Jan 3, 2018
5b3cf4e
Use gflags to parse arguments from command-line.
Xreki Jan 3, 2018
5a4367b
Update
reyoung Jan 3, 2018
907e6d0
Fix bug in SetAttrDescVisitor (#7165)
QiJune Jan 3, 2018
2d2b633
add more comments in CMakelists.txt of operator
luotao1 Jan 3, 2018
231e2ee
Merge pull request #7148 from luotao1/op_make
luotao1 Jan 3, 2018
1954146
"fix frigled test gradient of rnn" (#7166)
dzhwinter Jan 3, 2018
89bbc4f
Merge pull request #7157 from pkuyym/fix-7156
pkuyym Jan 4, 2018
042f352
add flag use_mkl_packed
tensor-tang Jan 4, 2018
d3f867e
enable mkl_packed_recurrent python interface
tensor-tang Jan 4, 2018
dd8ffe1
Merge pull request #7131 from reyoung/feature/tiny_enhance_of_while_op
reyoung Jan 4, 2018
a893f15
fix layout transform (#7149)
dzhwinter Jan 4, 2018
8ae84a5
Async to drop kid
reyoung Jan 4, 2018
cd5fad1
Merge pull request #7160 from reyoung/feature/expose_activations
reyoung Jan 4, 2018
24181fd
Merge branch 'develop' of github.com:baidu/Paddle into feature/async_…
reyoung Jan 4, 2018
e138bcf
Update cmake of scope
reyoung Jan 4, 2018
7e10b81
Fix style check
reyoung Jan 4, 2018
3b5e4e0
default disable use_mkl_packed
tensor-tang Jan 4, 2018
b585c93
Default use one thread in fluid
reyoung Jan 4, 2018
a4024a5
"remove cudnn devicecontext" (#7207)
dzhwinter Jan 4, 2018
ee341ef
Merge pull request #7183 from tensor-tang/use_mkl_packed
luotao1 Jan 4, 2018
6f347fa
Merge pull request #6401 from Xreki/fix_build_android_openblas
luotao1 Jan 4, 2018
c7bd777
Support the link of inference library on mac.
Xreki Jan 4, 2018
2b259bf
Merge pull request #7208 from reyoung/feature/default_omp_num_threads_1
reyoung Jan 4, 2018
564dba1
Merge pull request #7196 from reyoung/feature/async_drop_kid
reyoung Jan 4, 2018
040dc59
Correctly handle image operators
reyoung Jan 4, 2018
f3c42f6
Add doc for gru_unit op (in fluid) (#7151)
sidgoyal78 Jan 4, 2018
a8b3996
Merge pull request #7219 from reyoung/feature/correctly_handle_lod_in…
reyoung Jan 5, 2018
809122c
Merge pull request #7097 from Xreki/core_inference_example
luotao1 Jan 5, 2018
7508d52
add memory optimization design doc (#7206)
QiJune Jan 5, 2018
e5fe893
send_recv variables (#7161)
Jan 5, 2018
60e27d1
Merge branch 'develop' of github.com:baidu/Paddle into parallel_do
reyoung Jan 5, 2018
8496b2e
Refine parallel_do
reyoung Jan 5, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,10 @@ set(PADDLE_BINARY_DIR ${CMAKE_CURRENT_BINARY_DIR})
include(system)

project(paddle CXX C Go)
message(STATUS "CXX compiler: " ${CMAKE_CXX_COMPILER} ", version: " ${CMAKE_CXX_COMPILER_VERSION})
message(STATUS "C compiler: " ${CMAKE_C_COMPILER} ", version: " ${CMAKE_C_COMPILER_VERSION})
message(STATUS "CXX compiler: ${CMAKE_CXX_COMPILER}, version: "
"${CMAKE_CXX_COMPILER_ID} ${CMAKE_CXX_COMPILER_VERSION}")
message(STATUS "C compiler: ${CMAKE_C_COMPILER}, version: "
"${CMAKE_C_COMPILER_ID} ${CMAKE_C_COMPILER_VERSION}")

find_package(Sphinx)
if(NOT CMAKE_CROSSCOMPILING)
Expand Down
2 changes: 1 addition & 1 deletion cmake/external/eigen.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ ExternalProject_Add(

if (${CMAKE_VERSION} VERSION_LESS "3.3.0")
set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/eigen3_dummy.c)
file(WRITE ${dummyfile} "const char * dummy_eigen3 = \"${dummyfile}\";")
file(WRITE ${dummyfile} "const char *dummy_eigen3 = \"${dummyfile}\";")
add_library(eigen3 STATIC ${dummyfile})
else()
add_library(eigen3 INTERFACE)
Expand Down
10 changes: 3 additions & 7 deletions cmake/external/openblas.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -30,23 +30,21 @@ IF(NOT ${CBLAS_FOUND})
CACHE FILEPATH "openblas library." FORCE)

SET(OPENBLAS_CC "${CMAKE_C_COMPILER} -Wno-unused-but-set-variable -Wno-unused-variable")
SET(OPENBLAS_COMMIT "v0.2.20")

IF(CMAKE_CROSSCOMPILING)
SET(OPTIONAL_ARGS HOSTCC=${HOST_C_COMPILER})
GET_FILENAME_COMPONENT(CROSS_SUFFIX ${CMAKE_C_COMPILER} DIRECTORY)
SET(CROSS_SUFFIX ${CROSS_SUFFIX}/)
IF(ANDROID)
# arm_soft_fp_abi branch of OpenBLAS to support softfp
# https://github.com/xianyi/OpenBLAS/tree/arm_soft_fp_abi
SET(OPENBLAS_COMMIT "b5c96fcfcdc82945502a2303116a64d89985daf5")
IF(ANDROID_ABI MATCHES "^armeabi(-v7a)?$")
# use softfp
SET(OPTIONAL_ARGS ${OPTIONAL_ARGS} TARGET=ARMV7 ARM_SOFTFP_ABI=1 USE_THREAD=0)
ELSEIF(ANDROID_ABI STREQUAL "arm64-v8a")
SET(OPTIONAL_ARGS ${OPTIONAL_ARGS} TARGET=ARMV8 BINARY=64 USE_THREAD=0)
ENDIF()
ELSEIF(IOS)
IF(CMAKE_OSX_ARCHITECTURES MATCHES "arm64")
SET(OPENBLAS_COMMIT "b5c96fcfcdc82945502a2303116a64d89985daf5")
SET(OPENBLAS_CC "${OPENBLAS_CC} ${CMAKE_C_FLAGS} -isysroot ${CMAKE_OSX_SYSROOT}")
SET(OPENBLAS_CC "${OPENBLAS_CC} -arch arm64")
SET(OPTIONAL_ARGS ${OPTIONAL_ARGS} TARGET=ARMV8 BINARY=64 USE_THREAD=0 CROSS_SUFFIX=${CROSS_SUFFIX})
Expand All @@ -56,14 +54,12 @@ IF(NOT ${CBLAS_FOUND})
ENDIF()
ELSEIF(RPI)
# use hardfp
SET(OPENBLAS_COMMIT "v0.2.20")
SET(OPTIONAL_ARGS ${OPTIONAL_ARGS} TARGET=ARMV7 USE_THREAD=0)
ENDIF()
ELSE()
IF(APPLE)
SET(OPENBLAS_CC "${CMAKE_C_COMPILER} -isysroot ${CMAKE_OSX_SYSROOT}")
ENDIF()
SET(OPENBLAS_COMMIT "v0.2.20")
SET(OPTIONAL_ARGS "")
IF(CMAKE_SYSTEM_PROCESSOR MATCHES "^x86(_64)?$")
SET(OPTIONAL_ARGS DYNAMIC_ARCH=1 NUM_THREADS=64)
Expand Down Expand Up @@ -113,7 +109,7 @@ INCLUDE_DIRECTORIES(${CBLAS_INC_DIR})
# FIXME(gangliao): generate cblas target to track all high performance
# linear algebra libraries for cc_library(xxx SRCS xxx.c DEPS cblas)
SET(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/cblas_dummy.c)
FILE(WRITE ${dummyfile} "const char * dummy = \"${dummyfile}\";")
FILE(WRITE ${dummyfile} "const char *dummy_cblas = \"${dummyfile}\";")
ADD_LIBRARY(cblas STATIC ${dummyfile})
TARGET_LINK_LIBRARIES(cblas ${CBLAS_LIBRARIES})

Expand Down
6 changes: 3 additions & 3 deletions cmake/generic.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ function(merge_static_libs TARGET_NAME)
DEPENDS ${libs})

# Generate dummy staic lib
file(WRITE ${target_SRCS} "const char *dummy = \"${target_SRCS}\";")
file(WRITE ${target_SRCS} "const char *dummy_${TARGET_NAME} = \"${target_SRCS}\";")
add_library(${TARGET_NAME} STATIC ${target_SRCS})
target_link_libraries(${TARGET_NAME} ${libs_deps})

Expand Down Expand Up @@ -160,7 +160,7 @@ function(merge_static_libs TARGET_NAME)
DEPENDS ${libs} ${target_OBJS})

# Generate dummy staic lib
file(WRITE ${target_SRCS} "const char *dummy = \"${target_SRCS}\";")
file(WRITE ${target_SRCS} "const char *dummy_${TARGET_NAME} = \"${target_SRCS}\";")
add_library(${TARGET_NAME} STATIC ${target_SRCS})
target_link_libraries(${TARGET_NAME} ${libs_deps})

Expand Down Expand Up @@ -324,7 +324,7 @@ function(go_library TARGET_NAME)
)

# Add dummy code to support `make target_name` under Terminal Command
file(WRITE ${dummyfile} "const char * dummy = \"${dummyfile}\";")
file(WRITE ${dummyfile} "const char *dummy_${TARGET_NAME} = \"${dummyfile}\";")
if (go_library_SHARED OR go_library_shared)
add_library(${TARGET_NAME} SHARED ${dummyfile})
else()
Expand Down
6 changes: 6 additions & 0 deletions doc/api/v2/fluid/layers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,12 @@ sequence_expand
:noindex:


gru_unit
--------
.. autofunction:: paddle.v2.fluid.layers.gru_unit
:noindex:


lstm_unit
---------
.. autofunction:: paddle.v2.fluid.layers.lstm_unit
Expand Down
Binary file added doc/design/images/control_flow_graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/design/images/dataflow_equations.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/design/images/deep_learning.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
217 changes: 217 additions & 0 deletions doc/design/memory_optimization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
# Memory Optimization


## Problem

In a lecture from Andrew Ng, he attributes the recent sucess of AI due to a combination of these:

- availability of Big Data
- supercomputing power to process this Big Data over very large neural networks
- modern algorithms

Following graph shows the details:

![](images/deep_learning.png)

Larger model usually brings better performance. However, GPU memory is certain limited. For example, the memory size of a GTX TITAN X is only 12GB. To train complex and large model, we have to take care of memory using. Besides, memory optimization is also necessary in both online/mobile inference.

## Solution

### Basic Strategy

There are some basic strategies to make memory optimization, including in-place operation and memory sharing.

#### In-place Operation
In a relu activation operator:

$y = \max(x, 0)$

If the variable x is not used in any other operator, we can make an in-place operation. In other words, the memory block of variable y and variable x are the same. In-place operation will save 50% memory occupancy immediately.

#### Memory Sharing

Not all operators support in-place operations. Memory sharing is a more general strategy.

Following is an example:

```
a = op1(b, c);
d = op2(a)
e = op3(d, f)
```

In this case, variable a is no longer used, and op2 does not support in-place operation. After op2 finished, we can put the memory of variable a to a memory pool. Then, variable e can share the memory of variable a from the pool.


### Live Variable Analysis

It's not enough to only have some basic strategies. The prerequisite of memory optimization is to know if a variable is still "live" after an operation.

In our design, the neural network topology is defined as a program. Luckily, [live variable analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a classic problem in compilers which can be used in many stages, such as register allocation.

In compilers, the front end of the compilers translates programs into an intermediate language with an unbounded number of temporaries. This program must run on a machine with a bounded number of registers. Two temporaries a and b can fit into the same register, if a and b are never "in use" at the same time. Thus, many temporaries can fit in few registers; if they don't all fit, the excess temporaries can be kept in memory.

Therefore, the compiler needs to analyze the intermediate-representation program to determine which temporaries are in use at the same time. We say a variable is "live" if it holds a value that may be needed in the future, so this analysis is called liveness analysis.

We can leran these techniques from compilers. There are mainly two stages to make live variable analysis:

- construct a control flow graph
- solve the dataflow equations


#### Control Flow Graph
To preform analyses on a program, it is often useful to make a control flow graph. A [control flow graph](https://en.wikipedia.org/wiki/Control_flow_graph) (CFG) in computer science is a representation, using graph notation, of all paths that might be traversed through a program during its execution. Each statement in the program is a node in the flow graph; if statemment x can be followed by statement y, there is an egde from x to y.

Following is the flow graph for a simple loop.

![](images/control_flow_graph.png)

#### Dataflow Analysis

liveness of variable "flows" around the edges of the control flow graph; determining the live range of each variable is an example of a dataflow problem. [Dataflow analysis](https://en.wikipedia.org/wiki/Data-flow_analysis) is a technique for gathering information about the possible set of values calculated at various points in a computer program.

A simple way to perform data-flow analysis of programs is to set up dataflow equations for each node of the control flow graph and solve them by repeatedly calculating the output from the input locally at each node until the whole system stabilizes.

- Flow Graph Terminology

A flow graph node has out-edges that lead to sucessor nodes, and in-edges that come from presucessor nodes. The set *pred[n]* is all the predecessors of node n, and *succ[n]* is the set of sucessors.
In former control flow graph, the out-edges of node 5 are 5 --> 6 and 5 --> 2, and *succ[5]* = {2, 6}. The in-edges of 2 are 5 --> 2 and 1 --> 2, and *pred[2]* = {1, 5}.

- Uses and Defs

An assignmemt to a variable or temporary defines that variable. An occurence of a variable on the right-hand side of an assginment(or in other expressions) uses the variable. We can speak the *def* of a variable as the set of graph nodes that define it; or the *def* of a graph node as the set of variables that it defines; and the similarly for the *use* of a variable or graph node. In former control flow graph, *def(3)* = {c}, *use(3)* = {b, c}.

- Liveness

A variable is *live* on an edge if there is a directed path from that edge to a *use* of the variable that does not go through any *def*. A variable is *live-in* at a node if it is live on any of the in-edges of that node; it is *live-out* at a node if it is live on any of the out-edges of the node.


The calcution of liveness can be solved by iteration until a fixed pointer is reached. Following is the recursive formula:

![](images/dataflow_equations.png)

### Memory optimization transpiler

At last, we take basic strategy and liveness analysis techniques learning from compilers to implement our memory optimization transpiler.

#### add in-place attribute

In-place is a built-in attribute of an operator. Since we treat in-place and other operators differently, we have to add an in-place attribute for every operator.


#### contruct control flow graph

Following is the ProgramDesc protobuf of [machine translation](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/tests/book/test_machine_translation.py) example.

- Block0:

```
lookup_table
mul
...
while(sub-block idx 1)
...
array_to_lod_tensor
cross_entropy
...
while_grad(sub-block idx 2)
read_from_array
array_to_lod_tensor
...
```

- Block1

```
read_from_array
read_from_array
...
write_to_array
increment
write_to_array
less_than
```

- Block2

```
read_from_array
increment
...
write_to_array
write_to_array
```

We can transfer all the operators and variables in ProgramDesc to build a control flow graph.

```python
class ControlFlowGraph(object):
def __init__(self, Program):
self._sucessors = defaultdict(set)
self._presucessors = defaultdict(set)
self._uses = defaultdict(set)
self._defs = defaultdict(set)
self._live_in = defaultdict(set)
self._live_out = defaultdict(set)
self._program = Program

def build(self):
pass

def dataflow_analysis(self):
pass

def memory_optimization(self):
pass

def get_program(self):
return self._program
```

#### make dataflow analysis

We follow guide from compilers and try to solve the dataflow equation to get liveness of every variable. If the live-in of an operator node is different from the live-out, then we can make memory sharing.

For example:

```
a = op1(b, c);
d = op2(a)
e = op3(d, f)
```

The dataflow analysis result is:

```
live_in(op1) = {b, c, f}
live_out(op1) = {a, f}

live_in(op2) = {a, f}
live_out(op2) = {d, f}

live_in(op3) = {d, f}
live_out(op3) = {}
```

After op1, we can process variable b and variable c; After op2, we can process variable a. After op3, we can process variable d and variable f.

#### memory sharing policy

A memory pool will be mantained in the stage of memory optimization. Each operator node will be scanned to determine memory optimization is done or not. If an operator satifies the requirement, following policy will be taken to handle input/output variables.

```
if op.support_inplace():
i --> pool
pool --> o
else:
pool --> o
i --> pool
```



## Reference

- [Lecture Notes From Artificial Intelligence Is The New Electricity By Andrew Ng](https://manavsehgal.com/lecture-notes-from-artificial-intelligence-is-the-new-electricity-by-andrew-ng-4712dcbf26e5)
- Modern compiler implementation in ML, by Andrew W. Appel
- [Optimizing Memory Consumption in Deep learning](https://mxnet.incubator.apache.org/architecture/note_memory.html)
14 changes: 2 additions & 12 deletions doc/design/support_new_device.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ Fluid uses class [DeviceContext](https://github.com/PaddlePaddle/Paddle/blob/dev


```
/-> CPUDeviceContext --> MKLDeviceContext
DeviceContext ----> CUDADeviceContext --> CUDNNDeviceContext
/-> CPUDeviceContext
DeviceContext ----> CUDADeviceContext
\-> FPGADeviceContext
```

Expand Down Expand Up @@ -79,16 +79,6 @@ private:
};
```

- CUDNNDeviceContext

```
class CUDNNDeviceContext : public CUDADeviceContext {
private:
cudnnHandle_t cudnn_handle_;
};
```


### Memory and Tensor


Expand Down
Loading