Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Faster pointwise fusion graph pass #19269

Merged
merged 11 commits into from
Oct 19, 2020

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Oct 1, 2020

Description

This PR greatly increases the speed of pointwise fusion graph pass. As a test case I used XLNet from Gluon-NLP which did expose slowness in this graph pass before: #17105 - the total time of fwd+bwd fusion after the fix from that issue was ~8s, after this PR the time is ~11 ms, with 3 ms of that taken by IndexedGraph construction from the original graph (so the actual fusion graph pass takes ~8ms for that case - about 1000x improvement).

The motivation of this PR is getting the pointwise fusion graph pass to be lightweight enough to possibly be run every time the shapes change in the graph, enabling the fusion to be shape-aware. This is important as in 2.0 NumPy semantics make multiple operations (like add, sub etc.) broadcast by default, even if in the end they are simple elementwise operations. This PR does not fully get us there (I would much like it being <5ms for the network like XLNet to be sure it is lightweight enough to not be bottleneck for all usecases), although there are some parts that would not be needed anymore if we could make this pass after infershape is already done (e.g. we would not need to insert FusedOpHelper/FusedOpOutHelper, which currently takes a little over 1ms). That said, it is a big step in that direction.

The main problem that the fusion graph pass needs to solve is not allowing the cycles to be formed: nodes which both consume the output of and provide the input to a single subgraph. The original pointwise fusion graph pass' algorithm to avoid cycles was to first construct a mapping, which nodes are excluded to be in the same subgraph with a given node. The construction of the mapping was simple but inefficient (and then improved in #17114, but still pretty slow):

for each node n that does not qualify to be in a subgraph:
    outputs = nodes reached by DFS from n in the direction of n's outputs
    inputs = nodes reached by DFS from n in the direction of n's inputs
    for each output in outputs:
        for each input in inputs:
            put (input, output) and (output, input) pair into the exclusion mapping

This was O(n^3) algorithm (improved in #17114 to be O(n^2) on average) and required a separate BidirectionalGraph datastructure that enabled DFS in both directions.

The new algorithm for pointwise fusion graph pass is designed based on 2 observations:

  • most of the entries in the exclusion mappings are not useful, as those nodes are not considered to be part of the same subgraph
  • we can traverse the graph in topological order, which the original algorithm did not take advantage of

In the new algorithm the graph is traversed in the topological order (and so we are sure that all the inputs of the current node were already processed) and each node has its own exclusion set of subgraphs that it can't be a part of. The exclusion set of node is constructed as the union of the exclusion sets of its inputs + all the subsets that its inputs are part of if the node itself is ineligible to be in a subgraph. Because subsets can merge (e.g. when you have operation a + b where a is part of subgraph s1 and b is part of subgraph s2, then s1 and s2 need to be merged into a single subgraph containing everything from s1, s2 and the + operator), the mapping is maintained to know which other subsets the current subset is merged with.

There are number of additional optimizations:

  • because of the topological ordering of the graph traversal, subset ids in the exclusion set are typically consecutive number (or a small group of consecutive numbers) -> therefore in the exclusion set we actually keep intervals of numbers instead of the numbers themselves
  • in most cases the union of exclusion sets is equal to one of the sets -> in order to avoid costly unnecessary memory allocations, we share the exclusion sets between nodes
  • the fwd and bwd fusions are done together in a single pass to avoid overheads

The second part of the PR is the overhaul of the actual subgraph substitution - previously it was done 1 subgraph at a time, with multiple DFSVisit calls per subgraph. Unfortunately DFSVisit is costly and most of that work was wasted (as the DFS over the entire graph was needed for the substitution of a few nodes. In the new approach a new graph is created based on the subgraph assignment generated in the previous part, which requires only a single pass over the graph to apply all subgraphs.

@samskalicky @Caenorst @mk-61

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@ptrendx ptrendx added the pr-work-in-progress PR is still work in progress label Oct 1, 2020
@mxnet-bot
Copy link

Hey @ptrendx , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [website, sanity, windows-cpu, edge, unix-gpu, windows-gpu, centos-gpu, miscellaneous, unix-cpu, centos-cpu, clang]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@ptrendx
Copy link
Member Author

ptrendx commented Oct 2, 2020

Seems cpplint that we use does not like structured binding from C++17.

@szha szha requested a review from mseth10 October 3, 2020 21:13
@ptrendx
Copy link
Member Author

ptrendx commented Oct 6, 2020

Ok, issue #19264 came at a perfect time - it actually exposed a corner case that breaks both the original and the new algorithm. The minimal example looks like this:
Unfused graph
where a0, a1, b0 and b1 are fusable, while xa and xb are not. According to the original algorithm a0 and a1 can be fused together and b0 and b1 can be fused together. But if we do that we get:
Fused graph
It happens because the algorithm does not take into account that the exclusion sets of the nodes connected to a subgraph need to contain the excluded nodes from all nodes in that subgraph, even if they were added later in the subgraph creation process.

I am working on a solution that will fix that.

@ptrendx ptrendx added pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress labels Oct 9, 2020
@ptrendx
Copy link
Member Author

ptrendx commented Oct 10, 2020

I fixed the problem with the cycle - every time when I add a new node to a subset, I check whether it's exclusion set is fully included in the future fusion node exclusion set. If so, I update all descendants of the fusion node (with the assumption that typically fusion subsets are not very big and their nodes are quite close together in the indexed graph, so the cost of that is not very high). Still, the check is quite expensive compared to the rest of the algorithm and now the cost of the graph pass for XLNet is ~10-11 ms (still a huge improvement compared to the previous algorithm).

@mseth10
Copy link
Contributor

mseth10 commented Oct 12, 2020

Thanks for the optimization @ptrendx . A couple of questions based on the PR description:

Because subsets can merge (...), the mapping is maintained to know which other subsets the current subset is merged with.

  1. What do you mean by subsets? Is it the same as subgraphs?
  2. What mapping are you referring to here?

@mseth10
Copy link
Contributor

mseth10 commented Oct 12, 2020

@ptrendx In regards to this edge case #19269 (comment), considering a0 -> xa -> b0 -> b1 -> xb -> a1 as the topological graph, before reaching node xb, we should have already combined b0 and b1 into subset b and updated exclusion set of b to contain a0. If that's the case, a0 should automatically be included in the exclusion set for xb. Why then do we need to "update all descendants of the fusion node"?

@ptrendx
Copy link
Member Author

ptrendx commented Oct 12, 2020

Hi @mseth10, I think all of your 3 questions are quite connected. I agree that the nomenclature here is slightly confusing. I use "subsets" because in this algorithm there are actually 2 parts:

  • the first one just finds the sets of nodes that can be fused together but does not do any actual fusion
  • the second one turns those subsets into the actual subgraphs and puts them inside the fusion node
    About the mapping - consider a following graph:
    example
    where a and b are the outputs of some previously identified subgraphs (let's call those subgraphs A and B). Node c (which can't be fused) has A in its exclusion list. Then + node is being considered, which can be fused. It merges subgraphs A and B together under one of them (let's say B). Then subgraph A needs to have a mapping that says that it is a part of B now (and B should also have an inverse mapping that it contains subgraphs A and B in order to be able to check against nodes like c which only saw A but did not see B).

@ptrendx
Copy link
Member Author

ptrendx commented Oct 13, 2020

@mxnet-bot run ci [unix-cpu, unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu, unix-gpu]

@ptrendx
Copy link
Member Author

ptrendx commented Oct 13, 2020

Unix-gpu failed due to test_countsketch (issue #10988)
Unix-cpu failed due to instability in test_np_average (issue #19071)

@mxnet-bot run ci [unix-gpu, unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu, unix-gpu]

@szha
Copy link
Member

szha commented Oct 14, 2020

@mseth10 any more comments on the PR?

src/imperative/simple_partition_pass.cc Show resolved Hide resolved
src/imperative/simple_partition_pass.cc Outdated Show resolved Hide resolved
src/imperative/simple_partition_pass.cc Outdated Show resolved Hide resolved
src/imperative/pointwise_fusion_pass.cc Outdated Show resolved Hide resolved
node_id));
new_nodes[i]->control_deps.insert(new_nodes[i]->control_deps.begin() + dep_num,
std::move(helper_node));
} else if (their_subgraph_id != subgraph_id &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can subgraph_id be -1 here? or should we add that check here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can. The role of FusedOpHelper is to get information (for shape/type inference) from inside the FusedOp to the outside world.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if your dependency is going to be fused, then you need to have this helper instead.

Copy link
Contributor

@mseth10 mseth10 Oct 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. In general, what are the control flow dependencies of a node? Is it just the dependency of backward op node on its corresponding forward op node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is the dependency we are changing as it is used for infershape/infertype. Going forward I would like to get rid of this and never do the inferattr on the fused graph (instead do it on the original graph and map the results to the fused one), but it is outside the scope of this PR.

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 16, 2020
@ptrendx
Copy link
Member Author

ptrendx commented Oct 16, 2020

@mxnet-bot run ci [all]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-gpu, unix-gpu, sanity, edge, windows-cpu, miscellaneous, centos-cpu, clang, website, unix-cpu, centos-gpu]

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 16, 2020
@ptrendx
Copy link
Member Author

ptrendx commented Oct 16, 2020

@ChaiBapchya FYI - there seem to be some connection issues in CI - first sanity test failed because of connection reset, then centos-cpu test failed with

[2020-10-16T21:36:45.894Z] ERROR: for centos7_cpu  ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

(https://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/centos-cpu/job/PR-19269/8/display/redirect)

@mxnet-bot run ci [centos-cpu]

@mxnet-bot
Copy link

Undefined action detected.
Permissible actions are : run ci [all], run ci [job1, job2]
Example : @mxnet-bot run ci [all]
Example : @mxnet-bot run ci [centos-cpu, clang]

@ptrendx
Copy link
Member Author

ptrendx commented Oct 16, 2020

@mxnet-bot run ci [centos-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [centos-cpu]

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 16, 2020
Copy link
Contributor

@mseth10 mseth10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sxjscience sxjscience merged commit a0fd1fe into apache:master Oct 19, 2020
ptrendx added a commit to ptrendx/mxnet that referenced this pull request Oct 22, 2020
* Faster pointwise fusion graph pass

* Fix lint

* Fix lint 2

* Fixes

* Fixing slice parameter handling in fusion

* Fixing the slice fix

* Fix the cycle bug

* Added test

* Fix lint

* Fix merging of subgraphs

* Fixes from review
chinakook pushed a commit to chinakook/mxnet that referenced this pull request Nov 17, 2020
* Faster pointwise fusion graph pass

* Fix lint

* Fix lint 2

* Fixes

* Fixing slice parameter handling in fusion

* Fixing the slice fix

* Fix the cycle bug

* Added test

* Fix lint

* Fix merging of subgraphs

* Fixes from review
ptrendx added a commit that referenced this pull request Nov 17, 2020
* Faster pointwise fusion graph pass (#19269)

* Faster pointwise fusion graph pass

* Fix lint

* Fix lint 2

* Fixes

* Fixing slice parameter handling in fusion

* Fixing the slice fix

* Fix the cycle bug

* Added test

* Fix lint

* Fix merging of subgraphs

* Fixes from review

* Use std::tie instead of C++17 structured binding

* More fixes for lack of c++17

* Fix
josephevans pushed a commit to josephevans/mxnet that referenced this pull request Dec 8, 2020
…che#19413)

* Faster pointwise fusion graph pass (apache#19269)

* Faster pointwise fusion graph pass

* Fix lint

* Fix lint 2

* Fixes

* Fixing slice parameter handling in fusion

* Fixing the slice fix

* Fix the cycle bug

* Added test

* Fix lint

* Fix merging of subgraphs

* Fixes from review

* Use std::tie instead of C++17 structured binding

* More fixes for lack of c++17

* Fix
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
6 participants