Parallel graph walk #531

ejholmes · 2018-02-07T02:51:47Z

Fixes #279
Closes #357

Now that the DAG has been merged, the single biggest performance improvement we can make to stacker build is to switch the graph walk to a parallel walk.

This PR isn't quite ready to merge (but works well, I've used it internally to make changes), so I'm primarily opening this up to start talking about the implementation, and possible changes we need to make. I don't think that we should include this in the 1.2 release of stacker, so it gives us time to polish this and test it internally first.

This is a multi-threaded implementation of the graph walk, which will walk the graph as fast as the graph allows. I think multi-threading is ultimately easier to do than multi-processing, since there's actually not very much we need to make thread safe within stacker itself, because of the nature of the graph. There was some talk in the past about using async io, which would be more resource efficient, but I think would complicate the implementation, but if someone wants to give that a try, be my guest.

Note that, I'm basing this branch on another branch that removes the loop logger, and moves to a simple sequential logger. It makes parallelism a lot easier to deal with.

Perf

I tested this against our internal stacker config (153 stacks) and it drops execution time from ~8 minutes to ~2 minutes. I think there's still room for a lot of optimization here.

before

$ time stacker build
real    8m32.462s

after

$ time stacker build
real    2m23.987s

FWIW, I have not yet run into throttling on DescribeStacks after #529 (which this PR includes) and the change to set_outputs.

Prerequsites

Simplified logging #532

TODO

We should add something like a --max-parallel flag to specify the maximum allowed parallelism. This would just control a semaphore that wraps a build/destroy.
In interactive mode, we'd need a lock around asking for approval.
Think about thread safety more.
Handle SIGTERM/SIGINT gracefully.

ejholmes · 2018-02-07T02:54:39Z

stacker/actions/build.py

@@ -255,6 +255,7 @@ def _launch_stack(self, stack, **kwargs):

                return FailedStatus(reason)
            elif self.provider.is_stack_completed(provider_stack):
+                self.provider.set_outputs(stack.fqn, provider_stack)


This is a simple change to ensure that output lookups are parallelized. I'm not sure this is really the best implementation for this, which is why there's no comment explaining this yet.

In a nutshell, when a stack is "UPDATE_COMPLETE", this will store the outputs for the stack on the provider (which it already handles). When another stack references an output from a dependency, this ensures that there's no sequential blocking DescribeStacks calls to do the output lookups from dependencies, since it's already cached, and also makes outputs thread safe.

ejholmes · 2018-02-07T02:56:59Z

stacker/dag/__init__.py

+                logger.debug("cancelling %s. "
+                             "Some dependencies "
+                             "were not satisfied", n)
+                return False


Note to self. This should just continue to walk to the graph, instead of stopping. Conditional execution of the node is handled further up the call stack, which is a better place to abort.

ejholmes · 2018-02-09T09:12:17Z

Alright, this is getting awesome. I've added a few commits:

Added a -j/--max-parallel flag to control concurrency.
Interactive mode is handled. So if two stacks request changes, only 1 thread can ask the user for input at a time.
SIGTERM/SIGINT are handled gracefully. Stacker will finish whatever it's currently working on, then exit.

Here's an example of two stacks requesting changes in interactive mode:

And an example of ^C (SIGINT/SIGTERM):

I'll get us dogfooding this internally, but I think implementation wise, this is looking pretty solid.

codecov-io · 2018-02-10T02:36:29Z

Codecov Report

Merging #531 into master will decrease coverage by 0.17%.
The diff coverage is 87.06%.

@@            Coverage Diff             @@
##           master     #531      +/-   ##
==========================================
- Coverage   87.72%   87.54%   -0.18%     
==========================================
  Files          93       94       +1     
  Lines        6003     6072      +69     
==========================================
+ Hits         5266     5316      +50     
- Misses        737      756      +19

Impacted Files	Coverage Δ
stacker/util.py	`62.92% <ø> (-2.07%)`	⬇️
stacker/tests/test_util.py	`92.16% <ø> (-7.36%)`	⬇️
stacker/logger/__init__.py	`0% <0%> (ø)`	⬆️
stacker/tests/test_plan.py	`99.34% <100%> (ø)`	⬆️
stacker/tests/factories.py	`96.55% <100%> (+0.39%)`	⬆️
stacker/plan.py	`91.01% <100%> (+0.2%)`	⬆️
stacker/tests/actions/test_destroy.py	`100% <100%> (ø)`	⬆️
stacker/tests/test_dag.py	`100% <100%> (ø)`	⬆️
stacker/status.py	`100% <100%> (ø)`	⬆️
stacker/tests/actions/test_build.py	`96.77% <100%> (+0.01%)`	⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77172bf...6d7b228. Read the comment docs.

ejholmes · 2018-02-10T02:36:52Z

stacker/commands/stacker/base.py

+    return cancel
+
+
+def build_semaphore(concurrency):


Might be cleaner to change this to build_walker and return an object for walking the graph (stacker.dag.walk/stacker.dag.walk_threaded), with the semaphore built-in.

This way, when --max-parallel=1 we can disable threaded execution entirely, which may be useful for debugging in some cases.

ejholmes · 2018-02-10T04:07:54Z

stacker/commands/stacker/build.py

@@ -31,6 +31,12 @@ def add_arguments(self, parser):
                                 "dependencies. Can be specified more than "
                                 "once. If not specified then stacker will "
                                 "work on all stacks in the config file.")
+        parser.add_argument("-j", "--max-parallel", action="store", type=int,
+                            default=0,
+                            help="The maximum number of stacks to execute in "


Should expand the docs on this.

phobologic

This looks great - a lot simpler than I thought it would end up being. Love the UI module/class. I'm good merging this - I'll let you hit merge when you feel up to it!

ejholmes · 2018-02-14T02:31:40Z

A couple people here have hit throttling while using this. I think there's 2 things that can be done:

Better exponential backoff on API calls
Less aggressive polling during updates.

ejholmes · 2018-02-14T04:20:33Z

Added 2 additional commits to help with throttling:

Added a 30 second wait time between the DescribeStacks calls that are used to poll a stack that's updating/creating.
Replaced stackers internal throttling retry with botocores exponential backoff and a larger max_attempts.

With these two changes it should be 1) less likely that throttling is hit on DescribeStacks and 2) make stacker fallback more gracefully when throttling is hit.

ejholmes · 2018-02-15T01:02:22Z

This is working well for us internally. I'm going to merge this into master and plan on doing a release candidate tomorrow to get more external people testing it.

Parallel graph walk

ejholmes requested a review from a team February 7, 2018 02:51

ejholmes commented Feb 7, 2018

View reviewed changes

ejholmes force-pushed the dag-concurrent branch from 1469d39 to d2f3a87 Compare February 9, 2018 03:35

ejholmes changed the base branch from simplified-logging to simple-logging February 9, 2018 03:35

ejholmes force-pushed the dag-concurrent branch from d2f3a87 to 0214217 Compare February 9, 2018 03:54

cloudtools deleted a comment from codecov-io Feb 9, 2018

russellballestrini approved these changes Feb 9, 2018

View reviewed changes

troyready mentioned this pull request Feb 9, 2018

add raw json/yaml template support #530

Merged

ejholmes force-pushed the simple-logging branch 2 times, most recently from 97db788 to 1a81dd4 Compare February 10, 2018 01:56

ejholmes force-pushed the dag-concurrent branch from 7acacbb to 487140a Compare February 10, 2018 02:10

ejholmes changed the base branch from simple-logging to master February 10, 2018 02:10

cloudtools deleted a comment from codecov-io Feb 10, 2018

ejholmes commented Feb 10, 2018

View reviewed changes

ejholmes force-pushed the dag-concurrent branch from 7c0f252 to e7b90d0 Compare February 10, 2018 02:57

ejholmes commented Feb 10, 2018

View reviewed changes

phobologic approved these changes Feb 11, 2018

View reviewed changes

ejholmes added 7 commits February 13, 2018 17:33

Threaded parallel walk

675cb6c

Add -j/--max-parallel flag to build/destroy commands

7194239

Wrap UI input with a lock

729b36c

Handle SIGTERM/SIGINT gracefully

db21192

Set outputs on stack if there's no change

6ac9549

Include threadName in DEBUG format

e55a1e9

Make -j=1 disable multithreading

680d0ca

ejholmes force-pushed the dag-concurrent branch from c1a3c99 to 680d0ca Compare February 14, 2018 01:35

Use exponential retry from botocore

b9fee47

ejholmes mentioned this pull request Feb 14, 2018

Use exponential backoff from botocore #533

Merged

Wait 30 seconds between DescribeStacks calls during updates.

6d7b228

ejholmes merged commit 5e16948 into master Feb 15, 2018

ejholmes deleted the dag-concurrent branch February 15, 2018 01:02

ejholmes mentioned this pull request Feb 21, 2018

Diff improvements #539

Merged

ejholmes added this to the 1.2 milestone Mar 1, 2018

phrohdoh pushed a commit to phrohdoh/stacker that referenced this pull request Dec 18, 2018

Merge pull request cloudtools#531 from remind101/dag-concurrent

5780051

Parallel graph walk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel graph walk #531

Parallel graph walk #531

ejholmes commented Feb 7, 2018 •

edited

Loading

ejholmes Feb 7, 2018 •

edited

Loading

ejholmes Feb 7, 2018

ejholmes commented Feb 9, 2018 •

edited

Loading

codecov-io commented Feb 10, 2018 •

edited

Loading

ejholmes Feb 10, 2018

ejholmes Feb 10, 2018

phobologic left a comment

ejholmes commented Feb 14, 2018

ejholmes commented Feb 14, 2018

ejholmes commented Feb 15, 2018

Parallel graph walk #531

Parallel graph walk #531

Conversation

ejholmes commented Feb 7, 2018 • edited Loading

Perf

Prerequsites

TODO

ejholmes Feb 7, 2018 • edited Loading

Choose a reason for hiding this comment

ejholmes Feb 7, 2018

Choose a reason for hiding this comment

ejholmes commented Feb 9, 2018 • edited Loading

codecov-io commented Feb 10, 2018 • edited Loading

Codecov Report

ejholmes Feb 10, 2018

Choose a reason for hiding this comment

ejholmes Feb 10, 2018

Choose a reason for hiding this comment

phobologic left a comment

Choose a reason for hiding this comment

ejholmes commented Feb 14, 2018

ejholmes commented Feb 14, 2018

ejholmes commented Feb 15, 2018

ejholmes commented Feb 7, 2018 •

edited

Loading

ejholmes Feb 7, 2018 •

edited

Loading

ejholmes commented Feb 9, 2018 •

edited

Loading

codecov-io commented Feb 10, 2018 •

edited

Loading