Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel graph walk #531

Merged
merged 9 commits into from
Feb 15, 2018
Merged

Parallel graph walk #531

merged 9 commits into from
Feb 15, 2018

Conversation

ejholmes
Copy link
Contributor

@ejholmes ejholmes commented Feb 7, 2018

Fixes #279
Closes #357

Now that the DAG has been merged, the single biggest performance improvement we can make to stacker build is to switch the graph walk to a parallel walk.

This PR isn't quite ready to merge (but works well, I've used it internally to make changes), so I'm primarily opening this up to start talking about the implementation, and possible changes we need to make. I don't think that we should include this in the 1.2 release of stacker, so it gives us time to polish this and test it internally first.

This is a multi-threaded implementation of the graph walk, which will walk the graph as fast as the graph allows. I think multi-threading is ultimately easier to do than multi-processing, since there's actually not very much we need to make thread safe within stacker itself, because of the nature of the graph. There was some talk in the past about using async io, which would be more resource efficient, but I think would complicate the implementation, but if someone wants to give that a try, be my guest.

Note that, I'm basing this branch on another branch that removes the loop logger, and moves to a simple sequential logger. It makes parallelism a lot easier to deal with.

Perf

I tested this against our internal stacker config (153 stacks) and it drops execution time from ~8 minutes to ~2 minutes. I think there's still room for a lot of optimization here.

before

$ time stacker build
real    8m32.462s

after

$ time stacker build
real    2m23.987s

FWIW, I have not yet run into throttling on DescribeStacks after #529 (which this PR includes) and the change to set_outputs.

Prerequsites

TODO

  • We should add something like a --max-parallel flag to specify the maximum allowed parallelism. This would just control a semaphore that wraps a build/destroy.
  • In interactive mode, we'd need a lock around asking for approval.
  • Think about thread safety more.
  • Handle SIGTERM/SIGINT gracefully.

@ejholmes ejholmes requested a review from a team February 7, 2018 02:51
@@ -255,6 +255,7 @@ def _launch_stack(self, stack, **kwargs):

return FailedStatus(reason)
elif self.provider.is_stack_completed(provider_stack):
self.provider.set_outputs(stack.fqn, provider_stack)
Copy link
Contributor Author

@ejholmes ejholmes Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simple change to ensure that output lookups are parallelized. I'm not sure this is really the best implementation for this, which is why there's no comment explaining this yet.

In a nutshell, when a stack is "UPDATE_COMPLETE", this will store the outputs for the stack on the provider (which it already handles). When another stack references an output from a dependency, this ensures that there's no sequential blocking DescribeStacks calls to do the output lookups from dependencies, since it's already cached, and also makes outputs thread safe.

logger.debug("cancelling %s. "
"Some dependencies "
"were not satisfied", n)
return False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self. This should just continue to walk to the graph, instead of stopping. Conditional execution of the node is handled further up the call stack, which is a better place to abort.

@ejholmes ejholmes changed the base branch from simplified-logging to simple-logging February 9, 2018 03:35
@cloudtools cloudtools deleted a comment from codecov-io Feb 9, 2018
@ejholmes
Copy link
Contributor Author

ejholmes commented Feb 9, 2018

Alright, this is getting awesome. I've added a few commits:

  1. Added a -j/--max-parallel flag to control concurrency.
  2. Interactive mode is handled. So if two stacks request changes, only 1 thread can ask the user for input at a time.
  3. SIGTERM/SIGINT are handled gracefully. Stacker will finish whatever it's currently working on, then exit.

Here's an example of two stacks requesting changes in interactive mode:

asciicast

And an example of ^C (SIGINT/SIGTERM):

asciicast

I'll get us dogfooding this internally, but I think implementation wise, this is looking pretty solid.

@ejholmes ejholmes force-pushed the simple-logging branch 2 times, most recently from 97db788 to 1a81dd4 Compare February 10, 2018 01:56
@ejholmes ejholmes changed the base branch from simple-logging to master February 10, 2018 02:10
@cloudtools cloudtools deleted a comment from codecov-io Feb 10, 2018
@codecov-io
Copy link

codecov-io commented Feb 10, 2018

Codecov Report

Merging #531 into master will decrease coverage by 0.17%.
The diff coverage is 87.06%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #531      +/-   ##
==========================================
- Coverage   87.72%   87.54%   -0.18%     
==========================================
  Files          93       94       +1     
  Lines        6003     6072      +69     
==========================================
+ Hits         5266     5316      +50     
- Misses        737      756      +19
Impacted Files Coverage Δ
stacker/util.py 62.92% <ø> (-2.07%) ⬇️
stacker/tests/test_util.py 92.16% <ø> (-7.36%) ⬇️
stacker/logger/__init__.py 0% <0%> (ø) ⬆️
stacker/tests/test_plan.py 99.34% <100%> (ø) ⬆️
stacker/tests/factories.py 96.55% <100%> (+0.39%) ⬆️
stacker/plan.py 91.01% <100%> (+0.2%) ⬆️
stacker/tests/actions/test_destroy.py 100% <100%> (ø) ⬆️
stacker/tests/test_dag.py 100% <100%> (ø) ⬆️
stacker/status.py 100% <100%> (ø) ⬆️
stacker/tests/actions/test_build.py 96.77% <100%> (+0.01%) ⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77172bf...6d7b228. Read the comment docs.

return cancel


def build_semaphore(concurrency):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be cleaner to change this to build_walker and return an object for walking the graph (stacker.dag.walk/stacker.dag.walk_threaded), with the semaphore built-in.

This way, when --max-parallel=1 we can disable threaded execution entirely, which may be useful for debugging in some cases.

@@ -31,6 +31,12 @@ def add_arguments(self, parser):
"dependencies. Can be specified more than "
"once. If not specified then stacker will "
"work on all stacks in the config file.")
parser.add_argument("-j", "--max-parallel", action="store", type=int,
default=0,
help="The maximum number of stacks to execute in "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should expand the docs on this.

Copy link
Member

@phobologic phobologic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great - a lot simpler than I thought it would end up being. Love the UI module/class. I'm good merging this - I'll let you hit merge when you feel up to it!

@ejholmes
Copy link
Contributor Author

A couple people here have hit throttling while using this. I think there's 2 things that can be done:

  1. Better exponential backoff on API calls
  2. Less aggressive polling during updates.

@ejholmes
Copy link
Contributor Author

Added 2 additional commits to help with throttling:

  1. Added a 30 second wait time between the DescribeStacks calls that are used to poll a stack that's updating/creating.
  2. Replaced stackers internal throttling retry with botocores exponential backoff and a larger max_attempts.

With these two changes it should be 1) less likely that throttling is hit on DescribeStacks and 2) make stacker fallback more gracefully when throttling is hit.

@ejholmes
Copy link
Contributor Author

This is working well for us internally. I'm going to merge this into master and plan on doing a release candidate tomorrow to get more external people testing it.

@ejholmes ejholmes merged commit 5e16948 into master Feb 15, 2018
@ejholmes ejholmes deleted the dag-concurrent branch February 15, 2018 01:02
@ejholmes ejholmes mentioned this pull request Feb 21, 2018
@ejholmes ejholmes added this to the 1.2 milestone Mar 1, 2018
phrohdoh pushed a commit to phrohdoh/stacker that referenced this pull request Dec 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants