DAG #523

ejholmes · 2018-01-26T05:17:57Z

Third times the charm :)

Fixes #300
Fixes #186
Fixes #384

This is an internal refactor to change the Plan implementation to be backed by a Directed Acyclic Graph. Most of this work was already done a long time ago in #406, but this just brings it up to speed with the master branch, with a few important caveats:

Not parallelized

This implementation is not parallelized. This PR paves the way to make it easy to implement a 100% parallelized graph walk, but I didn't implement it here, primarily to make review easier, but also because it doesn't matter much based on the current implementation (more on that below). I think we should get the graph in master, and then add a parallelized walk after the fact since that will need special attention to deal with concurrency issues (logging, user input, etc).

In theory, this sounds like it should reduce performance, but ironically it has little impact. I benchmarked before and after on our internal stacker graph, which is pretty big:

before (stacker 1.1.3)

$ time stacker build

real    8m31.716s
user    0m0.139s
sys     0m0.100s

after (dag)

$ time stacker build

real    8m33.819s
user    0m0.142s
sys     0m0.094s

However, because the --stacks flag now works like you'd expect it to, this branch can actually make most workflows a lot faster, even without concurrency.

Notes for reviewer

For the time being, I'd like to focus on functionality, and making sure everything is feature complete, and all of the integration tests pass. There's probably some missing doc's, and I'll get to those things later.

Also, to make review easier, instead of making changes to stacker.plan I made a new stacker.plan2, and swapped out call sites to use the new graph based plan implementation. This will likely replace stacker.plan before merging, unless we want to keep the old stacker.plan around for API compatibility.

The stacker.dag module probably doesn't need much attention, since it's already been reviewed previously, and is stilled based on py-dag. Feel free to review it if you want.

TODO

Bring back the check pointed loop logger (I just changed this to a simplified logger to make refactoring easier)
Bring back some removed tests.
Re-implement the --tail flag.
Re-implement dump

phobologic

Skipped the tests for now, let me know when you're ready for me to take a look at those.

phobologic · 2018-01-26T23:00:18Z

stacker/dag/__init__.py

+
+    def add_node(self, node_name, graph=None):
+        """ Add a node if it does not exist yet, or error out. """
+        if not graph:


Do you have any idea why this library does this so often? I haven't seen this pattern used really ever before. I thought maybe for testing, but it seems like a messy way to handle that.

No clue. It seems unnecessary. I'll try to add a commit that removes this.

Thanks, yeah, I couldn't figure it out - I thought it'd be a test thing, but just now looking at tests I don't think it is. Strange.

phobologic · 2018-01-26T23:02:51Z

stacker/dag/__init__.py

+            raise KeyError('node %s does not exist' % dep_node)
+        test_graph = deepcopy(graph)
+        test_graph[ind_node].add(dep_node)
+        is_valid, message = self.validate(test_graph)


It seems like the validate method should throw the exception itself, and have anything that calls it deal with the results if they want to catch them.

phobologic · 2018-01-26T23:03:52Z

stacker/dag/__init__.py

+        if not graph:
+            graph = self.graph
+        if dep_node not in graph.get(ind_node, []):
+            raise KeyError('this edge does not exist in graph')


Can you add more info to this exception? I actually think his overuse of KeyError, instead of custom Exceptions, is a little sloppy. In any case, I think it's going to be necessary to at least have the two nodes in the exception, so we have a useful error message.

phobologic · 2018-01-26T23:06:19Z

stacker/dag/__init__.py

+            graph = self.graph
+        return [key for key in graph if node in graph[key]]
+
+    def downstream(self, node, graph=None):


descendants maybe to match predecessors more?

When I was cleaning this up I called them: ancestors (recursive parents) and descendants (recursive children) - I also had children and parents, as well as properties for has_children, has_descendants, etc. has_children and has_descendants were both the same call, because you couldn't have descendants if you didn't have children.

Oh, sorry, this is what I woul dhave called children - all_downstream is what I would have called descendants

phobologic · 2018-01-26T23:09:10Z

stacker/dag/__init__.py

+
+        # Now, rebuild the graph for each node that's present.
+        for node, edges in graph.items():
+            if node in dag.graph:


This if seems dangerous, since we silently throw away any nodes that aren't in dag.graph, which really shouldn't ever happen, right?

That's what should happen. It's iterating through the full graph of nodes, and only re-adding edges for nodes that exist in the filtered graph.

Ahh, my bad - mixing dag & graph in this method made it confusing. Maybe having it be original_graph and filtered_graph would make it less confusing. Not a huge deal for now.

phobologic · 2018-01-26T23:13:52Z

stacker/dag/__init__.py

+        if graph is None:
+            graph = self.graph
+
+        dependent_nodes = set(


I'm having a really hard time reading this - any way to dumb it down that isn't going to destroy performance?

phobologic · 2018-01-26T23:14:39Z

stacker/dag/__init__.py

+        else:
+            raise ValueError('graph is not acyclic')
+
+    def size(self):


You could also give this class a __len__ method so that calling len() on it would return the size.

phobologic · 2018-01-26T23:15:34Z

stacker/exceptions.py

+        self.stack = stack
+        self.dependency = dependency
+        self.exception = exception
+        message = ("Error detected when adding '%s' "


This message will be broken if stack & dependency aren't set.

Where this is used, those things are never not present.

Cool - you should make them required then, so we don't hit an edge case.

phobologic · 2018-01-26T23:18:19Z

stacker/plan2.py

+    return graph
+
+
+class Graph(object):


Curious if this should just be combined with the Plan class?

I originally started with that, but I think it's cleaner to keep these things separate:

Graph encapsulates a graph of Step objects, and has operations to make working on the graph itself easy (transposition, walking, etc).

Plan is more like a convenience class for Actions to consume (dump, outline, etc).

Plan could probably eventually get removed, as most of the methods on it may be better somewhere else, but I primarily kept it around to keep API compatibility with the existing Plan implementation.

phobologic · 2018-01-26T23:18:49Z

stacker/plan2.py

+        return self.step_names
+
+
+class UnlimitedSemaphore(object):


Is this just planning ahead for multi-process/threading?

Yeah, this is leftover from multi-threading. I'll remove this since it's unnecessary right now.

codecov-io · 2018-01-27T01:52:41Z

Codecov Report

Merging #523 into master will increase coverage by 0.41%.
The diff coverage is 92.33%.

@@            Coverage Diff             @@
##           master     #523      +/-   ##
==========================================
+ Coverage   88.72%   89.14%   +0.41%     
==========================================
  Files          92       96       +4     
  Lines        5926     6401     +475     
==========================================
+ Hits         5258     5706     +448     
- Misses        668      695      +27

Impacted Files	Coverage Δ
stacker/tests/test_context.py	`99.05% <ø> (-0.03%)`	⬇️
stacker/commands/stacker/build.py	`83.33% <ø> (ø)`	⬆️
stacker/tests/test_stacker.py	`97.82% <ø> (-0.26%)`	⬇️
stacker/tests/actions/test_build.py	`96.75% <100%> (-0.34%)`	⬇️
stacker/context.py	`97.26% <100%> (-0.08%)`	⬇️
stacker/tests/test_plan.py	`100% <100%> (ø)`	⬆️
stacker/exceptions.py	`90.43% <100%> (+0.61%)`	⬆️
stacker/util.py	`64.98% <100%> (+0.28%)`	⬆️
stacker/tests/test_dag.py	`100% <100%> (ø)`
stacker/tests/actions/test_destroy.py	`100% <100%> (+1.12%)`	⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2c51b9...82edab0. Read the comment docs.

ejholmes · 2018-01-27T02:20:45Z

Functional tests are all passing, so the only next steps to get this mergeable should be:

Implement --tail flag.
Implement --dump flag.
Re-implement the loop logger.

ejholmes · 2018-01-27T04:56:56Z

@phobologic when you wanna review this, this should be 100% feature complete and pretty close to being mergeable.

A lot of your original comments were around the implementation of py-dag. I'm not really inclined to change it very much, since it's algorithmically correct and has good test coverage. Would you rather pull in py-dag as a third party dependency, or leave it vendored internally as is?

phobologic · 2018-01-29T23:19:31Z

A few of my comments are around things that look dangerous or error prone (duplicate key comment, the comment about ignoring mising edges)- if you think they aren't and can explain it, I'd be fine ignoring those.

Others are just from general usefulness. Better errors, easier to read code, etc.

I'm fine with skipping the stuff about children/descendants, since I know that's mostly just graph theory terms being used instead.

All in all, I'll take a small performance hit if it means better maintainability in the future. That said, I don't expect my suggestions/questions to actually result in much of a hit at the scale we're operating and due to the fact that most of the time is spent in talking with AWS.

I think we should go ahead and vendor - though we should give the Author credit at the top of the code most likely.

Ok, now going to review the latest.

phobologic

A lot of docstrings need updating, and a few comments/concerns.

phobologic · 2018-01-29T23:20:18Z

stacker/__init__.py

+         tail=None,
+         stacks=None, stack_names=None,
+         reverse=False):
+    """A simple helper that builds a graph based plan from a set of stacks."""


Can you do the full Args/Returns docstring here? Also, is plan a temporary thing until we remove the old plan?

phobologic · 2018-01-29T23:22:12Z

stacker/plan2.py

+
+def build_plan(description=None, steps=None,
+               step_names=None, reverse=False):
+    graph = build_graph(steps)


if steps is None, it'll break build_graph. Not sure what the right behavior should be.

I can't think of a scenario where that could actually happen from the stacker CLI, since there's config validation to ensure that stacks isn't empty, but I'll add a test for that here.

phobologic · 2018-01-29T23:23:21Z

stacker/plan2.py

+
+class Step(object):
+    """State machine for executing generic actions related to stacks.
+    Args:


Needs it's args updated.

phobologic · 2018-01-29T23:23:32Z

stacker/plan2.py

+            with this step
+    """
+
+    def __init__(self, stack, fn=None, watch_func=None,


Is fn really optional at this point?

phobologic · 2018-01-29T23:24:16Z

stacker/plan2.py

+        self.set_status(SUBMITTED)
+
+
+def build_plan(description=None, steps=None,


description seems like it should be required, as well as steps?

phobologic · 2018-01-29T23:34:51Z

stacker/tests/test_plan2.py

+
+        plan = build_plan(description="Test", steps=[Step(vpc), Step(bastion)])
+
+        self.assertEqual(plan.graph.dag.graph, {


These kind of lines are part of my concern with having a separate graph class. It's a little confusing. Probably not a huge deal, just not straight forward.

This is really just a bad test on my part. dag should be private within the Graph class, but no way to enforce that with python :(.

I'll add a to_dict method to Graph, to make testing this cleaner.

Cool - I also just realized the name of this test is probably wrong.

phobologic · 2018-01-29T23:43:34Z

stacker/plan2.py

+    def walk(self):
+        """Walks each step in the underlying graph, in topological order."""
+
+        def walk_func(step):


We should probably do something here to catch any errors that are raised, and prepend/append info about what step caused the issue - either here or on Graph.

I think we definitely should make that change, since it's annoying when a stack throws an exception, and you have no context about what stack has the issue. But we should probably make that change separate from this PR, since that's not existing behavior.

We'll also need to think about how we handle exceptions in a parallelized world, since the current method of just throwing an exception and aborting won't work with multi-threads/processes.

phobologic · 2018-01-30T00:11:10Z

stacker/tests/test_plan2.py

+            build_plan(
+                description="Test",
+                steps=[Step(vpc, None), Step(db, None), Step(app, None)])
+        message = ("Error detected when adding 'namespace-db.1' "


You know, you probably wouldn't need this message test if we had different exceptions ;) Just saying.

phobologic · 2018-01-30T00:12:05Z

tests/suite.bats

@@ -629,10 +629,7 @@ EOF
  assert "$status" -eq 1


Curious why we removed the asserts here?

In the implementation in this PR, if a node in the graph fails (returns False), the entire graph walk will halt. It's not actually necessary to halt the entire graph walk, only nodes that depend on the failed node. I can update the implementation of the graph walk to support the old behavior.

phobologic · 2018-01-30T00:13:00Z

stacker/__init__.py

 __version__ = "1.1.4"
+
+
+def plan(description=None, action=None,


What's the plan going forward with this? We probably can't name this function "plan" if we move plan2 to plan.

The plan (no pun intended) would be to replace plan.py with plan2.py before merging, unless we want to keep the original plan.py around for backwards compatibility. I only went with plan2.py because overwriting plan.py in the PR would make reviewing it pretty hard, since it's a re-write.

Right, but what about the plan method? If you have a plan method in stacker/__init__.py and a stacker/plan directory, you're going to have conflicts when someone does something like: from stacker import plan or from stacker.plan import * (maybe not on the last one, I honestly don't know!)

Ah, gotcha. I think this plan method should actually be moved to stacker.actions.base, since that's where it's intended to be used.

ejholmes · 2018-01-30T03:13:17Z

Alright @phobologic. I think I've addressed a majority of the comments here.

I also reverted back to the original behavior for failed child stacks, so that the entire graph of stacks is executed, but a given stack will not execute if any of it's dependencies are in a "not ok" state (see c0f3714)

ejholmes · 2018-01-31T01:45:00Z

@Lowercases found a bug where CancelExecution wasn't handled (pressing n in interactive mode). Added a commit that handles this flow, and also an integration test, since we didn't have one.

phobologic

Looks good - let me submit my PR, it's simple quality of life changes.

phobologic · 2018-01-31T16:34:40Z

stacker/actions/base.py

+        description (str): a description of the plan.
+        action (func): a function to call for each stack.
+        stacks (list): a list of :class:`stacker.stack.Stack` objects to build.
+        targets (list): an optional list of targets to filter the graph to.


You're right, targets is definitely a better name for this. Thanks for thinking of that.

phobologic · 2018-01-31T23:51:54Z

stacker/plan2.py

+
+
+class Graph(object):
+    """Graph represents a graph of steps.


This is super helpful - thanks!

ejholmes · 2018-02-01T02:00:06Z

Cool. I'll update us internally to run this branch before merging.

ejholmes · 2018-02-02T00:22:05Z

There's some talk about whether or not the --stacks flag should stop at the given stack, or continue to execute stacks that transitively depend on the provided stack.

Let's say you have a graph of stacks like the following:

In the current behavior (this pr), if you specify --stacks db, stacker will effectively filter the graph to the following, only including nodes in the graph that are transitive dependencies of db:

If there's a change to the db stack (like a new instance url or w/e), then the app stack won't see it.

We could change the behavior here so that the graph includes the dependencies of db, as well as anything that transitively depends on it. So the graph would be filtered to:

In some configurations, this could easily blow up the size of the graph (e.g. if a parent also depends on a bunch of stacks), but makes it more "correct".

acmcelwee · 2018-02-02T03:33:56Z

Personally, the expected behavior of the --stacks flag is as you described it in the first example, and how it's currently implemented in the PR. That said, I think the second behavior could be useful in other occasions. It might be worth thinking about how both could coexist with different flags or modes.

russellballestrini · 2018-02-02T16:08:49Z

The version is this PR is faster but I'm pretty sure the version in the 3rd chart is safer (and puts the system as a whole in a more correct state).

I would prefer correctness over speed.

I like @acmcelwee point, if we can support both, (--stacks-fast) I think that could be really cool.

ejholmes · 2018-02-02T23:22:24Z

Yeah, that makes sense to me. I think I'll update this PR to the more complete version.

If there's a need for the old behavior, we can add new flags for that.

ejholmes · 2018-02-03T02:50:24Z

Aaaaand...I think I'm gonna flipflop my preference back to the fast version. I tried the complete version internally, and it's pretty easy for the graph to explode making --stacks mostly useless.

I'll leave out specific stack names, but on our internal graph, when I targeted a stack that builds out an ECS cluster, the fast versions results in 4 stacks (3 dependencies). With the full version, the graph explodes to 100 stacks as a result of the parents transitive dependencies.

This changes the internals of stacker plan's to first generate a Directed Acyclic Graph, and uses this as a means of walking the graph.

ejholmes · 2018-02-06T01:18:28Z

Thanks everyone for the comments and feedback. Super excited about what this is gonna enable for stacker moving forward.

Generate a DAG internally

ejholmes mentioned this pull request Jan 26, 2018

DAG #406

Closed

9 tasks

phobologic requested changes Jan 26, 2018

View reviewed changes

ejholmes force-pushed the dag-1.1 branch from 1040cd6 to d03e832 Compare January 27, 2018 01:50

ejholmes force-pushed the dag-1.1 branch 2 times, most recently from 1fba1c8 to f9ba218 Compare January 27, 2018 04:55

phobologic requested changes Jan 30, 2018

View reviewed changes

ejholmes force-pushed the dag-1.1 branch from cd0c0f6 to c0f3714 Compare January 30, 2018 03:17

phobologic approved these changes Jan 31, 2018

View reviewed changes

ejholmes added 2 commits February 5, 2018 17:13

Generate a DAG internally

87cbc8e

This changes the internals of stacker plan's to first generate a Directed Acyclic Graph, and uses this as a means of walking the graph.

Update CHANGELOG.md

82edab0

ejholmes force-pushed the dag-1.1 branch from f6dea16 to 82edab0 Compare February 6, 2018 01:14

ejholmes merged commit 464123b into master Feb 6, 2018

ejholmes deleted the dag-1.1 branch February 6, 2018 01:18

ejholmes mentioned this pull request Feb 7, 2018

Parallel graph walk #531

Merged

4 tasks

ejholmes added this to the 1.2 milestone Mar 1, 2018

phrohdoh pushed a commit to phrohdoh/stacker that referenced this pull request Dec 18, 2018

Merge pull request cloudtools#523 from remind101/dag-1.1

70603cd

Generate a DAG internally

		self.set_status(SUBMITTED)


		def build_plan(description=None, steps=None,


		plan = build_plan(description="Test", steps=[Step(vpc), Step(bastion)])

		self.assertEqual(plan.graph.dag.graph, {

		__version__ = "1.1.4"


		def plan(description=None, action=None,

DAG #523

DAG #523

Conversation

ejholmes commented Jan 26, 2018 • edited Loading

Not parallelized

Notes for reviewer

TODO

phobologic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jan 27, 2018 • edited Loading

Codecov Report

ejholmes commented Jan 27, 2018

ejholmes commented Jan 27, 2018 • edited Loading

phobologic commented Jan 29, 2018

phobologic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ejholmes commented Jan 30, 2018 • edited Loading

ejholmes commented Jan 31, 2018

phobologic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ejholmes commented Feb 1, 2018

ejholmes commented Feb 2, 2018 • edited Loading

acmcelwee commented Feb 2, 2018

russellballestrini commented Feb 2, 2018

ejholmes commented Feb 2, 2018

ejholmes commented Feb 3, 2018

ejholmes commented Feb 6, 2018

ejholmes commented Jan 26, 2018 •

edited

Loading

codecov-io commented Jan 27, 2018 •

edited

Loading

ejholmes commented Jan 27, 2018 •

edited

Loading

ejholmes commented Jan 30, 2018 •

edited

Loading

ejholmes commented Feb 2, 2018 •

edited

Loading