Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAG #523

Merged
merged 2 commits into from
Feb 6, 2018
Merged

DAG #523

merged 2 commits into from
Feb 6, 2018

Conversation

ejholmes
Copy link
Contributor

@ejholmes ejholmes commented Jan 26, 2018

Third times the charm :)

Fixes #300
Fixes #186
Fixes #384

This is an internal refactor to change the Plan implementation to be backed by a Directed Acyclic Graph. Most of this work was already done a long time ago in #406, but this just brings it up to speed with the master branch, with a few important caveats:

Not parallelized

This implementation is not parallelized. This PR paves the way to make it easy to implement a 100% parallelized graph walk, but I didn't implement it here, primarily to make review easier, but also because it doesn't matter much based on the current implementation (more on that below). I think we should get the graph in master, and then add a parallelized walk after the fact since that will need special attention to deal with concurrency issues (logging, user input, etc).

In theory, this sounds like it should reduce performance, but ironically it has little impact. I benchmarked before and after on our internal stacker graph, which is pretty big:

before (stacker 1.1.3)

$ time stacker build

real    8m31.716s
user    0m0.139s
sys     0m0.100s

after (dag)

$ time stacker build

real    8m33.819s
user    0m0.142s
sys     0m0.094s

However, because the --stacks flag now works like you'd expect it to, this branch can actually make most workflows a lot faster, even without concurrency.

Notes for reviewer

For the time being, I'd like to focus on functionality, and making sure everything is feature complete, and all of the integration tests pass. There's probably some missing doc's, and I'll get to those things later.

Also, to make review easier, instead of making changes to stacker.plan I made a new stacker.plan2, and swapped out call sites to use the new graph based plan implementation. This will likely replace stacker.plan before merging, unless we want to keep the old stacker.plan around for API compatibility.

The stacker.dag module probably doesn't need much attention, since it's already been reviewed previously, and is stilled based on py-dag. Feel free to review it if you want.

TODO

  • Bring back the check pointed loop logger (I just changed this to a simplified logger to make refactoring easier)
  • Bring back some removed tests.
  • Re-implement the --tail flag.
  • Re-implement dump

@ejholmes ejholmes mentioned this pull request Jan 26, 2018
Closed
9 tasks
Copy link
Member

@phobologic phobologic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipped the tests for now, let me know when you're ready for me to take a look at those.


def add_node(self, node_name, graph=None):
""" Add a node if it does not exist yet, or error out. """
if not graph:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any idea why this library does this so often? I haven't seen this pattern used really ever before. I thought maybe for testing, but it seems like a messy way to handle that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No clue. It seems unnecessary. I'll try to add a commit that removes this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yeah, I couldn't figure it out - I thought it'd be a test thing, but just now looking at tests I don't think it is. Strange.

raise KeyError('node %s does not exist' % dep_node)
test_graph = deepcopy(graph)
test_graph[ind_node].add(dep_node)
is_valid, message = self.validate(test_graph)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the validate method should throw the exception itself, and have anything that calls it deal with the results if they want to catch them.

if not graph:
graph = self.graph
if dep_node not in graph.get(ind_node, []):
raise KeyError('this edge does not exist in graph')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more info to this exception? I actually think his overuse of KeyError, instead of custom Exceptions, is a little sloppy. In any case, I think it's going to be necessary to at least have the two nodes in the exception, so we have a useful error message.

graph = self.graph
return [key for key in graph if node in graph[key]]

def downstream(self, node, graph=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

descendants maybe to match predecessors more?

When I was cleaning this up I called them: ancestors (recursive parents) and descendants (recursive children) - I also had children and parents, as well as properties for has_children, has_descendants, etc. has_children and has_descendants were both the same call, because you couldn't have descendants if you didn't have children.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, this is what I woul dhave called children - all_downstream is what I would have called descendants


# Now, rebuild the graph for each node that's present.
for node, edges in graph.items():
if node in dag.graph:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if seems dangerous, since we silently throw away any nodes that aren't in dag.graph, which really shouldn't ever happen, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what should happen. It's iterating through the full graph of nodes, and only re-adding edges for nodes that exist in the filtered graph.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, my bad - mixing dag & graph in this method made it confusing. Maybe having it be original_graph and filtered_graph would make it less confusing. Not a huge deal for now.

if graph is None:
graph = self.graph

dependent_nodes = set(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a really hard time reading this - any way to dumb it down that isn't going to destroy performance?

else:
raise ValueError('graph is not acyclic')

def size(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also give this class a __len__ method so that calling len() on it would return the size.

self.stack = stack
self.dependency = dependency
self.exception = exception
message = ("Error detected when adding '%s' "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message will be broken if stack & dependency aren't set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where this is used, those things are never not present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - you should make them required then, so we don't hit an edge case.

return graph


class Graph(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if this should just be combined with the Plan class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally started with that, but I think it's cleaner to keep these things separate:

  1. Graph encapsulates a graph of Step objects, and has operations to make working on the graph itself easy (transposition, walking, etc).
  2. Plan is more like a convenience class for Actions to consume (dump, outline, etc).

Plan could probably eventually get removed, as most of the methods on it may be better somewhere else, but I primarily kept it around to keep API compatibility with the existing Plan implementation.

stacker/plan2.py Outdated
return self.step_names


class UnlimitedSemaphore(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just planning ahead for multi-process/threading?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is leftover from multi-threading. I'll remove this since it's unnecessary right now.

@codecov-io
Copy link

codecov-io commented Jan 27, 2018

Codecov Report

Merging #523 into master will increase coverage by 0.41%.
The diff coverage is 92.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #523      +/-   ##
==========================================
+ Coverage   88.72%   89.14%   +0.41%     
==========================================
  Files          92       96       +4     
  Lines        5926     6401     +475     
==========================================
+ Hits         5258     5706     +448     
- Misses        668      695      +27
Impacted Files Coverage Δ
stacker/tests/test_context.py 99.05% <ø> (-0.03%) ⬇️
stacker/commands/stacker/build.py 83.33% <ø> (ø) ⬆️
stacker/tests/test_stacker.py 97.82% <ø> (-0.26%) ⬇️
stacker/tests/actions/test_build.py 96.75% <100%> (-0.34%) ⬇️
stacker/context.py 97.26% <100%> (-0.08%) ⬇️
stacker/tests/test_plan.py 100% <100%> (ø) ⬆️
stacker/exceptions.py 90.43% <100%> (+0.61%) ⬆️
stacker/util.py 64.98% <100%> (+0.28%) ⬆️
stacker/tests/test_dag.py 100% <100%> (ø)
stacker/tests/actions/test_destroy.py 100% <100%> (+1.12%) ⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2c51b9...82edab0. Read the comment docs.

@ejholmes
Copy link
Contributor Author

Functional tests are all passing, so the only next steps to get this mergeable should be:

  1. Implement --tail flag.
  2. Implement --dump flag.
  3. Re-implement the loop logger.

@ejholmes ejholmes force-pushed the dag-1.1 branch 2 times, most recently from 1fba1c8 to f9ba218 Compare January 27, 2018 04:55
@ejholmes
Copy link
Contributor Author

ejholmes commented Jan 27, 2018

@phobologic when you wanna review this, this should be 100% feature complete and pretty close to being mergeable.

A lot of your original comments were around the implementation of py-dag. I'm not really inclined to change it very much, since it's algorithmically correct and has good test coverage. Would you rather pull in py-dag as a third party dependency, or leave it vendored internally as is?

@phobologic
Copy link
Member

A few of my comments are around things that look dangerous or error prone (duplicate key comment, the comment about ignoring mising edges)- if you think they aren't and can explain it, I'd be fine ignoring those.

Others are just from general usefulness. Better errors, easier to read code, etc.

I'm fine with skipping the stuff about children/descendants, since I know that's mostly just graph theory terms being used instead.

All in all, I'll take a small performance hit if it means better maintainability in the future. That said, I don't expect my suggestions/questions to actually result in much of a hit at the scale we're operating and due to the fact that most of the time is spent in talking with AWS.

I think we should go ahead and vendor - though we should give the Author credit at the top of the code most likely.

Ok, now going to review the latest.

Copy link
Member

@phobologic phobologic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of docstrings need updating, and a few comments/concerns.

tail=None,
stacks=None, stack_names=None,
reverse=False):
"""A simple helper that builds a graph based plan from a set of stacks."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do the full Args/Returns docstring here? Also, is plan a temporary thing until we remove the old plan?

stacker/plan2.py Outdated

def build_plan(description=None, steps=None,
step_names=None, reverse=False):
graph = build_graph(steps)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if steps is None, it'll break build_graph. Not sure what the right behavior should be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a scenario where that could actually happen from the stacker CLI, since there's config validation to ensure that stacks isn't empty, but I'll add a test for that here.


class Step(object):
"""State machine for executing generic actions related to stacks.
Args:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs it's args updated.

stacker/plan2.py Outdated
with this step
"""

def __init__(self, stack, fn=None, watch_func=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is fn really optional at this point?

stacker/plan2.py Outdated
self.set_status(SUBMITTED)


def build_plan(description=None, steps=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description seems like it should be required, as well as steps?


plan = build_plan(description="Test", steps=[Step(vpc), Step(bastion)])

self.assertEqual(plan.graph.dag.graph, {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kind of lines are part of my concern with having a separate graph class. It's a little confusing. Probably not a huge deal, just not straight forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really just a bad test on my part. dag should be private within the Graph class, but no way to enforce that with python :(.

I'll add a to_dict method to Graph, to make testing this cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - I also just realized the name of this test is probably wrong.

def walk(self):
"""Walks each step in the underlying graph, in topological order."""

def walk_func(step):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably do something here to catch any errors that are raised, and prepend/append info about what step caused the issue - either here or on Graph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we definitely should make that change, since it's annoying when a stack throws an exception, and you have no context about what stack has the issue. But we should probably make that change separate from this PR, since that's not existing behavior.

We'll also need to think about how we handle exceptions in a parallelized world, since the current method of just throwing an exception and aborting won't work with multi-threads/processes.

build_plan(
description="Test",
steps=[Step(vpc, None), Step(db, None), Step(app, None)])
message = ("Error detected when adding 'namespace-db.1' "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, you probably wouldn't need this message test if we had different exceptions ;) Just saying.

tests/suite.bats Outdated
@@ -629,10 +629,7 @@ EOF
assert "$status" -eq 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why we removed the asserts here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the implementation in this PR, if a node in the graph fails (returns False), the entire graph walk will halt. It's not actually necessary to halt the entire graph walk, only nodes that depend on the failed node. I can update the implementation of the graph walk to support the old behavior.

__version__ = "1.1.4"


def plan(description=None, action=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the plan going forward with this? We probably can't name this function "plan" if we move plan2 to plan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan (no pun intended) would be to replace plan.py with plan2.py before merging, unless we want to keep the original plan.py around for backwards compatibility. I only went with plan2.py because overwriting plan.py in the PR would make reviewing it pretty hard, since it's a re-write.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but what about the plan method? If you have a plan method in stacker/__init__.py and a stacker/plan directory, you're going to have conflicts when someone does something like: from stacker import plan or from stacker.plan import * (maybe not on the last one, I honestly don't know!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha. I think this plan method should actually be moved to stacker.actions.base, since that's where it's intended to be used.

@ejholmes
Copy link
Contributor Author

ejholmes commented Jan 30, 2018

Alright @phobologic. I think I've addressed a majority of the comments here.

I also reverted back to the original behavior for failed child stacks, so that the entire graph of stacks is executed, but a given stack will not execute if any of it's dependencies are in a "not ok" state (see c0f3714)

@ejholmes
Copy link
Contributor Author

@Lowercases found a bug where CancelExecution wasn't handled (pressing n in interactive mode). Added a commit that handles this flow, and also an integration test, since we didn't have one.

Copy link
Member

@phobologic phobologic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - let me submit my PR, it's simple quality of life changes.

description (str): a description of the plan.
action (func): a function to call for each stack.
stacks (list): a list of :class:`stacker.stack.Stack` objects to build.
targets (list): an optional list of targets to filter the graph to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, targets is definitely a better name for this. Thanks for thinking of that.



class Graph(object):
"""Graph represents a graph of steps.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super helpful - thanks!

@ejholmes
Copy link
Contributor Author

ejholmes commented Feb 1, 2018

Cool. I'll update us internally to run this branch before merging.

@ejholmes
Copy link
Contributor Author

ejholmes commented Feb 2, 2018

There's some talk about whether or not the --stacks flag should stop at the given stack, or continue to execute stacks that transitively depend on the provided stack.

Let's say you have a graph of stacks like the following:

In the current behavior (this pr), if you specify --stacks db, stacker will effectively filter the graph to the following, only including nodes in the graph that are transitive dependencies of db:

If there's a change to the db stack (like a new instance url or w/e), then the app stack won't see it.

We could change the behavior here so that the graph includes the dependencies of db, as well as anything that transitively depends on it. So the graph would be filtered to:

In some configurations, this could easily blow up the size of the graph (e.g. if a parent also depends on a bunch of stacks), but makes it more "correct".

@acmcelwee
Copy link
Member

Personally, the expected behavior of the --stacks flag is as you described it in the first example, and how it's currently implemented in the PR. That said, I think the second behavior could be useful in other occasions. It might be worth thinking about how both could coexist with different flags or modes.

@russellballestrini
Copy link
Member

The version is this PR is faster but I'm pretty sure the version in the 3rd chart is safer (and puts the system as a whole in a more correct state).

I would prefer correctness over speed.

I like @acmcelwee point, if we can support both, (--stacks-fast) I think that could be really cool.

@ejholmes
Copy link
Contributor Author

ejholmes commented Feb 2, 2018

Yeah, that makes sense to me. I think I'll update this PR to the more complete version.

If there's a need for the old behavior, we can add new flags for that.

@ejholmes
Copy link
Contributor Author

ejholmes commented Feb 3, 2018

Aaaaand...I think I'm gonna flipflop my preference back to the fast version. I tried the complete version internally, and it's pretty easy for the graph to explode making --stacks mostly useless.

I'll leave out specific stack names, but on our internal graph, when I targeted a stack that builds out an ECS cluster, the fast versions results in 4 stacks (3 dependencies). With the full version, the graph explodes to 100 stacks as a result of the parents transitive dependencies.

This changes the internals of stacker plan's to first generate a
Directed Acyclic Graph, and uses this as a means of walking the graph.
@ejholmes
Copy link
Contributor Author

ejholmes commented Feb 6, 2018

Thanks everyone for the comments and feedback. Super excited about what this is gonna enable for stacker moving forward.

@ejholmes ejholmes mentioned this pull request Feb 7, 2018
4 tasks
@ejholmes ejholmes added this to the 1.2 milestone Mar 1, 2018
phrohdoh pushed a commit to phrohdoh/stacker that referenced this pull request Dec 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants