Skip to content
This repository has been archived by the owner on Feb 3, 2018. It is now read-only.

Refactor sourceMgr #196

Merged
merged 43 commits into from
Apr 4, 2017
Merged

Refactor sourceMgr #196

merged 43 commits into from
Apr 4, 2017

Conversation

sdboyer
Copy link
Owner

@sdboyer sdboyer commented Mar 20, 2017

This PR refactors the sourceMgr in several areas:

The changes here also pave the way for persistent caching - #130.

Smaller TODOs:

  • Rename update() to fetch() in our vcs repo impl

Copy link
Contributor

@spenczar spenczar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use of big anonymous closures like this is dangerous. They're tricky to read, because you have to scan all the surrounding code to understand which things are free variables and which are created within. That leads to shadowing bugs like #187 quite quickly.

It also complicates tests and makes it difficult to follow logic while debugging.

I'd really encourage you to consider breaking those anonymous closures out, making them named functions or methods, so it's clearer what their inputs and outputs are. My experience with this was in debugging that error shadowing case, and it was only hard because of the use of closures and callbacks.

Ultimately, that's a style thing, not a correctness thing, so it's your call, of course. But I think correctness (or incorrectness!) would be easier to see.

deduce.go Outdated
callMgr: cm,
ctx: cm.getLifetimeContext(),
rootxt: radix.New(),
deducext: pathDeducerTrie(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a line for action: make(chan func()) here

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

deduce.go Outdated
for {
select {
case <-dc.ctx.Done():
close(dc.action)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This risks either a double close or a read of a nil action in the next loop iteration. Do you need to return in this branch?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, totally, thank you

deduce.go Outdated
@@ -690,6 +695,253 @@ func (sm *SourceMgr) deduceFromPath(path string) (deductionFuture, error) {
}, nil
}

type deductionCoordinator struct {
ctx context.Context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context package warns against using contexts this way: "Do not store Contexts inside a struct type; instead, pass a Context explicitly to each function that needs it."

I think it's good advice. Do you have a reason you need the context to be a struct field?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really follow most of that discussion, but at the very least, this is unusual, which in the Go community is a strong knock against it :)

It appears that the stored context is only really used in deduceRootPath to check whether the coordinator has already been terminated. The fact that it's done with a stored context (but other times a context is passed in to the function call - why? what's the difference?) will make it harder to see how you go about canceling calls, I think. I could be wrong! There might be more inherent complexity here than I see, so maybe that difficulty is going to be here no matter what.

But I agree with Peter that I've regretted doing things this way every time. Letting the context flow down through method invocations has never confused me in the same way.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so, I realize I actually conflated two things. THAT stored context doesn't necessarily need to be there; it could be grabbed from the callManager on the fly in the actor loop. But that doesn't really change much, as there's still a context on the callManager struct.

The basic issue here comes down to there being two different possible origins for cancellation of any given SourceManager method call:

  1. A cancellation of the context passed directly into the method - the standard Go way of doing it
  2. A cancellation from the long-lived context (e.g., a signal is sent to a dep process, which then needs to tell the sourceMgr to shut down via a context that's injected in the NewSourceManager() call), which must propagate to all open calls

There are different possible approaches to this, but that's the essential reality, and dealing with it entails combining cancelation paths somehow.

Copy link
Contributor

@spenczar spenczar Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think vanilla contexts handle that pretty well with just one tree of contexts, I don't see how you need to combine cancelation paths. I imagine it looking something like this:

func main() {
  ctx, cancel := context.WithCancel(context.Background())
  go cancelOnSignal(cancel) // yknow, set up a loop on a chan signal.Signal, blah blah

  sm := NewSourceManager(ctx, [...]) 
  sm.doStuff(ctx)
}

If the root context is canceled because of a signal, that gets propagated down to all the children, which will be all the open calls, like you want.

For the second case, if you want a 'locally' cancelable function, you call context.WithCancel again. The newly generated context will cancel if the old papa context made in main gets canceled, or if the new local cancel method is called. But the local cancel method won't close any of the context's ancestors. This seems to be what you want, right?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The little code block is fine for that set of calls - but in a dep run that includes a solve the vast majority of SourceManager calls are coming from the solver, not dep itself. And yes, we could thread a context through that as well (in fact, doing so might be quite useful to establish running time limits on the algorithm), and that could be subctx'd down into the calls the solver makes to the SourceManager.

But while that's a fine guarantee for an implementation to provide, within gps, we have no idea if the caller has wired the long-lived context together with the ephemeral call contexts. To ensure correctness, we still have to internally combine cancelation.

For the second case, if you want a 'locally' cancelable function, you call context.WithCancel again. The newly generated context will cancel if the old papa context made in main gets canceled, or if the new local cancel method is called. But the local cancel method won't close any of the context's ancestors. This seems to be what you want, right?

Yes, and this is mostly the counterargument that Peter made on reddit. (I also did up what you're describing in some pseudocode).

I'm not saying it's not possible to do that way. I'm saying it's fundamentally equivalent to the constext approach. And, while constext is a new thing to learn, it's also a reasonably well-defined formalism. Given that we have to solve this problem, I'm interested in exploring the formalism, rather than just following the guideline...for now 😄 .

Copy link
Contributor

@spenczar spenczar Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could thread a context through that as well

Yep! That's exactly what I'd expect to see naively.

But while that's a fine guarantee for an implementation to provide, within gps, we have no idea if the caller has wired the long-lived context together with the ephemeral call contexts. To ensure correctness, we still have to internally combine cancelation.

I don't mean to be dense, but I'm still not totally getting this. If the calls are coming from the solver, not dep itself, then we seem to be sure of the provenance of the contexts, right? We know they came as children made by the solver off of the original one from dep (or some other implementation using the solver).

Are you concerned that there are multiple entrypoints to this code, and you want to be sure the caller is using the same context for them? If so, I think the right approach is to shrink the entrypoints. Do you have a concrete case here? That might help me understand.

Yes, and this is mostly the counterargument that Peter made on reddit. (I also did up what you're describing in some pseudocode).

Hm, this pseudocode isn't what I was thinking of. There's no need for quitchan there - you'd just call cancel() at the end of the function directly. Each spawned goroutine is checking whether the context is closed before (or while) it does significant work like network calls. If you want to only exit once all spawned goroutines have exited, you keep track of that with sync.WaitGroup.

I agree that both approaches eventually work, of course. No question there! This is more about how easy those approaches are to keep up and maintain. I just think it's simpler to think about the one flow of contexts than this idea of merging two.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rogpeppe has also responded on reddit, with a hopefully-clearer explanation :)

Are you concerned that there are multiple entrypoints to this code, and you want to be sure the caller is using the same context for them? If so, I think the right approach is to shrink the entrypoints.

Yes. I don't see how it would be possible to shrink the entrypoints - there's the creation of the sourceManager, and there's calling methods on it - the context for the former cancels ALL running activities, and the context for the latter cancels only the current method.

These contexts are probably from the same tree, but we cannot know if they are from within the sourceMgr, so we have to take precautions.

There's no need for quitchan there - you'd just call cancel() at the end of the function directly.

Without the quitchan, we don't know to cancel this running call if the long-lived context was canceled. We'd have to make the assumption that the two contexts are part of the same tree.

If you want to only exit once all spawned goroutines have exited, you keep track of that with sync.WaitGroup.

Yep, but this is entirely separate.

I just think it's simpler to think about the one flow of contexts than this idea of merging two.

I agree, if we could make that assumption, it would be the simpler way to do it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! That makes a lot more sense. The root cause here is that you don't trust the caller to responsibly use the right context value.

So, the design tradeoff you're making is that you can be more confident that this code will work for callers who do weird things with the package (and using different contexts would be weird), but you are sacrificing simplicity to gain that confidence.

I don't think I agree that that's the right design decision here since gps isn't ever going to have, like, tons of direct users. It's likely that dep is the only user for quite a while. I think clear documentation on the source manager's constructor (use the same context, or children of the original context, whenever you call its methods) would be sufficient.

But! That's just my feeling. I think we've actually gotten to the core of things now, and we have the tradeoff much more clearly on the table, which is good enough for me. 👍

deducers.go Outdated

// The rest of the work needs its own goroutine, the results of which will
// be re-joined to this call via the return chans.
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making these anonymous closures into named functions or methods. That way, it's clear what the inputs and outputs are.

In this case, I think it's something like

func (sc *sourceCoordinator) computeSourceGateway(ctx context.Context, name string, rc srcReturnChans)

and the invocation becomes go sc.computeSourceGateway(ctx, normalizedName, rc).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Yeah, I've been leaning in that direction; I went heavy pseudo-actor model initially, and am slowly walking it back. The general issue I have with it is distinguishing between things that should be called by the wrapper object, vs those that should not. That's a smaller design consideration, though - like I said, I'm walking it back :)

This one isn't shouldn't be a problem to make its own method for, but the other one (in deductionCoordinator.deduce() is absolutely not threadsafe unless it's being serialized through that channel.

deducers.go Outdated
sc.srcmut.RLock()
srcGate, has := sc.srcs[normalizedName]
sc.srcmut.RUnlock()
if has {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code might be quite a bit simpler if it used sync.Cond to rendezvous the goroutines instead of channels. As it is, there are lots of opportunities for races in this code. For example, here it looks to me like we're unlocking too early - this should be

sc.srcmut.RLock()
srcGate, has := sc.srcs[normalizedName]
if has {
  sc.srcmut.RUnlock()
  doReturn(srcGate, nil)
  return
}
sc.srcmut.RUnlock()

but of course this is all a sad state to be in - we've wasted time running deduceRootPath twice. Instead of guarding the map with a lock, I think we'd be better off guarding the function invocation. sync.Cond is the simplest way to do that.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...well, while this is certainly my first, untested pass at the code, I'm not seeing the opportunity for races here. And I don't see how it would be possible for deduceRootPath to be run multiple times for the same path.

The big reason there's so much staged locking/unlocking here is because we have to keep both the deduceRootPath() AND sourceURL() calls on a non-blocking path, as both may involve I/O, especially the latter.

That said, I'd totally forgotten that sync.Cond even exists. It may well be a much better way of coordinating these fold-in behaviors. I'll have a look!

Copy link
Contributor

@spenczar spenczar Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how it would be possible for deduceRootPath to be run multiple times for the same path.

Suppose getSourceGatewayFor is called twice with the same arguments in two different goroutines, G1 and G2.

G1 does the L164 logic - it grabs the srcmut read lock, sees that sc.nameToURL doesn't have normalizedName, so it releases the lock. Now G2 is scheduled, and does exactly the same thing. They're both now waiting right in front of the go call at L182.

G1 starts this goroutine - the one from L182 to L272. It grabs the psrcmut write lock, sees that there is no entry in p.protoSrcs, so it creates one. It continues doing work, including running deduceRootPath, all the way to running doReturn. Again, it grabs the write lock, sends its output on the right channel, and then calls delete(sc.protoSrcs, normalizedName). It releases the lock and exits.

Now G2 is scheduled. It starts at the top of the big goroutine. It grabs the psrcmut lock. It checks if sc.protoSrcs has normalized name. It doesn't, because G1 deleted its entry. So, G2 continues - it will run deduceRootPath again.

So, there we are. Certainly not a very common scheduling, but definitely possible.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, yes, that's possible - but that's the one I'm accounting for with this additional medial check we're commenting on here 😄

In that case, we do run deduceRootPath() twice, but the results will be coming back from a trie - no regex, and definitely no HTTP call. So we're not paying a significant cost in cycles. And there's no unsynchronized data access or duplication of resources, as we return out early in the (fixed) version of this check we're commenting on.

(Also, the original code change you suggested is equivalent to what's there - I just hoisted the Unlock() out of the conditional)

Are there other potential bad interleavings that you can see?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the trie thing is good - I didn't realize that! But that kind of proves my point here, I think, which is that wowzers this is pretty tricky code. I think sync.Cond could make it simpler. I'll scan for more bad schedulings, but no promises :)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No argument here! I'd definitely like to see this decomposed into more comprehensible pieces. IMO it's already less complected than it was in the old system (where root deduction and source creation were inextricably bound up in those stacks of futures), but there's still a ways to go.

deducers.go Outdated
// scheduled. Guard against that by checking the main sources map again
// and bailing out if we find an entry.
sc.srcmut.RLock()
srcGate, has := sc.srcs[normalizedName]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're checking here for normalizedName, but later we set based on url. Is that correct?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, thank you - I initially wrote this medial check before introducing the nameToURL map, and forgot to update it when I changed the check at the top.

deducers.go Outdated
doReturn(sa, nil)
return
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This anonymous closure runs to several pages long in my editor, which makes it tough to keep track of everything. Could it be broken up into smaller function calls that have clear inputs/outputs?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure we can get it there :) Per the above response, once I've weaned off of the closure in general towards a method, splitting it up into parts would be easier.

Copy link
Contributor

@spenczar spenczar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being receptive to review! 😀

deducers.go Outdated
sc.srcmut.RLock()
srcGate, has := sc.srcs[normalizedName]
sc.srcmut.RUnlock()
if has {
Copy link
Contributor

@spenczar spenczar Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how it would be possible for deduceRootPath to be run multiple times for the same path.

Suppose getSourceGatewayFor is called twice with the same arguments in two different goroutines, G1 and G2.

G1 does the L164 logic - it grabs the srcmut read lock, sees that sc.nameToURL doesn't have normalizedName, so it releases the lock. Now G2 is scheduled, and does exactly the same thing. They're both now waiting right in front of the go call at L182.

G1 starts this goroutine - the one from L182 to L272. It grabs the psrcmut write lock, sees that there is no entry in p.protoSrcs, so it creates one. It continues doing work, including running deduceRootPath, all the way to running doReturn. Again, it grabs the write lock, sends its output on the right channel, and then calls delete(sc.protoSrcs, normalizedName). It releases the lock and exits.

Now G2 is scheduled. It starts at the top of the big goroutine. It grabs the psrcmut lock. It checks if sc.protoSrcs has normalized name. It doesn't, because G1 deleted its entry. So, G2 continues - it will run deduceRootPath again.

So, there we are. Certainly not a very common scheduling, but definitely possible.

@codecov
Copy link

codecov bot commented Mar 22, 2017

Codecov Report

Merging #196 into master will increase coverage by 1.04%.
The diff coverage is 72.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #196      +/-   ##
==========================================
+ Coverage   77.17%   78.22%   +1.04%     
==========================================
  Files          27       28       +1     
  Lines        3987     4018      +31     
==========================================
+ Hits         3077     3143      +66     
+ Misses        692      655      -37     
- Partials      218      220       +2
Impacted Files Coverage Δ
rootdata.go 92.1% <ø> (ø) ⬆️
hash.go 81.35% <100%> (ø) ⬆️
bridge.go 75.91% <100%> (-0.25%) ⬇️
solver.go 83.87% <100%> (+0.09%) ⬆️
maybe_source.go 63.86% <58.1%> (-16.14%) ⬇️
vcs_repo.go 54.19% <60.6%> (+8.04%) ⬆️
version.go 77.45% <62.85%> (-1.34%) ⬇️
cmd.go 79.66% <66.66%> (-8.58%) ⬇️
source_cache.go 67.01% <67.01%> (ø)
vcs_source.go 77.17% <69.56%> (+4.11%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c36f26...01e3ba5. Read the comment docs.

@sdboyer
Copy link
Owner Author

sdboyer commented Mar 22, 2017

#202 is quite related to this, but I'm gonna leave it out of this PR to kinda sorta avoid scope creep just a teensy bit.

@sdboyer
Copy link
Owner Author

sdboyer commented Mar 29, 2017

Noting this here because I'm not sure if it'll remain an issue or not, so I don't want to make it a separate issue until this is merged. But, this PR as currently written has a problematic case. (The problem exists on master, too, but is even less organized in how it deals with the issue)

Say there's a ProjectRoot, github.com/foo/bar. When attempting to get the source handler for that project, the default deduction and maybeSource path will end up pointing to https://github.com/foo/bar. That's the first scheme it'll try, anyway.

The role of the maybeSource is interesting. It's the way that the root deduction system abstracts over which exact upstream URL (scheme, mostly) might be used to fulfill an import. maybeSource.try() needs to balance two things: returning as quickly as possible, but only returning if it's reasonably sure the source it's deduced actually exists.

There's two definitions we could use for "a source exists": it exists upstream, or it exists in the cache. The latter, being local, is faster to check, but potentially wrong about the state of the world - even if the source exists in local cache, the upstream could have moved, been deleted, become inaccessible due to expired credentials, etc.

For now, I think I'm going to just force the remote check to always happen. However, it seems likely that we may need to revisit this, and rely only on the local check initially. If/when we make that shift, we'll need to add a mechanism by which to reach back into the sourceCoordinator and change its various maps to correspond to the new URLs. This could present issues for the current logic in getSourceGatewayFor(), as it's all carefully crafted around the idea that those maps are essentially append-only.

This was really always the intended model - there's no reason a
SourceManager needs to be permanently coupled with just one analyzer.
It's perfectly sufficient to provide one as an argument to the relevant
methods.

Fixes #195.
* Move baseVCSSource into vcs_source.go
* Rename source_test.go to vcs_source_test.go
* Remove sourceExistence type and remaining uses of it
This test isn't exhaustive, but it covers the basics.
This more accurately reflects how we were actually using the method for
some time, as well as what we'd more recently modified the method to do.
@sdboyer
Copy link
Owner Author

sdboyer commented Apr 4, 2017

As-is, this is very likely to cause a perceived performance regression. It will end up serializing calls from the solver that previously had been parallelized, as it reverses the change made in 4815cd5.

The fundamental problem is, we can't avoid that regression while maintaining the invariant I described as valuable earlier.

However, this PR has been open for far too long, and diverged far too much as-is. Because it's accomplished its main purpose, I'm going to merge it now, and then we can revise and modify more incrementally.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
2 participants