-
Notifications
You must be signed in to change notification settings - Fork 112
Feat: A more robust provider finder for sessions (for now) and soon for all bitswap #60
Conversation
Some tests are mildly unreliable, need to work on a bit more. |
7478032
to
bdaf8b3
Compare
Tests for this package should be more reliable now (still need to work on other BS ones) |
bdaf8b3
to
6af352a
Compare
Add a manger for querying providers on blocks, in charge of managing requests, deduping, and rate limiting
Add functionality to timeout find provider requests so they don't run forever
Add debug logging for the provider query manager and make tests more reliable
Removed a minor condition check that could fail in some cases just due to timing, but not a code issue
c5678f1
to
56d9e3f
Compare
} | ||
|
||
type receivedProviderMessage struct { | ||
k cid.Cid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick/nonblocking: should we get more explicit with these names? key
instead of k
and session
instead of ses
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well ses
is gone.
Honestly, I struggle with what to call k -- cause k is actually a CID, not a key. But, key or k is widely used, and cid is usually the name of the package unfortunately. Maybe contentID
. What do you use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I've noticed the same, I've been using key and cid mixed in different places, maybe we can start using just key
consistently? And I'll update on my end too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I'd be done for that but I want to all agree on what we use cause it's a problem through the bitswap codebase
What's the motivation for breaking up the "Network" interface that way? |
wg.Add(1) | ||
go func(p peer.ID) { | ||
defer wg.Done() | ||
err := pqm.network.ConnectTo(pqm.ctx, p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered #51 (comment)? That plus switching to sessions everywhere seems like a more robust solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and, I'm inclined to not address that until we get sessions everywhere merged first. This is necessary to do that. Hence: chicken/egg. Arguably it could all be one big PR, but I'm also generally disinclined to do things that way. Let me know if that's ok to proceed in this manner.
findProviderCtx, cancel := context.WithTimeout(pqm.ctx, pqm.findProviderTimeout) | ||
pqm.timeoutMutex.RUnlock() | ||
defer cancel() | ||
providers := pqm.network.FindProvidersAsync(findProviderCtx, k, maxProviders) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to keep working even if we no longer need to find any providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I have fixed this. The context is no longer derived from the overall query manager, but a context for this keys request that itself is cancelled if all sessions requesting the key cancel their request.
k: k, | ||
} | ||
// clear out any remaining providers | ||
for range incomingProviders { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why do this drain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't want to block incase there are outstanding sends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but maybe there is a better way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about it a bit more, and yea, this is neccesary. The reason is this:
messages get processed serially in the run loop. One of those messages might be an incoming provider found message that gets broadcast to everyone who is waiting. We don't want that to block, so we need to make sure the channel is read, emptied, and closed (which happens once the cancel is processed) before we proceed.
removed session id user completely from providerquerymanager
23f9b87
to
92717db
Compare
Make sure if all requestors cancel their request to find providers on a peer, the overall query gets cancelled
@Stebalien Well the session peer manager now only uses the connection manager (other network calls are through the Provider Query Manager) so I figured why not simplify even further? Makes tests simpler among other things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost good to go.
case <-pqm.ctx.Done(): | ||
return | ||
case <-sessionCtx.Done(): | ||
pqm.providerQueryMessages <- &cancelRequestMessage{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably needs to select on pqm.ctx.Done()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think this needs to read from incomingProviders
at the same time. Otherwise, we're relying on the buffer in providerQueryMessages
being large enough to fit this message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. I hate go. :P will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, I think
pqm.timeoutMutex.RLock() | ||
findProviderCtx, cancel := context.WithTimeout(fpr.ctx, pqm.findProviderTimeout) | ||
pqm.timeoutMutex.RUnlock() | ||
defer cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will defer till the function returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that bad? this cancel now only serves to make sure the context gets cleaned up. the cancel due to all the requests cancelling is the parent of this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defered functions will get pushed onto a stack which will continue to grow until the worker returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, it'll leak a bunch of memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I get a linter error if I don't name it and call it. I moved the discard to right after the for range of channel loop, which seems like the first safe time to cancel it.
Keep channels unblocked in cancelling request -- refactored to function. Also cancel find provider context as soon as it can be.
44fe93e
to
30f40ec
Compare
We should probably test this on the gateway, if possible. This now blocks the provider workers on actually connecting to the providers so that could have unintended consequences. |
…provider Feat: A more robust provider finder for sessions (for now) and soon for all bitswap This commit was moved from ipfs/go-bitswap@bb89789
Goals
Build a robust provider finder for sessions so that we can deduplicate provider logic (#52) and soon merge GetBlock w/ Session.GetBlocks (#49)
Implementation
A manager for find provider requests that:
For discussion
Will writeup general concurrency architecture tomorrow