-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve query performance by limiting query width to KValue peers #291
Conversation
72782a9
to
b9edb2c
Compare
I think if some of these can be dealt with in stand-alone PRs it will make it much more digestible. While you're poking around on this area, it always irks me that it appears that queries are only initialized with alpha closest peers, and yet can expand to many more once closer peers come back in replies. If all of those initial alpha peers fail, the entire query fails. There's no reason not to be starting with a lot more peers, but only actively querying alpha to begin with. |
@anacrolix rationale here: #192 (comment). But I agree we need to recover from a poisoned start -- in fact, I'd say that's urgent. |
Honestly, we could probably just seed with KValue peers. That should just work. I can break this up into separate PRs for commenting but it should be merged all at once. I've broken it into reasonably logical commits that can be reviewed separately but, well, github reviews don't play well with that workflow. |
@Stebalien I'd say keep it all in one PR, and group commits into logical changesets, and post a list of diff ranges like this to facilitate review: https://github.com/libp2p/go-libp2p-kad-dht/pull/291/files/2006602434583ea06634813330437f31df9300a1..1fcc9db35d65c32914c1b5bed4c8825437b697fe. |
4ac000f
to
f7005bd
Compare
This is actually still incorrect. We _should_ be limiting our query to AlphaValue peers and then expanding to KValue peers once we run out of peers to query. However, this is still much better and we can do that in a followup commit. Considerations: We may not want to merge this until we get the multipath lookup patch. It turns out, our current DHT effectively explores _all_ paths. fixes #290
Returning early is _not_ valid, ever, for any reason. Note: `query.Run` now returns the final peers. Any other values should be exported via channels, etc (it's a _lot_ simpler).
Unfortunately, while returning early isn't valid, FindPeer would block for _ages_ if we didn't. We should switch to a progressive FindPeerAsync but this'll have to do for now.
f7005bd
to
2f9e67e
Compare
// setup concurrency rate limiting | ||
for i := 0; i < r.query.concurrency; i++ { | ||
for len(r.rateLimit) < cap(r.rateLimit) { | ||
r.rateLimit <- struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this doing exactly? Are we just trying to fill up the channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (unnecessary but I was already messing with this code).
routing.go
Outdated
@@ -323,7 +319,7 @@ func (dht *IpfsDHT) getValues(ctx context.Context, key string, nvals int) (<-cha | |||
switch err { | |||
case routing.ErrNotFound: | |||
// in this case, they responded with nothing, | |||
// still send a notification so listeners can know the | |||
// still send a routingication so listeners can know the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(oops)
2f9e67e
to
78a2de6
Compare
This was causing us to do _slightly_ more work than necessary.
I.e., stop testing something that shouldn't work.
for i := 0; i < nDHTs; i++ { | ||
dhts[i].BootstrapOnce(ctx, DefaultBootstrapConfig) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I spent a while trying to avoid having to fix this test without complicating the query logic too much. Then I realized that was just stupid.
So, I wanted this to be the "ultimate patch that fixed everything". Yeah... This now correctly implements Kademlia, that's it. |
TODO: Consider increasing Alpha to 6 (or more). The recurse step is tacking ages and I believe trying more paths may help us find a faster path. |
@jacobheun did some interesting experimentation with DHT configuration parameters for js-libp2p-kad-dht, including changing alpha to 6, but ended up reverting back to 3: libp2p/js-libp2p-kad-dht#107 @jacobheun could you summarize why an alpha of 3 worked better in the end? |
A higher alpha will probably work fine for go. The major issue I hit was performance of the js-ipfs node with an alpha of 6. I ended up using 4 for js-ipfs in Node.js and 3 in Browser, ipfs/js-ipfs#1994. When I tested an alpha of 3 vs 6, 6 yielded a more normalized, lower range for the query times. Originally we hit problems with the higher alpha's due to dialing timeouts of the peers. If the query to peers isn't being fairly aggressively limited, having a higher alpha can result in a path taking a long time to finish because it basically trickles responses. This is also why we ended up with a "sloppy" approach to ending queries, which significantly improved query times. For a given path, if we finish one of the concurrent queries and there are no closer peers queued we complete that path, even if there are queries in progress. The results of this approach being able to consistently find the top closest peers was pretty consistent, libp2p/js-libp2p-kad-dht#107 (comment). |
Ah, it looks like this doesn't include disjoint paths, so the stuff I linked isn't going to help a lot here. Bumping the alpha to 6 here is only going to have 6 rpc calls concurrently, iiuc. With disjoint paths in JS the concurrency of 4 is going to get us 40 concurrent calls. (kValue / 2) * 4. If this is going to stay single path the alpha is going to need to increase pretty significantly I think, maybe 20 or more until disjoint paths are added. |
I'm going to run this with my enhanced logging and I'll report how it performs. |
@Stebalien the results seem iffy in my case. Two tests managed to finish "get closer peers" and start providing in 1 minute or less: 1 minute:
10 seconds:
10 minutes (and counting):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the separation between the recurse function and the finish function. It wouldn't hurt us to have a PING
message (even if a reflexive FIND_NODE
moonlights as that right now).
I was hoping we could take this opportunity to simplify various aspects of the implementation.
For example, the separation between dhtQuery
and dhtQueryRunner
seems redundant.
Our peer management is all over the place. I was thinking KPeerSet
could encapsulate the state of peer traversal. We'd instantiate with a target count (e.g. 16 KValue), and as we traverse the network, we would notify it of the state of each peer via methods: Failed(peer.ID)
, OK(peer.ID)
, Querying(peer.ID)
. We could also condense the todocounter functionality into it.
Whenever a worker was ready, it'd ask for the next peer via Next() peer.ID
. It would also expose a channel via Done() chan struct{}
to signal when the target was met, and we'd fetch the resulting peers via Peers()
and execute the finish function on them.
Dunno whether I should take a stab to put this approach together? WDYT, @Stebalien?
@Stebalien – I added an extra TODO point in the description for the Kademlia distance logging we discussed today. |
I tested out this PR in combination with the ipfs/go-ipfs-provider#8, and seems to be giving substantial performance improvements. By DHT provide announcements seem to be happening faster, additionally it seems like my |
I'd like to leave further refactors to a future PR unless this one makes things worse. I agree the query system has gotten a bit confusing but I wanted to save a larger refactor for a PR that only includes that refactor. |
License: MIT Signed-off-by: Raúl Kripalani <raul@protocol.ai>
Change bucket size to be configurable
This makes the initial step consistent with recursive steps. Strictly speaking, we should only _need_ alpha peers but some of those peers may not respond.
This has been replaced by #436 |
This is actually still incorrect. We should be limiting our query to AlphaValue peers and then expanding to KValue peers once we run out of peers to query. However, this is still much better and we can do that in a followup commit (maybe in this PR, maybe in a new PR, we'll see).This now correctly implements kademlia (everywhere).
TODO:
Multipath:puntedPlay with slop and dial parallelism. I can get some pretty fast (relatively, ~ 10s) queries this way). However, this "slop" is definitely not proper kademlia.puntedfixes #290