improve query performance by limiting query width to KValue peers #291

Stebalien · 2019-03-08T23:52:30Z

This is actually still incorrect. We should be limiting our query to AlphaValue peers and then expanding to KValue peers once we run out of peers to query. However, this is still much better and we can do that in a followup commit (maybe in this PR, maybe in a new PR, we'll see).

This now correctly implements kademlia (everywhere).

TODO:

~~Multipath:~~ punted
Avoid unecessary RPCs. Provide/PutValue currently calls GetClosestPeers which finds the 20 closest peers responding to queries. Unfortunately, validating this last part requires actually querying those peers. That means we query them twice: Once to see if they're live, once to put/provide.
Reconcile with the "quorum" concept. Answer: The "quorum" now allows us to stop the query early.
~~Play with slop and dial parallelism. I can get some pretty fast (relatively, ~ 10s) queries this way). However, this "slop" is definitely not proper kademlia.~~ punted
Deal with the fact that FindPeer now runs to the end of the query instead of returning early. We'll probably have to revert this unless we can get the routed host to use FindPeerAsync sometime soon.
Log Kademlia distance to the target with every peer that we return – allows us to verify the speed at which we're getting closer to the target as we progress.

fixes #290

anacrolix · 2019-03-13T04:36:16Z

I think if some of these can be dealt with in stand-alone PRs it will make it much more digestible.

While you're poking around on this area, it always irks me that it appears that queries are only initialized with alpha closest peers, and yet can expand to many more once closer peers come back in replies. If all of those initial alpha peers fail, the entire query fails. There's no reason not to be starting with a lot more peers, but only actively querying alpha to begin with.

raulk · 2019-03-13T17:23:22Z

@anacrolix rationale here: #192 (comment). But I agree we need to recover from a poisoned start -- in fact, I'd say that's urgent.

Stebalien · 2019-03-13T20:15:52Z

Honestly, we could probably just seed with KValue peers. That should just work.

I can break this up into separate PRs for commenting but it should be merged all at once. I've broken it into reasonably logical commits that can be reviewed separately but, well, github reviews don't play well with that workflow.

raulk · 2019-03-13T20:18:34Z

@Stebalien I'd say keep it all in one PR, and group commits into logical changesets, and post a list of diff ranges like this to facilitate review: https://github.com/libp2p/go-libp2p-kad-dht/pull/291/files/2006602434583ea06634813330437f31df9300a1..1fcc9db35d65c32914c1b5bed4c8825437b697fe.

This is actually still incorrect. We _should_ be limiting our query to AlphaValue peers and then expanding to KValue peers once we run out of peers to query. However, this is still much better and we can do that in a followup commit. Considerations: We may not want to merge this until we get the multipath lookup patch. It turns out, our current DHT effectively explores _all_ paths. fixes #290

Returning early is _not_ valid, ever, for any reason. Note: `query.Run` now returns the final peers. Any other values should be exported via channels, etc (it's a _lot_ simpler).

Unfortunately, while returning early isn't valid, FindPeer would block for _ages_ if we didn't. We should switch to a progressive FindPeerAsync but this'll have to do for now.

whyrusleeping · 2019-06-01T17:55:54Z

query.go

 	// setup concurrency rate limiting
-	for i := 0; i < r.query.concurrency; i++ {
+	for len(r.rateLimit) < cap(r.rateLimit) {
 		r.rateLimit <- struct{}{}


what is this doing exactly? Are we just trying to fill up the channel?

Yes (unnecessary but I was already messing with this code).

whyrusleeping · 2019-06-01T18:32:50Z

routing.go

@@ -323,7 +319,7 @@ func (dht *IpfsDHT) getValues(ctx context.Context, key string, nvals int) (<-cha
 		switch err {
 		case routing.ErrNotFound:
 			// in this case, they responded with nothing,
-			// still send a notification so listeners can know the
+			// still send a routingication so listeners can know the


This was causing us to do _slightly_ more work than necessary.

I.e., stop testing something that shouldn't work.

Stebalien · 2019-06-06T07:25:20Z

dht_test.go

+	for i := 0; i < nDHTs; i++ {
+		dhts[i].BootstrapOnce(ctx, DefaultBootstrapConfig)
+	}
+


So, I spent a while trying to avoid having to fix this test without complicating the query logic too much. Then I realized that was just stupid.

Stebalien · 2019-06-06T07:31:18Z

So, I wanted this to be the "ultimate patch that fixed everything". Yeah...

This now correctly implements Kademlia, that's it.

Stebalien · 2019-06-06T07:57:32Z

TODO: Consider increasing Alpha to 6 (or more). The recurse step is tacking ages and I believe trying more paths may help us find a faster path.

dirkmc · 2019-06-06T15:07:25Z

@jacobheun did some interesting experimentation with DHT configuration parameters for js-libp2p-kad-dht, including changing alpha to 6, but ended up reverting back to 3: libp2p/js-libp2p-kad-dht#107

@jacobheun could you summarize why an alpha of 3 worked better in the end?

jacobheun · 2019-06-07T11:02:35Z

A higher alpha will probably work fine for go. The major issue I hit was performance of the js-ipfs node with an alpha of 6. I ended up using 4 for js-ipfs in Node.js and 3 in Browser, ipfs/js-ipfs#1994.

When I tested an alpha of 3 vs 6, 6 yielded a more normalized, lower range for the query times.

Originally we hit problems with the higher alpha's due to dialing timeouts of the peers. If the query to peers isn't being fairly aggressively limited, having a higher alpha can result in a path taking a long time to finish because it basically trickles responses. This is also why we ended up with a "sloppy" approach to ending queries, which significantly improved query times. For a given path, if we finish one of the concurrent queries and there are no closer peers queued we complete that path, even if there are queries in progress. The results of this approach being able to consistently find the top closest peers was pretty consistent, libp2p/js-libp2p-kad-dht#107 (comment).

jacobheun · 2019-06-07T11:18:10Z

Ah, it looks like this doesn't include disjoint paths, so the stuff I linked isn't going to help a lot here. Bumping the alpha to 6 here is only going to have 6 rpc calls concurrently, iiuc. With disjoint paths in JS the concurrency of 4 is going to get us 40 concurrent calls. (kValue / 2) * 4.

If this is going to stay single path the alpha is going to need to increase pretty significantly I think, maybe 20 or more until disjoint paths are added.

raulk · 2019-06-07T15:03:02Z

I'm going to run this with my enhanced logging and I'll report how it performs.

raulk · 2019-06-07T15:31:19Z

@Stebalien the results seem iffy in my case. Two tests managed to finish "get closer peers" and start providing in 1 minute or less:

1 minute:

❯ # start looking for closest peers
❯ grep QmaXJUWEjWK3ef7ySAX5PkugyniFA7xCmwEdoBYbSbYNEc log2.log | head -1
16:04:10.453 DEBUG        dht: [outbound rpc] writing request outbound message; to_peer=QmSoLV4Bbm51jM9C4gDYZQ9Cy3U6aXMJDAbzgu2fzaDs64, type=FIND_NODE, cid_key=QmaXJUWEjWK3ef7ySAX5PkugyniFA7xCmwEdoBYbSbYNEc, peer_key=QmaXJUWEjWK3ef7ySAX5PkugyniFA7xCmwEdoBYbSbYNEc, raw_key=1220b506c0b62860ab8246c15e18514a13e032b794cc9566260dce7c295dd7a1c2a1, closer=[], providers=[] dht_net.go:

❯ # start providing
❯ grep QmaXJUWEjWK3ef7ySAX5PkugyniFA7xCmwEdoBYbSbYNEc log2.log | grep fire-and-forget | head -1
16:05:10.537 DEBUG        dht: [outbound rpc] writing fire-and-forget outbound message; to_peer=QmXPqBVTPhUBpWWoVuGrAqAdCH8Av2Ash1rL9jLP1NwrWr, type=ADD_PROVIDER, cid_key=QmaXJUWEjWK3ef7ySAX5PkugyniFA7xCmwEdoBYbSbYNEc, peer_key=QmaXJUWEjWK3ef7ySAX5PkugyniFA7xCmwEdoBYbSbYNEc, raw_key=1220b506c0b62860ab8246c15e18514a13e032b794cc9566260dce7c295dd7a1c2a1, closer=[], providers=[{QmdGmB5iczLtmxBrAk8NtJsPDP5E4kiF9w4rAKqsq6u7TE: [/ip6/::1/tcp/4001 /ip4/127.0.0.1/tcp/4001 /ip4/192.168.0.132/tcp/4001 /ip4/79.154.225.195/tcp/64905]}] dht_net.go:335

10 seconds:

❯ # start looking for closest peers
❯ grep QmcRq8VMRHRxNbQBhAmPbVeum25LQHUW2ZwnjYbL6dp7cv log2.log | head -1
16:06:29.849 DEBUG        dht: [outbound rpc] writing request outbound message; to_peer=QmNQT4Da4xZZbJoqVjjXcyKnDuzWeDodCXpJJmWvZAavRC, type=FIND_NODE, cid_key=QmcRq8VMRHRxNbQBhAmPbVeum25LQHUW2ZwnjYbL6dp7cv, peer_key=QmcRq8VMRHRxNbQBhAmPbVeum25LQHUW2ZwnjYbL6dp7cv, raw_key=1220d1574dab405ea3be062eea8fa16fb33155283e9d79853bafad4569818a8e4973, closer=[], providers=[] dht_net.go:393

❯ # start providing
❯ grep QmcRq8VMRHRxNbQBhAmPbVeum25LQHUW2ZwnjYbL6dp7cv log2.log | grep fire-and-forget | head -1
16:06:40.706 DEBUG        dht: [outbound rpc] writing fire-and-forget outbound message; to_peer=QmT4maLTXfvn7K1gMwdgV2wiKhn7BSLRG8ecYUYNAdkPGF, type=ADD_PROVIDER, cid_key=QmcRq8VMRHRxNbQBhAmPbVeum25LQHUW2ZwnjYbL6dp7cv, peer_key=QmcRq8VMRHRxNbQBhAmPbVeum25LQHUW2ZwnjYbL6dp7cv, raw_key=1220d1574dab405ea3be062eea8fa16fb33155283e9d79853bafad4569818a8e4973, closer=[], providers=[{QmdGmB5iczLtmxBrAk8NtJsPDP5E4kiF9w4rAKqsq6u7TE: [/ip4/127.0.0.1/tcp/4001 /ip4/192.168.0.132/tcp/4001 /ip6/::1/tcp/4001 /ip4/79.154.225.195/tcp/64905]}] dht_net.go:335

10 minutes (and counting):

❯ # start looking for closest peers
❯ grep QmdpY9Ee6hWsL98pYbU6CgHunkjEMPK9SEyD7aF9gE3oZc log2.log | head -1
16:08:31.320 DEBUG        dht: [outbound rpc] writing request outbound message; to_peer=QmYLRiRq1FiSdeL2AAX3rHgfmGKQRmgFgJhnskvEQyWmao, type=FIND_NODE, cid_key=QmdpY9Ee6hWsL98pYbU6CgHunkjEMPK9SEyD7aF9gE3oZc, peer_key=QmdpY9Ee6hWsL98pYbU6CgHunkjEMPK9SEyD7aF9gE3oZc, raw_key=1220e604243a8479c245b641df6f76de92be237913ddb9fb525f82b08fbccdb04f5b, closer=[], providers=[] dht_net.go:393

❯ # start providing
❯ grep QmdpY9Ee6hWsL98pYbU6CgHunkjEMPK9SEyD7aF9gE3oZc log2.log | grep fire-and-forget | head -1

❯ date
Fri Jun  7 16:19:45 WEST 2019

raulk

I like the separation between the recurse function and the finish function. It wouldn't hurt us to have a PING message (even if a reflexive FIND_NODE moonlights as that right now).

I was hoping we could take this opportunity to simplify various aspects of the implementation.

For example, the separation between dhtQuery and dhtQueryRunner seems redundant.

Our peer management is all over the place. I was thinking KPeerSet could encapsulate the state of peer traversal. We'd instantiate with a target count (e.g. 16 KValue), and as we traverse the network, we would notify it of the state of each peer via methods: Failed(peer.ID), OK(peer.ID), Querying(peer.ID). We could also condense the todocounter functionality into it.

Whenever a worker was ready, it'd ask for the next peer via Next() peer.ID. It would also expose a channel via Done() chan struct{} to signal when the target was met, and we'd fetch the resulting peers via Peers() and execute the finish function on them.

Dunno whether I should take a stab to put this approach together? WDYT, @Stebalien?

raulk · 2019-06-10T18:19:34Z

@Stebalien – I added an extra TODO point in the description for the Kademlia distance logging we discussed today.

bonedaddy · 2019-06-13T22:21:31Z

I tested out this PR in combination with the ipfs/go-ipfs-provider#8, and seems to be giving substantial performance improvements.

By DHT provide announcements seem to be happening faster, additionally it seems like my go-ipfs nodes are able to pick up on the provide announcements from my custom nodes faster than without this proposed change.

Stebalien · 2019-06-14T00:15:56Z

Dunno whether I should take a stab to put this approach together? WDYT, @Stebalien?

I'd like to leave further refactors to a future PR unless this one makes things worse. I agree the query system has gotten a bit confusing but I wanted to save a larger refactor for a PR that only includes that refactor.

License: MIT Signed-off-by: Raúl Kripalani <raul@protocol.ai>

Change bucket size to be configurable

This makes the initial step consistent with recursive steps. Strictly speaking, we should only _need_ alpha peers but some of those peers may not respond.

Stebalien · 2020-03-10T15:57:51Z

This has been replaced by #436

ghost assigned Stebalien Mar 8, 2019

ghost added the status/in-progress In progress label Mar 8, 2019

Stebalien force-pushed the fix/closest-peers branch 2 times, most recently from 72782a9 to b9edb2c Compare March 12, 2019 08:53

Stebalien mentioned this pull request May 1, 2019

Reduce the impact of the DHT ipfs/kubo#6283

Open

Stebalien force-pushed the fix/closest-peers branch 2 times, most recently from 4ac000f to f7005bd Compare May 7, 2019 18:48

Stebalien mentioned this pull request May 28, 2019

Content Resolution And Gateway Performance ipfs/kubo#6383

Open

raulk mentioned this pull request May 30, 2019

Critical path towards DHT efficiency and performance #345

Open

7 tasks

Stebalien added 5 commits May 31, 2019 21:06

only add successfully queried peers to the peersQueried set

987c507

run all queries to completion

a0bc445

Returning early is _not_ valid, ever, for any reason. Note: `query.Run` now returns the final peers. Any other values should be exported via channels, etc (it's a _lot_ simpler).

findpeer: return early

13e5e29

Unfortunately, while returning early isn't valid, FindPeer would block for _ages_ if we didn't. We should switch to a progressive FindPeerAsync but this'll have to do for now.

findpeer: drain addresses before processing new ones

2648b2d

Stebalien force-pushed the fix/closest-peers branch from f7005bd to 2f9e67e Compare June 1, 2019 04:29

whyrusleeping reviewed Jun 1, 2019

View reviewed changes

Stebalien added 2 commits June 5, 2019 11:26

implement kademlia

c9f523d

query(nit): reliably fill rateLimit chan

78a2de6

Stebalien force-pushed the fix/closest-peers branch from 2f9e67e to 78a2de6 Compare June 5, 2019 18:27

Stebalien added 2 commits June 6, 2019 00:10

query: count already queried peers

2a209f1

This was causing us to do _slightly_ more work than necessary.

test: bootstrap before testing

da40930

I.e., stop testing something that shouldn't work.

Stebalien commented Jun 6, 2019

View reviewed changes

Stebalien marked this pull request as ready for review June 6, 2019 07:29

raulk reviewed Jun 7, 2019

View reviewed changes

raulk added a commit to ipfs/kubo that referenced this pull request Jun 14, 2019

back out libp2p/go-libp2p-kad-dht#291.

9ce4df3

License: MIT Signed-off-by: Raúl Kripalani <raul@protocol.ai>

michaelavila and others added 12 commits June 25, 2019 17:29

Change bucket size to be configurable

6ce8383

dont do all those other things

1afff31

Merge pull request #361 from libp2p/chores/make-bucket-size-configurable

bd07cbf

Change bucket size to be configurable

don't add peers with only private addresses to your routing table (#360)

604ac88

dht mode toggling (modulo dynamic switching) (#350)

4b31e56

consume identify events to evaluate routing table addition. (#365)

a4cabc7

filter unworkable peers in queries + enhanced logging (#363)

8fe679a

remove periodic printing of routing table.

138991d

Merge branch 'stabilize' into fix/closest-peers

1f4263f

fix distance output in logging.

eed72b8

Merge branch 'stabilize' into fix/closest-peers

836a15c

start with k peers instead of alpha

22017f0

This makes the initial step consistent with recursive steps. Strictly speaking, we should only _need_ alpha peers but some of those peers may not respond.

Stebalien mentioned this pull request Sep 23, 2019

Improve offline IPNS (record) support libp2p/notes#16

Open

Stebalien mentioned this pull request Dec 3, 2019

Road to 0.5 ipfs/kubo#6776

Closed

21 tasks

jacobheun added this to the Working Kademlia milestone Jan 23, 2020

Stebalien closed this Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve query performance by limiting query width to KValue peers #291

improve query performance by limiting query width to KValue peers #291

Stebalien commented Mar 8, 2019 •

edited by raulk

Loading

anacrolix commented Mar 13, 2019

raulk commented Mar 13, 2019

Stebalien commented Mar 13, 2019

raulk commented Mar 13, 2019

whyrusleeping Jun 1, 2019

Stebalien Jun 1, 2019

whyrusleeping Jun 1, 2019

Stebalien Jun 1, 2019

Stebalien Jun 6, 2019

Stebalien commented Jun 6, 2019

Stebalien commented Jun 6, 2019

dirkmc commented Jun 6, 2019

jacobheun commented Jun 7, 2019

jacobheun commented Jun 7, 2019

raulk commented Jun 7, 2019

raulk commented Jun 7, 2019

raulk left a comment

raulk commented Jun 10, 2019 •

edited

Loading

bonedaddy commented Jun 13, 2019

Stebalien commented Jun 14, 2019

Stebalien commented Mar 10, 2020

improve query performance by limiting query width to KValue peers #291

improve query performance by limiting query width to KValue peers #291

Conversation

Stebalien commented Mar 8, 2019 • edited by raulk Loading

anacrolix commented Mar 13, 2019

raulk commented Mar 13, 2019

Stebalien commented Mar 13, 2019

raulk commented Mar 13, 2019

whyrusleeping Jun 1, 2019

Choose a reason for hiding this comment

Stebalien Jun 1, 2019

Choose a reason for hiding this comment

whyrusleeping Jun 1, 2019

Choose a reason for hiding this comment

Stebalien Jun 1, 2019

Choose a reason for hiding this comment

Stebalien Jun 6, 2019

Choose a reason for hiding this comment

Stebalien commented Jun 6, 2019

Stebalien commented Jun 6, 2019

dirkmc commented Jun 6, 2019

jacobheun commented Jun 7, 2019

jacobheun commented Jun 7, 2019

raulk commented Jun 7, 2019

raulk commented Jun 7, 2019

raulk left a comment

Choose a reason for hiding this comment

raulk commented Jun 10, 2019 • edited Loading

bonedaddy commented Jun 13, 2019

Stebalien commented Jun 14, 2019

Stebalien commented Mar 10, 2020

Stebalien commented Mar 8, 2019 •

edited by raulk

Loading

raulk commented Jun 10, 2019 •

edited

Loading