feat: ping peers on routing table refresh #810

dennis-tra · 2023-02-03T12:13:03Z

We have seen in the past that there are peers in the IPFS DHT that let you connect to them but then refuse to speak any protocol. This was mainly due to the resource manager killing the connection if limits were exceeded. We have seen that such peers are already pushed to the edge of the DHT - meaning they get pruned from lower buckets. However, they won't get pruned from higher ones because we only try to connect to them and not speak anything on that connection.

This change adds a ping message to the liveness check on routing table refreshes.

We have seen in the past that there are peers in the IPFS DHT that let you connect to them but then refuse to speak any protocol. This was mainly due to resource manager killing the connection if limits were exceeded. We have seen that such peers are pushed to the edge of the DHT - meaning, they get already pruned from lower buckets. However, they won't get pruned from higher ones, because we only try to connect to them and not speak anything on that connection. This change adds a ping message to the liveness check on routing table refreshes.

dht.go

Jorropo

LGTM
Would be better with a test but this is very annoying to test (annoying to write and annoying to maintain because of the cost to update mocks), this is most likely work, so I'm fine without a test.

guillaumemichel · 2023-02-06T10:10:56Z

IMO we don't need to periodically check whether nodes still answer to DHT queries as expected. Preventing unresponsive nodes from being added to the RT should be sufficient.

See #811

dennis-tra · 2023-02-06T11:34:50Z

Periodically interacting with peers beyond just connecting to them would detect if they have become overwhelmed with requests (e.g., reached their resource manager limits). This can happen over time so I don't think just checking once upon inserting them to our routing table is enough.

dennis-tra · 2023-02-06T11:43:16Z

Another remark:

I'm using the ping package to probe the remote peer instead of the ping message from the /ipfs/kad/1.0.0 protocol because @Jorropo, you said we wanted to get rid of the ping message from the /ipfs/kad/1.0.0 protocol. What was the reasoning here again? Is it just because we have a dedicated ping package/protocol for that?

Just wanted to point out that this means we require all peers in the network to speak that other ping protocol which makes the /ipfs/kad/1.0.0 protocol dependent on it. Not sure if I like such inter-protocol dependencies.

Opinions @guseggert ?

Jorropo · 2023-02-06T13:33:39Z

Actually, using the ping endpoint allows to ensure the dht protocol stream limits worked at some point, you might have very high ping protocol limits but very low DHT limits, I think the deprecation has something to do with rust-libp2p that implement it because it is unused or something ? (cc @mxinden )

If using the dht ping endpoint is actually fine, we should revert my request to use the ping protocol (this will allows to test the correct per protocol stream limits).

guillaumemichel · 2023-02-06T14:06:01Z

Why not using directly a DHT findpeers request?

Jorropo · 2023-02-06T14:26:04Z

Why not using directly a DHT findpeers request?

Why do a more expensive request when a simpler one does the job ? The kadamelia ping and a findpeer will excercise mostly the same codepaths through the stack (only differences should be in the kadamelia handler, you would switch on the message type).

This reverts commit 16823c3.

Jorropo

LGTM
Would be better with a test but this is very annoying to test (annoying to write and annoying to maintain because of the cost to update mocks), this is most likely work, so I'm fine without a test.

dennis-tra · 2023-02-07T17:31:59Z

From our discussion a few minutes ago:

it actually doesn't really matter to do ping / find_node
ping saves minimal CPU cycles
As long as we don't do anything with the find node response we could just use ping
if we plan to add further peer verification we could easily change this to a find_node request

mxinden · 2023-02-08T16:02:29Z

I'm using the ping package to probe the remote peer instead of the ping message from the /ipfs/kad/1.0.0 protocol

Yes, please don't use the deprecated Kademlia Ping. See specification:

PING: Deprecated message type replaced by the dedicated [ping
protocol][ping]. Implementations may still handle incoming PING requests for
backwards compatibility. Implementations must not actively send PING
requests.

https://github.com/libp2p/specs/tree/master/kad-dht

Is it just because we have a dedicated ping package/protocol for that?

I don't recall the exact reasoning. This has been way before my time. Though this is my intuition, yes. #31 might hep a bit.

think the deprecation has something to do with rust-libp2p that implement it because it is unused or something ? (cc @mxinden )

I am not aware of any way the deprecation is related to rust-libp2p.

mxinden · 2023-02-08T16:03:22Z

dht.go

@@ -365,10 +365,15 @@ func makeRtRefreshManager(dht *IpfsDHT, cfg dhtcfg.Config, maxLastSuccessfulOutb
 		return err
 	}

+	pingFnc := func(ctx context.Context, p peer.ID) error {
+		return dht.protoMessenger.Ping(ctx, p)


I'm using the ping package to probe the remote peer instead of the ping message from the /ipfs/kad/1.0.0 protocol

Not deeply familiar with the codebase. Just double checking, is this really not using the Kademlia Ping mechanism?

After yesterdays discussion, we changed it back to the Kademlia PING message. So this is indeed the Kademlia Ping.

dennis-tra · 2023-02-08T16:20:47Z

Yes, please don't use the deprecated Kademlia Ping. See specification:

Nice, thanks for the clarification. Great to have the information from somewhere authoritative, although it's still unclear what the reasoning there was. Now, we have three options:

use the ping package
- wouldn't exercise DHT protocol resource manager limits - however, that's partially exactly what I would want to test here)
- puts a dependency from the DHT protocol on the ping package. Every peer speaking the DHT protocol, now must speak the ping protocol as well.

OR

Just probe with a FIND_NODES instead of a PING message. Yesterday, we said it doesn't really matter what we use. We chose PING to just save some CPU cycles.

OR

Change the spec :D

I'd vote for 2..

mxinden · 2023-02-10T09:49:14Z

First off, do we have agreement that this is a temporary hack? I.e. that this is to work around existing nodes with miss-configured resource manager? And that the long term solution is to somehow upgrade these nodes?

However, they won't get pruned from higher ones because we only try to connect to them and not speak anything on that connection.

Do I understand correctly that they would be pruned from the routing table whenever we send an RPC (e.g. FINDNODE) to them AND they don't respond? If so, the periodic test RPC would just speed up this pruning process, correct?

Change the spec :D

If indeed this is a temporary fix only, I am reluctant to change the Kademlia specification for it.

dennis-tra · 2023-02-10T12:32:40Z

First off, do we have agreement that this is a temporary hack?

For me, this is not a temporary hack. It's another safeguard against misconfigured nodes. We're not preventing any attacks here.

However, I totally agree that the priority is fixing the root cause. We have already put things in motion to do that. Provably, the network has significantly picked up on our proposed changes: https://github.com/protocol/network-measurements/blob/master/reports/2023/calendar-week-5/ipfs/README.md#agents, and we expect things to improve in the near future even more.

As a follow-up step, I'm also in favour of doing a similar check upon insertion of a peer to the routing table as proposed by @guillaumemichel in #811. With both of these changes, we could have notably mitigated the current performance hit to DHT lookup latencies.

Do I understand correctly that they would be pruned from the routing table whenever we send an RPC (e.g. FINDNODE) to them AND they don't respond? If so, the periodic test RPC would just speed up this pruning process, correct?

That's correct. For our current resource manager challenges, we would not only speed up this process but actually just begin to prune them at all. Right now, these unresponsive nodes we observe stay in routing tables basically forever.

guillaumemichel · 2023-02-10T12:42:51Z

First off, do we have agreement that this is a temporary hack? I.e. that this is to work around existing nodes with miss-configured resource manager? And that the long term solution is to somehow upgrade these nodes?

No, we want to prevent this kind of problem from happening again in the future (node misconfiguration, resource manager, implementation bugs or any other reason that we cannot predict).

If so, the periodic test RPC would just speed up this pruning process, correct?

I wouldn't say that it is a process speed up. If the peerids close to you are unresponsive to DHT queries, but responsive to ping, in the current state they are never pruned. A node (almost) never sends DHT queries to remote nodes close to itself, as they probably store the same Provider Records as the node itself, and the probability that the content you try to access is provided by a remote node very close to you (in XOR distance) is very small.

Right now, these unresponsive nodes we observe stay in routing tables basically forever.

+1

mxinden · 2023-02-10T13:26:38Z

A node (almost) never sends DHT queries to remote nodes close to itself, as they probably store the same Provider Records as the node itself, and the probability that the content you try to access is provided by a remote node very close to you (in XOR distance) is very small.

Good point. I did not consider this.

Just probe with a FIND_NODES instead of a PING message. Yesterday, we said it doesn't really matter what we use. We chose PING to just save some CPU cycles.

For what my opinion is worth, this sounds reasonable to me. Long term I would still wish for this to no longer be needed, i.e. I would wish for the majority of nodes to properly answer Kademlia requests when they advertise support for the Kademlia protocol. Though that might just be wishful thinking.

Thanks for expanding here @dennis-tra and @guillaumemichel.

Jorropo · 2023-02-10T13:32:21Z

Seems that everyone is fine with the current patch, I'll merge by the end of the day unless someone complain.

BigLep · 2023-02-13T18:54:41Z

@dennis-tra : are you getting this bubbled up into Kubo?

BigLep · 2023-02-13T18:55:51Z

I was surprirsed not see a corresponding update to https://github.com/libp2p/go-libp2p-kad-dht/commits/master/version.json but I see @Jorropo did this. Now we just make sure this bubbles up.

Jorropo · 2023-02-14T08:14:50Z

@BigLep version v0.21.0 has been created for go-libp2p v0.25

I'll bubble

dennis-tra force-pushed the ping-on-refresh branch from 758f56d to eb6d2f7 Compare February 3, 2023 12:16

use ping protocol instead of DHT ping

16823c3

Jorropo reviewed Feb 5, 2023

View reviewed changes

dht.go Outdated Show resolved Hide resolved

extract global ping timeout constant

d5ea78f

Jorropo approved these changes Feb 5, 2023

View reviewed changes

guillaumemichel mentioned this pull request Feb 6, 2023

Don't add unresponsive DHT servers to the RT #811

Closed

4 tasks

Revert "use ping protocol instead of DHT ping"

5e68507

This reverts commit 16823c3.

dennis-tra force-pushed the ping-on-refresh branch from 397805c to 5e68507 Compare February 7, 2023 17:15

Jorropo approved these changes Feb 7, 2023

View reviewed changes

mxinden reviewed Feb 8, 2023

View reviewed changes

refactor: use the FIND_NODE RPC to ping peer

62c832e

Jorropo merged commit 81e7325 into libp2p:master Feb 11, 2023

dennis-tra deleted the ping-on-refresh branch February 13, 2023 19:06

yiannisbot mentioned this pull request Feb 21, 2023

Milestone: DHT Optimisations probe-lab/roadmap#14

Closed

dennis-tra mentioned this pull request Mar 16, 2023

Optimistic Provide #783

Merged

7 tasks

guillaumemichel mentioned this pull request Jun 27, 2023

Optimize Routing Table Module probe-lab/go-kademlia#2

Closed

3 tasks

guillaumemichel mentioned this pull request Jul 6, 2023

Only add reachable peers to the routing table probe-lab/go-kademlia#53

Closed

dennis-tra mentioned this pull request Jul 11, 2023

Implement Provider Store Module probe-lab/go-kademlia#3

Closed

dennis-tra mentioned this pull request Aug 17, 2023

Protocol interface refinement probe-lab/go-kademlia#98

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ping peers on routing table refresh #810

feat: ping peers on routing table refresh #810

dennis-tra commented Feb 3, 2023 •

edited

Loading

Jorropo left a comment

guillaumemichel commented Feb 6, 2023

dennis-tra commented Feb 6, 2023

dennis-tra commented Feb 6, 2023

Jorropo commented Feb 6, 2023 •

edited

Loading

guillaumemichel commented Feb 6, 2023

Jorropo commented Feb 6, 2023 •

edited

Loading

Jorropo left a comment •

edited

Loading

dennis-tra commented Feb 7, 2023

mxinden commented Feb 8, 2023

mxinden Feb 8, 2023

dennis-tra Feb 8, 2023

dennis-tra commented Feb 8, 2023

mxinden commented Feb 10, 2023

dennis-tra commented Feb 10, 2023

guillaumemichel commented Feb 10, 2023

mxinden commented Feb 10, 2023

Jorropo commented Feb 10, 2023

BigLep commented Feb 13, 2023

BigLep commented Feb 13, 2023

Jorropo commented Feb 14, 2023 •

edited

Loading

feat: ping peers on routing table refresh #810

feat: ping peers on routing table refresh #810

Conversation

dennis-tra commented Feb 3, 2023 • edited Loading

Jorropo left a comment

Choose a reason for hiding this comment

guillaumemichel commented Feb 6, 2023

dennis-tra commented Feb 6, 2023

dennis-tra commented Feb 6, 2023

Jorropo commented Feb 6, 2023 • edited Loading

guillaumemichel commented Feb 6, 2023

Jorropo commented Feb 6, 2023 • edited Loading

Jorropo left a comment • edited Loading

Choose a reason for hiding this comment

dennis-tra commented Feb 7, 2023

mxinden commented Feb 8, 2023

mxinden Feb 8, 2023

Choose a reason for hiding this comment

dennis-tra Feb 8, 2023

Choose a reason for hiding this comment

dennis-tra commented Feb 8, 2023

mxinden commented Feb 10, 2023

dennis-tra commented Feb 10, 2023

guillaumemichel commented Feb 10, 2023

mxinden commented Feb 10, 2023

Jorropo commented Feb 10, 2023

BigLep commented Feb 13, 2023

BigLep commented Feb 13, 2023

Jorropo commented Feb 14, 2023 • edited Loading

dennis-tra commented Feb 3, 2023 •

edited

Loading

Jorropo commented Feb 6, 2023 •

edited

Loading

Jorropo commented Feb 6, 2023 •

edited

Loading

Jorropo left a comment •

edited

Loading

Jorropo commented Feb 14, 2023 •

edited

Loading