litep2p/kad: Configure periodic network bootstrap for better throughput #4942

lexnv · 2024-07-04T11:42:22Z

This PR modified the periodic network bootstrap process (submitting kad FIND_NODE command with random peerIDs) to improve the number of connected peers:

Initially, the kad queries are submitted exponentially from 5 sec until converging to 2 minutes
If a query fails, the timer is reset to 5 sec
A maximum 16 kad queries are allowed to exist at a given time (although in practice we have a maximum of 1 query in flight)
Queries are not initiated if we are connected to a healthy number of peers (this is similar to libp2p)

The old behavior:

only one query in flight
interval of queries: exp from 5 secs to 1 minute (so we submitted approx. around 1 query per minute)

For my full node started in Kusama, I observed that queries finish under 1 minute.
What I expect is happening for a long running node:

the node is loosing peers (probably due to the peerset banning and disconnecting -- to be investigated / addressed in the future)
kad queries can take significantly longer to finish (even over 2 minutes in a toy-app using standalone litep2p)

Testing Done

Started 2 nodes with --in-peers 50 --out-peers 50:

green line represents number of connected peers with litep2p (and this PR)
yellow is the equivalent for libp2p (current backend)

Part of:

network/litep2p: Investigate low peer count for long-running node #4925

cc @paritytech/networking

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

substrate/client/network/src/litep2p/mod.rs

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

bkchr · 2024-07-08T20:51:51Z

substrate/client/network/src/litep2p/mod.rs

+	/// Number of connected peers over the block announce protocol.
+	///
+	/// This is used to update metrics and network status.
+	num_sync_connected: Arc<AtomicUsize>,


Looks like a metric that should be moved to the sync crate.

Yep that sounds good, have created an issue for this:

network: Move number of connected peers metric to the sync component #5024

Since this was a bit involved for the peerstore metrics :D

substrate/client/network/src/litep2p/mod.rs

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

dmitry-markin · 2024-07-16T11:09:32Z

Is the algorithm the same as what libp2p is doing? I don't mind merging this, but it's possible the root issue is somewhere else, and would be masked once we merge this PR. Also, comparing libp2p and litep2pon 2h time interval can be influenced by random factors, so it's not necessarily evident that the litep2p approach from this PR is significantly better to what libp2p is doing.

dmitry-markin · 2024-07-16T14:09:50Z

substrate/client/network/src/litep2p/mod.rs

@@ -439,6 +441,9 @@ impl<B: BlockT + 'static, H: ExHashT> NetworkBackend<B, H> for Litep2pNetworkBac
 		let peer_store_handle = params.network_config.peer_store_handle();
 		let executor = Arc::new(Litep2pExecutor { executor: params.executor });

+		let limit_discovery_under =
+			params.network_config.network_config.default_peers_set.out_peers as usize + 15;


What happens if in_peers is lower than 15?

Do you think the discovery should be guided by the number of sync peers connected, and not by the number of known peers?

Also, I would make sure we do a random walk at least once per minute as it was before, even if we have enough peers connected (make discovery_only_if_under_num slow down queries, but not stop them completely).

lexnv · 2024-07-16T16:35:46Z

Nice catch indeed, we can remove the limit under config. The extra load on the network should be negligible and at the same time remove some extra logic from our code.

Initially, I used sync peers but can't remember why I transitioned to what libp2p is doing.

After I make the adjustemt and remove the limit, I think we'll have the benefit of not waiting for slow queries to finish before starting a new one. This should be a bit better than before, we could keep the PR around until we find the root cause (also think its safe to merge but lets be extra sure here) 👍

A few ideas to tackle this next:

We could probably report back the total number of connected peers in metrics (not only the sync peers)
Litep2p kademlia tables or kademlia queries might have some wrong logic
Maybe we could try to evit some peers that are not responding to the sync protocol (could correlate with the metric)

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

…er number" This reverts commit b6c9901.

This is mainly done to keep a healthy subset of the network in the node's memory and routing table. Otherwise, we may risk trading off discoverability with protocol performance, which is not entirely desirable. Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

…queries

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv · 2024-09-17T16:31:38Z

I changed the algorithm for kademlia discovery from litep2p side to:

check every 2 minutes that we have a healthy number of peers (double the number of peers libp2p is considering healthy), if we are under this treashold we'll perform a kademlia query
perform a mandatory kademlia query every 30 minutes

The main goal was to improve the network throughput as discussed in this forum post.

Will provide some more details about performance in a bit, however short-term 6-8h data looks extremely promising.

Offhand, it looks like we were performing too many KAD queries, leading to a more dynamic view of the network. In terms of consuming resources to dial, submit queries, handle responses, peer evictions from the routing table etc

lexnv added 8 commits July 4, 2024 11:20

net/litep2p: Update connected peers from network backend sooner

a126147

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

net/litep2p: Propagate connected peers to the discovery mechanism

d9b57ee

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

net/litep2p: Start random walks if under connected peers limit

3b2c5ac

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

net/kad: Ensure timer moves forward

b61e973

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

net/kad: Keep record of inflight kad queries

837f196

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

net/kad: Reset kad query timer on failure

5610b95

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

net/kad: Downgrade logs to trace and debug

f5aee82

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

net/kad: Bound maximum number of queries

6ac1bff

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv self-assigned this Jul 4, 2024

bkchr reviewed Jul 4, 2024

View reviewed changes

substrate/client/network/src/litep2p/mod.rs Outdated Show resolved Hide resolved

lexnv added 4 commits July 8, 2024 14:39

litep2p: Rename num connected peers to num sync peers

4983b89

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

litep2p: Rename function to illustrate sync peers

32c9257

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

litep2p: Improve comment wrt num_sync_connected

2a6d231

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

litep2p: Extract number of distinct connected peers

c71fa9e

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

bkchr approved these changes Jul 8, 2024

View reviewed changes

litep2p: Remove fetch sync peers method

24bafe3

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv mentioned this pull request Jul 16, 2024

network: Move number of connected peers metric to the sync component #5024

Open

dmitry-markin approved these changes Jul 16, 2024

View reviewed changes

lexnv added 5 commits July 16, 2024 19:41

litep2p: Periodically perform kademlia queries instead if under number

b6c9901

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

Revert "litep2p: Periodically perform kademlia queries instead if und…

7bf91eb

…er number" This reverts commit b6c9901.

litep2p/discovery: Fallback to 2x factor for discovery

f676446

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

Merge remote-tracking branch 'origin/master' into lexnv/litep2p-aggr-…

0124309

…queries

lexnv added 3 commits September 17, 2024 16:14

discovery: Introduce mandatory query for discovery every 16 minutes

a8e1cae

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

litep2p/discovery: Better logs for mandatory queries

fa3ce93

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

litep2p/discovery: Periodic kad queries every 30 minutes

2ab14ee

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv changed the title ~~litep2p/kad: Configure periodic network bootstrap for higher connected peers~~ litep2p/kad: Configure periodic network bootstrap for better throughput Sep 17, 2024

lexnv mentioned this pull request Sep 24, 2024

ProtocolSet: failed to register opened substream paritytech/litep2p#252

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

litep2p/kad: Configure periodic network bootstrap for better throughput #4942

litep2p/kad: Configure periodic network bootstrap for better throughput #4942

lexnv commented Jul 4, 2024 •

edited

Loading

bkchr Jul 8, 2024

lexnv Jul 16, 2024

dmitry-markin commented Jul 16, 2024 •

edited

Loading

dmitry-markin Jul 16, 2024

lexnv commented Jul 16, 2024

lexnv commented Sep 17, 2024

litep2p/kad: Configure periodic network bootstrap for better throughput #4942

Are you sure you want to change the base?

litep2p/kad: Configure periodic network bootstrap for better throughput #4942

Conversation

lexnv commented Jul 4, 2024 • edited Loading

Testing Done

bkchr Jul 8, 2024

Choose a reason for hiding this comment

lexnv Jul 16, 2024

Choose a reason for hiding this comment

dmitry-markin commented Jul 16, 2024 • edited Loading

dmitry-markin Jul 16, 2024

Choose a reason for hiding this comment

lexnv commented Jul 16, 2024

lexnv commented Sep 17, 2024

lexnv commented Jul 4, 2024 •

edited

Loading

dmitry-markin commented Jul 16, 2024 •

edited

Loading