Upgrade libp2p from 0.52.4 to 0.54.1 #6248

nazar-pc · 2024-10-27T00:25:28Z

Description

https://github.com/libp2p/rust-libp2p/releases/tag/libp2p-v0.53.0
https://github.com/libp2p/rust-libp2p/blob/master/CHANGELOG.md

Integration

Nothing special is needed, just note that yamux_window_size is no longer applicable to libp2p (litep2p seems to still have it though).

Review Notes

There are a few simplifications and improvements done in libp2p 0.53 regarding swarm interface, I'll list a few key/applicable here.

libp2p/rust-libp2p#4788 removed write_length_prefixed function, so I inlined its code instead.

libp2p/rust-libp2p#4120 introduced new libp2p::SwarmBuilder instead of now deprecated libp2p::swarm::SwarmBuilder, the transition is straightforward and quite ergonomic (can be seen in tests).

libp2p/rust-libp2p#4581 is the most annoying change I have seen that basically makes many enums #[non_exhaustive]. I mapped some, but those that couldn't be mapped I dealt with by printing log messages once they are hit (the best solution I could come up with, at least with stable Rust).

libp2p/rust-libp2p#4306 makes connection close as soon as there are no handler using it, so I had to replace KeepAlive::Until with an explicit future that flips internal boolean after timeout, achieving the old behavior, though it should ideally be removed completely at some point.

yamux_window_size is no longer used by libp2p thanks to libp2p/rust-libp2p#4970 and generally Yamux should have a higher performance now.

I have resolved and cleaned up all deprecations related to libp2p except BandwidthSinks. Libp2p deprecated it (though it is still present in 0.54.1, which is why I didn't handle it just yet). Ideally Substrate would finally switch to the official Prometheus client, in which case we'd get metrics for free. Otherwise a bit of code will need to be copy-pasted to maintain current behavior with BandwidthSinks gone, which I left a TODO about.

The biggest change in 0.54.0 is libp2p/rust-libp2p#4568 that changed transport APIs and enabled unconditional potential port reuse, which can lead to very confusing errors if running two Substrate nodes on the same machine without changing listening port explicitly.

Overall nothing scary here, but testing is always appreciated.

Checklist

My PR includes a detailed description as outlined in the "Description" and its two subsections above.
My PR follows the labeling requirements of this project (at minimum one label for T required)
- External contributors: ask maintainers to put the right label on your PR.

Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd

nazar-pc · 2024-10-27T00:30:22Z

Looks like something called zombienet-orchestrator still pulls old version of libp2p in, so whoever maintains that will have work to do after this

dmitry-markin

Nothing scary indeed. Thanks a lot for upgrading libp2p for us!

substrate/client/network/sync/src/engine.rs

dmitry-markin · 2024-11-04T13:21:56Z

substrate/client/network/src/discovery.rs

-			// Populate kad with both the legacy and the new protocol names.
-			// Remove the legacy protocol:
-			// https://github.com/paritytech/polkadot-sdk/issues/504
-			let kademlia_protocols = if let Some(legacy_protocol) = kademlia_legacy_protocol {
-				vec![kademlia_protocol.clone(), legacy_protocol]
-			} else {
-				vec![kademlia_protocol.clone()]
-			};


Looks good! We should finally do this)

substrate/client/network/src/discovery.rs

substrate/client/network/src/peer_info.rs

substrate/client/network/src/protocol/notifications/behaviour.rs

substrate/client/network/src/protocol/notifications/handler.rs

substrate/client/network/src/request_responses.rs

rockbmb · 2024-11-04T17:37:36Z

/tip small

substrate-tip-bot · 2024-11-04T17:37:44Z

Only members of paritytech/tip-bot-approvers have permission to request the creation of the tip referendum from the bot.

However, you can create the tip referendum yourself using Polkassembly or PolkadotJS Apps.

nazar-pc · 2024-11-05T05:21:08Z

Looks like in your CI request_responses::tests::max_response_size_exceeded runs even slower than I anticipated, would it be okay if I set an even larger connection timeout for this test?

dmitry-markin · 2024-11-05T07:22:54Z

Looks like in your CI request_responses::tests::max_response_size_exceeded runs even slower than I anticipated, would it be okay if I set an even larger connection timeout for this test?

Looking at the test, it doesn't seem to do anything time consuming. And a longer timeout was not needed with old libp2p?

nazar-pc · 2024-11-05T07:27:44Z

Looking at the test, it doesn't seem to do anything time consuming. And a longer timeout was not needed with old libp2p?

It runs quickly on its own, but when running all network tests at once it suddenly takes much longer and always seems to finish later than most tests. Didn't debug why very deeply though, primarily because as comment states it is not really applicable to how it is actually used in Substrate, just test-specific behavior. Pumping to 5 minutes fixed the test and shouldn't be particularly flaky.

Not sure about older version, didn't run tests in a loop to try and reproduce, but keep-alive behavior has certainly changed.

substrate/client/network/src/request_responses.rs

substrate/client/network/src/protocol/notifications/handler.rs

# Conflicts: # Cargo.lock

nazar-pc · 2024-11-07T10:21:47Z

I remember with previous libp2p upgrades there was some kind of burn-in testing done, can it be triggered for this PR, hopefully making it into stable2412 if all goes well 🤞 ?

dmitry-markin · 2024-11-07T10:33:06Z

I remember with previous libp2p upgrades there was some kind of burn-in testing done, can it be triggered for this PR, hopefully making it into stable2412 if all goes well 🤞 ?

Yes, that was the plan — I was going to do a versi burn-in once it is available after syncing refactoring testing. Last time the issues with libp2p upgrade popped a week after running the nodes, so we need to run a burn-in for at least a week or so.

Unfortunately, the branch-off for stable2412 was planned for yesterday and is going to happen today, so I don't think libp2p upgrade will make it into December release.

nazar-pc · 2024-11-11T11:21:14Z

I'll probably be backporting it then, so any feedback that can be collected until then would be helpful regardless

lexnv · 2024-11-13T14:56:09Z

substrate/client/network/src/peer_info.rs

-					return Poll::Ready(ToSwarm::RemoveListener { id }),
+				Poll::Ready(event) => {
+					return Poll::Ready(event.map_in(Either::Left).map_out(|_| {
+						unreachable!("`GenerateEvent` is handled in a branch above; qed")


nit: Im a bit afraid we might be missing events in cases where they are generated a few hours / days after starting the node. Could we instead make this an error log?

This assertion is statically correct. event is non-exhaustive, so they added .map_in() and .map_out() methods to exhaustively process it.

In this case .map_out()'s argument corresponds to ev from ToSwarm::GenerateEvent(ev) above and since we did match on ToSwarm::GenerateEvent(ev) already, there is no way .map_out() will ever be called, hence unreachable.

lexnv · 2024-11-13T14:57:42Z

substrate/client/network/src/discovery.rs


 			config.set_replication_factor(kademlia_replication_factor);
-			// Populate kad with both the legacy and the new protocol names.
-			// Remove the legacy protocol:
-			// https://github.com/paritytech/polkadot-sdk/issues/504


I think we should wait a bit and scrape the network. I believe everything should be fine, but let's double check first.

I guess we can merge #6109 before this one as it also handles legacy_protocol removal, but shouldn't matter too much.

I'll have a look this week and investigate versions, generally I think its safe for now

Yes, I specifically used kademlia_legacy_protocol: _, above instead of full removal of the property to reduce the inevitable conflict with #6109 whichever ends up being first.

lexnv · 2024-11-13T14:59:34Z

substrate/client/network/src/request_responses.rs

+			SwarmConfig::with_executor(TokioExecutor(runtime))
+				// This is taken care of by notification protocols in non-test environment
+				// It is very slow in test environment for some reason, hence larger timeout
+				.with_idle_connection_timeout(Duration::from_secs(300)),


dq: Was this working ok in the past?

It might indicate some issues with the debug build in terms of degrading libp2p performance? I don't believe we changed CI machines recently

It worked differently due to timeouts as described in libp2p/rust-libp2p#4306

I did not investigate deeply why this particular test takes so long to run with others sometimes (it runs quickly on its own though), someone should definitely look into it at some point. However this test only tests a single behavior in isolation, while the actual Substrate network notification protocol keeps connection alive, see substrate/client/network/src/protocol/notifications/handler.rs.

lexnv · 2024-11-13T15:01:23Z

substrate/client/network/src/service.rs

-						)))) => "sync-notifications-clogged",
-						Some(ConnectionError::Handler(Either::Left(Either::Left(
-							Either::Right(Either::Left(_)),
-						)))) => "ping-timeout",


It looks like we don't propagate the ping-timeout and sync-clogged labels anymore, are they merged into actively-closed?

I suspect they were merged into Io, ConnectionError::Handler variant doesn't exist upstream anymore

lexnv · 2024-11-13T15:02:47Z

substrate/client/network/src/transport.rs

@@ -29,28 +29,21 @@ use libp2p::{
 };
 use std::{sync::Arc, time::Duration};

+// TODO: Create a wrapper similar to upstream `BandwidthTransport` that tracks sent/received bytes


Is this blocked by prometheus-client crate update? Would be good to have an issue in substrate to not forget about this, thanks

Not really, this tracker is ultimately used to display in/out bandwidth in the informer, which will be needed regardless of prometheus library used. Though we will be able to avoid this for metrics purposes with prometheus library upgrade indeed.

lexnv · 2024-11-13T15:10:36Z

substrate/client/network/src/service.rs

-			transport::build_transport(
-				local_identity.clone().into(),
-				config_mem,
-				network_config.yamux_window_size,


We should deprecate this CLI flag with the next litep2p release

dq: @nazar-pc Did you observe a significant improvement in terms of performance when switching to the autotuning impl? I know the synthetic benchmarks look solid, but for litep2p this did not translate into other improvements

IIRC we saw some improvements, but nothing groundbreaking. Although our DSN networking stack works substantially differently from Substrate to draw any conclusions here.

lexnv

As always @nazar-pc thanks for the outstanding work here! 🙏

Before merging this I would like to sort out the legacy KAD protocol deprecation, as it might possibly affect the discovery of nodes that have not updated for a while (cc #6109)

It would be good to have it running in our test stack versi or versi-net and have a collected report in terms of logs and a few dashboards (mostly interested in performance side and connection stability -- I remember that 0.52.4 had a few regressions) 🙏

lexnv · 2024-11-13T15:24:17Z

As part of our testing and performance comparison we start 2 nodes side by side, it might not be ideal but I'm wondering if this will break the testing environment:

The biggest change in 0.54.0 is libp2p/rust-libp2p#4568 that changed transport APIs and enabled unconditional potential port reuse, which can lead to very confusing errors if running two Substrate nodes on the same machine without changing listening port explicitly.

@nazar-pc we'll just need to provide a different --listen-addr port for libp2p for the installed protocols and everything will work as before? (ie we can have libp2p and litep2p side by side or even 2 libp2p nodes without interferences)

nazar-pc

@nazar-pc we'll just need to provide a different --listen-addr port for libp2p for the installed protocols and everything will work as before? (ie we can have libp2p and litep2p side by side or even 2 libp2p nodes without interferences)

Yes, exactly. We basically no longer have fallback to a random listening port if 2 libp2p instances are claiming the same one.

nazar-pc · 2024-11-15T07:31:17Z

substrate/client/network/src/peer_info.rs

-					return Poll::Ready(ToSwarm::RemoveListener { id }),
+				Poll::Ready(event) => {
+					return Poll::Ready(event.map_in(Either::Left).map_out(|_| {
+						unreachable!("`GenerateEvent` is handled in a branch above; qed")


This assertion is statically correct. event is non-exhaustive, so they added .map_in() and .map_out() methods to exhaustively process it.

In this case .map_out()'s argument corresponds to ev from ToSwarm::GenerateEvent(ev) above and since we did match on ToSwarm::GenerateEvent(ev) already, there is no way .map_out() will ever be called, hence unreachable.

nazar-pc · 2024-11-15T07:32:34Z

substrate/client/network/src/discovery.rs


 			config.set_replication_factor(kademlia_replication_factor);
-			// Populate kad with both the legacy and the new protocol names.
-			// Remove the legacy protocol:
-			// https://github.com/paritytech/polkadot-sdk/issues/504


Yes, I specifically used kademlia_legacy_protocol: _, above instead of full removal of the property to reduce the inevitable conflict with #6109 whichever ends up being first.

nazar-pc · 2024-11-15T07:36:01Z

substrate/client/network/src/request_responses.rs

+			SwarmConfig::with_executor(TokioExecutor(runtime))
+				// This is taken care of by notification protocols in non-test environment
+				// It is very slow in test environment for some reason, hence larger timeout
+				.with_idle_connection_timeout(Duration::from_secs(300)),


It worked differently due to timeouts as described in libp2p/rust-libp2p#4306

I did not investigate deeply why this particular test takes so long to run with others sometimes (it runs quickly on its own though), someone should definitely look into it at some point. However this test only tests a single behavior in isolation, while the actual Substrate network notification protocol keeps connection alive, see substrate/client/network/src/protocol/notifications/handler.rs.

nazar-pc · 2024-11-15T07:39:35Z

substrate/client/network/src/service.rs

-			transport::build_transport(
-				local_identity.clone().into(),
-				config_mem,
-				network_config.yamux_window_size,


IIRC we saw some improvements, but nothing groundbreaking. Although our DSN networking stack works substantially differently from Substrate to draw any conclusions here.

nazar-pc · 2024-11-15T07:41:21Z

substrate/client/network/src/service.rs

-						)))) => "sync-notifications-clogged",
-						Some(ConnectionError::Handler(Either::Left(Either::Left(
-							Either::Right(Either::Left(_)),
-						)))) => "ping-timeout",


I suspect they were merged into Io, ConnectionError::Handler variant doesn't exist upstream anymore

nazar-pc · 2024-11-15T07:44:43Z

substrate/client/network/src/transport.rs

@@ -29,28 +29,21 @@ use libp2p::{
 };
 use std::{sync::Arc, time::Duration};

+// TODO: Create a wrapper similar to upstream `BandwidthTransport` that tracks sent/received bytes


Not really, this tracker is ultimately used to display in/out bandwidth in the informer, which will be needed regardless of prometheus library used. Though we will be able to avoid this for metrics purposes with prometheus library upgrade indeed.

# Conflicts: # Cargo.lock # substrate/client/authority-discovery/src/worker/tests.rs # substrate/client/network/src/event.rs # substrate/client/network/src/litep2p/discovery.rs # substrate/client/network/src/litep2p/service.rs # substrate/client/network/src/service.rs # substrate/client/network/src/service/traits.rs

nazar-pc · 2024-11-15T08:21:36Z

Merged master, mostly imports conflicts + had to update explicit libp2p-kad dependency to the latest version to match this PR after #5842

dmitry-markin · 2024-11-18T08:17:31Z

Merged master, mostly imports conflicts + had to update explicit libp2p-kad dependency to the latest version to match this PR after #5842

Thanks @nazar-pc .

Versi burn-in revealed spamming by messages

WARN tokio-runtime-worker sub-libp2p: Libp2p => Unhandled Kademlia event: Bootstrap(Ok(BootstrapOk { peer: PeerId("*"), num_remaining: * }))

Also, the peer count seems to be less stable than before, but I need to check if it's due to libp2p PR or some other change.

nazar-pc mentioned this pull request Oct 27, 2024

network: Update libp2p to 0.54.1 #5996

Open

nazar-pc changed the title ~~Upgrade libp2p from 0.52.4 to 0.53.2~~ Upgrade libp2p from 0.52.4 to 0.54.1 Oct 27, 2024

nazar-pc added 4 commits October 27, 2024 22:44

Initial upgrade libp2p to 0.53.2 (some deprecations are not fixed yet)

de76b66

Resolve many libp2p deprecations, suppress others

99432cc

Add prdoc

aa14790

Upgrade libp2p to 0.54.1

7903b11

nazar-pc force-pushed the libp2p-0.53.x branch from 1d112cd to 7903b11 Compare October 27, 2024 20:45

dmitry-markin assigned nazar-pc Oct 28, 2024

dmitry-markin added the T0-node This PR/Issue is related to the topic “node”. label Oct 28, 2024

lexnv self-requested a review October 30, 2024 11:07

dmitry-markin self-requested a review November 3, 2024 19:07

dmitry-markin mentioned this pull request Nov 4, 2024

[network] Confirm and get rid of keep-alives in Notifications protocol #6350

Open

dmitry-markin approved these changes Nov 4, 2024

View reviewed changes

nazar-pc added 2 commits November 4, 2024 18:52

Address review feedback

46e722b

Merge remote-tracking branch 'upstream/master' into libp2p-0.53.x

5591612

nazar-pc added 2 commits November 5, 2024 07:21

Increase test timeout further to make sure it passes in CI

614bb83

Merge remote-tracking branch 'upstream/master' into libp2p-0.53.x

56e40c6

dmitry-markin reviewed Nov 5, 2024

View reviewed changes

substrate/client/network/src/request_responses.rs Outdated Show resolved Hide resolved

substrate/client/network/src/request_responses.rs Outdated Show resolved Hide resolved

minor: revert string literal formatting changes

6c2b051

dmitry-markin reviewed Nov 5, 2024

View reviewed changes

substrate/client/network/src/protocol/notifications/handler.rs Show resolved Hide resolved

nazar-pc added 2 commits November 5, 2024 21:17

Add TODO

28f1849

Merge remote-tracking branch 'upstream/master' into libp2p-0.53.x

ba7a246

# Conflicts: # Cargo.lock

lexnv reviewed Nov 13, 2024

View reviewed changes

lexnv mentioned this pull request Nov 13, 2024

network: Remove CLI flag for yamux_window_size #6468

Open

lexnv reviewed Nov 13, 2024

View reviewed changes

lexnv approved these changes Nov 13, 2024

View reviewed changes

nazar-pc commented Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade libp2p from 0.52.4 to 0.54.1 #6248

Upgrade libp2p from 0.52.4 to 0.54.1 #6248

nazar-pc commented Oct 27, 2024 •

edited

Loading

nazar-pc commented Oct 27, 2024

dmitry-markin left a comment

dmitry-markin Nov 4, 2024

rockbmb commented Nov 4, 2024

substrate-tip-bot bot commented Nov 4, 2024

nazar-pc commented Nov 5, 2024

dmitry-markin commented Nov 5, 2024

nazar-pc commented Nov 5, 2024

nazar-pc commented Nov 7, 2024

dmitry-markin commented Nov 7, 2024

nazar-pc commented Nov 11, 2024

lexnv Nov 13, 2024

nazar-pc Nov 15, 2024

lexnv Nov 13, 2024

nazar-pc Nov 15, 2024

lexnv Nov 13, 2024

nazar-pc Nov 15, 2024

lexnv Nov 13, 2024

nazar-pc Nov 15, 2024

lexnv Nov 13, 2024

nazar-pc Nov 15, 2024

lexnv Nov 13, 2024

nazar-pc Nov 15, 2024

lexnv left a comment

lexnv commented Nov 13, 2024

nazar-pc left a comment

nazar-pc Nov 15, 2024

nazar-pc Nov 15, 2024

nazar-pc Nov 15, 2024

nazar-pc Nov 15, 2024

nazar-pc Nov 15, 2024

nazar-pc Nov 15, 2024

nazar-pc commented Nov 15, 2024

dmitry-markin commented Nov 18, 2024

Upgrade libp2p from 0.52.4 to 0.54.1 #6248

Are you sure you want to change the base?

Upgrade libp2p from 0.52.4 to 0.54.1 #6248

Conversation

nazar-pc commented Oct 27, 2024 • edited Loading

Description

Integration

Review Notes

Checklist

nazar-pc commented Oct 27, 2024

dmitry-markin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rockbmb commented Nov 4, 2024

substrate-tip-bot bot commented Nov 4, 2024

nazar-pc commented Nov 5, 2024

dmitry-markin commented Nov 5, 2024

nazar-pc commented Nov 5, 2024

nazar-pc commented Nov 7, 2024

dmitry-markin commented Nov 7, 2024

nazar-pc commented Nov 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lexnv left a comment

Choose a reason for hiding this comment

lexnv commented Nov 13, 2024

nazar-pc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nazar-pc commented Nov 15, 2024

dmitry-markin commented Nov 18, 2024

nazar-pc commented Oct 27, 2024 •

edited

Loading