Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade libp2p from 0.52.4 to 0.54.1 #6248

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

nazar-pc
Copy link
Contributor

@nazar-pc nazar-pc commented Oct 27, 2024

Description

Fixes #5996

https://github.com/libp2p/rust-libp2p/releases/tag/libp2p-v0.53.0
https://github.com/libp2p/rust-libp2p/blob/master/CHANGELOG.md

Integration

Nothing special is needed, just note that yamux_window_size is no longer applicable to libp2p (litep2p seems to still have it though).

Review Notes

There are a few simplifications and improvements done in libp2p 0.53 regarding swarm interface, I'll list a few key/applicable here.

libp2p/rust-libp2p#4788 removed write_length_prefixed function, so I inlined its code instead.

libp2p/rust-libp2p#4120 introduced new libp2p::SwarmBuilder instead of now deprecated libp2p::swarm::SwarmBuilder, the transition is straightforward and quite ergonomic (can be seen in tests).

libp2p/rust-libp2p#4581 is the most annoying change I have seen that basically makes many enums #[non_exhaustive]. I mapped some, but those that couldn't be mapped I dealt with by printing log messages once they are hit (the best solution I could come up with, at least with stable Rust).

libp2p/rust-libp2p#4306 makes connection close as soon as there are no handler using it, so I had to replace KeepAlive::Until with an explicit future that flips internal boolean after timeout, achieving the old behavior, though it should ideally be removed completely at some point.

yamux_window_size is no longer used by libp2p thanks to libp2p/rust-libp2p#4970 and generally Yamux should have a higher performance now.

I have resolved and cleaned up all deprecations related to libp2p except BandwidthSinks. Libp2p deprecated it (though it is still present in 0.54.1, which is why I didn't handle it just yet). Ideally Substrate would finally switch to the official Prometheus client, in which case we'd get metrics for free. Otherwise a bit of code will need to be copy-pasted to maintain current behavior with BandwidthSinks gone, which I left a TODO about.

The biggest change in 0.54.0 is libp2p/rust-libp2p#4568 that changed transport APIs and enabled unconditional potential port reuse, which can lead to very confusing errors if running two Substrate nodes on the same machine without changing listening port explicitly.

Overall nothing scary here, but testing is always appreciated.

Checklist

  • My PR includes a detailed description as outlined in the "Description" and its two subsections above.
  • My PR follows the labeling requirements of this project (at minimum one label for T required)
    • External contributors: ask maintainers to put the right label on your PR.

Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd

@nazar-pc
Copy link
Contributor Author

Looks like something called zombienet-orchestrator still pulls old version of libp2p in, so whoever maintains that will have work to do after this

@nazar-pc nazar-pc changed the title Upgrade libp2p from 0.52.4 to 0.53.2 Upgrade libp2p from 0.52.4 to 0.54.1 Oct 27, 2024
Copy link
Contributor

@dmitry-markin dmitry-markin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing scary indeed. Thanks a lot for upgrading libp2p for us!

substrate/client/network/sync/src/engine.rs Outdated Show resolved Hide resolved
Comment on lines -225 to -232
// Populate kad with both the legacy and the new protocol names.
// Remove the legacy protocol:
// https://github.com/paritytech/polkadot-sdk/issues/504
let kademlia_protocols = if let Some(legacy_protocol) = kademlia_legacy_protocol {
vec![kademlia_protocol.clone(), legacy_protocol]
} else {
vec![kademlia_protocol.clone()]
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! We should finally do this)

substrate/client/network/src/discovery.rs Outdated Show resolved Hide resolved
substrate/client/network/src/discovery.rs Outdated Show resolved Hide resolved
substrate/client/network/src/peer_info.rs Outdated Show resolved Hide resolved
substrate/client/network/src/request_responses.rs Outdated Show resolved Hide resolved
substrate/client/network/src/request_responses.rs Outdated Show resolved Hide resolved
substrate/client/network/src/request_responses.rs Outdated Show resolved Hide resolved
@rockbmb
Copy link
Contributor

rockbmb commented Nov 4, 2024

/tip small

Copy link

Only members of paritytech/tip-bot-approvers have permission to request the creation of the tip referendum from the bot.

However, you can create the tip referendum yourself using Polkassembly or PolkadotJS Apps.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Nov 5, 2024

Looks like in your CI request_responses::tests::max_response_size_exceeded runs even slower than I anticipated, would it be okay if I set an even larger connection timeout for this test?

@dmitry-markin
Copy link
Contributor

Looks like in your CI request_responses::tests::max_response_size_exceeded runs even slower than I anticipated, would it be okay if I set an even larger connection timeout for this test?

Looking at the test, it doesn't seem to do anything time consuming. And a longer timeout was not needed with old libp2p?

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Nov 5, 2024

Looking at the test, it doesn't seem to do anything time consuming. And a longer timeout was not needed with old libp2p?

It runs quickly on its own, but when running all network tests at once it suddenly takes much longer and always seems to finish later than most tests. Didn't debug why very deeply though, primarily because as comment states it is not really applicable to how it is actually used in Substrate, just test-specific behavior. Pumping to 5 minutes fixed the test and shouldn't be particularly flaky.

Not sure about older version, didn't run tests in a loop to try and reproduce, but keep-alive behavior has certainly changed.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Nov 7, 2024

I remember with previous libp2p upgrades there was some kind of burn-in testing done, can it be triggered for this PR, hopefully making it into stable2412 if all goes well 🤞 ?

@dmitry-markin
Copy link
Contributor

I remember with previous libp2p upgrades there was some kind of burn-in testing done, can it be triggered for this PR, hopefully making it into stable2412 if all goes well 🤞 ?

Yes, that was the plan — I was going to do a versi burn-in once it is available after syncing refactoring testing. Last time the issues with libp2p upgrade popped a week after running the nodes, so we need to run a burn-in for at least a week or so.

Unfortunately, the branch-off for stable2412 was planned for yesterday and is going to happen today, so I don't think libp2p upgrade will make it into December release.

@nazar-pc
Copy link
Contributor Author

I'll probably be backporting it then, so any feedback that can be collected until then would be helpful regardless

return Poll::Ready(ToSwarm::RemoveListener { id }),
Poll::Ready(event) => {
return Poll::Ready(event.map_in(Either::Left).map_out(|_| {
unreachable!("`GenerateEvent` is handled in a branch above; qed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Im a bit afraid we might be missing events in cases where they are generated a few hours / days after starting the node. Could we instead make this an error log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is statically correct. event is non-exhaustive, so they added .map_in() and .map_out() methods to exhaustively process it.

In this case .map_out()'s argument corresponds to ev from ToSwarm::GenerateEvent(ev) above and since we did match on ToSwarm::GenerateEvent(ev) already, there is no way .map_out() will ever be called, hence unreachable.


config.set_replication_factor(kademlia_replication_factor);
// Populate kad with both the legacy and the new protocol names.
// Remove the legacy protocol:
// https://github.com/paritytech/polkadot-sdk/issues/504
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should wait a bit and scrape the network. I believe everything should be fine, but let's double check first.

I guess we can merge #6109 before this one as it also handles legacy_protocol removal, but shouldn't matter too much.

I'll have a look this week and investigate versions, generally I think its safe for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I specifically used kademlia_legacy_protocol: _, above instead of full removal of the property to reduce the inevitable conflict with #6109 whichever ends up being first.

SwarmConfig::with_executor(TokioExecutor(runtime))
// This is taken care of by notification protocols in non-test environment
// It is very slow in test environment for some reason, hence larger timeout
.with_idle_connection_timeout(Duration::from_secs(300)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dq: Was this working ok in the past?

It might indicate some issues with the debug build in terms of degrading libp2p performance? I don't believe we changed CI machines recently

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked differently due to timeouts as described in libp2p/rust-libp2p#4306

I did not investigate deeply why this particular test takes so long to run with others sometimes (it runs quickly on its own though), someone should definitely look into it at some point. However this test only tests a single behavior in isolation, while the actual Substrate network notification protocol keeps connection alive, see substrate/client/network/src/protocol/notifications/handler.rs.

)))) => "sync-notifications-clogged",
Some(ConnectionError::Handler(Either::Left(Either::Left(
Either::Right(Either::Left(_)),
)))) => "ping-timeout",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we don't propagate the ping-timeout and sync-clogged labels anymore, are they merged into actively-closed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect they were merged into Io, ConnectionError::Handler variant doesn't exist upstream anymore

@@ -29,28 +29,21 @@ use libp2p::{
};
use std::{sync::Arc, time::Duration};

// TODO: Create a wrapper similar to upstream `BandwidthTransport` that tracks sent/received bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this blocked by prometheus-client crate update? Would be good to have an issue in substrate to not forget about this, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, this tracker is ultimately used to display in/out bandwidth in the informer, which will be needed regardless of prometheus library used. Though we will be able to avoid this for metrics purposes with prometheus library upgrade indeed.

transport::build_transport(
local_identity.clone().into(),
config_mem,
network_config.yamux_window_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should deprecate this CLI flag with the next litep2p release

dq: @nazar-pc Did you observe a significant improvement in terms of performance when switching to the autotuning impl? I know the synthetic benchmarks look solid, but for litep2p this did not translate into other improvements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we saw some improvements, but nothing groundbreaking. Although our DSN networking stack works substantially differently from Substrate to draw any conclusions here.

Copy link
Contributor

@lexnv lexnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As always @nazar-pc thanks for the outstanding work here! 🙏

Before merging this I would like to sort out the legacy KAD protocol deprecation, as it might possibly affect the discovery of nodes that have not updated for a while (cc #6109)

It would be good to have it running in our test stack versi or versi-net and have a collected report in terms of logs and a few dashboards (mostly interested in performance side and connection stability -- I remember that 0.52.4 had a few regressions) 🙏

@lexnv
Copy link
Contributor

lexnv commented Nov 13, 2024

As part of our testing and performance comparison we start 2 nodes side by side, it might not be ideal but I'm wondering if this will break the testing environment:

The biggest change in 0.54.0 is libp2p/rust-libp2p#4568 that changed transport APIs and enabled unconditional potential port reuse, which can lead to very confusing errors if running two Substrate nodes on the same machine without changing listening port explicitly.

@nazar-pc we'll just need to provide a different --listen-addr port for libp2p for the installed protocols and everything will work as before? (ie we can have libp2p and litep2p side by side or even 2 libp2p nodes without interferences)

Copy link
Contributor Author

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nazar-pc we'll just need to provide a different --listen-addr port for libp2p for the installed protocols and everything will work as before? (ie we can have libp2p and litep2p side by side or even 2 libp2p nodes without interferences)

Yes, exactly. We basically no longer have fallback to a random listening port if 2 libp2p instances are claiming the same one.

return Poll::Ready(ToSwarm::RemoveListener { id }),
Poll::Ready(event) => {
return Poll::Ready(event.map_in(Either::Left).map_out(|_| {
unreachable!("`GenerateEvent` is handled in a branch above; qed")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is statically correct. event is non-exhaustive, so they added .map_in() and .map_out() methods to exhaustively process it.

In this case .map_out()'s argument corresponds to ev from ToSwarm::GenerateEvent(ev) above and since we did match on ToSwarm::GenerateEvent(ev) already, there is no way .map_out() will ever be called, hence unreachable.


config.set_replication_factor(kademlia_replication_factor);
// Populate kad with both the legacy and the new protocol names.
// Remove the legacy protocol:
// https://github.com/paritytech/polkadot-sdk/issues/504
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I specifically used kademlia_legacy_protocol: _, above instead of full removal of the property to reduce the inevitable conflict with #6109 whichever ends up being first.

SwarmConfig::with_executor(TokioExecutor(runtime))
// This is taken care of by notification protocols in non-test environment
// It is very slow in test environment for some reason, hence larger timeout
.with_idle_connection_timeout(Duration::from_secs(300)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked differently due to timeouts as described in libp2p/rust-libp2p#4306

I did not investigate deeply why this particular test takes so long to run with others sometimes (it runs quickly on its own though), someone should definitely look into it at some point. However this test only tests a single behavior in isolation, while the actual Substrate network notification protocol keeps connection alive, see substrate/client/network/src/protocol/notifications/handler.rs.

transport::build_transport(
local_identity.clone().into(),
config_mem,
network_config.yamux_window_size,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we saw some improvements, but nothing groundbreaking. Although our DSN networking stack works substantially differently from Substrate to draw any conclusions here.

)))) => "sync-notifications-clogged",
Some(ConnectionError::Handler(Either::Left(Either::Left(
Either::Right(Either::Left(_)),
)))) => "ping-timeout",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect they were merged into Io, ConnectionError::Handler variant doesn't exist upstream anymore

@@ -29,28 +29,21 @@ use libp2p::{
};
use std::{sync::Arc, time::Duration};

// TODO: Create a wrapper similar to upstream `BandwidthTransport` that tracks sent/received bytes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, this tracker is ultimately used to display in/out bandwidth in the informer, which will be needed regardless of prometheus library used. Though we will be able to avoid this for metrics purposes with prometheus library upgrade indeed.

# Conflicts:
#	Cargo.lock
#	substrate/client/authority-discovery/src/worker/tests.rs
#	substrate/client/network/src/event.rs
#	substrate/client/network/src/litep2p/discovery.rs
#	substrate/client/network/src/litep2p/service.rs
#	substrate/client/network/src/service.rs
#	substrate/client/network/src/service/traits.rs
@nazar-pc
Copy link
Contributor Author

Merged master, mostly imports conflicts + had to update explicit libp2p-kad dependency to the latest version to match this PR after #5842

@dmitry-markin
Copy link
Contributor

Merged master, mostly imports conflicts + had to update explicit libp2p-kad dependency to the latest version to match this PR after #5842

Thanks @nazar-pc .

Versi burn-in revealed spamming by messages

WARN tokio-runtime-worker sub-libp2p: Libp2p => Unhandled Kademlia event: Bootstrap(Ok(BootstrapOk { peer: PeerId("*"), num_remaining: * }))

Also, the peer count seems to be less stable than before, but I need to check if it's due to libp2p PR or some other change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
Status: Code in review 🧐
Development

Successfully merging this pull request may close these issues.

network: Update libp2p to 0.54.1
4 participants