-
Notifications
You must be signed in to change notification settings - Fork 959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core/: Concurrent connection attempts #2248
Conversation
Thanks @AgeManning for testing! That is great to hear. Thanks @thomaseizinger and @elenaf9 for the reviews. Much appreciated! Good to have additional input on these changes. I addressed all your comments. Let me know in case you would like to see any additional changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
With libp2p#2248 a connection task `await`s sending an event to the behaviour before polling for new events from the behaviour [1]. When `Swarm::poll` is unable to deliver an event to a connection task it returns `Poll::Pending` even though (a) polling `Swarm::network` might be able to make progress (`network_not_ready` being `false`) and (b) it does not register a waker to be woken up [2]. In combination this can lead to a deadlock where a connection task waits to send an event to the behaviour and `Swarm::poll` returns `Poll::Pending` failing to send an event to the connection task, not registering a waiker in order to be polled again. With this commit `Swarm::poll` will only return `Poll::Pending`, when failing to deliver an event to a connection task, if the network is unable to make progress (i.e. `network_not_ready` being `true`). In the long-run `Swarm::poll` should likely be redesigned, prioritizing the behaviour over the network, given the former is the control plane and the latter potentially yields new work from the outside. [1]: https://github.com/libp2p/rust-libp2p/blob/ca1b7cf043b4264c69b19fe75de488330a7a1f2f/core/src/connection/pool/task.rs#L224-L232 [2]: https://github.com/libp2p/rust-libp2p/blob/ca1b7cf043b4264c69b19fe75de488330a7a1f2f/swarm/src/lib.rs#L756-L783
) With #2248 a connection task `await`s sending an event to the behaviour before polling for new events from the behaviour [1]. When `Swarm::poll` is unable to deliver an event to a connection task it returns `Poll::Pending` even though (a) polling `Swarm::network` might be able to make progress (`network_not_ready` being `false`) and (b) it does not register a waker to be woken up [2]. In combination this can lead to a deadlock where a connection task waits to send an event to the behaviour and `Swarm::poll` returns `Poll::Pending` failing to send an event to the connection task, not registering a waiker in order to be polled again. With this commit `Swarm::poll` will only return `Poll::Pending`, when failing to deliver an event to a connection task, if the network is unable to make progress (i.e. `network_not_ready` being `true`). In the long-run `Swarm::poll` should likely be redesigned, prioritizing the behaviour over the network, given the former is the control plane and the latter potentially yields new work from the outside. [1]: https://github.com/libp2p/rust-libp2p/blob/ca1b7cf043b4264c69b19fe75de488330a7a1f2f/core/src/connection/pool/task.rs#L224-L232 [2]: https://github.com/libp2p/rust-libp2p/blob/ca1b7cf043b4264c69b19fe75de488330a7a1f2f/swarm/src/lib.rs#L756-L783
Since libp2p#2248 dial attempts are no longer reported per address, but instead reported for all addresses of a single dial at once. This commit updates the comment accordingly.
Since #2248 dial attempts are no longer reported per address, but instead reported for all addresses of a single dial at once. This commit updates the comment accordingly.
Since libp2p#2248 dial attempts are no longer reported per address, but instead reported for all addresses of a single dial at once. This commit updates the comment accordingly.
…304) With libp2p/rust-libp2p#2248 a connection task `await`s sending an event to the behaviour before polling for new events from the behaviour [1]. When `Swarm::poll` is unable to deliver an event to a connection task it returns `Poll::Pending` even though (a) polling `Swarm::network` might be able to make progress (`network_not_ready` being `false`) and (b) it does not register a waker to be woken up [2]. In combination this can lead to a deadlock where a connection task waits to send an event to the behaviour and `Swarm::poll` returns `Poll::Pending` failing to send an event to the connection task, not registering a waiker in order to be polled again. With this commit `Swarm::poll` will only return `Poll::Pending`, when failing to deliver an event to a connection task, if the network is unable to make progress (i.e. `network_not_ready` being `true`). In the long-run `Swarm::poll` should likely be redesigned, prioritizing the behaviour over the network, given the former is the control plane and the latter potentially yields new work from the outside. [1]: https://github.com/libp2p/rust-libp2p/blob/20e22ae696237ccd4e32f044d508ab1a88f53a9b/core/src/connection/pool/task.rs#L224-L232 [2]: https://github.com/libp2p/rust-libp2p/blob/20e22ae696237ccd4e32f044d508ab1a88f53a9b/swarm/src/lib.rs#L756-L783
Since libp2p/rust-libp2p#2248 dial attempts are no longer reported per address, but instead reported for all addresses of a single dial at once. This commit updates the comment accordingly.
Main feature
Concurrently dial candidates within a single dial attempt.
For the concrete interface changes, see the changelog entries in
core/CHANGELOG.md
andswarm/CHANGELOG.md
.Main motivation for this feature is to increase success rate on hole punching (see #1896 (comment) for details). Though, as a nice side effect, as one would expect, it does improve connection establishment time.
95th percentile duration of a Kademlia GetClosestPeers request:
50th percentile duration of a Kademlia GetClosestPeers request:
You can explore this data and more yourself at the URL below. On the left you have a Kademlia Exporter running with this feature, on the right you have a Kademlia Exporter running without.
https://kademlia-exporter.max-inden.de/d/Pfr0Fj6Mk/rust-libp2p?orgId=1&var-data_source=Prometheus&var-instance=kademlia-exporter-ipfs-concurrent:8080&var-instance=kademlia-exporter-ipfs:8080&from=now-3h&to=now&refresh=10s
For now, the concurrency factor is not configurable for the user. I think a value of 5 is a sane default. In the future I am fine exposing this value as a configuration parameter, though I would suggest we stick with this value as a first iteration.See #2248 (comment).
Cleanups and fixes done along the way
Merge
pool.rs
andmanager.rs
.Instead of manually implementing state machines in
task.rs
useasync/await
.Fix bug where
NetworkBehaviour::inject_connection_closed
is called without a previousNetworkBehaviour::inject_connection_established
(see Invalid Swarm behaviour #2242).Return handler to behaviour on incoming connection limit error. I missed this case in Invalid Swarm behaviour #2242.
Additional notes
I am sorry this ended up as such a large change. In my eyes it simplifies many of the somewhat dated structures in
libp2p-core
.