Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/: Concurrent connection attempts #2248

Merged
merged 74 commits into from
Oct 14, 2021
Merged

Conversation

mxinden
Copy link
Member

@mxinden mxinden commented Sep 26, 2021

Main feature

Concurrently dial candidates within a single dial attempt.

For the concrete interface changes, see the changelog entries in core/CHANGELOG.md and swarm/CHANGELOG.md.

Main motivation for this feature is to increase success rate on hole punching (see #1896 (comment) for details). Though, as a nice side effect, as one would expect, it does improve connection establishment time.

95th percentile duration of a Kademlia GetClosestPeers request:

image

50th percentile duration of a Kademlia GetClosestPeers request:

image

You can explore this data and more yourself at the URL below. On the left you have a Kademlia Exporter running with this feature, on the right you have a Kademlia Exporter running without.

https://kademlia-exporter.max-inden.de/d/Pfr0Fj6Mk/rust-libp2p?orgId=1&var-data_source=Prometheus&var-instance=kademlia-exporter-ipfs-concurrent:8080&var-instance=kademlia-exporter-ipfs:8080&from=now-3h&to=now&refresh=10s

For now, the concurrency factor is not configurable for the user. I think a value of 5 is a sane default. In the future I am fine exposing this value as a configuration parameter, though I would suggest we stick with this value as a first iteration.

See #2248 (comment).

Cleanups and fixes done along the way

  • Merge pool.rs and manager.rs.

  • Instead of manually implementing state machines in task.rs use async/await.

  • Fix bug where NetworkBehaviour::inject_connection_closed is called without a previous NetworkBehaviour::inject_connection_established (see Invalid Swarm behaviour #2242).

  • Return handler to behaviour on incoming connection limit error. I missed this case in Invalid Swarm behaviour #2242.

Additional notes

I am sorry this ended up as such a large change. In my eyes it simplifies many of the somewhat dated structures in libp2p-core.

@mxinden
Copy link
Member Author

mxinden commented Oct 13, 2021

So far this hasn't had any of the issues I was seeing in #2242

Thanks @AgeManning for testing! That is great to hear.


Thanks @thomaseizinger and @elenaf9 for the reviews. Much appreciated! Good to have additional input on these changes.

I addressed all your comments. Let me know in case you would like to see any additional changes.

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

core/CHANGELOG.md Outdated Show resolved Hide resolved
@mxinden mxinden merged commit 40c5335 into libp2p:master Oct 14, 2021
mxinden added a commit to mxinden/rust-libp2p that referenced this pull request Oct 20, 2021
With libp2p#2248 a connection task
`await`s sending an event to the behaviour before polling for new events
from the behaviour [1].

When `Swarm::poll` is unable to deliver an event to a connection task it
returns `Poll::Pending` even though (a) polling `Swarm::network` might be
able to make progress (`network_not_ready` being `false`) and (b) it
does not register a waker to be woken up [2].

In combination this can lead to a deadlock where a connection task waits
to send an event to the behaviour and `Swarm::poll` returns
`Poll::Pending` failing to send an event to the connection task, not
registering a waiker in order to be polled again.

With this commit `Swarm::poll` will only return `Poll::Pending`, when
failing to deliver an event to a connection task, if the network is
unable to make progress (i.e. `network_not_ready` being `true`).

In the long-run `Swarm::poll` should likely be redesigned, prioritizing
the behaviour over the network, given the former is the control plane
and the latter potentially yields new work from the outside.

[1]: https://github.com/libp2p/rust-libp2p/blob/ca1b7cf043b4264c69b19fe75de488330a7a1f2f/core/src/connection/pool/task.rs#L224-L232
[2]: https://github.com/libp2p/rust-libp2p/blob/ca1b7cf043b4264c69b19fe75de488330a7a1f2f/swarm/src/lib.rs#L756-L783
@iduartgomez iduartgomez mentioned this pull request Oct 24, 2021
10 tasks
mxinden added a commit that referenced this pull request Oct 26, 2021
)

With #2248 a connection task
`await`s sending an event to the behaviour before polling for new events
from the behaviour [1].

When `Swarm::poll` is unable to deliver an event to a connection task it
returns `Poll::Pending` even though (a) polling `Swarm::network` might be
able to make progress (`network_not_ready` being `false`) and (b) it
does not register a waker to be woken up [2].

In combination this can lead to a deadlock where a connection task waits
to send an event to the behaviour and `Swarm::poll` returns
`Poll::Pending` failing to send an event to the connection task, not
registering a waiker in order to be polled again.

With this commit `Swarm::poll` will only return `Poll::Pending`, when
failing to deliver an event to a connection task, if the network is
unable to make progress (i.e. `network_not_ready` being `true`).

In the long-run `Swarm::poll` should likely be redesigned, prioritizing
the behaviour over the network, given the former is the control plane
and the latter potentially yields new work from the outside.

[1]: https://github.com/libp2p/rust-libp2p/blob/ca1b7cf043b4264c69b19fe75de488330a7a1f2f/core/src/connection/pool/task.rs#L224-L232
[2]: https://github.com/libp2p/rust-libp2p/blob/ca1b7cf043b4264c69b19fe75de488330a7a1f2f/swarm/src/lib.rs#L756-L783
mxinden added a commit to mxinden/rust-libp2p that referenced this pull request Jan 12, 2022
Since libp2p#2248 dial attempts are
no longer reported per address, but instead reported for all addresses
of a single dial at once.

This commit updates the comment accordingly.
mxinden added a commit that referenced this pull request Jan 14, 2022
Since #2248 dial attempts are
no longer reported per address, but instead reported for all addresses
of a single dial at once.

This commit updates the comment accordingly.
mxinden added a commit to mxinden/rust-libp2p that referenced this pull request Jan 15, 2022
Since libp2p#2248 dial attempts are
no longer reported per address, but instead reported for all addresses
of a single dial at once.

This commit updates the comment accordingly.
santos227 pushed a commit to santos227/rustlib that referenced this pull request Jun 20, 2022
…304)

With libp2p/rust-libp2p#2248 a connection task
`await`s sending an event to the behaviour before polling for new events
from the behaviour [1].

When `Swarm::poll` is unable to deliver an event to a connection task it
returns `Poll::Pending` even though (a) polling `Swarm::network` might be
able to make progress (`network_not_ready` being `false`) and (b) it
does not register a waker to be woken up [2].

In combination this can lead to a deadlock where a connection task waits
to send an event to the behaviour and `Swarm::poll` returns
`Poll::Pending` failing to send an event to the connection task, not
registering a waiker in order to be polled again.

With this commit `Swarm::poll` will only return `Poll::Pending`, when
failing to deliver an event to a connection task, if the network is
unable to make progress (i.e. `network_not_ready` being `true`).

In the long-run `Swarm::poll` should likely be redesigned, prioritizing
the behaviour over the network, given the former is the control plane
and the latter potentially yields new work from the outside.

[1]: https://github.com/libp2p/rust-libp2p/blob/20e22ae696237ccd4e32f044d508ab1a88f53a9b/core/src/connection/pool/task.rs#L224-L232
[2]: https://github.com/libp2p/rust-libp2p/blob/20e22ae696237ccd4e32f044d508ab1a88f53a9b/swarm/src/lib.rs#L756-L783
santos227 added a commit to santos227/rustlib that referenced this pull request Jun 20, 2022
Since libp2p/rust-libp2p#2248 dial attempts are
no longer reported per address, but instead reported for all addresses
of a single dial at once.

This commit updates the comment accordingly.
@elenaf9 elenaf9 mentioned this pull request Aug 19, 2022
24 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants