core/: Concurrent connection attempts #2248

mxinden · 2021-09-26T10:37:30Z

Main feature

Concurrently dial candidates within a single dial attempt.

For the concrete interface changes, see the changelog entries in core/CHANGELOG.md and swarm/CHANGELOG.md.

Main motivation for this feature is to increase success rate on hole punching (see #1896 (comment) for details). Though, as a nice side effect, as one would expect, it does improve connection establishment time.

95th percentile duration of a Kademlia GetClosestPeers request:

50th percentile duration of a Kademlia GetClosestPeers request:

You can explore this data and more yourself at the URL below. On the left you have a Kademlia Exporter running with this feature, on the right you have a Kademlia Exporter running without.

https://kademlia-exporter.max-inden.de/d/Pfr0Fj6Mk/rust-libp2p?orgId=1&var-data_source=Prometheus&var-instance=kademlia-exporter-ipfs-concurrent:8080&var-instance=kademlia-exporter-ipfs:8080&from=now-3h&to=now&refresh=10s

For now, the concurrency factor is not configurable for the user. I think a value of 5 is a sane default. In the future I am fine exposing this value as a configuration parameter, though I would suggest we stick with this value as a first iteration.

See #2248 (comment).

Cleanups and fixes done along the way

Merge pool.rs and manager.rs.
Instead of manually implementing state machines in task.rs use async/await.
Fix bug where NetworkBehaviour::inject_connection_closed is called without a previous NetworkBehaviour::inject_connection_established (see Invalid Swarm behaviour #2242).
Return handler to behaviour on incoming connection limit error. I missed this case in Invalid Swarm behaviour #2242.

Additional notes

I am sorry this ended up as such a large change. In my eyes it simplifies many of the somewhat dated structures in libp2p-core.

mxinden · 2021-10-13T17:12:39Z

So far this hasn't had any of the issues I was seeing in #2242

Thanks @AgeManning for testing! That is great to hear.

Thanks @thomaseizinger and @elenaf9 for the reviews. Much appreciated! Good to have additional input on these changes.

I addressed all your comments. Let me know in case you would like to see any additional changes.

thomaseizinger

Thank you!

core/CHANGELOG.md

With libp2p#2248 a connection task `await`s sending an event to the behaviour before polling for new events from the behaviour [1]. When `Swarm::poll` is unable to deliver an event to a connection task it returns `Poll::Pending` even though (a) polling `Swarm::network` might be able to make progress (`network_not_ready` being `false`) and (b) it does not register a waker to be woken up [2]. In combination this can lead to a deadlock where a connection task waits to send an event to the behaviour and `Swarm::poll` returns `Poll::Pending` failing to send an event to the connection task, not registering a waiker in order to be polled again. With this commit `Swarm::poll` will only return `Poll::Pending`, when failing to deliver an event to a connection task, if the network is unable to make progress (i.e. `network_not_ready` being `true`). In the long-run `Swarm::poll` should likely be redesigned, prioritizing the behaviour over the network, given the former is the control plane and the latter potentially yields new work from the outside. [1]: https://github.com/libp2p/rust-libp2p/blob/ca1b7cf043b4264c69b19fe75de488330a7a1f2f/core/src/connection/pool/task.rs#L224-L232 [2]: https://github.com/libp2p/rust-libp2p/blob/ca1b7cf043b4264c69b19fe75de488330a7a1f2f/swarm/src/lib.rs#L756-L783

) With #2248 a connection task `await`s sending an event to the behaviour before polling for new events from the behaviour [1]. When `Swarm::poll` is unable to deliver an event to a connection task it returns `Poll::Pending` even though (a) polling `Swarm::network` might be able to make progress (`network_not_ready` being `false`) and (b) it does not register a waker to be woken up [2]. In combination this can lead to a deadlock where a connection task waits to send an event to the behaviour and `Swarm::poll` returns `Poll::Pending` failing to send an event to the connection task, not registering a waiker in order to be polled again. With this commit `Swarm::poll` will only return `Poll::Pending`, when failing to deliver an event to a connection task, if the network is unable to make progress (i.e. `network_not_ready` being `true`). In the long-run `Swarm::poll` should likely be redesigned, prioritizing the behaviour over the network, given the former is the control plane and the latter potentially yields new work from the outside. [1]: https://github.com/libp2p/rust-libp2p/blob/ca1b7cf043b4264c69b19fe75de488330a7a1f2f/core/src/connection/pool/task.rs#L224-L232 [2]: https://github.com/libp2p/rust-libp2p/blob/ca1b7cf043b4264c69b19fe75de488330a7a1f2f/swarm/src/lib.rs#L756-L783

Since libp2p#2248 dial attempts are no longer reported per address, but instead reported for all addresses of a single dial at once. This commit updates the comment accordingly.

Since #2248 dial attempts are no longer reported per address, but instead reported for all addresses of a single dial at once. This commit updates the comment accordingly.

Since libp2p#2248 dial attempts are no longer reported per address, but instead reported for all addresses of a single dial at once. This commit updates the comment accordingly.

…304) With libp2p/rust-libp2p#2248 a connection task `await`s sending an event to the behaviour before polling for new events from the behaviour [1]. When `Swarm::poll` is unable to deliver an event to a connection task it returns `Poll::Pending` even though (a) polling `Swarm::network` might be able to make progress (`network_not_ready` being `false`) and (b) it does not register a waker to be woken up [2]. In combination this can lead to a deadlock where a connection task waits to send an event to the behaviour and `Swarm::poll` returns `Poll::Pending` failing to send an event to the connection task, not registering a waiker in order to be polled again. With this commit `Swarm::poll` will only return `Poll::Pending`, when failing to deliver an event to a connection task, if the network is unable to make progress (i.e. `network_not_ready` being `true`). In the long-run `Swarm::poll` should likely be redesigned, prioritizing the behaviour over the network, given the former is the control plane and the latter potentially yields new work from the outside. [1]: https://github.com/libp2p/rust-libp2p/blob/20e22ae696237ccd4e32f044d508ab1a88f53a9b/core/src/connection/pool/task.rs#L224-L232 [2]: https://github.com/libp2p/rust-libp2p/blob/20e22ae696237ccd4e32f044d508ab1a88f53a9b/swarm/src/lib.rs#L756-L783

Since libp2p/rust-libp2p#2248 dial attempts are no longer reported per address, but instead reported for all addresses of a single dial at once. This commit updates the comment accordingly.

mxinden added 13 commits August 31, 2021 19:00

*: Implement concurrent dialing

75292bb

core/: Track pending connection via Endpoint

10d941b

core/: Add PendingPoint

f01b323

swarm/src/lib: Adjust

d6da039

*: Associate Multiaddr with transport error

5887d69

protocols: Update

08c196f

core/src/: Remove printlns

6c1ce85

misc/metrics: Update

9070bd2

core/: Remove printlns

4043970

core/src/network/concurrent: Catch address error

d594208

Merge branch 'libp2p/master' into concurrent-dial

bdfd5d1

core/src/connection: Remove meta file

a705bb0

Merge branch 'libp2p/master' into concurrent-dial

436b876

This was referenced Sep 26, 2021

NAT traversal #2052

Open

swarm/: Enable dialing a specific fixed set of addresses for a single peer #2249

Closed

Merge branch 'master' into concurrent-dial

db3fba7

mxinden mentioned this pull request Sep 28, 2021

core/: Merge pending and established connection limits #2253

Closed

mxinden added 13 commits September 29, 2021 17:42

core/: Stage connection task in pending and established

63c4a32

core/src/connection: Send pending result through events channel

eb86b41

core/src/connection: Handle connection error

ae21f9e

core/src/connection: Use oneshot Void for pending connection

7d3114b

core/src/connection: Remove task

c43496d

core/: Bubble dial errors on success up

e06131c

*: Pass dial success errors to behaviours

7c6a0fe

core/src/network/concurrent_dial: Limit concurrency factor

752bb70

core/src/network/concurrent_dial: Remove bound on Debug

0dcd1cf

core/src/connection: Remove task module

2392249

core/src/connection: Revive Pool::disconnect

b27c735

core/src/connection: Fold manager into pool

1703d40

core/src/connection: Use different channel for pending and established

794ca73

mxinden added 9 commits October 13, 2021 17:55

core/src/connection: Fix doc comment on to_pending_point

ecdb812

core/src/connection: Rename trait bound TransportError to TTransErr

2a6dc54

core/src/connection/pool: Impl start_close on EstablishedConnInfo

e7beec4

core/src/network: Fix trait bounds on DialingOpts

34daf90

core/src/network/peer: Fix doc comment on DialingAttempt

d099106

core/src/network/peer: Remove outdated comment on DialingAttempt::abort

d5e06b6

{core,swarm}/: Rename outgoing to concurrent_dial_errors

7331d90

core/src/connection/pool: Implement expect_occupied for hash_map::Entry

9fc5870

Merge branch 'libp2p/master' into concurrent-dial

e840d65

thomaseizinger approved these changes Oct 13, 2021

View reviewed changes

mxinden added 3 commits October 14, 2021 14:45

Merge branch 'master' into concurrent-dial

c35992a

{core,swarm}/CHANGELOG: Update now that concurrency is configurable

cb553cf

protocols/src/identify: Update to changed address failure reporting

04c694f

mxinden commented Oct 14, 2021

View reviewed changes

core/CHANGELOG.md Outdated Show resolved Hide resolved

core/CHANGELOG.md: Fix typo

7aafc52

mxinden merged commit 40c5335 into libp2p:master Oct 14, 2021

This was referenced Oct 14, 2021

core/: Concurrent connection attempts - aka. happy eyeball #1896

Closed

*: Prepare v0.40.0-rc.1 release #2290

Merged

mxinden mentioned this pull request Oct 20, 2021

swarm/src/lib: Continue polling network when behaviour is blocked #2304

Merged

iduartgomez mentioned this pull request Oct 24, 2021

NAT traversal freenet/freenet-core#2

Closed

10 tasks

mxinden mentioned this pull request Jan 12, 2022

swarm/src/lib: Update outdated SwarmEvent::Dialing comment #2429

Merged

elenaf9 mentioned this pull request Aug 19, 2022

Re-design the StreamMuxer trait #2722

Closed

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core/: Concurrent connection attempts #2248

core/: Concurrent connection attempts #2248

mxinden commented Sep 26, 2021 •

edited

Loading

mxinden commented Oct 13, 2021 •

edited

Loading

thomaseizinger left a comment

core/: Concurrent connection attempts #2248

core/: Concurrent connection attempts #2248

Conversation

mxinden commented Sep 26, 2021 • edited Loading

Main feature

95th percentile duration of a Kademlia GetClosestPeers request:

50th percentile duration of a Kademlia GetClosestPeers request:

Cleanups and fixes done along the way

Additional notes

mxinden commented Oct 13, 2021 • edited Loading

thomaseizinger left a comment

Choose a reason for hiding this comment

mxinden commented Sep 26, 2021 •

edited

Loading

mxinden commented Oct 13, 2021 •

edited

Loading