-
Notifications
You must be signed in to change notification settings - Fork 950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple connections per peer #1440
Conversation
After taking a look at The least intrusive solution that comes to mind is to let The alternatives that come to mind involve making a |
I would go more in this direction in general. |
While this is clearly where we are heading in general, I'd really like to keep the changes and especially API changes outside of |
Instead of trying to enforce a single connection per peer, which involves quite a bit of additional complexity e.g. to prioritise simultaneously opened connections and can have other undesirable consequences [1], we now make multiple connections per peer a feature. The gist of these changes is as follows: The concept of a "node" with an implicit 1-1 correspondence to a connection has been replaced with the "first-class" concept of a "connection". The code from `src/nodes` has moved (with varying degrees of modification) to `src/connection`. A `HandledNode` has become a `Connection`, a `NodeHandler` a `ConnectionHandler`, the `CollectionStream` was the basis for the new `connection::Pool`, and so forth. Conceptually, a `Network` contains a `connection::Pool` which in turn internally employs the `connection::Manager` for handling the background `connection::manager::Task`s, one per connection, as before. These are all considered implementation details. On the public API, `Peer`s are managed as before through the `Network`, except now the API has changed with the shift of focus to (potentially multiple) connections per peer. The `NetworkEvent`s have accordingly also undergone changes. The Swarm APIs remain largely unchanged, except for the fact that `inject_replaced` is no longer called. It may now practically happen that multiple `ProtocolsHandler`s are associated with a single `NetworkBehaviour`, one per connection. If implementations of `NetworkBehaviour` rely somehow on communicating with exactly one `ProtocolsHandler`, this may cause issues, but it is unlikely. [1]: paritytech/substrate#4272
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed write-up for this patch-set.
Co-Authored-By: Max Inden <mail@max-inden.de>
Turns out it was necessary to do the following two small API changes on
These changes are very straight-forward in terms of code migration. See: ae6bfe9. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this!
Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com>
Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com>
Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com>
Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com>
Co-Authored-By: Pierre Krieger <pierre.krieger1708@gmail.com>
core/src/connection/manager.rs
Outdated
/// | ||
/// **Note:** Must only be called after `poll_ready_notify_handler` | ||
/// was successful and without interference by another thread, otherwise the event is discarded. | ||
pub fn notify_handler(&mut self, event: I) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not replacing this racy poll_ready_notify_handler
/notify_handler
combo with a non-racy version, e.g.
pub fn notify_handler(&mut self, event: I, cx: &mut Context) -> Option<I> {
if self.poll_ready_notify_task(cx).is_pending() {
return Some(event)
}
let cmd = task::Command::NotifyHandler(event);
self.notify_task(cmd);
None
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be similar to the old AsyncSink
enum, and it forces you to potentially store the event somewhere for later when the handler is ready. Meanwhile, with this "racy" API (it's not racy, since it requires &mut
), you can generate the event with the guarantee that it is going to be delivered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had this discussion in #1373 already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be to start with the simpler and safer API as per the suggestion of @twittner and potentially add the poll_ready_x
/ x
combination later as an alternative if needed. That is, we may at the end offer both, just like futures::channel::mpsc::Sender
does in form of try_send
, start_send
and poll_ready
.
To elaborate, I'm sure we all understand the motivation and trade-off behind these two styles of API and we just have different priorities w.r.t. what we consider more important. Given that the poll_ready_x
/ x
design is arguably both more "complex" (in that you have to use 2 methods instead of 1) and more error-prone (in that you have to use these two methods in a certain way, subject as well to pitfalls w.r.t multi-threading as already discussed in #1373) I personally prefer to start with just providing try_notify_handler
for now. The poll_ready_notify_handler
/ notify_handler
combination could be added later if there is a strong demand or use-case (which as far as I can tell even the Swarm
does not currently have).
Does that make sense and sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomaka Could you elaborate on the 👎 w.r.t. what aspect you disagree with? Do you insist on offering poll_ready_notify_handler
/ notify_handler
from the start or do you even have objections to offering try_notify_handler
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposition has been refuted in #1373
It hasn't been refuted, I just gave up arguing.
the safer alternative
The definition of "safe" in the context of Rust is "cannot trigger undefined behaviour". In no way misusing this API can trigger any undefined behaviour. The word "safe" is totally irrelevant here.
The worst that can happen is a panic resulting from us misusing the API. There are lots of places in the Rust standard library and in the Rust ecosystem where misusing the API will result in a panic. Polling a Future
after it has returned Ready
, to give the most prominent example.
Using this API in a correct way is as racy as try_notify_handler
is, in the sense that sometimes it will return Pending
and sometimes it will return Ready
, depending on the timing of the other tasks/threads.
Misuing this API is racy in a similar way as using this API correctly, with the exception that you might get a panic (because you misused the API).
the more dangerous
In order for me to change my point of view, I would like to get a precise definition of what "dangerous" means here for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue with try_notify_handler
is the fact that there is no way for you to know whether your event will be accepted before generating said event.
Changing the design of Sink
between futures 0.1 (where it looked like try_notify_handler
) and 0.3 (where it looks like poll_ready_notify_handler
/notify_handler
) is to me one of the best decisions they made, as it considerably simplified lots of code. I don't want to revert to a futures-0.1-style API.
If panicking really is a big deal, then an API that I think is good is:
impl Manager {
/// Try to give out a slot for sending an event.
pub fn poll_ready_notify(&mut self, ...) -> Poll<EventSend> { ... }
}
impl<'a, T> EventSend<'a, T> {
/// Sends the event, consuming `self`.
pub fn send_event(self, event: T) { ... }
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The worst that can happen is a panic resulting from us misusing the API. There are lots of places in the Rust standard library and in the Rust ecosystem where misusing the API will result in a panic. Polling a
Future
after it has returnedReady
, to give the most prominent example.
Note though that the underlying API we're talking about here, futures::channel::mpsc::Sender
, never panics (and neither does our current API). Indeed the Sender
API says poll_ready
should be called before start_send
and it consequently still returns an error if the channel is full. It is a bit telling, to me anyway, that they did not go with saying "poll_ready
must be called before start_send
" combined with panicking due to programmer error on start_send
if the channel is full. And they also provide try_send
.
Using this API in a correct way is as racy as
try_notify_handler
is, in the sense that sometimes it will returnPending
and sometimes it will returnReady
, depending on the timing of the other tasks/threads.
Here we are at a point which I thought was clarified e.g. here. To recap, the crucial difference is in the assumptions that the method that must be called second (e.g. notify_handler
) places on the possible errors that can occur and how it deals with them. If it assumes certain errors can never happen because of a requirement that another method must be called first then calls to such two functions which may be interleaved by multiple threads if e.g. Mutex
es are used can lead to unexpected behaviour, hence why I already augmented the commentary on notify_handler
to say that not only must poll_ready_notify_handler
be called first, but also that no other thread must be allowed to interleave between poll_ready_notify_handler
and notify_handler
. Of course that is easy to satisfy if, as a user, you just call both these &mut
methods in sequence on the same thread without e.g. going through a Mutex
, but nevertheless it is another thing to watch out for. We may agree to disagree on whether that justifies calling an API "racy", but in any case the futures::channel::mpsc::Sender
API is actually not racy, whatever definition you choose, whereas the API provided by us, currently, is considered racy by myself and @twittner because it may silently (well, now it logs at least) ignore errors due to the channel being full when calling notify_handler
.
I have added try_notify_handler
in a new commit in the hope that it is uncontroversial since it follows the example of the futures::channel::mpsc::Sender
API, leaving the other functions in place. Is that acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposition has been refuted in #1373
It hasn't been refuted, I just gave up arguing.
I would love to hear a new argument then. If "it's not racy, since it requires &mut
" would be true why would we even have this discussion? Since the outcome of concurrent calls to poll_ready_notify_handler
and notify_handler
depends on relative timings there clearly is a race condition here that the individual &mut
s in each of these methods can not prevent.
the safer alternative
The definition of "safe" in the context of Rust is "cannot trigger undefined behaviour". In no way misusing this API can trigger any undefined behaviour. The word "safe" is totally irrelevant here.
[...]
the more dangerous
In order for me to change my point of view, I would like to get a precise definition of what "dangerous" means here for you.
Rust does not have a monopoly on the word "safe" and since you were asking for a definiton of "dangerous", I am happy to elaborate: Assuming there are undesirable program states, e.g. a panic, an API is safe w.r.t. those states if it can by construction guarantee that those states can never occur. Correspondingly an API is dangerous w.r.t. those states if it can not give those guarantees.
Misuing this API is racy in a similar way as using this API correctly, with the exception that you might get a panic (because you misused the API).
This "exception" being an undesirable program state and the fact that the API can not guarantee its own correct use but allows for degrees of freedom that can lead to such a state makes it more dangerous in my book.
My issue with
try_notify_handler
is the fact that there is no way for you to know whether your event will be accepted before generating said event.
We could combine both worlds by keeping poll_ready_notify_handler
as an indication that most likely notify_handler
will be able to accept the event, but in case it can not (because of API misuse) we can at least return the event:
pub fn poll_ready_notify_handler(&mut self, cx: &mut Context) -> Poll<()>;
pub fn notify_handler(&mut self, event: I, cx: &mut Context) -> Option<I>;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could combine both worlds by keeping
poll_ready_notify_handler
as an indication that most likelynotify_handler
will be able to accept the event, but in case it can not (because of API misuse) we can at least return the event:pub fn poll_ready_notify_handler(&mut self, cx: &mut Context) -> Poll<()>; pub fn notify_handler(&mut self, event: I, cx: &mut Context) -> Option<I>;
I would be ok with that as well, and it would make try_notify_handler
superfluous (notify_handler
would then internally use Sender::try_send
which is used by Sender::start_send
internally anyway).
Co-Authored-By: Toralf Wittner <tw@dtex.org>
Co-Authored-By: Toralf Wittner <tw@dtex.org>
Small API simplifications and code deduplication. Some more useful debug logging.
The test is repeatedly failing. |
Which test do you mean? They all look green right now. Relatedly, my proposal for the substrate integration is now also ready for review (but obviously not for merging): paritytech/substrate#5066. |
Hah, I restarted the tests once again right after posting my comment, and now it's green. |
I see, I can look at the ipfs-kad example again, but I would probably as a first guess blame IPFS nodes. It may help if we increase the acceptable maximum payload size of Kademlia requests / responses from 4KiB to 8KiB. I remember often seeing failures due to that size limit, sometimes even from the first node. There may be other issues as well, of course. |
That would also be my guess. |
PR [1440] introduced a regression w.r.t. the reporting of dial errors. In particular, if a connection attempt fails due to an invalid remote peer ID, any remaining addresses for the same peer would not be tried (intentional) but the dial failure would not be reported to the behaviour, causing e.g. libp2p-kad queries to potentially stall. In hindsight, I figured it is better to preserve the previous behaviour to still try alternative addresses of the peer even on invalid peer ID errors on an earlier address. In particular because in the context of libp2p-kad it is not uncommon for peers to report localhost addresses while the local node actually has e.g. an ipfs node running on that address, obviously with a different peer ID, which is the scenario causing frequent invalid peer ID (mismatch) errors when running the ipfs-kad example. This commit thus restores the previous behaviour w.r.t. trying all remaining addresses on invalid peer ID errors as well as making sure `inject_dial_error` is always called when the last attempt failed. [1440]: libp2p#1440.
* Fix regression w.r.t. reporting of dial errors. PR [1440] introduced a regression w.r.t. the reporting of dial errors. In particular, if a connection attempt fails due to an invalid remote peer ID, any remaining addresses for the same peer would not be tried (intentional) but the dial failure would not be reported to the behaviour, causing e.g. libp2p-kad queries to potentially stall. In hindsight, I figured it is better to preserve the previous behaviour to still try alternative addresses of the peer even on invalid peer ID errors on an earlier address. In particular because in the context of libp2p-kad it is not uncommon for peers to report localhost addresses while the local node actually has e.g. an ipfs node running on that address, obviously with a different peer ID, which is the scenario causing frequent invalid peer ID (mismatch) errors when running the ipfs-kad example. This commit thus restores the previous behaviour w.r.t. trying all remaining addresses on invalid peer ID errors as well as making sure `inject_dial_error` is always called when the last attempt failed. [1440]: #1440. * Remove an fmt::Debug requirement.
This is a follow-up to libp2p#1440 and relates to libp2p#925. This change permits multiple dialing attempts per peer. Note though that `libp2p-swarm` does not yet make use of this ability, retaining the current behaviour. The essence of the changes are that the `Peer` API now provides `Peer::dial()`, i.e. regardless of the state in which the peer is. A dialing attempt is always made up of one or more addresses tried sequentially, as before, but now there can be multiple dialing attempts per peer. A configurable per-peer limit for outgoing connections and thus concurrent dialing attempts is also included.
This is a follow-up to libp2p#1440 and relates to libp2p#925. This change permits multiple dialing attempts per peer. Note though that `libp2p-swarm` does not yet make use of this ability, retaining the current behaviour. The essence of the changes are that the `Peer` API now provides `Peer::dial()`, i.e. regardless of the state in which the peer is. A dialing attempt is always made up of one or more addresses tried sequentially, as before, but now there can be multiple dialing attempts per peer. A configurable per-peer limit for outgoing connections and thus concurrent dialing attempts is also included.
This is a follow-up to libp2p#1440 and relates to libp2p#925. This change permits multiple dialing attempts per peer. Note though that `libp2p-swarm` does not yet make use of this ability, retaining the current behaviour. The essence of the changes are that the `Peer` API now provides `Peer::dial()`, i.e. regardless of the state in which the peer is. A dialing attempt is always made up of one or more addresses tried sequentially, as before, but now there can be multiple dialing attempts per peer. A configurable per-peer limit for outgoing connections and thus concurrent dialing attempts is also included.
* Permit concurrent dialing attempts per peer. This is a follow-up to #1440 and relates to #925. This change permits multiple dialing attempts per peer. Note though that `libp2p-swarm` does not yet make use of this ability, retaining the current behaviour. The essence of the changes are that the `Peer` API now provides `Peer::dial()`, i.e. regardless of the state in which the peer is. A dialing attempt is always made up of one or more addresses tried sequentially, as before, but now there can be multiple dialing attempts per peer. A configurable per-peer limit for outgoing connections and thus concurrent dialing attempts is also included. * Introduce `DialError` in `libp2p-swarm`. For a cleaner API and to treat the case of no addresses for a peer as an error, such that a `NetworkBehaviourAction::DialPeer` request is always matched up with either `inject_connection_established` or `inject_dial_error`. * Fix rustdoc link. * Add `DialPeerCondition::Always`. * Adapt to master. * Update changelog.
* Fix regression w.r.t. reporting of dial errors. PR [1440] introduced a regression w.r.t. the reporting of dial errors. In particular, if a connection attempt fails due to an invalid remote peer ID, any remaining addresses for the same peer would not be tried (intentional) but the dial failure would not be reported to the behaviour, causing e.g. libp2p-kad queries to potentially stall. In hindsight, I figured it is better to preserve the previous behaviour to still try alternative addresses of the peer even on invalid peer ID errors on an earlier address. In particular because in the context of libp2p-kad it is not uncommon for peers to report localhost addresses while the local node actually has e.g. an ipfs node running on that address, obviously with a different peer ID, which is the scenario causing frequent invalid peer ID (mismatch) errors when running the ipfs-kad example. This commit thus restores the previous behaviour w.r.t. trying all remaining addresses on invalid peer ID errors as well as making sure `inject_dial_error` is always called when the last attempt failed. [1440]: libp2p/rust-libp2p#1440. * Remove an fmt::Debug requirement.
* Permit concurrent dialing attempts per peer. This is a follow-up to libp2p/rust-libp2p#1440 and relates to libp2p/rust-libp2p#925. This change permits multiple dialing attempts per peer. Note though that `libp2p-swarm` does not yet make use of this ability, retaining the current behaviour. The essence of the changes are that the `Peer` API now provides `Peer::dial()`, i.e. regardless of the state in which the peer is. A dialing attempt is always made up of one or more addresses tried sequentially, as before, but now there can be multiple dialing attempts per peer. A configurable per-peer limit for outgoing connections and thus concurrent dialing attempts is also included. * Introduce `DialError` in `libp2p-swarm`. For a cleaner API and to treat the case of no addresses for a peer as an error, such that a `NetworkBehaviourAction::DialPeer` request is always matched up with either `inject_connection_established` or `inject_dial_error`. * Fix rustdoc link. * Add `DialPeerCondition::Always`. * Adapt to master. * Update changelog.
Motivation
This is a proposal for #912. Instead of trying to enforce a single connection per peer, which involves quite a bit of additional complexity e.g. to prioritise simultaneously opened connections and can have other undesirable consequences (e.g. paritytech/substrate#4272) besides potentially prohibiting other valid use-cases where multiple connections per peer are desired, this PR makes multiple connections per peer a feature in
libp2p-core
.Overview
The concept of a "node" with an implicit 1-1 correspondence to a connection has been replaced with the "first-class" concept of a "connection". The code from
src/nodes
has moved (with varying degrees of modification) tosrc/connection
andsrc/network
. AHandledNode
has become aConnection
, aNodeHandler
aConnectionHandler
, theCollectionStream
was the basis for the newconnection::Pool
, and so forth.Conceptually, a
Network
contains aconnection::Pool
which in turn internally employs theconnection::Manager
for handling the backgroundconnection::manager::Task
s, one per connection, as before. These are all considered implementation details. On the public API,Peer
s are managed as before through theNetwork
, except now the API has changed with the shift of focus to (potentially multiple) connections per peer. TheNetworkEvent
s have accordingly also undergone changes.Backward Compatibility
The API of
libp2p-core
obviously has changed significantly in detail and semantics, though not so much in the structure (there is still theNetwork
emitting events andNetwork::peer()
for managing peers and almost all existing functions have been preserved, though often with changed semantics due to the nature of permitting multiple connections).The API of
libp2p-swarm
should be backward compatible. There is the subtle semantic change that a singleNetworkBehaviour
can now be associated with manyProtocolsHandler
instances for the same peer due to multiple connections,but unless an implementation makes some hard assumptions about such a 1-1 correspondence between peers and(EDIT: This is incorrect and needs to be addressed. See this comment).ProtocolsHandler
instances, which seems unlikely, existing implementations should remain largely unaffectedlibp2p-swarm
still does not by itself create multiple connections per peer (i.e. aDialPeer
received from a behaviour only results in a connection attempt if the peer is currently disconnected, whereaslibp2p-core
now also allows additionalconnect()
s for already connected peers), it just may occur as a result of simultaneous dialing. Lastly,inject_connected
andinject_disconnected
are only called when the first connection is established, respectively when the last connection is lost to a peer, to preserve the existing semantics.inject_replaced
is no longer called, since connections are no longer replaced and has thus been deprecated.Dialing Concurrency
While these changes permit multiple established connections per peer, what I have preserved is the constraint to a single pending outgoing (i.e. "dialing") connection per peer at a time. I think that is still desirable, especially because while dialing is in progress, new addresses to try can be added, e.g. as was and is still done by
libp2p-swarm
.Additions
incoming_limit
, there is now also a configurable outgoing limit (pending outgoing connections to any peer) and a configurable limit for the established connections per peer. These limits are enforced by the connectionPool
, i.e. the incoming limit is no longer checked on every invocation ofNetwork::poll()
to determine if the listeners stream should be polled but rather at the time when they are added to the pool and whenever a pending connection is established. The defaults are no limits for any of these options. Furthermore these are currently hard limits, i.e. there is no throttling based on delays and additional queuing involved.NetworkInfo
obtained byNetwork::info()
containing some basic connection statistics from the connection pool.NetworkConfig
for configuring the task executor as well as the connection pool limits.How to Review
It seems most sensible to first look through the new / adapted modules in
src/connection/**
andsrc/connections.rs
, followed bysrc/network/**.
andsrc/network.rs
and lastly looking at the actual diffs forlibp2p-swarm
. A lot of the code and modules insrc/connection
andsrc/network
will look familiar as it stems from previoussrc/nodes
modules.Remaining Work
At this point I'm only planning to go over the documentation again,think about new tests and verify with the current polkadot/substrate (e.g. that it indeed addresses paritytech/substrate#4272), but there are no major changes on my backlog other than those that may result from feedback here or from issues encountered in further testing, hence I'm opening this now for review and general feedback.