Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

libp2p-next #5278

Merged
merged 9 commits into from
Apr 8, 2020
Merged

libp2p-next #5278

merged 9 commits into from
Apr 8, 2020

Conversation

romanb
Copy link
Contributor

@romanb romanb commented Mar 17, 2020

Supersedes #5066. This PR upgrades substrate to libp2p-0.17. The following are the main libp2p changes which are visible in substrate:

libp2p/rust-libp2p#1440 and libp2p/rust-libp2p#1519

Overview

As described in the libp2p PRs, the underlying changes are primarily in libp2p-core. Realistically, a second connection to the same peer only occurs if two peers connect to each other "at the same time". As a side-effect, existing connections are also no longer closed in favour of new ones, which addresses #4272.

Details

  1. The enabled/disabled status is the same across all connections, as decided by the peerset manager.
  2. send_packet and write_notification always send all data over the same connection to preserve the ordering provided by the transport, as long as that connection is open. If it closes, a second open connection may take over, if one exists, but that case should be no different than a single connection failing and being re-established in terms of potential reordering and dropped messages. Messages can be received on any connection and thus two peers which happened to connect to each other simultaneously may each a different connection for sending data.
  3. The behaviour reports GenericProtoOut::CustomProtocolOpen when the first connection reports NotifsHandlerOut::Open.
  4. The behaviour reports GenericProtoOut::CustomProtocolClosed when the last connection reports NotifsHandlerOut::Closed.

In this way, the number of actual established connections to the peer is an implementation detail of the network behaviours. As mentioned before, in practice and at the time of this writing, there may be at most two connections to a peer and only as a result of simultaneous dialing. The network service even configures a hard limit of 2 connections per peer. However, the implementation in principle accommodates for any number of connections.

Noteworthy

During intermediate testing with the (by default disabled) integration tests test_consensus, test_sync and test_connectivity it was revealed that when run in release mode these tests were very often failing, with the common symptom that the last node to start in a round of testing would often see no other peers (i.e. empty DHT routing table) and thus make no progress while all the others keep on running, causing the tests to time out waiting for the problematic peer to reach a certain state. The tests are mainly using add_reserved_peer on the network to initialise the topology, however, add_reserved_peer ultimately results in a call to add_known_peer on the DiscoveryBehaviour which did not actually add that address to the Kademlia routing table, though it adds it to the user_defined peers which, when passed in the constructor of the behaviour, are added to the Kademlia routing table. I thus changed add_known_peer to also add the given address to the Kademlia routing table and that resolved the issues with these integration tests and the test_connectivity test seems to run notably faster (release mode). My current guess is that the tests were so far unknowingly relying on a timing assumption w.r.t. the initial discovery / connection setup and DHT queries in order for all peers to find each other, in particular when simultaneous connections attempts are in play, as often happens in release mode. Ultimately, the change of letting add_known_peer add the given address to the Kademlia routing table may be a patch worth extracting separately, because it does look like an oversight to me.

libp2p/rust-libp2p#1472

Insubstantial changes to adapt to the new APIs.

libp2p/rust-libp2p#1467

Insubstantial changes to add now necessary feature flags where there is a dependency on libp2p with default-features = false.

@romanb
Copy link
Contributor Author

romanb commented Mar 17, 2020

CC @tomaka, @twittner, @mxinden - Feel free to review at your leisure. This is obviously not mergeable until the next libp2p release but you may already want to skim the PR description.

@gavofyork gavofyork added the A3-in_progress Pull request is in progress. No review needed at this stage. label Mar 18, 2020
@romanb romanb force-pushed the libp2p-next branch 2 times, most recently from 59f7df3 to e3d6406 Compare March 18, 2020 13:47
@tomaka
Copy link
Contributor

tomaka commented Apr 2, 2020

I opened paritytech/polkadot#963, which should fix the Polkadot compilation.

Comment on lines +314 to +318
struct SpawnImpl<F>(F);
impl<F: Fn(Pin<Box<dyn Future<Output = ()> + Send>>)> Executor for SpawnImpl<F> {
fn exec(&self, f: Pin<Box<dyn Future<Output = ()> + Send>>) {
(self.0)(f)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this here from libp2p-core because I forgot to re-expose SwarmBuilder::executor_fn before 0.17, there is only SwarmBuilder::executor, and I wanted to avoid another libp2p release just for that.

@romanb romanb marked this pull request as ready for review April 3, 2020 15:31
@romanb romanb added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Apr 3, 2020
Copy link
Contributor

@tomaka tomaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, this looks good to me other than a couple of suggestions. As someone who has been battling through changes and bugs in generic_proto/behaviour.rs, this change is a bit scary, but I also can't spot anything wrong in the code.

Before merging, however, it would be great to deploy it on a validator node and checks its logs and graphs.

client/network/src/service.rs Outdated Show resolved Hide resolved
client/network/src/discovery.rs Outdated Show resolved Hide resolved
client/network/src/service.rs Outdated Show resolved Hide resolved
client/network/src/protocol/generic_proto/behaviour.rs Outdated Show resolved Hide resolved
@tomaka
Copy link
Contributor

tomaka commented Apr 6, 2020

Would you mind merging master, so that we can test Polkadot with it?

@romanb
Copy link
Contributor Author

romanb commented Apr 6, 2020

Would you mind merging master, so that we can test Polkadot with it?

Just did so. For what it is worth, I already ran polkadot with it a few weeks ago, just to check that #4272 is resolved with these changes, which seemed to be the case.

@gnunicorn gnunicorn added this to the 2.0 milestone Apr 6, 2020
@tomaka
Copy link
Contributor

tomaka commented Apr 6, 2020

Deployed both on my Google Cloud instance (with incoming connections) around 3pm and on kusama-validator-ew4-0 around 5pm.

There are a couple warnings (notably about reserved nodes getting disconnected), but that was already the case before. In particular, comparing on Kibana the logs before and after this PR, the rate of all warnings seems exactly the same.
Notably there's no log message about a protocols handler enabled/disabled multiple times.

In Grafana everything seems normal as well. The number of GrandPa notifications emitted has increased, but that is expected due to #5520.

So everything looks good to me so far. I suggest to take a look tomorrow morning again and if there's nothing weird happening, we merge this.

@tomaka
Copy link
Contributor

tomaka commented Apr 7, 2020

Well, the test is kind of inconclusive.
It seems that Parity's logs server has crashed during the night, and my Google Cloud node stopped shortly after I posted the previous message without me realizing 🤦‍♂️ (probably because I closed the SSH connection, forgetting that this would close the node (I'm not a devops)).

On Grafana, however, everything looks normal.

@tomaka
Copy link
Contributor

tomaka commented Apr 7, 2020

I'd be in favour of merging.
@twittner?

@tomaka tomaka added A8-mergeoncegreen and removed A0-please_review Pull request needs code review. A1-onice labels Apr 7, 2020
@tomaka
Copy link
Contributor

tomaka commented Apr 7, 2020

Needs a merge/rebase over master.

@tomaka
Copy link
Contributor

tomaka commented Apr 7, 2020

I think the test is failing because of the unused import warning.

warning: unused import: channel::mpsc
--> client/network/src/service.rs:39:27
|
39 | use futures::{prelude::*, channel::mpsc};
| ^^^^^^^^^^^^^
|

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants