Networking simplification #2264

tomaka · 2022-05-04T11:01:55Z

This big unreviewable PR refactors the networking part of the code, and more precisely the layers that coordinate all the TCP/WebSocket connections together. The code concerning individual connections has barely been touched.

The main file that has been modified is collection.rs. This file contains a data structure that represents a set of connections.

Before this PR, this data structure was "atomic". All the methods were taking &self as parameter, meaning that multiple methods could be called at the same time, and many of these methods were asynchronous and could be interrupted. If you needed to inject data into a connection, this had to be done through the data structure.
This design made the code extremely difficult to understand, because of all the corner cases to handle, most notably around the fact that futures can be cancelled by the user before their completion.

After this PR, this data structure has been split in two: one with the set of connections, and one ConnectionTask for each individual connection. The set of connections has a queue of messages destined to the ConnectionTasks, and the ConnectionTasks have a queue of messages destined to the set. This is exposed in the APIs of these objects, and it is the role of the user to do the messages passing. Before this PR, all the "locking/multithreading strategy" was handled internally by the collection, while after this PR it needs to be done by the user.

Thanks to this change in paradigm, the data structures are no longer atomic and are now simple mutable state machines. You have getters and you have methods that modify the state, and that's it. This considerably simplifies the implementation.

In the same vein, peers.rs and service.rs, which are data structures built on top of collection.rs, have been modified in the same way.

This new paradigm is in theory slightly less optimal than the one before. Before this PR, locking was fine-grained. If multiple threads wanted to access the set of connections at the same time, they could call a method at the same time, and if their changes didn't overlap they would actually run at the same time. After this PR, if multiple threads want to access the set, each thread needs to lock a mutex around the entire set.
However, the complexity of the previous implementation, notably around cancellable futures, has lead to a lot of overhead as well, where for example operations in progress needed to be buffered so that it can be resumed later in case the user interrupts the operation.
Overall I think that this change is more than worth it.

tomaka · 2022-05-04T11:02:28Z

This PR isn't completely finished, but the only thing that remains to update is the code of the full node.

mergify

Automatically approving tomaka's pull requests. This auto-approval will be removed once more maintainers are active.

github-actions · 2022-05-04T11:12:29Z

twiggy diff report

Difference in .wasm size before and after this pull request.

 Delta Bytes │ Item
─────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      -46124 ┊ smoldot::executor::host::ReadyToRun::run_once::h9fbe42d9b906904d
      +46124 ┊ smoldot::executor::host::ReadyToRun::run_once::hfc452f475e091cdd
      +41864 ┊ smoldot::json_rpc::methods::MethodCall::from_defs::h7cb5a357dee63457
      -41864 ┊ smoldot::json_rpc::methods::MethodCall::from_defs::h87eda0c7c25e9df4
      -12051 ┊ smoldot::network::service::ChainNetwork<TNow>::next_event::{{closure}}::h4eea15e8e5e22b4f
      +11616 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h190dfa9030081709
      -11616 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h50df569abb293116
      -11199 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h4019dbba31ff7cf1
      +11199 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::ha6738d4012a39afe
      -11113 ┊ smoldot_light_base::Client<TChain,TPlat>::add_chain::h009c2eec1e98d69c
      +11113 ┊ smoldot_light_base::Client<TChain,TPlat>::add_chain::ha0522f1d6faf3be7
      +10473 ┊ smoldot::network::service::ChainNetwork<TNow>::next_event::h8c3fbc17c9efa90a
      +10402 ┊ <parity_wasm::elements::ops::Instruction as parity_wasm::elements::Deserialize>::deserialize::h2cf195427700fab0
      -10402 ┊ <parity_wasm::elements::ops::Instruction as parity_wasm::elements::Deserialize>::deserialize::he072ea64b5fc2251
      +10168 ┊ smoldot::json_rpc::methods::Response::to_json_response::h29508715e5d54042
      -10168 ┊ smoldot::json_rpc::methods::Response::to_json_response::h2c80cc85b760bea0
      +10089 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h4507f5ca3ea840f8
      -10089 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::hd43ea00e82f8eccc
       -9880 ┊ smoldot::libp2p::connection::established::substream::Substream<TNow,TRqUd,TNotifUd>::read_write2::h1848d8501371c422
       +9713 ┊ smoldot::libp2p::connection::established::substream::Substream<TNow,TRqUd,TNotifUd>::read_write2::h517e999dc3d40e80
      +34739 ┊ ... and 26650 more.
      +59531 ┊ Σ [26670 Total Rows]

tomaka · 2022-05-04T13:26:23Z

As part of this PR, I've removed the node-info feature of the full node that would print the PeerId given an IP address.
It leads to too much code duplication, and it is unclear right now how to properly de-duplicate this code.

tomaka · 2022-05-04T15:02:00Z

There are high chances that this PR introduces some bugs, and I admit that I don't have the courage to chase bugs for days, and would prefer to merge this so that I do other networking-related changes on top of it.
However, everything seems to work completely fine for the light node, and the changes introduced by this PR reduce the chances of bugs in the long term by making the code so much more simple.

tomaka · 2022-05-16T07:36:17Z

@melekes Do you want to review that? You can completely say no (and actually I'd expect you to say no)

melekes

man this PR is massive 🙈 I've reviewed some code, but not everything. Probably okay to merge. I can review the logic later.

Also would be interesting to compare how introducing global lock affected the performance. Will the tracing or some other instrument (similar to what go tool pprof provides) can give an insight into lock contention in Rust?