Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fake out requests system in peers.rs #2369

Merged
merged 3 commits into from
Jun 14, 2022

Conversation

tomaka
Copy link
Contributor

@tomaka tomaka commented Jun 10, 2022

Fix #2361

This PR solves a tricky problem in peers.rs.

A connection can be in three phases: handshaking, established, or shutting down (and, later, completely shut down, but that doesn't count as a phase because when the connection is shut down it no longer exists).
Once a connection starts its shutdown phase, it is forbidden to start new requests on this connection, but incoming notifications can still arrive.

This is not a problem in connection.rs because we simply indicate when a connection starts shutting down, and the upper layers of the code are aware of the fact that they shouldn't start any new request but can still receiving incoming notifications.

But in peers.rs this is problematic, because peers.rs groups together multiple connections belonging to the same peer and exposes them as just one. It is for example possible to have one connection to a peer shutting down but another connection to the same peer is still established. For this reason, peers.rs doesn't report shut downs, and instead simply says "we are connected to a peer" or "we are no longer connected to a peer".

But this causes another problem: if you have only one connection left and it is shutting down, should you still tell the upper API layers that you're connected? The answer is yes, otherwise it would be weird potentially receive incoming notifications from peers you are not connected to.
But if the upper API layers think that you are connected to a peer despite having only one connection that is shutting down, what happens if these upper API layers start new requests? This is the problem that this PR solves.

This PR introduces a fake requests system in peers.rs. When a request is started and all the connections to that peer are shutting down, we create a fake request and later report as event that the request has failed.

Copy link
Contributor

@mergify mergify bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automatically approving tomaka's pull requests. This auto-approval will be removed once more maintainers are active.

@tomaka
Copy link
Contributor Author

tomaka commented Jun 10, 2022

I've also opened #2370 which is an alternative approach, but #2370 would be longer and more difficult to do, and I might run into a wall.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 10, 2022

twiggy diff report

Difference in .wasm size before and after this pull request.


 Delta Bytes │ Item
─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────
       +4367 ┊ smoldot::libp2p::peers::Peers<TConn,TNow>::next_event::hb6f87d091bf808c8
       -4106 ┊ smoldot::libp2p::peers::Peers<TConn,TNow>::next_event::h72e9646e8873adec
       +3098 ┊ smoldot::libp2p::collection::Network<TConn,TNow>::next_event::h63f1dd0af6eee90f
       -3071 ┊ smoldot::libp2p::collection::Network<TConn,TNow>::next_event::h51253087aaabcc13
       -1071 ┊ hashbrown::raw::RawTable<T,A>::reserve_rehash::h4d4e14f2c41649db
        +974 ┊ hashbrown::raw::RawTable<T,A>::reserve_rehash::h5c506daf47dd8686
        -848 ┊ smoldot::libp2p::collection::Network<TConn,TNow>::new::h444a2d3945d47dc7
        +848 ┊ smoldot::libp2p::collection::Network<TConn,TNow>::new::hec02c7b882f29333
        -761 ┊ smoldot::libp2p::peers::Peers<TConn,TNow>::unfulfilled_desired_peers::h3c89363d5dbffd79
        +761 ┊ smoldot::libp2p::peers::Peers<TConn,TNow>::unfulfilled_desired_peers::hf1b74d0d7f9f5766
        +685 ┊ smoldot::libp2p::collection::Network<TConn,TNow>::insert::h772cdc29f66164b7
        -666 ┊ smoldot::libp2p::collection::Network<TConn,TNow>::insert::h523c30ea0f7bf119
        +618 ┊ smoldot::libp2p::peers::Peers<TConn,TNow>::new::hdc36e2f18b61d0f3
        -596 ┊ smoldot::libp2p::peers::Peers<TConn,TNow>::new::hfd6825ad9d124fbf
        -582 ┊ smoldot::libp2p::peers::Peers<TConn,TNow>::set_peer_notifications_out_desired::h82af476ca3bef8e3
        +579 ┊ smoldot::libp2p::peers::Peers<TConn,TNow>::set_peer_notifications_out_desired::hddcf377b4705d8cf
        -504 ┊ <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold::hb586dd1cf9b55e10
        +504 ┊ <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold::hdd97c756676df40e
        +489 ┊ smoldot::libp2p::peers::Peers<TConn,TNow>::try_clean_up_peer::h38158f98391a7909
        -489 ┊ smoldot::libp2p::peers::Peers<TConn,TNow>::try_clean_up_peer::h9991d477cd3a02d3
       +4114 ┊ ... and 298 more.
       +6920 ┊ Σ [318 Total Rows]

@@ -11,6 +11,7 @@
- Fix another panic in case of a carefully-crafted LEB128 length. ([#2337](https://github.com/paritytech/smoldot/pull/2337))
- Fix a panic when decoding a block header containing a large number of Aura authorities. ([#2338](https://github.com/paritytech/smoldot/pull/2338))
- Fix multiple panics when decoding network messages in case where these messages were truncated. ([#2340](https://github.com/paritytech/smoldot/pull/2340), [#2355](https://github.com/paritytech/smoldot/pull/2355))
- Fix panic when the Kademlia random discovery process initiates a request on a connection that has just started shutting down. ([#2369](https://github.com/paritytech/smoldot/pull/2369))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mentioning Kademlia because I've only ever seen this problem happen with the Kademlia discovery. In theory it can happen with other protocols, but in practice it doesn't seem to, probably because we accidentally conveniently never start any other kind of requests during shut down.

@melekes
Copy link
Contributor

melekes commented Jun 14, 2022

if you have only one connection left and it is shutting down, should you still tell the upper API layers that you're connected? The answer is yes, otherwise it would be weird potentially receive incoming notifications from peers you are not connected to.

It's not immediately obvious to me why the answer is yes. If we've reported that connection is shutting down in peers.rs and upper layers are aware, then it's okay for them to receive notifications (same as in connection.rs).

But in peers.rs this is problematic, because peers.rs groups together multiple connections belonging to the same peer and exposes them as just one.

I think we've talked about this before a little bit, but this bug again is a "reminder" that the underlying design is too complex and buggy, and it would be far simpler to have a single connection to a peer instead of peers.rs pretending to have a single connection when in fact there are many.

Copy link
Contributor

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@tomaka
Copy link
Contributor Author

tomaka commented Jun 14, 2022

If we've reported that connection is shutting down in peers.rs and upper layers are aware, then it's okay for them to receive notifications (same as in connection.rs).

Well, that's #2370

it would be far simpler to have a single connection to a peer

This opens us to many issues and even an attack:

  • Node A and B are connected to each other.
  • Through the random discovery process, node C tells node A that there exists a node D with a certain IP address.
  • Node A tries to connect to node D. It turns out that this IP address connects to node B, and this hypothetical node D is actually B. Node A finds itself with two connections to node B.

So C managed to force A to connect twice to B.

If we add some code that closes one connection when there are two, the question is which connection do you close? The older one or the newer one? Node A knows that the newer connection is "wrong", but node B doesn't know.
Closing the older one opens the door for an attack where C can force-disconnect A and B.
Closing the newer one means that if node A has a sudden disconnect and changes its IP address, then until node B realizes that the connection is unresponsive (which will take ~40 seconds) node A will be unable to reconnect.

The only good solution to that problem is to allow 2 simultaneous connections per peer.

See also paritytech/substrate#4272 (comment)

@melekes
Copy link
Contributor

melekes commented Jun 14, 2022

It turns out that this IP address connects to node B, and this hypothetical node D is actually B. Node A finds itself with two connections to node B.

Can't A drop the second connection once it finds out that the IP provided by C == B's IP?

@melekes
Copy link
Contributor

melekes commented Jun 14, 2022

See also paritytech/substrate#4272 (comment)

Thanks for the link! Didn't know there were some issues with unique connection before. It may be that multiple connections are lesser evil, I was just pointing out that when bugs like #2361 start to appear it could be the sign of improper design (i.e. bugs that force you to lay additional abstractions like fake requests so the code work).

@tomaka
Copy link
Contributor Author

tomaka commented Jun 14, 2022

Can't A drop the second connection once it finds out that the IP provided by C == B's IP?

C could also advertise an IP that it controls, and then when A connects it tunnels the traffic between A and this IP to B, and thus make A believe that this IP is actually B. Comparing IPs is not a good way of ensuring that you don't connect to the same peer.

In addition to this, there's also the fact that libp2p has many different ways of connecting to a peer, such as using a relay. Sometimes you don't have an IP address available.

I was just pointing out that when bugs like #2361 start to appear it could be the sign of improper design (i.e. bugs that force you to lay additional abstractions like fake requests so the code work)

Well, the design is IMO not wrong by itself. The problem here is more that the abstraction of collection.rs and the abstraction of peers.rs are different, which adds complexity, and that I was too lazy to properly "bridge" the two.

This PR right now is about fixing a panic.
After #2352 is merged, I want to clean everything networking from the grounds up and move networking things towards their stable version.

@tomaka tomaka added the automerge Automatically merge pull request as soon as possible label Jun 14, 2022
@mergify mergify bot merged commit 4472dc1 into paritytech:main Jun 14, 2022
@tomaka tomaka deleted the fix-opened-state-mismatch branch June 14, 2022 09:40
@melekes
Copy link
Contributor

melekes commented Jun 14, 2022

C could also advertise an IP that it controls, and then when A connects it tunnels the traffic between A and this IP to B, and thus make A believe that this IP is actually B. Comparing IPs is not a good way of ensuring that you don't connect to the same peer.

That's what peer IDs are for, correct? And the same process (C relaying data) can't be performed with peer IDs because the B's certificate contains B's IP, right?

The problem here is more that the abstraction of collection.rs and the abstraction of peers.rs are different, which adds complexity, and that I was too lazy to properly "bridge" the two.

That makes sense, thanks 🙏

@tomaka
Copy link
Contributor Author

tomaka commented Jun 14, 2022

That's what peer IDs are for, correct? And the same process (C relaying data) can't be performed with peer IDs because the B's certificate contains B's IP, right?

I don't know which certificates you're mentioning.

The original problem is that C can force A to connect to B. If A was already connected to B, then C will have forced A to open a second connection.
There's also no way for A to have the guarantee in advance that it will not connect to a peer it is already connected to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge Automatically merge pull request as soon as possible
Projects
None yet
Development

Successfully merging this pull request may close these issues.

thread 'tasks-pool-3' panicked at 'called Option::unwrap() on a None value',
2 participants