Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing debug_assert in Request-Response protocol #4773

Open
nathanielc opened this issue Oct 31, 2023 · 5 comments
Open

Failing debug_assert in Request-Response protocol #4773

nathanielc opened this issue Oct 31, 2023 · 5 comments
Labels

Comments

@nathanielc
Copy link
Contributor

nathanielc commented Oct 31, 2023

Summary

I am hitting panics in debug mode at this debug_assert.

While the debug_assert is panic in request-response code, I do NOT think is a bug in request-response. I took a look at it and it seems to be related to a combination of factors.

  • Multiple inflight dial requests must be happening
  • Some of those inflight dials must get denied

I made a smallish reproducer by hacking the file-sharing example on this branch. I added the mdns behaviour so that I can easily get multiple inflight dial requests. I added the libp2p_connection_limits behaviour so that its easy to deny some of those requests. I set the max per peer limit to 2.

Finally to reproduce I run these commands from the example/file-sharing directory:

Provider

RUST_LOG=info,libp2p_swarm=debug cargo run -- \
    --listen-addresses /ip4/0.0.0.0/tcp/40837 \
    --secret-key-seed 1 provide \
    --path README.md \
    --name README.md

Note using /ip4/0.0.0.0/ binds to multiple listen addresses. I have 4-5 local interfaces that mean I listen on that many addresses. This means that when mdns discovers a peer it attempts to dial each address separately and we attempt to dial them as separate concurrent dial requests.

Retriever

RUST_LOG=info,libp2p_swarm=debug cargo run --  get --name README.md

It seems that the request-response behaviour is receiving an inconsistent set of events. It receives both a call to handle_established_outbound_connection and a swarm event DialFailure for the same connection id. This breaks the assumptions of the request-response internal state management and triggers the debug_assert failure.

Expected behavior

I expect debug_asserts to not panic. Additionally I expect that for a single connection id it is either considered failed or established but not both.

Actual behavior

The debug_asserts panics because the request-response logic receives conflicting information about a connection.

Relevant log output

Here are logs from the modified `file-sharing` example run.

Of note is that the `request-response` behaviour receives both a `handle_established_outbound_connection` event from the swarm as well as a `DialFailure` event from the swarm for the same connection id. This confuses the request-response internal state can causes the debug_assert failure.


[2023-10-31T22:09:57Z INFO  libp2p_request_response] handle_established_outbound_connection 1
[2023-10-31T22:09:57Z DEBUG libp2p_swarm] Connection established: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") Dialer { address: "/ip4/192.168.0.136/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }; Total (peer): 1.
[2023-10-31T22:09:57Z INFO  libp2p_request_response] handle_established_outbound_connection 3
[2023-10-31T22:09:57Z DEBUG libp2p_swarm] Connection established: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") Dialer { address: "/ip4/192.168.12.189/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }; Total (peer): 2.
[2023-10-31T22:09:57Z INFO  libp2p_request_response] handle_established_outbound_connection 4
[2023-10-31T22:09:57Z INFO  libp2p_request_response] on_dial_failure 4 Denied { cause: ConnectionDenied { inner: Exceeded { limit: 2, kind: EstablishedPerPeer } } }
[2023-10-31T22:09:57Z INFO  libp2p_request_response] handle_established_outbound_connection 2
[2023-10-31T22:09:57Z INFO  libp2p_request_response] on_dial_failure 2 Denied { cause: ConnectionDenied { inner: Exceeded { limit: 2, kind: EstablishedPerPeer } } }
[2023-10-31T22:10:17Z DEBUG libp2p_swarm] Connection closed with error KeepAliveTimeout: Connected { endpoint: Dialer { address: "/ip4/192.168.0.136/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }, peer_id: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") }; Total (peer): 1.
[2023-10-31T22:10:17Z INFO  libp2p_request_response] on_connection_closed 1 3 1
[2023-10-31T22:10:17Z DEBUG libp2p_swarm] Connection closed with error KeepAliveTimeout: Connected { endpoint: Dialer { address: "/ip4/192.168.12.189/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }, peer_id: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") }; Total (peer): 0.
[2023-10-31T22:10:17Z INFO  libp2p_request_response] on_connection_closed 3 2 0
thread 'async-std/runtime' panicked at /home/nathanielc/projects/rust-libp2p/protocols/request-response/src/lib.rs:640:9:
assertion `left == right` failed
  left: false
 right: true
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at examples/file-sharing/src/network.rs:125:24:
Sender not to be dropped.: Canceled

Possible Solution

One solution is to make request-response more robust to these inconsistent events from the swarm but that seems like a fix that only hides the real problem.

Version

v0.52.4

The reproducer branch is based off the libp2p-v0.52.4 tag.

Would you like to work on fixing this bug ?

Yes

@thomaseizinger
Copy link
Contributor

We've landed some big internal changes to request-response recently: #3914.

Can you retest with latest master whether or not that still happens?

@nathanielc
Copy link
Contributor Author

I am still able to reproduce the error on master.

Here is a branch with a reproducer against master.

To repro run from within the examples/file-sharing directory:

RUST_LOG=info,libp2p_swarm=debug cargo run -- \
    --listen-address /ip4/0.0.0.0/tcp/40837 \
    --secret-key-seed 1 provide \
    --path README.md \
    --name README.md

and

RUST_LOG=info,libp2p_swarm=debug cargo run --  get --name README.md

Logs:

[2023-11-01T20:37:59Z INFO  libp2p_request_response] handle_established_outbound_connection 3
[2023-11-01T20:37:59Z DEBUG libp2p_swarm] Connection established: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") Dialer { address: "/ip4/172.22.0.1/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }; Total (peer): 1.
[2023-11-01T20:37:59Z INFO  libp2p_request_response] handle_established_outbound_connection 1
[2023-11-01T20:37:59Z DEBUG libp2p_swarm] Connection established: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") Dialer { address: "/ip4/172.17.0.1/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }; Total (peer): 2.
[2023-11-01T20:37:59Z INFO  libp2p_request_response] handle_established_outbound_connection 2
[2023-11-01T20:37:59Z INFO  libp2p_request_response] on_dial_failure 2
[2023-11-01T20:37:59Z INFO  libp2p_request_response] handle_established_outbound_connection 4
[2023-11-01T20:37:59Z INFO  libp2p_request_response] on_dial_failure 4

@nathanielc
Copy link
Contributor Author

I took a stab at a fix here #4777 open to other ideas for how to fix this.

@elmattic
Copy link

elmattic commented Apr 2, 2024

Hi, I am facing the same errors when running our Filecoin Rust client in debug mode (using libp2p-request-response-0.26.1 if that helps).

@thomaseizinger
Copy link
Contributor

Hi, I am facing the same errors when running our Filecoin Rust client in debug mode (using libp2p-request-response-0.26.1 if that helps).

Have you tried re-ordering your behaviours to have all connection management ones come first? See #4777 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants