Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transports/quic: STOP_SENDING fails poll_close #3144

Closed
mxinden opened this issue Nov 18, 2022 · 8 comments · Fixed by #3164
Closed

transports/quic: STOP_SENDING fails poll_close #3144

mxinden opened this issue Nov 18, 2022 · 8 comments · Fixed by #3164
Assignees
Labels
bug priority:important The changes needed are critical for libp2p, or are blocking another project
Milestone

Comments

@mxinden
Copy link
Member

mxinden commented Nov 18, 2022

Summary

Receiving a STOP_SENDING makes poll_close fail. This is problematic in libp2p-identify where a node expects to be able to close its write side before reading.

Expected behaviour

libp2p-identify should successfully exchange identify information.

Actual behaviour

Say that node A and B connected. A opens a libp2p-identify stream to B. A expects B to send its identify information.

After having negotiated the libp2p-identify protocol via multistream-select:

  • B sends its identify information and closes its write side.

    framed_io.send(message).await?;
    framed_io.close().await?;

    B then drops the stream. Dropping the stream results in libp2p-quic sending a STOP_SENDING.
    impl Drop for Substream {
    fn drop(&mut self) {
    let mut state = self.state.lock();
    state.substreams.remove(&self.id);
    let _ = state.connection.recv_stream(self.id).stop(0u32.into());

  • In parallel A closes its write side.

    socket.close().await?;

    This results in Substream::poll_close being called in libp2p-quic.
    fn poll_close(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<std::io::Result<()>> {
    let mut inner = self.state.lock();
    if inner.unchecked_substream_state(self.id).is_write_closed {
    return Poll::Ready(Ok(()));
    }
    match inner.connection.send_stream(self.id).finish() {
    Ok(()) => {
    let substream_state = inner.unchecked_substream_state(self.id);
    substream_state.finished_waker = Some(cx.waker().clone());
    Poll::Pending
    }
    Err(err @ quinn_proto::FinishError::Stopped(_)) => {
    Poll::Ready(Err(io::Error::new(io::ErrorKind::ConnectionReset, err)))
    }

    In case the STOP_SENDING from B already arived, poll_close fails and thus the identify handshake fails.

Possible Solution

  1. Don't send a STOP_SENDING on Drop.
    • That would be unfortunate, the remote might send data which will never be received.
  2. Don't send a STOP_SENDING on DROP IFF the remote already closed the write side.
  3. Ignore the STOP_SENDING error in poll_close.
    • That would give a false sense to upstream users. While they think closing was successful and thus assume (one can never be sure) that ones data made it to the remote, there is actually a high likelihood that remote never received the data.
  4. Don't close the write side in libp2p-identify before reading the remote's identify information.
    • I would expect libp2p-identify to be an edge-case here, i.e. I would expect other protocols not to run into this issue. E.g. in a standard request response style protocol A would send a request, then close, then B would read, then B writes, then B closes and only then B drops.

I am tending towards (2) AND (4).

Version

  • libp2p version (version number, commit, or branch):

I am using 055636e on #2712 but any commit on master should work. E.g. 0c85839.

Would you like to work on fixing this bug?

Maybe. Let's find consensus on a good solution first.

//CC @elenaf9

@mxinden mxinden added the bug label Nov 18, 2022
@mxinden
Copy link
Member Author

mxinden commented Nov 18, 2022

@MarcoPolo this reminds me of a conversation we had around libp2p/specs#473 and stream close behavior related to the ping protocol. You might have an opinion here.

@mxinden mxinden added the priority:important The changes needed are critical for libp2p, or are blocking another project label Nov 18, 2022
@thomaseizinger
Copy link
Contributor

Can we replicate this problem with a failing test in the stream muxer test harness?

@mxinden
Copy link
Member Author

mxinden commented Nov 19, 2022

Can we replicate this problem with a failing test in the stream muxer test harness?

Good idea. I will take a look.

Don't close the write side in libp2p-identify before reading the remote's identify information.

This might not work when using multistream-select with V1Lazy as the multistream-select payload might not yet been flushed and thus not yet send to the remote. This might only happen with the close and the thus implied flush. Instead of closing first, we could only flush in libp2p-identify, though that would imply that upper layer protocols (libp2p-identify) need to care about lower layer details (multistream-select V1Lazy). Another alternative, not violating these layers, would be for V1Lazy to send AND flush, not just send (i.e. moving the below code snipped below the flush after the send).

Version::V1Lazy => {
log::debug!("Dialer: Expecting proposed protocol: {}", p);
let hl = HeaderLine::from(Version::V1Lazy);
let io = Negotiated::expecting(io.into_reader(), p, Some(hl));
return Poll::Ready(Ok((protocol, io)));
}

@elenaf9 elenaf9 self-assigned this Nov 22, 2022
@elenaf9
Copy link
Contributor

elenaf9 commented Nov 23, 2022

Thanks for the detailed report and proposing solutions @mxinden!

I was able to reproduce this bug with this test: https://github.com/elenaf9/rust-libp2p/blob/826f3f6ae3bf6ad0a4df16776e110b23fa12fced/transports/quic/tests/smoke.rs#L215-L230 (not using the stream muxer harness right now but happy to add it there in a different PR).

Don't send a STOP_SENDING on DROP IFF the remote already closed the write side.

But isn't the problem that we drop the stream before the remote closes the stream?
I agree that it makes sense to not send a STOP_SENDING if the remote already closed the stream, but I think this is orthogonal to this issue. Am I missing something?

This might not work when using multistream-select with V1Lazy as the multistream-select payload might not yet been flushed and thus not yet send to the remote. This might only happen with the close and the thus implied flush. Instead of closing first, we could only flush in libp2p-identify

Note that we don't have "flushing" on QUIC substreams. Stream data is send asap. Close then only waits for the ACKs for sent data.
I am not familiar with V1Lazy. Do we need the guarantee that the remote received the data before the behaviour protocol can use the stream?


I am currently looking into this, trying to think of alternative solutions.

@mxinden did you witness this happening in the real world or only on local tests?

@mxinden
Copy link
Member Author

mxinden commented Nov 23, 2022

I was able to reproduce this bug with this test: https://github.com/elenaf9/rust-libp2p/blob/826f3f6ae3bf6ad0a4df16776e110b23fa12fced/transports/quic/tests/smoke.rs#L215-L230 (not using the stream muxer harness right now but happy to add it there in a different PR).

Thank you!

Don't send a STOP_SENDING on DROP IFF the remote already closed the write side.

But isn't the problem that we drop the stream before the remote closes the stream? I agree that it makes sense to not send a STOP_SENDING if the remote already closed the stream, but I think this is orthogonal to this issue. Am I missing something?

Agreed. This is orthogonal. (In my case the two were racing each other.)

I am currently looking into this, trying to think of alternative solutions.

@mxinden did you witness this happening in the real world or only on local tests?

I connected to kademlia-exporter.max-inden.de/ (QUIC server) via my laptop using https://github.com/mxinden/libp2p-lookup (QUIC client). I would consider this a real world scenario and would expect others to see similar behavior when using libp2p-quic.

This might not work when using multistream-select with V1Lazy as the multistream-select payload might not yet been flushed and thus not yet send to the remote. This might only happen with the close and the thus implied flush. Instead of closing first, we could only flush in libp2p-identify

Note that we don't have "flushing" on QUIC substreams. Stream data is send asap. Close then only waits for the ACKs for sent data.

Ah, good point. Never mind then. (I think still relevant for non-QUIC streams.)

I am not familiar with V1Lazy. Do we need the guarantee that the remote received the data before the behaviour protocol can use the stream?

No we don't. Problem today on flushable-stream-muxers is, that the multistream-select message might never be sent, (i.e. stuck in a buffer) in case the local node never writes application data on the stream. But as you noted above, not relevant for the libp2p-quic case.

@thomaseizinger
Copy link
Contributor

Thanks for the detailed report and proposing solutions @mxinden!

I was able to reproduce this bug with this test: https://github.com/elenaf9/rust-libp2p/blob/826f3f6ae3bf6ad0a4df16776e110b23fa12fced/transports/quic/tests/smoke.rs#L215-L230 (not using the stream muxer harness right now but happy to add it there in a different PR).

Thanks and yes please :)

@mxinden
Copy link
Member Author

mxinden commented Nov 24, 2022

I'd like to get #3144 fixed before releasing QUIC. I am currently working on it. But I doesn't have to be a blocker.

#3163 (comment)

Are you still exploring potential fixes, or do you already have a fix in mind?

@elenaf9
Copy link
Contributor

elenaf9 commented Nov 24, 2022

I'd like to get #3144 fixed before releasing QUIC. I am currently working on it. But I doesn't have to be a blocker.

#3163 (comment)

Are you still exploring potential fixes, or do you already have a fix in mind?

I think I have it fixed; writing tests right now to confirm. But your proposed solution 2. Don't send a STOP_SENDING on DROP IFF the remote already closed the write side. may be the fix after all. Will explain in the PR why (if the tests pass now).

@mergify mergify bot closed this as completed in #3164 Nov 25, 2022
mergify bot pushed a commit that referenced this issue Nov 25, 2022
- Only send `STOP_SENDING` on a stream when dropping it if the remote did not finish the stream yet.
- Only call `quinn_proto::SendStream::finish` once. (A second call to it will always fail. Though I don't think this was the issue in #3144.)
- Add tests for reading and writing to streams after the remote dropped. Also adds a smoke test for backpressure.

Fixes #3144.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug priority:important The changes needed are critical for libp2p, or are blocking another project
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants