fix: don't fail when `finish` is called on a stopped stream #1699

thomaseizinger · 2023-10-30T02:33:04Z

At the moment, calling finish on a stream that has been stopped by the remote fails. This can lead to weird race conditions in application protocols. I added a test that showcases this. Depending on where you place the send_stream.finish in the client, the test either succeeds or fails. In other words, if the recv_stream at the opposite side is still alive, calling finish will work but if it is not, finish will fail.

In a real-world setting, we cannot influence how fast the remote is in processing your data and / or when they drop (i.e. stop) the stream. I think quinn should not fail finish if the other party has already stopped the stream. Instead, we should assume that they've read all the data they expected and thus no longer need the stream.

Unfortunately, the only workaround for this issue is to not call finish at all. Whilst that may fine for some usecases, I'd argue that it is not very clean and finish should in fact be idempotent.

To fix this issue, we check whether we all data we sent has been acknowledged and in that case, exit finish gracefully, pretending that the remote acked our FIN.

thomaseizinger · 2023-10-30T02:46:03Z

I added to commits that make the test pass. Let me know if this fix is okay :)

djc · 2023-10-30T06:16:23Z

Seems reasonable to me!

Ralith · 2023-10-30T18:32:00Z

Thanks for the PR! I'd like to discuss the motivation in some more depth. My knee-jerk reaction is discomfort with the presumptions this makes about the application's interests, and I wonder if your requirements could be addressed with a different pattern at the application layer instead.

if the recv_stream at the opposite side is still alive, calling finish will work but if it is not, finish will fail.

The intention of the current API is that, if you don't want to stop incoming streams, then you always read them to the end, even if application-layer reasoning tells you to expect no further data.

Whilst [not calling finish] may fine for some usecases, I'd argue that it is not very clean

What cases do you feel finish must be called in? My intuition is that you should call it iff you want to know about the stream being stopped, and it seems like here you don't, so why call it at all?

finish should in fact be idempotent.

This seems like a separate concern from whether streams stopped during finish should raise errors.

we check whether we all data we sent has been acknowledged and in that case, exit finish gracefully

Returning an error if and only if we haven't received some ACKs for the stream is extremely fragile. Packet loss or reordering might cause the STOP_SENDING frame to be received before any outstanding ACKs, even if the peer has indeed received everything. This creates a very similar problem to the original motivation, where finish may unpredictably fail, except it cannot be prevented by changing application-layer behavior.

thomaseizinger · 2023-10-31T05:04:29Z

Thanks for the PR! I'd like to discuss the motivation in some more depth.

Thank you for the comment. I'll try to expand a bit.

Whilst [not calling finish] may fine for some usecases, I'd argue that it is not very clean

What cases do you feel finish must be called in? My intuition is that you should call it iff you want to know about the stream being stopped, and it seems like here you don't, so why call it at all?

In rust-libp2p, we abstract over many different kinds of streams and QUIC is just one of the possible transports. They all share the AsyncRead + AsyncWrite API and we forward AsyncWrite::close to finish.

The issue we had in libp2p/rust-libp2p#4747 was resolved by not calling close (and thereby not calling finish). This works for now but it is a massive footgun for protocol developers on top of rust-libp2p.

It just doesn't seem correct to fail closing the stream even though everything worked correctly and we just had a timing issue with the remote on who sent what when.

we check whether we all data we sent has been acknowledged and in that case, exit finish gracefully

Returning an error if and only if we haven't received some ACKs for the stream is extremely fragile. Packet loss or reordering might cause the STOP_SENDING frame to be received before any outstanding ACKs, even if the peer has indeed received everything. This creates a very similar problem to the original motivation, where finish may unpredictably fail, except it cannot be prevented by changing application-layer behavior.

I am not sure this is true.

Client sends some data
Server receives data & responds
Client receives response
Client finishes stream

(4) is what I think should not fail even if the remote has already stopped the stream. There cannot be any missing ACKs because the server would not have responded if it didn't receive the entire request from (1). If we did in fact still send some data that the server hasn't acked yet, then closing should fail! In this case however, the two peers don't seem to agree on the protocol because the server apparently did not expect further data to be sent and dropped the stream without continuing to read.

Ralith · 2023-11-01T03:48:09Z

It just doesn't seem correct to fail closing the stream even though everything worked correctly and we just had a timing issue with the remote on who sent what when.

Could you instead ignore all errors from finish, perhaps unconditionally in your AsyncWrite::close binding? This would also protect you from a peer that closes the entire connection rather than just the stream.

Note that even a successful finish is not assurance that the remote application layer has seen all data you sent, so ignoring finish errors categorically doesn't put you at increased risk of data loss. Just like in TCP, the only way to know for sure that a peer's application layer has processed your data is with an application-layer ACK.

I am not sure this is true.

Successful receipt of a server's application-layer response does not guarantee successful receipt of all ACKs covering application data that prompted that response. This information may have been carried in separate packets.

For clarity, note that your example illustrates a particular bidirectional request/response pattern. Application protocols routinely employ other patterns, which may have fewer round-trips. For example, if you open, write to, and immediately finish a unidirectional stream, and the peer stops that stream, it is impossible to reliably predict whether or not an error will be raised under this PR's changes, because packets carrying ACK frames can easily be reordered or lost when the STOP_SENDING frame is received.

thomaseizinger · 2023-11-01T06:38:08Z

Thank you for the clarification!

It seems that the safe thing to do is to instantly close the stream once we are done with writing before we start reading data from the remote. That should make this race condition go away.

Still, it leaves me with an uneasy feeling that it matters when I close my stream. My argument would be that it shouldn't matter. Perhaps finish should not fail at all and always return ()? Or maybe it should return something other than Result to avoid that users run into the trap of using ? and aborting their entire task even though nothing actually failed?

Perhaps the last part is interesting to debate: Why is it an "error" to finish a stream when the remote has closed it already?

Ralith · 2023-11-01T17:46:32Z

It seems that the safe thing to do is to instantly close the stream once we are done with writing before we start reading data from the remote. That should make this race condition go away.

That's a more conventional pattern, though if the peer stops the stream immediately after reading data, and the connection was very low latency, you could still theoretically race.

Perhaps finish should not fail at all and always return ()?

This would be less of a footgun, and it's tempting. Maybe even make it synchronous, and thereby prevent similarly error-prone application sensitivity to transport-layer ACKs. This places it in similar territory to a successful write -- you don't know how far the data's gotten, just that it's out of your hands now.

Our intention was to support application protocols in which STOP_SENDING has greater significance. Because an application can ensure that streams are never stopped in normal operation by always reading streams to the end, a stopped stream could have special meaning, whether to indicate a logic error in the peer, or other application layer significance. An asynchronous, fallible finish allows reliable detection of stops that would otherwise be lost.

Or maybe it should return something other than Result

This is a good thought. On review of the API, I think any form of async, data-bearing SendStream::finish future might actually be redundant to a synchronous, infallible finish followed by waiting on SendStream::stopped if you really want to.

djc · 2023-11-01T18:37:41Z

This is a good thought. On review of the API, I think any form of async, data-bearing SendStream::finish future might actually be redundant to a synchronous, infallible finish followed by waiting on SendStream::stopped if you really want to.

Sounds good!

thomaseizinger · 2023-11-02T04:08:06Z

Or maybe it should return something other than Result

This is a good thought. On review of the API, I think any form of async, data-bearing SendStream::finish future might actually be redundant to a synchronous, infallible finish followed by waiting on SendStream::stopped if you really want to.

I really like this!

So if I understand you correctly: We essentially design finish to be the same as reset: Synchronous and just changing the state of the underlying stream.

Looking at the API though, I'd have a different question: Why do we expose the UnknownStream error from SendStream::reset and other APIs? Surely that should not be hit when I am calling the function on a SendStream?

Ralith · 2023-11-02T19:29:33Z

We essentially design finish to be the same as reset: Synchronous and just changing the state of the underlying stream.

Yep!

Surely that should not be hit when I am calling the function on a SendStream?

Per the doc comments, UnknownStream can arise if the stream becomes fully closed, such as after calling finish yourself. Perhaps we should rename it for clarity.

thomaseizinger · 2023-11-02T20:48:31Z

Surely that should not be hit when I am calling the function on a SendStream?

Per the doc comments, UnknownStream can arise if the stream becomes fully closed, such as after calling finish yourself. Perhaps we should rename it for clarity.

But didn't we just conclude that such an error should only be returned from awaiting stopped? I guess the question is, should it be a synchronous error to reset a finished stream and vice versa?

What if we say instead:

Calling reset and finish just changes local state and may queue a frame to be sent.
Calling reset or finish on a stopped stream does nothing.
To learn why a stream was stopped, you must await stopped.

This means if I really care that a stream finished gracefully, I first call finish, then await stopped and check that this didn't fail.

I don't know the internals that well though. Will calling reset and then stopped give me a stop-code that allows me to check that it was actually reset?

Ralith · 2023-11-02T22:40:29Z

But didn't we just conclude that such an error should only be returned from awaiting stopped?

To be clear, it's usually a "you're calling methods in a nonsensical order" error. stopped after finish or perhaps reset is the only time when it should arise in a non-buggy program. reset after finish is a corner case.

should it be a synchronous error to reset a finished stream and vice versa?

Maybe it's useful for finish to return an error synchronously if used on a stream that you previously reset, since that indicates a logic error by the caller. Calling reset twice is also a logic error. I like the simplicity of making these cases a no-op, but maybe we should try to make caller bugs obvious here? I could go either way.

Conversely, reset after finish is perfectly sensible (read it as "give up on any retransmits"), and we should probably not return UnknownStream errors there ever, since otherwise the behavior is racey.

Calling reset or finish on a stopped stream does nothing.

It feels a bit weird that finish() would succeed when write(&[]) would fail with Stopped, but I think I'm satisfied that this is the most constructive take regardless. We can think of finish as morally just setting a bit on the most recent preceding write.

await stopped and check that this didn't fail.

For this to work as written, we should probably replace Err(UnknownStream) with an Ok(None) case or similar. I think that's reasonable.

Will calling reset and then stopped give me a stop-code that allows me to check that it was actually reset?

Stream resets are advisory, and a peer that has received a RESET_STREAM frame from you probably won't bother to send STOP_SENDING, since you've already declared intent not to send. A stream is "actually reset" whenever you say it is, and what the peer makes of that is up to them.

thomaseizinger · 2023-11-03T00:08:24Z

But didn't we just conclude that such an error should only be returned from awaiting stopped?

To be clear, it's usually a "you're calling methods in a nonsensical order" error.

Perhaps a debug_assert (and a warning?) is appropriate then? If I understand correctly, this can't be triggered by a remote so should be safe? It would be a no-op in production anyway though.

We can think of finish as morally just setting a bit on the most recent preceding write.

So what happens when I call finish but there are no more pending writes? Will that create an empty frame and send it? Is that a thing in QUIC? Sorry, I am not too familiar with the details 😬

Ralith · 2023-11-03T04:49:35Z

So what happens when I call finish but there are no more pending writes? Will that create an empty frame and send it?

Yep, exactly!

thomaseizinger · 2023-11-03T05:22:43Z

Okay, I might have a stab at this next week :)

Ralith · 2024-05-01T22:31:16Z

Closing in favor of #1840, where I've implemented something like what we discussed here, since the next release is very close. Feedback welcome!

thomaseizinger mentioned this pull request Oct 30, 2023

fix(hole-punch): don't explicitly close RESERVE stream libp2p/rust-libp2p#4747

Merged

4 tasks

Add failing test

ccf7794

thomaseizinger force-pushed the fix/racy-close branch from 8a0e5d7 to ccf7794 Compare October 30, 2023 02:34

thomaseizinger added 2 commits October 30, 2023 13:37

Prefer early-return

2cd1fef

Check is_fully_acked before failing finish

6a0228f

Format code

52a83cf

thomaseizinger force-pushed the fix/racy-close branch from f2acae6 to 52a83cf Compare October 30, 2023 02:47

thomaseizinger added 3 commits October 30, 2023 13:48

Expand log

63e4ee2

Rewrite finish

9699f31

Fix tests

0ea5198

djc approved these changes Oct 30, 2023

View reviewed changes

djc mentioned this pull request Jan 12, 2024

0.11.0 release planning #1737

Closed

7 tasks

Ralith mentioned this pull request May 1, 2024

Make SendStream::finish synchronous #1840

Merged

Ralith closed this May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: don't fail when `finish` is called on a stopped stream #1699

fix: don't fail when `finish` is called on a stopped stream #1699

thomaseizinger commented Oct 30, 2023 •

edited

Loading

thomaseizinger commented Oct 30, 2023

djc commented Oct 30, 2023

Ralith commented Oct 30, 2023

thomaseizinger commented Oct 31, 2023

Ralith commented Nov 1, 2023 •

edited

Loading

thomaseizinger commented Nov 1, 2023 •

edited

Loading

Ralith commented Nov 1, 2023 •

edited

Loading

djc commented Nov 1, 2023

thomaseizinger commented Nov 2, 2023

Ralith commented Nov 2, 2023

thomaseizinger commented Nov 2, 2023 •

edited

Loading

Ralith commented Nov 2, 2023 •

edited

Loading

thomaseizinger commented Nov 3, 2023 •

edited

Loading

Ralith commented Nov 3, 2023

thomaseizinger commented Nov 3, 2023

Ralith commented May 1, 2024

fix: don't fail when finish is called on a stopped stream #1699

fix: don't fail when finish is called on a stopped stream #1699

Conversation

thomaseizinger commented Oct 30, 2023 • edited Loading

thomaseizinger commented Oct 30, 2023

djc commented Oct 30, 2023

Ralith commented Oct 30, 2023

thomaseizinger commented Oct 31, 2023

Ralith commented Nov 1, 2023 • edited Loading

thomaseizinger commented Nov 1, 2023 • edited Loading

Ralith commented Nov 1, 2023 • edited Loading

djc commented Nov 1, 2023

thomaseizinger commented Nov 2, 2023

Ralith commented Nov 2, 2023

thomaseizinger commented Nov 2, 2023 • edited Loading

Ralith commented Nov 2, 2023 • edited Loading

thomaseizinger commented Nov 3, 2023 • edited Loading

Ralith commented Nov 3, 2023

thomaseizinger commented Nov 3, 2023

Ralith commented May 1, 2024

fix: don't fail when `finish` is called on a stopped stream #1699

fix: don't fail when `finish` is called on a stopped stream #1699

thomaseizinger commented Oct 30, 2023 •

edited

Loading

Ralith commented Nov 1, 2023 •

edited

Loading

thomaseizinger commented Nov 1, 2023 •

edited

Loading

Ralith commented Nov 1, 2023 •

edited

Loading

thomaseizinger commented Nov 2, 2023 •

edited

Loading

Ralith commented Nov 2, 2023 •

edited

Loading

thomaseizinger commented Nov 3, 2023 •

edited

Loading