Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Chrome datachannel stuck at closing #238

Merged
merged 1 commit into from
Oct 15, 2022
Merged

Conversation

jerry-tao
Copy link
Member

Description

When close datachannel from chrome, sctp should response with an outgoingResetRequest, otherwise the chrome won't fully close the datachannel.

Reference issue

Fixes #187

@codecov
Copy link

codecov bot commented Sep 19, 2022

Codecov Report

Base: 80.86% // Head: 81.04% // Increases project coverage by +0.17% 🎉

Coverage data is based on head (e549ed0) compared to base (d0b7cf3).
Patch coverage: 96.82% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #238      +/-   ##
==========================================
+ Coverage   80.86%   81.04%   +0.17%     
==========================================
  Files          48       48              
  Lines        3962     3994      +32     
==========================================
+ Hits         3204     3237      +33     
+ Misses        619      618       -1     
  Partials      139      139              
Flag Coverage Δ
go 81.04% <96.82%> (+0.17%) ⬆️
wasm 68.05% <95.23%> (+1.21%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
stream.go 97.20% <96.42%> (+0.32%) ⬆️
association.go 84.79% <100.00%> (+0.10%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@enobufs enobufs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerry-tao Thanks for this PR. You are right about the missing implementation of the outgoing reset on receipt of the same from remote. (thanks for the RFC reference!) I left one comment and hoping you could address it. Thank you!

association.go Outdated Show resolved Hide resolved
@jerry-tao
Copy link
Member Author

@enobufs I'm not sure about how to fix the test, after the change there will be 1 more pending reconfig chunk(outgoingReset).

		// Verify there's no more pending reconfig
		assoc.lock.RLock()
		pendingReconfigs := len(assoc.reconfigs)
		assoc.lock.RUnlock()
		assert.Equal(t, 0, pendingReconfigs, "should be zero")

But this is datachannel spec, not SCTP, i'm not clear about how a SCTP stream should close.

@enobufs
Copy link
Member

enobufs commented Oct 1, 2022

@jerry-tao Thanks for following up on the error. It's not immediately obvious to me how to resolve it. Let me see what I can find this weekend.

@jerry-tao
Copy link
Member Author

@enobufs During testing the fix, i found a problem, if client create and close datachannel quickly, the server will won't handle it and throw the error:

sctp ERROR: 2022/10/08 18:19:56 [0xc0001aa380] stream 3 not found)

It occured both fix and unfix version, but the unfix version only the closed datachannel have the error.

Simple Client Code

    for (let i = 0; i < 5; i++) {
      const dc = pc.createDataChannel("dc " + i);
      id += 2
      dc.addEventListener("open", () => {
        dc.send(req);
      });
      await delay(1000);
      dc.close();
    }

    const dc = pc.createDataChannel("dc last");
    dc.addEventListener("closing", () => {
      console.log("closing");
    });
    dc.addEventListener("open", () => {
      console.log("opened");
      console.log("send message", req);
      dc.send(req);
    });w
    dc.addEventListener("message", (e) => {
      console.log("got message");
    });

At the unfix version the first 13579 stream will have:

sctp ERROR: 2022/10/08 18:19:56 [0xc0001aa380] stream 3 not found)

But the dc last could response message correctly.

At the fix version, the dc last won't get the response but just :

sctp ERROR: 2022/10/08 18:19:56 [0xc0001aa380] stream 3 not found)

I think it's because the datachannel closed it's just reset the stream, not remove it.

@enobufs
Copy link
Member

enobufs commented Oct 9, 2022

@jerry-tao
I was testing with browser app, using Chrome this time, creating 6 channels, each of which, like your example, send one message, wait for a second, then close the data channel. I am seeing all message were received (my app does a loop back), then received close event on each data channel. Interesting thing is, the close event was received without this pull-request.

    const test1 = (ep) => {
      let sess = new Session(ep);
      sess.on('dcopen', (dc) => {
        console.log(`new datachannel ${dc.label}`)
        dc.send(`Hello from ${dc.label}`)
        setTimeout(() => {
          console.log(`closing datachannel ${dc.label}`)
          dc.close()
        }, 1000)
      })
      sess.open(6); // open 3 data channels
    };

My app initiates dc.close() from the browser (just like your example) and never initiated by the peer using pion. Looking at pion/datachannel implementation, I noticed what you have pointed out in #187 appears to have been implemented there.
See: https://github.com/pion/datachannel/blob/v1.5.2/datachannel.go#L188-L192

When Chrome side calls dc.close(), this will send Reconfig to the peer, which would cause stream.ReadSCTP to return io.EOF. The above code, then call stream.Close() which is to conform what the RFC says in the section 6.7 you pointed out.

Now, the question is why you were not seeing the 'close' event on Chrome. Could you please review your test code?

@enobufs
Copy link
Member

enobufs commented Oct 9, 2022

Here's my test code in case helps.
pion-server.zip

@jerry-tao
Copy link
Member Author

I may fix it in a wrong way.
The last time i checked when it run to close in https://github.com/pion/sctp/blob/master/stream.go#L281, the readErr already is io.EOF, so the reset won't be called by server.

@jerry-tao
Copy link
Member Author

jerry-tao commented Oct 9, 2022

Here is my test code to reproduce the cloinsg state: ice-tcp.zip
ice-udp.zip
Both TCP and UDP will stuck at closing.

@jerry-tao
Copy link
Member Author

jerry-tao commented Oct 9, 2022

Here's my test code in case helps. pion-server.zip

I run your test code, in the chrome://webrtc-internals/ it shows all datachannel stuck at closing too.

@enobufs
Copy link
Member

enobufs commented Oct 9, 2022

I run your test code, in the chrome://webrtc-internals/ it shows all datachannel stuck at closing too.

You are right. I read my console logs wrong. The 'close' event never fired before the fix. Digging deeper now.

@jerry-tao
Copy link
Member Author

I found the problem with stream not found, it is hard to reproduce on loop back network. When datachannel closed, if the buffer still have data unsend, it won't be cleared.

  1. If the datachannel is closing/closed, the server will keep report stream not found until it's timeout.
  2. If the stream ID get reused, it will send the old data or nothing.

With or without this PR, the problem 1 is same.
Without this PR, the stream ID will get reused after Client consider reconfig timeout, and the server will send the buffered old data to it.
With this PR, the stream ID get reused quickly but the client won't receive anything.

@enobufs
Copy link
Member

enobufs commented Oct 10, 2022

I found the problem with stream not found, it is hard to reproduce on loop back network.

Thanks @jerry-tao. I have not seen that error log yet. I will try to reproduce it on my end as well.

@jerry-tao
Copy link
Member Author

jerry-tao commented Oct 10, 2022

You could put a time.Sleep(time.Second) before https://github.com/pion/sctp/blob/master/association.go#L559, and remove the setInterval in your test code like:

        const test1 = (ep) => {
            let sess = new Session(ep);
            sess.on('dcopen', (dc) => {
                console.log(`new datachannel ${dc.label}`)
                dc.send(`Hello from ${dc.label}`)
                dc.close()
            })
            sess.open(6); // open 3 data channels
        };

Send then close, it will produce the error.

@enobufs
Copy link
Member

enobufs commented Oct 10, 2022

@jerry-tao Ok I will try that. Thank you! I think I know what is going on. I believe we are removing a stream from the map a.streams too early. I can see there would be a case where when delete(a.streams, streamIdentifier), there may be DATA chunks in the send buffer. I will first try to repro, then find out a workaround. I really appreciate you being diligent about this troubleshooting. The quality of your work is phenomenal!

@enobufs
Copy link
Member

enobufs commented Oct 10, 2022

Hi @jerry-tao, Chrome does not seem to work as we expect, either.
I modified my pion-server code as you suggested, closing the datachannel right after sending a message on Chrome. Server (pion) side logs showed sctp ERROR: 2022/10/10 11:30:46 [0xc000326540] stream 3 not found), and the message appears to have been dropped as you reported. What I noticed though is there were other messages that did not cause this message, meaning that those messages were sent on wire back to Chrome, but nothing appears to have been received on Chrome's console log.

I wrote a script to do the same between the browsers (please find the attachment). Situation is even worse than pion because server side readyState goes to closing before receiving a message. When you try to call dc.send() in closing state, it would be an exception.

Uncaught DOMException: Failed to execute 'send' on 'RTCDataChannel': RTCDataChannel.readyState is not 'open'

FireFox is even worse... Server side datachannel is left open (never transitions to 'closing' or 'closed' as far as I can see.
Safari's behavior is identical to that of Chrome.

index.zip

Could you please review my browser script, and let me know what you think?

So, I believe the reason why data appears to be dropped at server (pion) side is not due to stream x not found, but due to the browsers not able to receive data on the datachannel that is in the closing state. In fact, the message stream x not found is from the method checkPartialReliabilityStatus() which is for partial reliability checking to determine if the chunk should be abandoned or not. I think, with a high confidence, that the corresponding packet was sent back to the client.

@jerry-tao
Copy link
Member Author

For the sending on closing state, I think it should happen at someone trying to call WriteSCTP when stream is not open, I think this is expected.
For the stream not found in checkPartialReliabilityStatus, it's called from https://github.com/pion/sctp/blob/d0b7cf3963e99337bdf09e11989cf8b0916e1179/association.go#L2052, but as you can see, even the checkPartialReliabilityStatus failed, it still moves chunk to inflightQueue.
I think there are two problems for now:

  • When stream receive reconfig, we should clear the stream chunks left in pendingQueue, since nobody will receive it and it should not get a TSN, and reject more WriteSCTP request by retrun a error like Chrome.
  • I tried clear it by dropping it in popPendingDataChunksToSend, but when chrome reuse the streamID it fail to receive new messsage, i'm digging this right not.

@enobufs
Copy link
Member

enobufs commented Oct 11, 2022

@jerry-tao
I think we should try to solve the stream x not found issue in another issue thread because:

  • The issue with stuck at closing with Chrome is resolved (Firefox does not seem to conform RFC 8831, however)
  • RFC 8831 does not make it clear what to do with pending data in the send buffer after the inbound stream is reset (discussion below), and the fix may not be trivial.
  • Avoid deviating code from the HEAD too much. (take small steps)

As you said, It's a good idea to return an error from WriteSCTP when the stream state is not open. (the PR to your fork issue-187-rev2 has the state for this).

nobody will receive it

This is not entirely true. the SCTP's association is still there to handle those chunks. It's just that RFC 8831 does not make it clear if it intends to deliver data received during the closing state to the app level. (Chrome does not seem to receive them but that does not mean Chrome implements it correctly.) DCEP (RFC 8831) makes use of RFC 6525 (Reconfig) as a way to close stream. OutgoingResetRequest, as its name implies, is similar to shutdown(SHUT_WR), the TCP's "graceful shutdown", where you can still receive data from the socket while the outgoing is closed, until read() returns EOF...

@jerry-tao
Copy link
Member Author

Yes I agree with you about there are two probelmes. But I don't think we could solve it sepreatly.
Without this PR Chrome will stuck at closing, so the stream id won't get reused , in some way, this hide the pending chunks problem.
If we merged this PR, Chrome will reuse the stream id quickly, and during my test if the stream id get reused, the new datachannel won't receive anything. It make things worse.
I understand the diffrence between browsers is very annoying, also the uncleared rfc, but still, compatible with Chrome is the best idea so far.
I'm still debugging it, hope could fix it by the end of this week.

@enobufs
Copy link
Member

enobufs commented Oct 11, 2022

Ok. Let's see what we can find by next week.

Chrome will reuse the stream id quickly, and during my test if the stream id get reused, the new datachannel won't receive anything.

Interesting. How do you cause this "reuse" of stream ID by Chrome? (how can I repro it?)

@jerry-tao
Copy link
Member Author

If we trying to send data after the channel was closed by chrome the chrome will throw the error:

Unchecked runtime.lastError: A listener indicated an asynchronous response by returning true, but the message channel closed before a response was received

Interesting. How do you cause this "reuse" of stream ID by Chrome? (how can I repro it?)

You could using my test code and log the dc object at the open event, you will find they have the same id:

    for (let i = 0; i < 5; i++) {
      const dc = pc.createDataChannel("dc " + i);
      id += 2
      dc.addEventListener("open", () => {
        console.log(dc);
        dc.send(req);
      });
      await delay(1000);
      dc.close();
    }

    const dc = pc.createDataChannel("dc last");
    dc.addEventListener("closing", () => {
      console.log("closing");
    });
    dc.addEventListener("open", () => {
      console.log("opened");
      console.log(dc);
      console.log("send message", req);
      dc.send(req);
    });
    dc.addEventListener("message", (e) => {
      console.log("got message");
    });

Ok. Let's see what we can find by next week.

I have successfully fixed it by workaround, you could check it https://github.com/jerry-tao/sctp/tree/wip, it works as expected but may not fullfil the RFC.
And weirdly it worked on my origin commit but not on your pr, will dig it deeper.

@jerry-tao
Copy link
Member Author

The wip is just my dummy workaround, it worked fine and as expected in my test. Maybe we could have another way to do it.
I'm sorry i haven't check your PR yet, I will check it later and see how can I put them together and commit to this PR.

@enobufs
Copy link
Member

enobufs commented Oct 11, 2022

Hi @jerry-tao I have invited you to the pion (github) organization as a member. Once you accept it, you should be able to directly create a branch for the future pull-request. Github action for auto test should kick in without an approval! ;)

FWIW, Here's my analysis of logs on a spreadsheet from pion (with my modified pion-server) which sends 10 messages on a datachannel, then call dc.close() immediately (with 5 msec delay). I see a couple of 'stream 1 not found' error but all the 10 message were echoed back to Chrome as you can see with SACK (selective ack). On the Chrome side script though, it only received the first 3 messages. (7 were dropped, or ignored by Chrome for some reason). Also, I did not see any new channels created by Chrome. I used issue-187-rev2 branch. FYI.

Dev console on Chrome:
Screen Shot 2022-10-11 at 2 55 43 AM

I will try your test code next time.

@jerry-tao
Copy link
Member Author

jerry-tao commented Oct 12, 2022

I try test it reverse, I think the Chrome stop sending buffer data if it received reconfig.
Here is my test code:

           channel.addEventListener("closing", () => {
                console.log("closing");
            });
            channel.addEventListener("close", () => {
                console.log("closed")
            })
            channel.onerror = (e)=>{
                console.log("error",e)
            }
            // 16k data
            let buf = new ArrayBuffer(1024 * 16);
            channel.addEventListener("open", () => {
                console.log("opened");
                // totally 16mb
                for (let i = 0; i < 1024; i++) {
                    channel.send(buf);
                }
                console.log("js done")
            });
            channel.addEventListener("message", (e) => {
                console.log("got message");
            });

When I received a message i closed the channel:

func(msg webrtc.DataChannelMessage, channel *webrtc.DataChannel) {
		channel.Close()
}

The Chrome will stop sending and turn it to closed state.

@jerry-tao
Copy link
Member Author

And I found this:

void SctpDataChannel::OnClosingProcedureStartedRemotely(int sid) {
  RTC_DCHECK_RUN_ON(signaling_thread_);
  if (sid == config_.id && state_ != kClosing && state_ != kClosed) {
    // Don't bother sending queued data since the side that initiated the
    // closure wouldn't receive it anyway. See crbug.com/559394 for a lengthy
    // discussion about this.
    queued_send_data_.Clear();
    queued_control_data_.Clear();
    // Just need to change state to kClosing, SctpTransport will handle the
    // rest of the closing procedure and OnClosingProcedureComplete will be
    // called later.
    started_closing_procedure_ = true;
    SetState(kClosing);
  }
}

@enobufs
Copy link
Member

enobufs commented Oct 12, 2022

@jerry-tao
I am a bit concerned about where this is going, and if we are aligned.

First of all, I think we should let datachannel (DCEP) layer close the outgoing stream on EOF. RFC 8831 is not about SCTP layer. This has been fixed, along with a unit test in this PR: jerry-tao#2

I don't think removing data chunks in the pending queue is necessary. RFC 6525 indicates all data written before OutgoingResetRequest should be delivered. The method a.sendResetRequest() pushes a empty DATA chunk as an indicator of EOS (End of Stream) to insure that. (the method does not immediately create an OutgoingResetRequest)

Also...

Chrome will reuse the stream id quickly, and during my test if the stream id get reused, the new datachannel won't receive anything.

I still don't know exactly what the problem here is. I followed your code in which creating a new data channel right after closing the previous datachannel, repeating 5 times. I can see stream ID is reused. I don't see any problem with it. Each iteration (as soon as new data channel becomes available), sending one message, which is successfully echoed (by pion) back to Chrome with no problem. I can see all data channels in 'closed' state at the end successfully. I tried both 03f5520 and pion:issue-187-rev2 .

I think we should step back a little bit and try to come up with a code with which we both can see the same problem first. If you could share a complete code that demonstrate the problem you are seeing, that would be a great start. I highly recommend we both base on pion:issue-187-rev2.

@jerry-tao
Copy link
Member Author

In our use case, the problem is we send huge amount of data through data channel, they will be put in pendingQueue.
If we waiting all data in pendingQueue been sent, it will be very long and a huge waste of network. The new data won't be sent since it all in one queue, which I think was caused by the fix, now I'm clear they are unrelated.
I agree that could be fixed in future, I will check the jerry-tao/sctp#2, and merge it asap.

@jerry-tao
Copy link
Member Author

@enobufs I have reviewed your PR at jerry-tao#2, and will working on a new PR about clear the buffer.

@enobufs
Copy link
Member

enobufs commented Oct 13, 2022

And I found this:

@jerry-tao Good finding! That's exactly how Chrome appears to behave from what I saw.

FYI. I asked questions to a co-author of RFC 8831 (and 6525) about section 6.7. He agrees that from RFC 8831 (based on RFC 6525), it is natural for us to assume that what section 6.7 offers is a "graceful shutdown" like we do with shutdown(socket, SHUT_WR) while the socket is still able to read data from the remote. But, looking at W3C's API, he said it felt like the spec followed a different path, and RTCDataChannel.close() is, like close(socket) in TCP, which closes both outbound and inbound...

W3C API doc, though, has this text in 6.2.4 Closing procedure:

Finish sending all currently pending messages of the channel.

But then, the quote in your finding:

    // Don't bother sending queued data since the side that initiated the
    // closure wouldn't receive it anyway. See crbug.com/559394 for a lengthy
    // discussion about this.

This may be true, if W3C's RTCDataChannel.close() is meant to close both directions. That is, once you have received 'OutgoingResetRequest' from peer, you know that the peer no longer want to receive anything, we could discard the pending data (not inflight yet) when it sends 'OutgoingResetRequest'. (NOTE: we would discard the pending only when we have received OutgoingResetRequest already.)

If you'd like to follow through your work on this, I am ok with this direction, and I'd be happy to review your code. (I have added a couple of comments to pion/sctp: State Transition) Please let me know your thoughts.

@jerry-tao
Copy link
Member Author

Now we know they are unrelated issues I think we could finish this PR first (still working on the lint), I open a new issue #239 to discuss the pendingBuffer.

@jerry-tao
Copy link
Member Author

@enobufs BTW, do you prefer I squash this PR to a single commit or left the commits history?

@enobufs
Copy link
Member

enobufs commented Oct 14, 2022

@enobufs BTW, do you prefer I squash this PR to a single commit or left the commits history?

Sure > squash

Copy link
Member

@enobufs enobufs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jerry-tao for your great work. Thanks for adding state checking in WriteSCTP() method. It looks good too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data channels do not fully close when close initiated from Chrome
2 participants