-
Notifications
You must be signed in to change notification settings - Fork 37
fix: drop connection when stream ends unexpectedly #262
Conversation
Pull streams pass true in the error position when the sream ends. In https://github.com/multiformats/js-multistream-select/blob/5b19358b91850b528b3f93babd60d63ddcf56a99/src/select.js#L18-L21 ...we're getting lots of instances of pull-length-prefixed stream erroring early with `true` and it's passed back up to the dialer in https://github.com/libp2p/js-libp2p-switch/blob/fef2d11850379a4720bb9c736236a81a067dc901/src/dial.js#L238-L241 The `_createMuxedConnection` contains an assumption that any error that occurs when trying `_attemptMuxerUpgrade` is ok, and keeps the relveant baseConnecton in the cache. If the pull-stream has ended unexpectedly then keeping the connection arround starts causing the "already piped" errors when we try and use the it later. This PR adds a guard to avoid putting the connection back into the cache if the stream has ended. There is related work in an old PR to add a check for exactly this issue in pull-length-prefixed dignifiedquire/pull-length-prefixed#8 ...but it's still open, so this PR adds a check for true in the error position at the site where the "already piped" errors were appearing. Once the PR on pull-length-prefixed is merged this check can be removed. It's not ideal to have it in this code as it is far removed from the source, but it fixes the issue for now. Arguably anywhere that `msDialer.handle` is called should do the same check, but we're not seeing this error occur anywhere else so to keep this PR small, I've left it as the minimal changeset to fix the issue. Of note, we had to add '/dns4/ws-star.discovery.libp2p.io/tcp/443/wss/p2p-websocket-star' to the swarm config to trigger the "already piped" errors. There is a minimal test app here https://github.com/tableflip/js-ipfs-already-piped-error Manual testing shows ~50 streams fail in the first 2 mins of running a node, and then things stabalise with ~90 active muxed connections after that. Fixes #235 Fixes ipfs/js-ipfs#1366 See dignifiedquire/pull-length-prefixed#8 License: MIT Signed-off-by: Oli Evans <oli@tableflip.io>
This should have a test, but it's late and I wanted to surface that there is a fix, and get some advice on how to go about testing it. Also, this PR removes an un-needed waterfall with only a single task in it from |
This looks good. I will think about the testing options a bit more, as it's late here as well. I am running your test app locally and that looks good. My initial thoughts are that this is going to be difficult to test, without significant changes, in switch's current state. I am leaning towards releasing this without them as the changes and risk are small, and there will be a clear reduction in already piped errors. The work to convert the underlying connections and switch itself into state machines, https://github.com/libp2p/js-libp2p-switch/issues/24, should make this much easier to catch, handle and test these types of errors. |
Agreed. It'd be more valuable to refactor js-multistream-select and add a smoke test for js-ipfs that just does what the demo app does... let ipfs run for a few minutes in a real browser and see if any errors occur. It's not a tidy unit test, but it would have picked this up before the 0.29 release. |
@olizilla after running your example more I ran into the error again. After looking at the stack trace I think this is indeed due to the other multistream dialers not handling the I made an update to the code locally and am testing it against your example. It adds two helper methods |
@olizilla here is the commit I am testing against, af27db9. I didn't want to push directly to your branch, but I can cheery pick that over if it looks good to you. There are a bit more changes, but I haven't encountered the error yet and am getting quite a few |
@jacobheun feel free to push to this branch. I didn't see errors occurring from the other usages, I'd have done similar to what you have. I think keeping the "ok errors" separate from the "error errors" is a good move and a reasonable fix here for now, so I'm all +1 on those changes. More generally, I'm not familiar enough with the purpose of each layer to say but it feels like the the number of errors that can be safely ignored would be finite, so we could switch things so that only a known set of errors are tested against to suggest that the connection is still ok even though a transition failed. That or MultistreamDialer should take responsibility for signalling the difference between "temporary errors" and "failed connections" to the caller. These are just musings at this point, I'm fine with fixing this at the switch layer for now, but we should follow up with @diasdavid about what (if anything) should happen to the api for MultistreamDialer to simplify this situation for calling code. |
Pull Request Test Coverage Report for Build 865
💛 - Coveralls |
Yeah, I think this is something that needs to be solved in the While this fix should help mitigate the error, we're likely to still encounter it until the underlying problem is resolved. |
Pull streams pass true in the error position when the sream ends.
In https://github.com/multiformats/js-multistream-select/blob/5b19358b91850b528b3f93babd60d63ddcf56a99/src/select.js#L18-L21
...we're getting lots of instances of pull-length-prefixed stream
erroring early with
true
and it's passed back up to the dialerin
js-libp2p-switch/src/dial.js
Lines 238 to 241 in fef2d11
The
_createMuxedConnection
contains an assumption that any errorthat occurs when trying
_attemptMuxerUpgrade
is ok, and keeps therelveant baseConnecton in the cache. If the pull-stream has ended
unexpectedly then keeping the connection arround starts causing
the "already piped" errors when we try and use the it later.
This PR adds a guard to avoid putting the connection back into the
cache if the stream has ended.
There is related work in an old PR to add a check for exactly this issue in
pull-length-prefixed dignifiedquire/pull-length-prefixed#8
...but it's still open, so this PR adds a check for true in
the error position at the site where the "already piped" errors
were appearing. Once the PR on pull-length-prefixed is merged this
check can be removed. It's not ideal to have it in this code as it
is far removed from the source, but it fixes the issue for now.
Arguably anywhere that
msDialer.handle
is called should do thesame check, but we're not seeing this error occur anywhere else so
to keep this PR small, I've left it as the minimal changeset to
fix the issue.
Of note, we had to add '/dns4/ws-star.discovery.libp2p.io/tcp/443/wss/p2p-websocket-star'
to the swarm config to trigger the "already piped" errors. There
is a minimal test app here https://github.com/tableflip/js-ipfs-already-piped-error
Manual testing shows ~50 streams fail in the first 2 mins of
running a node, and then things stabalise with ~90 active muxed
connections after that.
Fixes #235
Fixes ipfs/js-ipfs#1366
See dignifiedquire/pull-length-prefixed#8
License: MIT
Signed-off-by: Oli Evans oli@tableflip.io