Skip to content
This repository has been archived by the owner on Aug 23, 2019. It is now read-only.

fix: drop connection when stream ends unexpectedly #262

Merged
merged 2 commits into from
May 31, 2018

Conversation

olizilla
Copy link
Contributor

Pull streams pass true in the error position when the sream ends.
In https://github.com/multiformats/js-multistream-select/blob/5b19358b91850b528b3f93babd60d63ddcf56a99/src/select.js#L18-L21
...we're getting lots of instances of pull-length-prefixed stream
erroring early with true and it's passed back up to the dialer
in

msDialer.handle(connection, (err) => {
if (err) {
return callback(new Error('multistream not supported'))
}

The _createMuxedConnection contains an assumption that any error
that occurs when trying _attemptMuxerUpgrade is ok, and keeps the
relveant baseConnecton in the cache. If the pull-stream has ended
unexpectedly then keeping the connection arround starts causing
the "already piped" errors when we try and use the it later.

This PR adds a guard to avoid putting the connection back into the
cache if the stream has ended.

There is related work in an old PR to add a check for exactly this issue in
pull-length-prefixed dignifiedquire/pull-length-prefixed#8
...but it's still open, so this PR adds a check for true in
the error position at the site where the "already piped" errors
were appearing. Once the PR on pull-length-prefixed is merged this
check can be removed. It's not ideal to have it in this code as it
is far removed from the source, but it fixes the issue for now.

Arguably anywhere that msDialer.handle is called should do the
same check, but we're not seeing this error occur anywhere else so
to keep this PR small, I've left it as the minimal changeset to
fix the issue.

Of note, we had to add '/dns4/ws-star.discovery.libp2p.io/tcp/443/wss/p2p-websocket-star'
to the swarm config to trigger the "already piped" errors. There
is a minimal test app here https://github.com/tableflip/js-ipfs-already-piped-error

Manual testing shows ~50 streams fail in the first 2 mins of
running a node, and then things stabalise with ~90 active muxed
connections after that.

Fixes #235
Fixes ipfs/js-ipfs#1366
See dignifiedquire/pull-length-prefixed#8

License: MIT
Signed-off-by: Oli Evans oli@tableflip.io

Pull streams pass true in the error position when the sream ends.
In https://github.com/multiformats/js-multistream-select/blob/5b19358b91850b528b3f93babd60d63ddcf56a99/src/select.js#L18-L21
...we're getting lots of instances of pull-length-prefixed stream
erroring early with `true` and it's passed back up to the dialer
in https://github.com/libp2p/js-libp2p-switch/blob/fef2d11850379a4720bb9c736236a81a067dc901/src/dial.js#L238-L241

The `_createMuxedConnection` contains an assumption that any error
that occurs when trying `_attemptMuxerUpgrade` is ok, and keeps the
relveant baseConnecton in the cache. If the pull-stream has ended
unexpectedly then keeping the connection arround starts causing
the "already piped" errors when we try and use the it later.

This PR adds a guard to avoid putting the connection back into the
cache if the stream has ended.

There is related work in an old PR to add a check for exactly this issue in
pull-length-prefixed dignifiedquire/pull-length-prefixed#8
...but it's still open, so this PR adds a check for true in
the error position at the site where the "already piped" errors
were appearing. Once the PR on pull-length-prefixed is merged this
check can be removed. It's not ideal to have it in this code as it
is far removed from the source, but it fixes the issue for now.

Arguably anywhere that `msDialer.handle` is called should do the
same check, but we're not seeing this error occur anywhere else so
to keep this PR small, I've left it as the minimal changeset to
fix the issue.

Of note, we had to add '/dns4/ws-star.discovery.libp2p.io/tcp/443/wss/p2p-websocket-star'
to the swarm config to trigger the "already piped" errors. There
is a minimal test app here https://github.com/tableflip/js-ipfs-already-piped-error

Manual testing shows ~50 streams fail in the first 2 mins of
running a node, and then things stabalise with ~90 active muxed
connections after that.

Fixes #235
Fixes ipfs/js-ipfs#1366
See dignifiedquire/pull-length-prefixed#8

License: MIT
Signed-off-by: Oli Evans <oli@tableflip.io>
@ghost ghost assigned olizilla May 30, 2018
@ghost ghost added the in progress label May 30, 2018
@olizilla
Copy link
Contributor Author

This should have a test, but it's late and I wanted to surface that there is a fix, and get some advice on how to go about testing it.

Also, this PR removes an un-needed waterfall with only a single task in it from _createMuxedConnection, because @alanshaw wanted it so.

@olizilla olizilla requested a review from jacobheun May 30, 2018 22:52
@jacobheun
Copy link
Contributor

This looks good. I will think about the testing options a bit more, as it's late here as well. I am running your test app locally and that looks good. My initial thoughts are that this is going to be difficult to test, without significant changes, in switch's current state. I am leaning towards releasing this without them as the changes and risk are small, and there will be a clear reduction in already piped errors.

The work to convert the underlying connections and switch itself into state machines, https://github.com/libp2p/js-libp2p-switch/issues/24, should make this much easier to catch, handle and test these types of errors.

@olizilla
Copy link
Contributor Author

Agreed. It'd be more valuable to refactor js-multistream-select and add a smoke test for js-ipfs that just does what the demo app does... let ipfs run for a few minutes in a real browser and see if any errors occur. It's not a tidy unit test, but it would have picked this up before the 0.29 release.

@jacobheun
Copy link
Contributor

@olizilla after running your example more I ran into the error again. After looking at the stack trace I think this is indeed due to the other multistream dialers not handling the true error.

I made an update to the code locally and am testing it against your example. It adds two helper methods handleSafe and selectSafe that check for the error, and those are used in place of the existing dialer.handle and dialer.select logic. I'm doing some final testing now and can push that up shortly for you to take a look at.

@jacobheun
Copy link
Contributor

@olizilla here is the commit I am testing against, af27db9. I didn't want to push directly to your branch, but I can cheery pick that over if it looks good to you.

There are a bit more changes, but I haven't encountered the error yet and am getting quite a few Connection dropped logs. Looking at the logs it's also interesting to see there appear to be 4 particular addresses that are problematic.

@olizilla
Copy link
Contributor Author

@jacobheun feel free to push to this branch. I didn't see errors occurring from the other usages, I'd have done similar to what you have.

I think keeping the "ok errors" separate from the "error errors" is a good move and a reasonable fix here for now, so I'm all +1 on those changes.

More generally, I'm not familiar enough with the purpose of each layer to say but it feels like the the number of errors that can be safely ignored would be finite, so we could switch things so that only a known set of errors are tested against to suggest that the connection is still ok even though a transition failed. That or MultistreamDialer should take responsibility for signalling the difference between "temporary errors" and "failed connections" to the caller. These are just musings at this point, I'm fine with fixing this at the switch layer for now, but we should follow up with @diasdavid about what (if anything) should happen to the api for MultistreamDialer to simplify this situation for calling code.

@ghost ghost assigned jacobheun May 31, 2018
@coveralls
Copy link

Pull Request Test Coverage Report for Build 865

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 13 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.4%) to 89.889%

Files with Coverage Reduction New Missed Lines %
../src/dial.js 13 88.29%
Totals Coverage Status
Change from base Build 858: -0.4%
Covered Lines: 583
Relevant Lines: 622

💛 - Coveralls

@jacobheun
Copy link
Contributor

Yeah, I think this is something that needs to be solved in the multistream. I would expect handle and select to catch and return errors in the callback, rather than those errors being thrown.

While this fix should help mitigate the error, we're likely to still encounter it until the underlying problem is resolved.

@jacobheun jacobheun merged commit 1eb28ca into master May 31, 2018
@ghost ghost removed the in progress label May 31, 2018
@jacobheun jacobheun deleted the fix/already-piped branch May 31, 2018 15:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revenge of the Missing peerId and its sidekick, already piped Error "already piped"
3 participants