fix: drop connection when stream ends unexpectedly #262

olizilla · 2018-05-30T22:48:47Z

Pull streams pass true in the error position when the sream ends.
In https://github.com/multiformats/js-multistream-select/blob/5b19358b91850b528b3f93babd60d63ddcf56a99/src/select.js#L18-L21
...we're getting lots of instances of pull-length-prefixed stream
erroring early with true and it's passed back up to the dialer
in

js-libp2p-switch/src/dial.js

Lines 238 to 241 in fef2d11

    
           msDialer.handle(connection, (err) => { 
        
             if (err) { 
        
               return callback(new Error('multistream not supported')) 
        
             }

The _createMuxedConnection contains an assumption that any error
that occurs when trying _attemptMuxerUpgrade is ok, and keeps the
relveant baseConnecton in the cache. If the pull-stream has ended
unexpectedly then keeping the connection arround starts causing
the "already piped" errors when we try and use the it later.

This PR adds a guard to avoid putting the connection back into the
cache if the stream has ended.

There is related work in an old PR to add a check for exactly this issue in
pull-length-prefixed dignifiedquire/pull-length-prefixed#8
...but it's still open, so this PR adds a check for true in
the error position at the site where the "already piped" errors
were appearing. Once the PR on pull-length-prefixed is merged this
check can be removed. It's not ideal to have it in this code as it
is far removed from the source, but it fixes the issue for now.

Arguably anywhere that msDialer.handle is called should do the
same check, but we're not seeing this error occur anywhere else so
to keep this PR small, I've left it as the minimal changeset to
fix the issue.

Of note, we had to add '/dns4/ws-star.discovery.libp2p.io/tcp/443/wss/p2p-websocket-star'
to the swarm config to trigger the "already piped" errors. There
is a minimal test app here https://github.com/tableflip/js-ipfs-already-piped-error

Manual testing shows ~50 streams fail in the first 2 mins of
running a node, and then things stabalise with ~90 active muxed
connections after that.

Fixes #235
Fixes ipfs/js-ipfs#1366
See dignifiedquire/pull-length-prefixed#8

License: MIT
Signed-off-by: Oli Evans oli@tableflip.io

Pull streams pass true in the error position when the sream ends. In https://github.com/multiformats/js-multistream-select/blob/5b19358b91850b528b3f93babd60d63ddcf56a99/src/select.js#L18-L21 ...we're getting lots of instances of pull-length-prefixed stream erroring early with `true` and it's passed back up to the dialer in https://github.com/libp2p/js-libp2p-switch/blob/fef2d11850379a4720bb9c736236a81a067dc901/src/dial.js#L238-L241 The `_createMuxedConnection` contains an assumption that any error that occurs when trying `_attemptMuxerUpgrade` is ok, and keeps the relveant baseConnecton in the cache. If the pull-stream has ended unexpectedly then keeping the connection arround starts causing the "already piped" errors when we try and use the it later. This PR adds a guard to avoid putting the connection back into the cache if the stream has ended. There is related work in an old PR to add a check for exactly this issue in pull-length-prefixed dignifiedquire/pull-length-prefixed#8 ...but it's still open, so this PR adds a check for true in the error position at the site where the "already piped" errors were appearing. Once the PR on pull-length-prefixed is merged this check can be removed. It's not ideal to have it in this code as it is far removed from the source, but it fixes the issue for now. Arguably anywhere that `msDialer.handle` is called should do the same check, but we're not seeing this error occur anywhere else so to keep this PR small, I've left it as the minimal changeset to fix the issue. Of note, we had to add '/dns4/ws-star.discovery.libp2p.io/tcp/443/wss/p2p-websocket-star' to the swarm config to trigger the "already piped" errors. There is a minimal test app here https://github.com/tableflip/js-ipfs-already-piped-error Manual testing shows ~50 streams fail in the first 2 mins of running a node, and then things stabalise with ~90 active muxed connections after that. Fixes #235 Fixes ipfs/js-ipfs#1366 See dignifiedquire/pull-length-prefixed#8 License: MIT Signed-off-by: Oli Evans <oli@tableflip.io>

olizilla · 2018-05-30T22:52:49Z

This should have a test, but it's late and I wanted to surface that there is a fix, and get some advice on how to go about testing it.

Also, this PR removes an un-needed waterfall with only a single task in it from _createMuxedConnection, because @alanshaw wanted it so.

jacobheun · 2018-05-31T00:19:30Z

This looks good. I will think about the testing options a bit more, as it's late here as well. I am running your test app locally and that looks good. My initial thoughts are that this is going to be difficult to test, without significant changes, in switch's current state. I am leaning towards releasing this without them as the changes and risk are small, and there will be a clear reduction in already piped errors.

The work to convert the underlying connections and switch itself into state machines, https://github.com/libp2p/js-libp2p-switch/issues/24, should make this much easier to catch, handle and test these types of errors.

olizilla · 2018-05-31T07:58:22Z

Agreed. It'd be more valuable to refactor js-multistream-select and add a smoke test for js-ipfs that just does what the demo app does... let ipfs run for a few minutes in a real browser and see if any errors occur. It's not a tidy unit test, but it would have picked this up before the 0.29 release.

jacobheun · 2018-05-31T13:28:57Z

@olizilla after running your example more I ran into the error again. After looking at the stack trace I think this is indeed due to the other multistream dialers not handling the true error.

I made an update to the code locally and am testing it against your example. It adds two helper methods handleSafe and selectSafe that check for the error, and those are used in place of the existing dialer.handle and dialer.select logic. I'm doing some final testing now and can push that up shortly for you to take a look at.

jacobheun · 2018-05-31T14:16:13Z

@olizilla here is the commit I am testing against, af27db9. I didn't want to push directly to your branch, but I can cheery pick that over if it looks good to you.

There are a bit more changes, but I haven't encountered the error yet and am getting quite a few Connection dropped logs. Looking at the logs it's also interesting to see there appear to be 4 particular addresses that are problematic.

olizilla · 2018-05-31T15:17:12Z

@jacobheun feel free to push to this branch. I didn't see errors occurring from the other usages, I'd have done similar to what you have.

I think keeping the "ok errors" separate from the "error errors" is a good move and a reasonable fix here for now, so I'm all +1 on those changes.

More generally, I'm not familiar enough with the purpose of each layer to say but it feels like the the number of errors that can be safely ignored would be finite, so we could switch things so that only a known set of errors are tested against to suggest that the connection is still ok even though a transition failed. That or MultistreamDialer should take responsibility for signalling the difference between "temporary errors" and "failed connections" to the caller. These are just musings at this point, I'm fine with fixing this at the switch layer for now, but we should follow up with @diasdavid about what (if anything) should happen to the api for MultistreamDialer to simplify this situation for calling code.

coveralls · 2018-05-31T15:34:27Z

Pull Request Test Coverage Report for Build 865

0 of 0 changed or added relevant lines in 0 files are covered.
13 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.4%) to 89.889%

Files with Coverage Reduction	New Missed Lines	%
../src/dial.js	13	88.29%

Totals
Change from base Build 858:	-0.4%
Covered Lines:	583
Relevant Lines:	622

💛 - Coveralls

jacobheun · 2018-05-31T15:56:44Z

Yeah, I think this is something that needs to be solved in the multistream. I would expect handle and select to catch and return errors in the callback, rather than those errors being thrown.

While this fix should help mitigate the error, we're likely to still encounter it until the underlying problem is resolved.

ghost assigned olizilla May 30, 2018

ghost added the in progress label May 30, 2018

olizilla requested a review from jacobheun May 30, 2018 22:52

This was referenced May 31, 2018

Slow response on new files AND Seeing "already piped" errors ipfs/js-ipfs#1156

Closed

feat: swap quick-lru by hashlru #259

Merged

fix: add utility methods to prevent already piped error

d7428e5

ghost assigned jacobheun May 31, 2018

jacobheun approved these changes May 31, 2018

View reviewed changes

jacobheun merged commit 1eb28ca into master May 31, 2018

ghost removed the in progress label May 31, 2018

jacobheun deleted the fix/already-piped branch May 31, 2018 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: drop connection when stream ends unexpectedly #262

fix: drop connection when stream ends unexpectedly #262

olizilla commented May 30, 2018

olizilla commented May 30, 2018

jacobheun commented May 31, 2018

olizilla commented May 31, 2018

jacobheun commented May 31, 2018

jacobheun commented May 31, 2018

olizilla commented May 31, 2018

coveralls commented May 31, 2018

jacobheun commented May 31, 2018

	msDialer.handle(connection, (err) => {
	if (err) {
	return callback(new Error('multistream not supported'))
	}

fix: drop connection when stream ends unexpectedly #262

fix: drop connection when stream ends unexpectedly #262

Conversation

olizilla commented May 30, 2018

olizilla commented May 30, 2018

jacobheun commented May 31, 2018

olizilla commented May 31, 2018

jacobheun commented May 31, 2018

jacobheun commented May 31, 2018

olizilla commented May 31, 2018

coveralls commented May 31, 2018

Pull Request Test Coverage Report for Build 865

💛 - Coveralls

jacobheun commented May 31, 2018