-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ODSP socket reuse logic #4483
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
■ @fluidframework/base-host: No change
⯆ @fluid-example/bundle-size-tests: -1.26 KB
Baseline commit: f06200a |
sumedhb1995
reviewed
Nov 26, 2020
sumedhb1995
reviewed
Nov 26, 2020
packages/drivers/odsp-driver/src/odspDocumentDeltaConnection.ts
Outdated
Show resolved
Hide resolved
ChumpChief
approved these changes
Dec 1, 2020
packages/drivers/odsp-driver/src/odspDocumentDeltaConnection.ts
Outdated
Show resolved
Hide resolved
packages/drivers/odsp-driver/src/odspDocumentDeltaConnection.ts
Outdated
Show resolved
Hide resolved
packages/drivers/odsp-driver/src/odspDocumentDeltaConnection.ts
Outdated
Show resolved
Hide resolved
packages/drivers/odsp-driver/src/odspDocumentDeltaConnection.ts
Outdated
Show resolved
Hide resolved
packages/drivers/odsp-driver/src/odspDocumentDeltaConnection.ts
Outdated
Show resolved
Hide resolved
packages/drivers/odsp-driver/src/odspDocumentDeltaConnection.ts
Outdated
Show resolved
Hide resolved
packages/drivers/odsp-driver/src/odspDocumentDeltaConnection.ts
Outdated
Show resolved
Hide resolved
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We have some number of errors in telemetry that are hard to track down.
One of them is hitting this assert:
Another one is related to raising "disconnect" event on socket as result of closing one connection and and what it looks like failure / closure of some other connection reusing same socket.
Examining code reveals that decisions on whether socket is alive is wrong. We use disconnected property on the socket as a way to determine if socket can be reused. However, socket might have never had a chance to connect in the first place. In such case, not only we will not reuse it, but we will fail connection that is reusing same socket.
Fundamentally, the problem here is having global map and not really well isolated lifetime management. Using keys asks for trouble, as it leaves us in position where cleanup of old connections has to happen (synchronously) before new connection is going to reuse same slot (same key). This is not good, as we will keep running into interesting cases where this requirement is violated.
As such, I'm moving away from using keys in the code, to using actual SocketReference object, and only using key when creating / finding existing SocketReference. Once SocketReference is in bad state, it's removed from the list, but existing users of SocketReference can hold it as long as they need to do their cleanup. New users of the system will create new SocketReference but it will not interfere with old users who are going through cleanup.
This also allows us to reduce global state sharing across users of socket. In particular, we raise "disconnect" event on socket only in response to "server_disconnect" event and no longer do it for any other critical events on the socket, with expectation that all users will process individually whatever event causes us to believe socket is bad.
And with this, logic moved to SocketReference - place where it should have been in the first place, out of OdspDocumentDeltaConnection.
Initial problem of figuring out if socket was ever connected is solved via inUse flag as a proxy of "socket ever connected" tracking. Flag is se to true whenever any client is done with socket reference. This would happen either if client was able to connect, or hit an error during connection. If client hit error during connection, then likely we do not want to reuse the socket, even if socket is Ok. This may not be perfect solution (client may have hit timeout of waiting for "connect_document_success" message, even maybe not getting through establishing connection), but it also does not feel right if takes so long to establish socket connection, so better to retry.
And coming back to reentrancy (assert), I was not able to fully pinpoint the problem. But couple times I was able to hit it, we had two "disconnect" handlers - one from DeltaManager, one - connection listener (later should have been removed on successful connection). I've put more telemetry logging, and made code safe in those cases, will look for more telemetry to better understand the problem.