Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ODSP socket reuse logic #4483

Merged
merged 7 commits into from
Dec 1, 2020
Merged

Improve ODSP socket reuse logic #4483

merged 7 commits into from
Dec 1, 2020

Conversation

vladsud
Copy link
Contributor

@vladsud vladsud commented Nov 25, 2020

We have some number of errors in telemetry that are hard to track down.
One of them is hitting this assert:

assert(key !== undefined, "reentrancy not supported!");

Another one is related to raising "disconnect" event on socket as result of closing one connection and and what it looks like failure / closure of some other connection reusing same socket.

Examining code reveals that decisions on whether socket is alive is wrong. We use disconnected property on the socket as a way to determine if socket can be reused. However, socket might have never had a chance to connect in the first place. In such case, not only we will not reuse it, but we will fail connection that is reusing same socket.

Fundamentally, the problem here is having global map and not really well isolated lifetime management. Using keys asks for trouble, as it leaves us in position where cleanup of old connections has to happen (synchronously) before new connection is going to reuse same slot (same key). This is not good, as we will keep running into interesting cases where this requirement is violated.

As such, I'm moving away from using keys in the code, to using actual SocketReference object, and only using key when creating / finding existing SocketReference. Once SocketReference is in bad state, it's removed from the list, but existing users of SocketReference can hold it as long as they need to do their cleanup. New users of the system will create new SocketReference but it will not interfere with old users who are going through cleanup.

This also allows us to reduce global state sharing across users of socket. In particular, we raise "disconnect" event on socket only in response to "server_disconnect" event and no longer do it for any other critical events on the socket, with expectation that all users will process individually whatever event causes us to believe socket is bad.

And with this, logic moved to SocketReference - place where it should have been in the first place, out of OdspDocumentDeltaConnection.

Initial problem of figuring out if socket was ever connected is solved via inUse flag as a proxy of "socket ever connected" tracking. Flag is se to true whenever any client is done with socket reference. This would happen either if client was able to connect, or hit an error during connection. If client hit error during connection, then likely we do not want to reuse the socket, even if socket is Ok. This may not be perfect solution (client may have hit timeout of waiting for "connect_document_success" message, even maybe not getting through establishing connection), but it also does not feel right if takes so long to establish socket connection, so better to retry.

And coming back to reentrancy (assert), I was not able to fully pinpoint the problem. But couple times I was able to hit it, we had two "disconnect" handlers - one from DeltaManager, one - connection listener (later should have been removed on successful connection). I've put more telemetry logging, and made code safe in those cases, will look for more telemetry to better understand the problem.

@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Nov 25, 2020

@fluidframework/base-host: No change
Metric NameBaseline SizeCompare SizeSize Diff
main.js 163.84 KB 163.84 KB No change
Total Size 163.84 KB 163.84 KB No change
@fluid-example/bundle-size-tests: -1.26 KB
Metric NameBaseline SizeCompare SizeSize Diff
container.js 187.28 KB 187.01 KB -278 Bytes
map.js 46.21 KB 45.94 KB -282 Bytes
matrix.js 144.81 KB 144.53 KB -282 Bytes
odspDriver.js 194.38 KB 194.23 KB -162 Bytes
sharedString.js 158.32 KB 158.04 KB -283 Bytes
Total Size 731 KB 729.75 KB -1.26 KB

Baseline commit: f06200a

Generated by 🚫 dangerJS against fc06a8b

@vladsud vladsud merged commit ae0afbe into microsoft:main Dec 1, 2020
@vladsud vladsud deleted the Disconnect branch December 1, 2020 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants