Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message queued indefinitely when reconnecting over XHR and failing to upgrade to WS #300

Closed
mattheworiordan opened this issue Jul 5, 2016 · 1 comment
Labels
bug Something isn't working. It's clear that this does need to be fixed.

Comments

@mattheworiordan
Copy link
Member

@SimonWoolf please update this ticket with the details of what we saw. I can provide more detailed logs if you require.

@mattheworiordan mattheworiordan added bug Something isn't working. It's clear that this does need to be fixed. high priority labels Jul 5, 2016
@SimonWoolf
Copy link
Member

SimonWoolf commented Jul 7, 2016

Postmortem

Observed behaviour

A customer reported that occasionally, on a server running a node client lib instance (which was configured to renew a token every hour), outbound messages would be queued for up to an hour, then sent all at once.

Messages being queued for an hour should not be possible: messages are queued when the library is attaching to a channel, but a failed attach will time out after 10s and fail the messages. They are also queued when the library is in the connecting or disconnected states, eg if the lib loses connectivity, but that will only be the case for a maximum of 2 minutes, after which the lib will move to the suspended state and fail the messages. They are also queued while the library is in the synchronizing state, the last stage of a comet->websocket upgrade (involving a single websocket round-trip of a SYNC action with the connection serial at the point of changeover).

Analysis

Looking at the router logs: Normally the library used websockets, but during this hour, the library was connected with comet. However:

$ grep <connectionId> gistfile1.txt | grep comet | grep T18 | grep recv | wc -l
313
$ grep <connectionId> gistfile1.txt | grep comet | grep T18 | grep send | wc -l
0

The library was only doing receives, no sends. Together with the use of comet rather than websockets, this suggested that it was stuck in the synchronizing state after a failed comet->websocket upgrade.

Actions
  • A timeout was added to the synchronizing state so that if the SYNC roundtrip was not completed in under realtimeRequestTimeout, the library would cancel the upgrade (and reconnect from scratch, since both transports are then in an unknown state), so it will never stay in the synchronizing state for more than that time. Other changes were also made to the sync process for added robustness. See Sync robustness fixes #302
  • The library now remembers its preferred transport within a library instance, so that once successfully upgraded to websockets, it would go straight to that in future connections rather than going through the upgrade process again. (This was already the case with browsers, but only using localStorage, not in memory, so it didn't apply to node instances). See Store transport preference in memory, not just localStorage #303

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. It's clear that this does need to be fixed.
Development

No branches or pull requests

2 participants