Message queued indefinitely when reconnecting over XHR and failing to upgrade to WS #300

mattheworiordan · 2016-07-05T21:06:19Z

@SimonWoolf please update this ticket with the details of what we saw. I can provide more detailed logs if you require.

SimonWoolf · 2016-07-07T13:52:50Z

Postmortem

Observed behaviour

A customer reported that occasionally, on a server running a node client lib instance (which was configured to renew a token every hour), outbound messages would be queued for up to an hour, then sent all at once.

Messages being queued for an hour should not be possible: messages are queued when the library is attaching to a channel, but a failed attach will time out after 10s and fail the messages. They are also queued when the library is in the connecting or disconnected states, eg if the lib loses connectivity, but that will only be the case for a maximum of 2 minutes, after which the lib will move to the suspended state and fail the messages. They are also queued while the library is in the synchronizing state, the last stage of a comet->websocket upgrade (involving a single websocket round-trip of a SYNC action with the connection serial at the point of changeover).

Analysis

Looking at the router logs: Normally the library used websockets, but during this hour, the library was connected with comet. However:

$ grep <connectionId> gistfile1.txt | grep comet | grep T18 | grep recv | wc -l
313
$ grep <connectionId> gistfile1.txt | grep comet | grep T18 | grep send | wc -l
0

The library was only doing receives, no sends. Together with the use of comet rather than websockets, this suggested that it was stuck in the synchronizing state after a failed comet->websocket upgrade.

Actions

A timeout was added to the synchronizing state so that if the SYNC roundtrip was not completed in under realtimeRequestTimeout, the library would cancel the upgrade (and reconnect from scratch, since both transports are then in an unknown state), so it will never stay in the synchronizing state for more than that time. Other changes were also made to the sync process for added robustness. See Sync robustness fixes #302
The library now remembers its preferred transport within a library instance, so that once successfully upgraded to websockets, it would go straight to that in future connections rather than going through the upgrade process again. (This was already the case with browsers, but only using localStorage, not in memory, so it didn't apply to node instances). See Store transport preference in memory, not just localStorage #303

mattheworiordan added bug Something isn't working. It's clear that this does need to be fixed. high priority labels Jul 5, 2016

SimonWoolf mentioned this issue Jul 6, 2016

Sync robustness fixes #302

Closed

SimonWoolf closed this as completed Jul 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message queued indefinitely when reconnecting over XHR and failing to upgrade to WS #300

Message queued indefinitely when reconnecting over XHR and failing to upgrade to WS #300

mattheworiordan commented Jul 5, 2016

SimonWoolf commented Jul 7, 2016 •

edited

Loading

Message queued indefinitely when reconnecting over XHR and failing to upgrade to WS #300

Message queued indefinitely when reconnecting over XHR and failing to upgrade to WS #300

Comments

mattheworiordan commented Jul 5, 2016

SimonWoolf commented Jul 7, 2016 • edited Loading

Postmortem

Observed behaviour

Analysis

Actions

SimonWoolf commented Jul 7, 2016 •

edited

Loading