Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection recovery not working #1844

Closed
michalzaq12 opened this issue Aug 12, 2024 · 14 comments · Fixed by #1855
Closed

Connection recovery not working #1844

michalzaq12 opened this issue Aug 12, 2024 · 14 comments · Fixed by #1855
Assignees
Labels
bug Something isn't working. It's clear that this does need to be fixed.

Comments

@michalzaq12
Copy link

michalzaq12 commented Aug 12, 2024

Ably ver: 2.3.1

Description
Realtime connection (websocket, only websocket transport provided) is not restored after network connection is lost. SDK tries to connect without success (network is already available).

My observation

  1. When a network problem occurs, checkWsConnectivity() is called, which of course fails (that's fine).
  2. checkWsConnectivity() can by called only once (WHY?) , so wsCheckResult is always false
    if (this.wsCheckResult === null) {
  3. tryTransportWithFallbacks() is called when trying to reconnect
  4. shouldContinue() in tryTransportWithFallbacks() body always returns false, which call transport.dispose()

┆Issue is synchronized with this Jira Task by Unito

@sacOO7
Copy link
Collaborator

sacOO7 commented Aug 12, 2024

@michalzaq12 reconnection works in following manner

  1. Once client is disconnected, client checks if fallbacks can be used. Internet check is part of the same.
  2. When internet is available, all fallbacks ( 5 default ) are tried. If none of them succeed, connection goes into disconnected state again
  3. New connection is retried again after disconnectedRetryTimeout ( with fallbacks retried ), value for disconnectedRetryTimeout is 15 seconds.

@sacOO7
Copy link
Collaborator

sacOO7 commented Aug 12, 2024

@michalzaq12 Can you check confirm if connection recovery works after 15 seconds ?
Ideally, it should work as long as connection is in either disconnected or suspended state.
Only when connection goes into closed state, connection is not retried.

@michalzaq12
Copy link
Author

Once client is disconnected, client checks if fallbacks can be used. Internet check is part of the same.
When internet is available, all fallbacks ( 5 default ) are tried. If none of them succeed, connection goes into disconnected state again
New connection is retried again after disconnectedRetryTimeout, value is 15 seconds.

I know exactly how it works. It doesn't change the fact that there's some bug in the implementation (regression from v1).

Can you check confirm if connection recovery works after 15 seconds ?

After 15 seconds, the SDK tries to reconnect but fails every time.

@michalzaq12
Copy link
Author

Is there any template for minimal reproducible repo?

@sacOO7
Copy link
Collaborator

sacOO7 commented Aug 13, 2024

@michalzaq12 there's no template as such. It will be great if you can either post code here or best if create a separate repository with steps to reproduce the bug. That way, we can reproduce it given environment i.e. nodejs, browser etc

@sacOO7 sacOO7 added the bug Something isn't working. It's clear that this does need to be fixed. label Aug 13, 2024
@michalzaq12
Copy link
Author

michalzaq12 commented Aug 13, 2024

Environment: Windows 11, Chrome 127

Reproduction code (ably-js 2.3.1):

var client = new Ably.Realtime({
  key: '<ABLY_KEY>',
  transports: ['web_socket']
});

setInterval(() => {
  console.info('STATE: ' +  client.connection.state)
}, 5000)

Reproduction steps:

  1. Wait for connected state
  2. Disable WiFi (do not use 'offline' simulation from dev network tab)
  3. Wait for WebSocket connection to 'wss://ws-up.ably-realtime.com/' failed log
  4. Enable WiFi
  5. SDK tries to reconnect but fails every time

image

@sync-by-unito sync-by-unito bot assigned ttypic and VeskeR and unassigned ttypic Aug 14, 2024
@VeskeR
Copy link
Contributor

VeskeR commented Aug 15, 2024

Hi @michalzaq12 !

Thank you for spotting the issue and providing detailed explanations and steps to reproduce it.
I wasn't able to reproduce the issue locally as, in my case, the internet-up check completes before the startWebSocketSlowTimer timer performs its logic with checkWsConnectivity and sets this.wsCheckResult = false. However, I can see the race condition problem in the code, and artificially slowing down the resolution of the internet-up check produces the situation you're describing: shouldContinue now always returns false, and the client never reconnects.
This is indeed a bug, and we will fix it as soon as possible.

As a side note, is there something unusual about your network setup? For example, do you have a VPN program installed that might be keeping the request to https://internet-up.ably-realtime.com open longer than usual without internet access, which then causes the race condition on your machine?

@michalzaq12
Copy link
Author

Default network settings. I don't use a VPN.

@justerror
Copy link

Some of our users have encountered this reconnection issue after waking up PC. It seems that the problem is that the internet does not become available immediately after PC connecting to Wi-Fi or LAN.

I was able to reproduce this on macOS. To do so, I installed the Network Link Conditioner from Apple (https://developer.apple.com/download/more/?q=Additional%20Tools), which is part of the Additional Tools for Xcode package.

Steps:
1. Disconnect from the network.
2. Enable the Network Link Conditioner with a setting of 100% loss. You can open terminal window with ping some internet resource for monitoring, e.g. PING 8.8.8.8).
3. Reconnect to the network.
4. Wait for more than 4 seconds.
5. Disable the Network Link Conditioner.
6. The internet connection returns, but ably fails to reconnect, resulting in a loop of disconnecting -> connecting -> disconnecting -> …

As mentioned earlier, if the startWebSocketSlowTimer

webSocketSlowTimeout: 4000,
timeout triggers, the connection will not be restored due race condition problem.

Workaround

var client = new Ably.Realtime({
 ...
});

client.options.timeouts.webSocketSlowTimeout = 0; // <--- workaround

@michalzaq12
Copy link
Author

@VeskeR When will the bug be fixed?

@VeskeR
Copy link
Contributor

VeskeR commented Aug 30, 2024

Hi @michalzaq12 , fixing this will be our priority next week and I'm going to work on it on Monday and hopefully release a fix early next week.

@Gid733
Copy link

Gid733 commented Sep 5, 2024

Hello @VeskeR , same issue on the React Native, I'm not sure this workaround would work here. The problem happens when the internet connection is not quite good (especially on bad public wifi), and the app goes into an infinite loop of Disconnected -> Reconnecting -> Disconnected.

@VeskeR
Copy link
Contributor

VeskeR commented Sep 6, 2024

I was able to fix the issue locally and right now adding some additional tests for the future. We aim to review the PR and release a patch version on npm today.

VeskeR added a commit that referenced this issue Sep 6, 2024
…ycle

Fixes a regression introduced in ably-js v2 in the "no upgrade" PR [1].
Under certain network conditions (internet down) a race condition could
occur in the connection manager. Specifically, websocket connectivity
check in the `startWebSocketSlowTimer` function needed to set the
`wsCheckResult` flag to `false` before the timer was cleared by other
code branches. If that happened, it then caused an endless loop of
websocket connection retries, even after the network conditions
stabilized. A websocket transport would successfully open but then be
immediately disposed due to the `wsCheckResult` flag still being set to
`false`, without any mechanism to reset it in this scenario.

Upon closer look, it makes sense that the websocket connection health
flags (`wsCheckResult` and `abandonedWebSocket`) should be reset when
attempting a new websocket connection. The previous solution did not
explicitly reset these flags in the `connectWs` function, leading to the
described race condition and endless loop. Thus, `wsCheckResult` and
`abandonedWebSocket` are now reset to their neutral values upon entering
the `connectWs` function.

Resolves #1844

[1] #1645
@VeskeR
Copy link
Contributor

VeskeR commented Sep 6, 2024

Websocket reconnection issue has been fixed in ably-js 2.3.2 release.

@Gid733 The problem should be fixed on all platforms. Please update to ably-js 2.3.2 version and see if that fixes it in your case. If you're still experiencing reconnection issues with 2.3.2 release, please let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. It's clear that this does need to be fixed.
Development

Successfully merging a pull request may close this issue.

6 participants