Can't rejoin room if network connectivity is interrupted #84

thazhemadam · 2020-10-09T06:39:07Z

Issue

If one of the devices loses network connectivity after joining a room, said device does not seem to be able to re-join the same room right after.

Steps to Reproduce

Create a new room using device A, and join said room with device B
Disconnect device B's network access
Try to rejoin the same room with device B

A Connection Error that says User with same name exists in this room gets displayed.

Potential Solution

Maintaining a temporary state locally which keeps tracks of connectivity, maybe?
Periodical ACKs from receiver device might also help.

blenderskool · 2020-10-09T07:30:55Z

@thazhemadam Interesting that this issue is still occurring. I'm not able to reproduce the issue from the steps.

As of now following points have already been implemented which is similar to the potential solution.

A local state is maintained to keep track of connectivity via navigator.onLine. (NOTE: This only checks whether the system is connected to a network and not whether the network itself is active and sharing data)
The WebSocket protocol defines a ping/pong mechanism for the ACKs. When the server sends a ping, the client must respond with a pong to confirm its presence. The interval for these requests is set to 30s (i.e. after every 30s, all sockets in the network are checked whether they are alive or not) which may need to be tweaked accordingly.

thazhemadam · 2020-10-09T07:49:48Z

Hmm, that's odd.
Anyway, this should give you some more clarity.

thazhemadam · 2020-10-09T08:06:07Z

A local state is maintained to keep track of connectivity via navigator.onLine. (NOTE: This only checks whether the system is connected to a network and not whether the network itself is active and sharing data)

Another idea could be encoding the "Cool nickname" for a device, with the MAC address of the device.
That way, if a device with the same "Cool nickname" tries to access the same room, a simple check would suffice to confirm if the device is the one that was initially connected and got disconnected (which hasn't been detected yet), or a different device.
I think technically, this would fix #72 as well.

blenderskool · 2020-10-09T08:45:01Z

Hmm, that's odd.
Anyway, this should give you some more clarity.

Thanks for providing this, I'm able to reproduce the issue now.

Another idea could be encoding the "Cool nickname" for a device, with the MAC address of the device.
That way, if a device with the same "Cool nickname" tries to access the same room, a simple check would suffice to confirm if the device is the one that was initially connected and got disconnected (which hasn't been detected yet), or a different device.

Yes, this can be a potential fix too. However, MAC addresses aren't accessible from JavaScript client directly. Instead, maybe we can allot a random unique id for every client when they use Blaze for the first time.

I think technically, this would fix #72 as well.

The major reason for preventing the same nicknames in the same room was a loose attempt in preventing same client from joining same room twice. Since it is possible for multiple clients with same nicknames it would make sense to switch to a unique id based solution.
This would not really solve this issue directly as it was meant to solve a different issue (which can be modified to accommodate the MAC address solution you explained above).

@thazhemadam I'm currently marking this issue as a bug as we first need to investigate where the two existing solutions are failing, if and how it can be fixed.

RotonEvan · 2020-10-18T19:55:25Z

@blenderskool Did you try removing Device B from the room when it loses connection? That might fix the problem. How are you keeping track whether a user (Device) is present in the room or if it has left?

blenderskool · 2020-10-18T20:20:18Z

@RotonEvan I don't think the server(which maintains the list of sockets) can be notified about the loss of socket when the socket itself loses connection. It currently performs a check on all sockets at every interval to remove dead sockets.

RotonEvan · 2020-10-19T00:10:57Z

@RotonEvan I don't think the server(which maintains the list of sockets) can be notified about the loss of socket when the socket itself loses connection. It currently performs a check on all sockets at every interval to remove dead sockets.

That should help only if you decrease the interval time to 1 sec

The major reason for preventing the same nicknames in the same room was a loose attempt in preventing same client from joining same room twice. Since it is possible for multiple clients with same nicknames it would make sense to switch to a unique id based solution.

Are you setting the nickname in the session?

For a solution, as you do not want a client to join same room twice, we need to detect the network lose from client side itself and add a state of network lose in the state machine. You need to put this logic in code. Else decreasing socket heartbeat interval to 1 sec might help too I think.

I can help in fixing this. We can discuss and decide on a solution. I'll make a PR if we get a solution. :)

blenderskool · 2020-10-19T03:00:35Z

@RotonEvan

That should help only if you decrease the interval time to 1 sec

Yes, this is the most straightforward fix for now. I'm thinking if there is another way of doing this without having to check every second (which might become a time consuming operation in case of large number of sockets?)

Are you setting the nickname in the session?

Yes, nickname is being used to uniquely identify peers on the server.

For a solution, as you do not want a client to join same room twice, we need to detect the network lose from client side itself and add a state of network lose in the state machine. You need to put this logic in code. Else decreasing socket heartbeat interval to 1 sec might help too I think.

On the client side, this check is being made via the navigator.onLine listeners which takes the user back to the Rooms page in case of connection loss.
This causes unmount of FileTransfer which also tries to close the socket connection from client side.

In summary, I think the best option in terms of the fix is to decrease the duration of interval as you mentioned earlier 🙂

RotonEvan · 2020-10-19T03:06:17Z

Can I work on this? I know it's a really easy fix but might help me in hacktoberfest :p

blenderskool · 2020-10-19T03:11:41Z

@RotonEvan Yes sure go ahead, but may I know the changes you will be making to solve this? Is it decreasing the heartbeat interval to 1s?

RotonEvan · 2020-10-19T03:52:21Z

@blenderskool you tell what seems to be the best fix for this issue. Decreasing interval is a way out. Maybe the best one even as of now.

blenderskool · 2020-10-19T04:30:09Z

@RotonEvan Decreasing interval makes the most sense as this is exactly why the ping/pong mechanism is mentioned in the WebSockets spec.
My major concern is how it would scale against a large number of socket connections. I couldn't find any "optimal" interval duration for this, so we might have to apply a safer growth function like log or sqrt to subtly increase the interval duration as the number of socket connections increase.

thazhemadam · 2020-10-19T08:36:58Z

This doesn't really sound like a permanent fix to me. The overhead incurred by performing this check every 1s will be much greater. It becomes a bottleneck for scalability. Secondly, what happens if this happens when Device B is transferring a file? Does the file get only partly received? Or does the received file get deleted?
I feel it'd be a better idea, to put Device A into a "suspend" state of sorts for a fixed time interval when the file transfer is paused (if that can be accommodated), during which it could rejoin the room and continue file transfer. This could save network data usage as well.
But how do we know when to put Device A into this suspended state? There was some discussion previously about uniquely identifying a device. Using this, a unique id can be hashed with the username and generated per user. The unique id can be stored locally as well. If it's possible, implementing a check for when an ACK for the sent data isn't received, and proceeding to transition the receiver device into the suspend state might be a better option. This way, if the device that momentarily disconnected can come back into the room (we know it's the same device thanks to the unique identifier), then all we have to do is flip it's state from "suspended" to "active" (or something of the sort...).

blenderskool · 2020-10-19T10:20:23Z

The overhead incurred by performing this check every 1s will be much greater.

Yep, as I said earlier this is a major concern with decreasing the timeout interval which is why it was set to 30s in the first place.

Secondly, what happens if this happens when Device B is transferring a file? Does the file get only partly received? Or does the received file get deleted?

There are two situations for understanding what happens here:

In the case of P2P sharing with WebTorrent, the file-sharing stops immediately if no other peer in the room has received the file.
In WebSockets (fallback if P2P isn't supported), this handling hasn't been thought of yet.

Well, this is a separate issue altogether and probably not what this issue was intended to be.

If it's possible, implementing a check for when an ACK for the sent data isn't received, and proceeding to transition the receiver device into the suspend state might be a better option.

I think this is also geared towards dealing with file transfer if there's a connection loss, which can be a separate issue.
Assuming "sent data" here means the file data being transferred, with WebTorrent it is not possible as it's P2P and the server does not maintain the progress of the file transfer. When sharing over WebSockets, the sender waits at every 25% interval to receive ACK from all receivers about the status of the transfer before continuing. If we send some frames just to indicate that sender is still alive in this waiting period, it ends up becoming similar to the ping/pong mechanism

blenderskool · 2020-10-19T10:32:38Z

Another idea could be encoding the "Cool nickname" for a device, with the MAC address of the device.
That way, if a device with the same "Cool nickname" tries to access the same room, a simple check would suffice to confirm if the device is the one that was initially connected and got disconnected (which hasn't been detected yet), or a different device.

Adding to what @thazhemadam said earlier, this might be the best approach (with unique ids instead of MAC addresses) till now (if we aren't decreasing the interval of ping/pong) to solve this issue. The server would still however not know that a peer has lost connection, which gives other peers in the same room a false impression that the peer (who lost connection) is still connected.

RotonEvan · 2020-10-19T22:03:43Z

You aren't logging in any user right? So in that case a unique ID is a bad idea as you would have to save the ID in browser cookies. Bdw can you tell me where you are processing the check that whether a user is already there in the room? Might get some idea from there...

blenderskool · 2020-10-20T02:55:54Z

@RotonEvan

So in that case a unique ID is a bad idea as you would have to save the ID in browser cookies

Umm I don't understand how it's bad because we have to store the id locally. Possibly localStorage as that's where all data is stored.

Bdw can you tell me where you are processing the check that whether a user is already there in the room?

Check server/index.js lines 37-41.

RotonEvan · 2020-10-20T03:05:10Z

@blenderskool

Umm I don't understand how it's bad because we have to store the id locally. Possibly localStorage as that's where all data is stored.

Cookies is bad as cuz it's not permanent as isn't in the control of the code. But about localStorage...what data do you store?

blenderskool · 2020-10-20T03:15:34Z

Cookies is bad as cuz it's not permanent as isn't in the control of the code

We are not using cookies for storing data :)

But about localStorage...what data do you store?

Maybe you can take a look at the code from your side too 😛
It holds user nickname, rooms that they joined before.

RotonEvan · 2020-10-20T03:43:52Z

Okay so...the problem is that your system isn't getting updated instantly when there is a connection failure for one device (consistency issue). To compensate that we can decrease the heartbeat interval. Now, you say a unique ID is a good solution. Let's first clear a few things.

You do not want one person to enter a room more than once. However can another person with same name enter that room? Right now no but should they be able to? If yes, then unique ID is required.
You cannot solve Can't rejoin room if network connectivity is interrupted #84 using unique ID as the server still sees that the user with the same ID is present in the room (unless updated through heartbeat confirmation).

The server needs to be updated. The quicker the better. This is the only way you can ensure consistency during a network partition.

blenderskool · 2020-10-20T04:03:43Z

Right now no but should they be able to? If yes, then unique ID is required.

Yes, there's an issue already for the same #72

You cannot solve #84 using unique ID as the server still sees that the user with the same ID is present in the room.

Yes, this is the caveat (quoted below) I mentioned in the earlier messages.

gives other peers in the same room a false impression that the peer (who lost connection) is still connected.

But I think the approach suggested by @thazhemadam was mainly to allow a peer (who lost connection) to rejoin a room within the 30s interval without getting the User with same name exists in this room error. If the peer joins within that 30s interval, well and good, otherwise the ping/pong would remove the peer after the interval is over. Keeping in mind that this approach does not let the server/other peers know that some peer lost connection and left the room.

To compensate that we can decrease the heartbeat interval.

Any thoughts on how much this would be decreased to? We can't go with 1s intervals.

RotonEvan · 2020-10-20T06:01:07Z

Any thoughts on how much this would be decreased to? We can't go with 1s intervals.

Normally to get back network connection it takes atleast 10s. However it is necessary to update the server as early as possible. So I think we can settle somewhere between 5-10s.

7 or 8 seems fine. Tell me one, I'll update the interval in a PR. And then I can try #72 and also the 30s buffer one separately.

blenderskool · 2020-10-20T07:18:27Z

Normally to get back network connection it takes atleast 10s

@RotonEvan Ok makes sense. We can try between 10s-15s.

@RotonEvan @thazhemadam From what I understand with the discussion here and issue #72,
We can try decreasing the interval duration to something between 10s to 15s after testing with a large number of sockets connected at once.
While this interval reduction would help the server to identify the broken connection faster, it doesn't solve the original issue of User with same name exists in this room in case the peer who lost connection joins back within the interval time. This is where I think we can modify the checks the server makes while allowing a peer to join a room.

If let's say after the refactoring from issue #72, a peer (let's call it peer 1) with an id (which already exists in the room) tries to join a room, server first sends a ping to the already joined peer (let's call this peer 2) in the room. Following cases can occur here:

If the peer 2 responds with a pong, we do not accept the peer 1's connection as there's already peer 2 with the same id as peer 1
If the peer 2 does not respond with a pong, we can assume that peer 2 is dead and can destroy it from the server and accept peer 1 in the room.

RotonEvan · 2020-10-20T18:20:55Z

Got busy with some other work, will update the interval with 12s as of now and create a PR.

This is where I think we can modify the checks the server makes while allowing a peer to join a room.

@blenderskool yeah these modifications seems good. #72 needs to get solved.

Heartbeat interval decreased to 12s from 30s.

… room If a user A is trying to join a room which already has user B with the same name as A, Blaze first checks if user B is "alive" by sending PING message and anticipating a PONG within some timeframe. If a PONG is received by user B, user A is not let into the room because of the name clash. However, if the PONG is not received by user B within the alotted timeframe, it is assumed that user B has disconnected and the stray socket on server is closed. User A is also let in because there is no name clash in this scenario. Fixes #84

blenderskool · 2022-06-29T14:15:17Z

Closing this issue as it has been fixed in commit 076bdf2 and will be released in Blaze v3.0.0
Refer to the commit description on the approach taken to solve this.

blenderskool added the bug Something isn't working label Oct 9, 2020

RotonEvan added a commit to RotonEvan/blaze that referenced this issue Oct 20, 2020

fix: server gets updated early upon connection lose blenderskool#84

df2a7ce

Heartbeat interval decreased to 12s from 30s.

blenderskool mentioned this issue Jun 30, 2021

Roadmap for Blaze v3 ⚡🥁 #128

Closed

14 tasks

blenderskool added this to the 3.0 milestone Jun 29, 2022

blenderskool self-assigned this Jun 29, 2022

blenderskool closed this as completed Jun 29, 2022

blenderskool mentioned this issue Jun 29, 2022

fix: server gets updated early upon connection lose #84 #88

Closed

blenderskool mentioned this issue Jan 3, 2023

Version 3.0.0 ⚡️🚀 #157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't rejoin room if network connectivity is interrupted #84

Can't rejoin room if network connectivity is interrupted #84

thazhemadam commented Oct 9, 2020

blenderskool commented Oct 9, 2020 •

edited

Loading

thazhemadam commented Oct 9, 2020

thazhemadam commented Oct 9, 2020

blenderskool commented Oct 9, 2020 •

edited

Loading

RotonEvan commented Oct 18, 2020

blenderskool commented Oct 18, 2020

RotonEvan commented Oct 19, 2020

blenderskool commented Oct 19, 2020

RotonEvan commented Oct 19, 2020

blenderskool commented Oct 19, 2020

RotonEvan commented Oct 19, 2020

blenderskool commented Oct 19, 2020 •

edited

Loading

thazhemadam commented Oct 19, 2020 •

edited

Loading

blenderskool commented Oct 19, 2020 •

edited

Loading

blenderskool commented Oct 19, 2020

RotonEvan commented Oct 19, 2020

blenderskool commented Oct 20, 2020

RotonEvan commented Oct 20, 2020

blenderskool commented Oct 20, 2020

RotonEvan commented Oct 20, 2020

blenderskool commented Oct 20, 2020

RotonEvan commented Oct 20, 2020

blenderskool commented Oct 20, 2020

RotonEvan commented Oct 20, 2020

blenderskool commented Jun 29, 2022

Can't rejoin room if network connectivity is interrupted #84

Can't rejoin room if network connectivity is interrupted #84

Comments

thazhemadam commented Oct 9, 2020

Issue

Steps to Reproduce

Potential Solution

blenderskool commented Oct 9, 2020 • edited Loading

thazhemadam commented Oct 9, 2020

thazhemadam commented Oct 9, 2020

blenderskool commented Oct 9, 2020 • edited Loading

RotonEvan commented Oct 18, 2020

blenderskool commented Oct 18, 2020

RotonEvan commented Oct 19, 2020

blenderskool commented Oct 19, 2020

RotonEvan commented Oct 19, 2020

blenderskool commented Oct 19, 2020

RotonEvan commented Oct 19, 2020

blenderskool commented Oct 19, 2020 • edited Loading

thazhemadam commented Oct 19, 2020 • edited Loading

blenderskool commented Oct 19, 2020 • edited Loading

blenderskool commented Oct 19, 2020

RotonEvan commented Oct 19, 2020

blenderskool commented Oct 20, 2020

RotonEvan commented Oct 20, 2020

blenderskool commented Oct 20, 2020

RotonEvan commented Oct 20, 2020

blenderskool commented Oct 20, 2020

RotonEvan commented Oct 20, 2020

blenderskool commented Oct 20, 2020

RotonEvan commented Oct 20, 2020

blenderskool commented Jun 29, 2022

blenderskool commented Oct 9, 2020 •

edited

Loading

blenderskool commented Oct 9, 2020 •

edited

Loading

blenderskool commented Oct 19, 2020 •

edited

Loading

thazhemadam commented Oct 19, 2020 •

edited

Loading

blenderskool commented Oct 19, 2020 •

edited

Loading