Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't rejoin room if network connectivity is interrupted #84

Closed
thazhemadam opened this issue Oct 9, 2020 · 25 comments
Closed

Can't rejoin room if network connectivity is interrupted #84

thazhemadam opened this issue Oct 9, 2020 · 25 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@thazhemadam
Copy link

Issue

If one of the devices loses network connectivity after joining a room, said device does not seem to be able to re-join the same room right after.

Steps to Reproduce

  • Create a new room using device A, and join said room with device B
  • Disconnect device B's network access
  • Try to rejoin the same room with device B

A Connection Error that says User with same name exists in this room gets displayed.

Potential Solution

Maintaining a temporary state locally which keeps tracks of connectivity, maybe?
Periodical ACKs from receiver device might also help.

@blenderskool
Copy link
Owner

blenderskool commented Oct 9, 2020

@thazhemadam Interesting that this issue is still occurring. I'm not able to reproduce the issue from the steps.

As of now following points have already been implemented which is similar to the potential solution.

  • A local state is maintained to keep track of connectivity via navigator.onLine. (NOTE: This only checks whether the system is connected to a network and not whether the network itself is active and sharing data)
  • The WebSocket protocol defines a ping/pong mechanism for the ACKs. When the server sends a ping, the client must respond with a pong to confirm its presence. The interval for these requests is set to 30s (i.e. after every 30s, all sockets in the network are checked whether they are alive or not) which may need to be tweaked accordingly.

@thazhemadam
Copy link
Author

Hmm, that's odd.
Anyway, this should give you some more clarity.
blaze_reproduce

@thazhemadam
Copy link
Author

  • A local state is maintained to keep track of connectivity via navigator.onLine. (NOTE: This only checks whether the system is connected to a network and not whether the network itself is active and sharing data)

Another idea could be encoding the "Cool nickname" for a device, with the MAC address of the device.
That way, if a device with the same "Cool nickname" tries to access the same room, a simple check would suffice to confirm if the device is the one that was initially connected and got disconnected (which hasn't been detected yet), or a different device.
I think technically, this would fix #72 as well.

@blenderskool
Copy link
Owner

blenderskool commented Oct 9, 2020

Hmm, that's odd.
Anyway, this should give you some more clarity.

Thanks for providing this, I'm able to reproduce the issue now.

Another idea could be encoding the "Cool nickname" for a device, with the MAC address of the device.
That way, if a device with the same "Cool nickname" tries to access the same room, a simple check would suffice to confirm if the device is the one that was initially connected and got disconnected (which hasn't been detected yet), or a different device.

Yes, this can be a potential fix too. However, MAC addresses aren't accessible from JavaScript client directly. Instead, maybe we can allot a random unique id for every client when they use Blaze for the first time.

I think technically, this would fix #72 as well.

The major reason for preventing the same nicknames in the same room was a loose attempt in preventing same client from joining same room twice. Since it is possible for multiple clients with same nicknames it would make sense to switch to a unique id based solution.
This would not really solve this issue directly as it was meant to solve a different issue (which can be modified to accommodate the MAC address solution you explained above).

@thazhemadam I'm currently marking this issue as a bug as we first need to investigate where the two existing solutions are failing, if and how it can be fixed.

@blenderskool blenderskool added the bug Something isn't working label Oct 9, 2020
@RotonEvan
Copy link
Contributor

@blenderskool Did you try removing Device B from the room when it loses connection? That might fix the problem. How are you keeping track whether a user (Device) is present in the room or if it has left?

@blenderskool
Copy link
Owner

@RotonEvan I don't think the server(which maintains the list of sockets) can be notified about the loss of socket when the socket itself loses connection. It currently performs a check on all sockets at every interval to remove dead sockets.

@RotonEvan
Copy link
Contributor

@RotonEvan I don't think the server(which maintains the list of sockets) can be notified about the loss of socket when the socket itself loses connection. It currently performs a check on all sockets at every interval to remove dead sockets.

That should help only if you decrease the interval time to 1 sec

The major reason for preventing the same nicknames in the same room was a loose attempt in preventing same client from joining same room twice. Since it is possible for multiple clients with same nicknames it would make sense to switch to a unique id based solution.

Are you setting the nickname in the session?

For a solution, as you do not want a client to join same room twice, we need to detect the network lose from client side itself and add a state of network lose in the state machine. You need to put this logic in code. Else decreasing socket heartbeat interval to 1 sec might help too I think.

I can help in fixing this. We can discuss and decide on a solution. I'll make a PR if we get a solution. :)

@blenderskool
Copy link
Owner

@RotonEvan

That should help only if you decrease the interval time to 1 sec

Yes, this is the most straightforward fix for now. I'm thinking if there is another way of doing this without having to check every second (which might become a time consuming operation in case of large number of sockets?)

Are you setting the nickname in the session?

Yes, nickname is being used to uniquely identify peers on the server.

For a solution, as you do not want a client to join same room twice, we need to detect the network lose from client side itself and add a state of network lose in the state machine. You need to put this logic in code. Else decreasing socket heartbeat interval to 1 sec might help too I think.

On the client side, this check is being made via the navigator.onLine listeners which takes the user back to the Rooms page in case of connection loss.
This causes unmount of FileTransfer which also tries to close the socket connection from client side.

In summary, I think the best option in terms of the fix is to decrease the duration of interval as you mentioned earlier 🙂

@RotonEvan
Copy link
Contributor

Can I work on this? I know it's a really easy fix but might help me in hacktoberfest :p

@blenderskool
Copy link
Owner

@RotonEvan Yes sure go ahead, but may I know the changes you will be making to solve this? Is it decreasing the heartbeat interval to 1s?

@RotonEvan
Copy link
Contributor

@blenderskool you tell what seems to be the best fix for this issue. Decreasing interval is a way out. Maybe the best one even as of now.

@blenderskool
Copy link
Owner

blenderskool commented Oct 19, 2020

@RotonEvan Decreasing interval makes the most sense as this is exactly why the ping/pong mechanism is mentioned in the WebSockets spec.
My major concern is how it would scale against a large number of socket connections. I couldn't find any "optimal" interval duration for this, so we might have to apply a safer growth function like log or sqrt to subtly increase the interval duration as the number of socket connections increase.

@thazhemadam
Copy link
Author

thazhemadam commented Oct 19, 2020

This doesn't really sound like a permanent fix to me. The overhead incurred by performing this check every 1s will be much greater. It becomes a bottleneck for scalability. Secondly, what happens if this happens when Device B is transferring a file? Does the file get only partly received? Or does the received file get deleted?
I feel it'd be a better idea, to put Device A into a "suspend" state of sorts for a fixed time interval when the file transfer is paused (if that can be accommodated), during which it could rejoin the room and continue file transfer. This could save network data usage as well.
But how do we know when to put Device A into this suspended state? There was some discussion previously about uniquely identifying a device. Using this, a unique id can be hashed with the username and generated per user. The unique id can be stored locally as well. If it's possible, implementing a check for when an ACK for the sent data isn't received, and proceeding to transition the receiver device into the suspend state might be a better option. This way, if the device that momentarily disconnected can come back into the room (we know it's the same device thanks to the unique identifier), then all we have to do is flip it's state from "suspended" to "active" (or something of the sort...).

@blenderskool
Copy link
Owner

blenderskool commented Oct 19, 2020

The overhead incurred by performing this check every 1s will be much greater.

Yep, as I said earlier this is a major concern with decreasing the timeout interval which is why it was set to 30s in the first place.

Secondly, what happens if this happens when Device B is transferring a file? Does the file get only partly received? Or does the received file get deleted?

There are two situations for understanding what happens here:

  • In the case of P2P sharing with WebTorrent, the file-sharing stops immediately if no other peer in the room has received the file.
  • In WebSockets (fallback if P2P isn't supported), this handling hasn't been thought of yet.

Well, this is a separate issue altogether and probably not what this issue was intended to be.

If it's possible, implementing a check for when an ACK for the sent data isn't received, and proceeding to transition the receiver device into the suspend state might be a better option.

I think this is also geared towards dealing with file transfer if there's a connection loss, which can be a separate issue.
Assuming "sent data" here means the file data being transferred, with WebTorrent it is not possible as it's P2P and the server does not maintain the progress of the file transfer. When sharing over WebSockets, the sender waits at every 25% interval to receive ACK from all receivers about the status of the transfer before continuing. If we send some frames just to indicate that sender is still alive in this waiting period, it ends up becoming similar to the ping/pong mechanism

@blenderskool
Copy link
Owner

Another idea could be encoding the "Cool nickname" for a device, with the MAC address of the device.
That way, if a device with the same "Cool nickname" tries to access the same room, a simple check would suffice to confirm if the device is the one that was initially connected and got disconnected (which hasn't been detected yet), or a different device.

Adding to what @thazhemadam said earlier, this might be the best approach (with unique ids instead of MAC addresses) till now (if we aren't decreasing the interval of ping/pong) to solve this issue. The server would still however not know that a peer has lost connection, which gives other peers in the same room a false impression that the peer (who lost connection) is still connected.

@RotonEvan
Copy link
Contributor

You aren't logging in any user right? So in that case a unique ID is a bad idea as you would have to save the ID in browser cookies. Bdw can you tell me where you are processing the check that whether a user is already there in the room? Might get some idea from there...

@blenderskool
Copy link
Owner

@RotonEvan

So in that case a unique ID is a bad idea as you would have to save the ID in browser cookies

Umm I don't understand how it's bad because we have to store the id locally. Possibly localStorage as that's where all data is stored.

Bdw can you tell me where you are processing the check that whether a user is already there in the room?

Check server/index.js lines 37-41.

@RotonEvan
Copy link
Contributor

@blenderskool

Umm I don't understand how it's bad because we have to store the id locally. Possibly localStorage as that's where all data is stored.

Cookies is bad as cuz it's not permanent as isn't in the control of the code. But about localStorage...what data do you store?

@blenderskool
Copy link
Owner

Cookies is bad as cuz it's not permanent as isn't in the control of the code

We are not using cookies for storing data :)

But about localStorage...what data do you store?

Maybe you can take a look at the code from your side too 😛
It holds user nickname, rooms that they joined before.

@RotonEvan
Copy link
Contributor

Okay so...the problem is that your system isn't getting updated instantly when there is a connection failure for one device (consistency issue). To compensate that we can decrease the heartbeat interval. Now, you say a unique ID is a good solution. Let's first clear a few things.

  1. You do not want one person to enter a room more than once. However can another person with same name enter that room? Right now no but should they be able to? If yes, then unique ID is required.
  2. You cannot solve Can't rejoin room if network connectivity is interrupted #84 using unique ID as the server still sees that the user with the same ID is present in the room (unless updated through heartbeat confirmation).

The server needs to be updated. The quicker the better. This is the only way you can ensure consistency during a network partition.

@blenderskool
Copy link
Owner

Right now no but should they be able to? If yes, then unique ID is required.

Yes, there's an issue already for the same #72

You cannot solve #84 using unique ID as the server still sees that the user with the same ID is present in the room.

Yes, this is the caveat (quoted below) I mentioned in the earlier messages.

gives other peers in the same room a false impression that the peer (who lost connection) is still connected.

But I think the approach suggested by @thazhemadam was mainly to allow a peer (who lost connection) to rejoin a room within the 30s interval without getting the User with same name exists in this room error. If the peer joins within that 30s interval, well and good, otherwise the ping/pong would remove the peer after the interval is over. Keeping in mind that this approach does not let the server/other peers know that some peer lost connection and left the room.

To compensate that we can decrease the heartbeat interval.

Any thoughts on how much this would be decreased to? We can't go with 1s intervals.

@RotonEvan
Copy link
Contributor

Any thoughts on how much this would be decreased to? We can't go with 1s intervals.

Normally to get back network connection it takes atleast 10s. However it is necessary to update the server as early as possible. So I think we can settle somewhere between 5-10s.

7 or 8 seems fine. Tell me one, I'll update the interval in a PR. And then I can try #72 and also the 30s buffer one separately.

@blenderskool
Copy link
Owner

Normally to get back network connection it takes atleast 10s

@RotonEvan Ok makes sense. We can try between 10s-15s.

@RotonEvan @thazhemadam From what I understand with the discussion here and issue #72,
We can try decreasing the interval duration to something between 10s to 15s after testing with a large number of sockets connected at once.
While this interval reduction would help the server to identify the broken connection faster, it doesn't solve the original issue of User with same name exists in this room in case the peer who lost connection joins back within the interval time. This is where I think we can modify the checks the server makes while allowing a peer to join a room.

If let's say after the refactoring from issue #72, a peer (let's call it peer 1) with an id (which already exists in the room) tries to join a room, server first sends a ping to the already joined peer (let's call this peer 2) in the room. Following cases can occur here:

  • If the peer 2 responds with a pong, we do not accept the peer 1's connection as there's already peer 2 with the same id as peer 1
  • If the peer 2 does not respond with a pong, we can assume that peer 2 is dead and can destroy it from the server and accept peer 1 in the room.

@RotonEvan
Copy link
Contributor

Got busy with some other work, will update the interval with 12s as of now and create a PR.

This is where I think we can modify the checks the server makes while allowing a peer to join a room.

@blenderskool yeah these modifications seems good. #72 needs to get solved.

RotonEvan added a commit to RotonEvan/blaze that referenced this issue Oct 20, 2020
Heartbeat interval decreased to 12s from 30s.
@blenderskool blenderskool added this to the 3.0 milestone Jun 29, 2022
@blenderskool blenderskool self-assigned this Jun 29, 2022
blenderskool added a commit that referenced this issue Jun 29, 2022
… room

If a user A is trying to join a room which already has user B with
the same name as A, Blaze first checks if user B is "alive" by
sending PING message and anticipating a PONG within some
timeframe.
If a PONG is received by user B, user A is not let into the room
because of the name clash.
However, if the PONG is not received by user B within the alotted
timeframe, it is assumed that user B has disconnected and the
stray socket on server is closed. User A is also let in because
there is no name clash in this scenario.

Fixes #84
@blenderskool
Copy link
Owner

Closing this issue as it has been fixed in commit 076bdf2 and will be released in Blaze v3.0.0
Refer to the commit description on the approach taken to solve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants