Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rocket.Chat stops working with 1000 active users #11288

Closed
AmShaegar13 opened this issue Jun 28, 2018 · 77 comments
Closed

Rocket.Chat stops working with 1000 active users #11288

AmShaegar13 opened this issue Jun 28, 2018 · 77 comments

Comments

@AmShaegar13
Copy link
Contributor

AmShaegar13 commented Jun 28, 2018

Description:

For us, Rocket.Chat does not work with more than 1000 active users. Rebooting a server, restarting Apache or restarting Rocket.Chat after an update causes all clients to face serious issues connecting to the chat.

Steps to reproduce:

  1. Setup a chat with 1000 simultaneous active users
  2. Restart all instances at once.

Expected behavior:

Clients can reconnect to the chat.

Actual behavior:

While reconnecting the server sends an enormous amount of the following messages over websocket:

{"msg":"added","collection":"users","id":"$userId","fields":{"name":"$displayName","status":"online","username":"$username","utcOffset":2}}
{"msg":"changed","collection":"users","id":"$userId","fields":{"status":"online"}}
{"msg":"removed","collection":"users","id":"$userId"}

This continues until the server closes the websocket. I assume, this is due to the lack of ping-pong messages in this time. The client instantly requests a new websocket starting the whole thing over and over again.

The only effective way to get the cluster up and working again is to force-logout all users by deleting their loginTokens from mongodb directly.

Server Setup Information:

  • Version of Rocket.Chat Server: 0.65.2
  • Operating System: Debian 8.11
  • Deployment Method: tar with pm2
  • Number of Running Instances: 8 virtual machines with 3 instances each (24 instances)
  • DB Replicaset Oplog: On
  • NodeJS Version: 8.9.4
  • MongoDB Version: 3.4.9

Additional context

The high amount of instances we operate directly results from this issue. When we first ran into it with about 700 users, we assumed we might need to scale the cluster accordingly but we are not willing to add another server to the cluster for every 40 new users. We planned to support around 8000 users. Approximately half of them active.

For now, we do not allow mobile clients yet. We would really love to do so but with the current state of the cluster this wont happen soon.

@magicbelette
Copy link
Contributor

Do you have Apache2 or Nginx for the frontend ?
Maybe you've reached some limitations (MaxClients?) for the frontend.

What about system usage (RAM, CPU, Network, FS) for the machines of the cluster?

Cheers

@AmShaegar13
Copy link
Contributor Author

We are using Apache as reverse proxy. The servers have 16 GB RAM available and only 1.5 GB used per instance. CPU usage is going up to its limit during the reconnects.

screenshot from 2018-06-28 14-53-04
screenshot from 2018-06-28 14-53-24
screenshot from 2018-06-28 14-53-54

@kaiiiiiiiii
Copy link
Contributor

Sounds like you reached the maximum mongodb connections (1024 by default on linux as far as I know).

@vynmera
Copy link
Contributor

vynmera commented Jun 28, 2018

@AmShaegar13 You could try using nginx instead of Apache, and as suggested check your mongo settings?

@qchn
Copy link

qchn commented Jun 28, 2018

hi @kaiiiiiiiii,
I am the admin of @AmShaegar13's Rocket.Chat-Setup, he kindly asked me to post this here:

001-rs:PRIMARY> db.serverStatus().connections
{ "current" : 182, "available" : 51018, "totalCreated" : 3234457 }

root@rocketchatdb:~# lsof -i | grep mongodb | wc -l
186

So this shouldn't be a thing…

Best,
qchn

@magicbelette
Copy link
Contributor

Did you check Apache2 log for MaxClients reached ?

@qchn
Copy link

qchn commented Jun 28, 2018

Yes, @magicbelette, thanks for the hint. We configured MaxClients to 1500 per node and we're far from reaching that.

@AmShaegar13
Copy link
Contributor Author

AmShaegar13 commented Jun 28, 2018

@qchn Thanks. ;)

@magicbelette Yup. no errors regarding max clients. At most, rare proxy connection timeouts (about once an hour).

@vynmera Trying is nothing I can easily do. This requires another downtime for our users. Additionally, I don't really suspect Apache to be the problem here. Node is causing the CPU load and HTTP is doing fine. I can load all scripts and assets just fine. Just the websocket never finishes receiving those collection update messages.

@jhermann
Copy link

Sounds like a job for exponential back-off on the client side, after say 2-3 failed web socket reconnects.

@dmoldovan123
Copy link

dmoldovan123 commented Jun 29, 2018

hello, You can try with haproxy and use forever-service to run nodes. haproxy -> n nodes -> 1 server mongodb

@AmShaegar13
Copy link
Contributor Author

@dmoldovan123 As already mentioned in my reply to vynmera I can't just try various things. I have to maintain a stable service for 1000 active users. So if you could give me a hint why haproxy with forever-service would be better than apache with pm2 I would be really greatful. This would give me something to justify breaking the service (again) on purpose.

The thing is, I do not see a different proxy or service manager reduce status-changed messages over the websocket.

@magicbelette
Copy link
Contributor

Don't think that's the best idea ever but you can easily test without PM2, directly with systemd. I don't really know about PM2 but the fact is that you add an another layer and potentially a bottleneck.

Another thing according to my experience, be careful with Apache2 config... My instance was incredibly slow (3 seconds to load each avatar). My Apache2 uses mpm_prefork with a dumb copy/paste (MaxRequestsPerChild 1). Servers were consuming a lot of resources forking new processes with a bad user experience but there was no system load. Took me a couple of days to figure it out :/

@AmShaegar13
Copy link
Contributor Author

I am using pm2 in fork mode so no extra layer should be present. 3 instances of Rocket.Chat are running. Each with its own port.

Cluster mode did not work for some reason.

@dmoldovan123
Copy link

https://rocket.chat/docs/installation/manual-installation/multiple-instances-to-improve-performance/
use haproxy not nginx. It's work very fast with haproxy.

@AmShaegar13
Copy link
Contributor Author

@dmoldovan123 This is what we already do. As you can see in the issue description, I am running 8 servers with 3 instances each to utilize CPU cores behind a reverse proxy. I don't see how another proxy would impact CPU load of node processes. We are using mongodb with replicaSet and instances can communicate with each other because I set INSTANCE_IP.

I am pretty sure, this issue is related to this one in the user-presence library Rocket.Chat uses as well.

@rodrigok @sampaiodiego Can one of you confirm this?

@magicbelette
Copy link
Contributor

@AmShaegar13
Copy link
Contributor Author

AmShaegar13 commented Jul 5, 2018

Thanks for all of your suggestions. We could now prove that the UserPresenceMonitor was responsible for the denial of service we faced.

We disabled it on all but two separate instances and can restart the cluster now without causing tons of status updates.

We did so by patching the source and setting USER_PRESENCE_MONITOR environment:

--- rocket.chat/programs/server/app/app.js	2018-07-04 18:07:36.917547890 +0200
+++ app.js	2018-07-04 18:10:12.273401726 +0200
@@ -7753,7 +7753,10 @@
 
   InstanceStatus.registerInstance('rocket.chat', instance);
   UserPresence.start();
-  return UserPresenceMonitor.start();
+
+  if (process.env['USER_PRESENCE_MONITOR']) {
+    return UserPresenceMonitor.start();
+  }
 });
 /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
 

I would still like to have an official fix for this rather than patching the source with every update.

@magicbelette We still use the slow database engine but we do not observer high CPU load or memory usage on database servers yet.

@elie222
Copy link

elie222 commented Jul 5, 2018

I am also surprised that user presence hasn't caused more issues for more people. I raised the issue a while back.

Another question is whether pm2 is handling sessions properly. Meteor uses sticky sessions and if not handled properly your servers may be doing a lot of extra work constantly logging in new users.

I'd look to add kadira (meteor apm) to check app performance. Node chef offers a solution for 50 dollars per month (as does meteor galaxy hosting but then you have to use their hosting service which is pricey)

@magicbelette
Copy link
Contributor

I'm not sure but I'm under the feeling that this patch causes users to appear offline for some others. As consequence, users receive email notification even if they are online.

@AmShaegar13 did you notice it or am I totally wrong ?

@AmShaegar13
Copy link
Contributor Author

AmShaegar13 commented Jul 19, 2018

I think, you are not. Usually, the status appears to be correct but I already had complaints about unnecessary emails. So yes, there is still something wrong with it and I am still hoping this will be fixed. But for now, we can at least use the chat again.

Currently, 1177 users at it's peak.

@AmShaegar13
Copy link
Contributor Author

The same appears to happen the other way round. No email notification although the user is offline in all clients.

@AmShaegar13
Copy link
Contributor Author

This is getting a major annoyance. More and more users complain about broken notifications. This issue is a major drawback for acceptance in our company.

@geekgonecrazy
Copy link
Contributor

How many instances are you distributing the load across?

@AmShaegar13
Copy link
Contributor Author

We are running 6 servers with 3 instances each without UserPresenceMonitor (see the patch above) behind a load balancer. Additionally, we run 2 servers with 1 instance each with UserPresenceMonitor not balanced, so no users can reach them. Those two servers are dedicated to running the UserPresenceMonitor.

This setup keeps the cluster at least stable but causes the aforementioned problems with notifications.

@geekgonecrazy
Copy link
Contributor

Just wanted to follow up here. We are working through another case like this. So this is definitely on our radar.

@nmagedman
Copy link
Contributor

nmagedman commented Oct 17, 2018

We were having pretty much identical symptoms. CPU pegged at 100%. Packet storm.

We implemented @AmShaegar13’s July 5 patch and (combined with splitting the servers into two Auto-Scaling Groups with different environment variables set) it solved that problem. We then noticed that we were experiencing some of the side-effects mentioned above, including users being marked as Away even when actively using the app. User activity would mark the user as Online for a split-second but then the user would return to Away.

I was concerned that this fix completely broke the User Presence system, but the almost-immediately-Away problem turned out to be something much simpler. Not a runtime failure, but just a configuration bug. As discussed in Issue #11309 (comment), releases 0.64.0 and 0.66.0 changed the semantics of the "Idle Time Limit" user config setting, changing the units of the idle timeout from milliseconds to seconds. I don't know if the migrations were broken, or ran twice, or something else, but the end result is that the 300-second idle timeout somehow became 0.3 seconds!

Point being, be aware that there are multiple issues to manage here.

@KirbySSmith
Copy link
Contributor

We see a similar issues with CPU related to users status when doing blue green deploys. It seems to be in part related to the activeUsers publication.
https://github.com/RocketChat/Rocket.Chat/blob/0.70.4/server/publications/activeUsers.js

When a sever goes offline all the clients connections for the server are removed from the db by other online servers or by the next server to come online.
https://github.com/Konecty/meteor-user-presence/blob/master/server/server.js#L82

When the clients reconnect to the new servers they create new client connections.

Both of these triggers the users records to get updated, status offline then online. Since the activeUsers publication notifies each client about changes to active users, that could be the number of active users x2 records sent to each client to process. This causes the clients fall behind in processing user statuses. It also seems to have a snowball effect because each client will try to report the users status multiple times as it struggles to sync user statuses. You can see the flood of users status updates using chrome dev tools and monitor the Web Socket frames when restarting the server.

@tpDBL
Copy link
Contributor

tpDBL commented Oct 22, 2018

Question for @geekgonecrazy:
If this is on the radar as you say, would it be a bad thing to enter the 4 line workaround of AmShaegar13 mentioned earlier as a pull request, because it could be an option for user groups not using the UserPresence monitor in the mean time?

@AmShaegar13
Copy link
Contributor Author

No. mongodb is pretty stable and can even handle high load. Also, our mongodbs are on extra hosts.

@introspection3
Copy link

you can try nginx http2. could tell me :does pm2 work?

@AmShaegar13
Copy link
Contributor Author

No, I cannot try anything. I have 2000 users in home office because of COVID-19 relying on a stable service.

@introspection3
Copy link

introspection3 commented Mar 17, 2020

does pm2 cluster work or docker swarm? I tink we can try http2 or http3,

@introspection3
Copy link

Gateway

what's Gateway's function @nmagedman @AmShaegar13

@AmShaegar13
Copy link
Contributor Author

The gateway is used for push notifications from your Rocket.Chat instance to Android/iOS.

pm2 in cluster mode does not work. Every Rocket.Chat instance needs it's own dedicated port. Docker should be fine as long as you pass the instance ip as INSTANCE_IP environment variable to Rocket.Chat.

@introspection3
Copy link

@AmShaegar13 ,thank you sir.does it work fine with Multiple instances ? OR are there some bugs?

@AmShaegar13
Copy link
Contributor Author

Multiple instance work, if you correctly set the INSTANCE_IP environment variable.

@introspection3
Copy link

@AmShaegar13 tanks ,somebody said their Multiple instances resolved performance problem,but there 's a bug about notifycation,does this exist now?

@AmShaegar13
Copy link
Contributor Author

As I said. The issue has been fixed.

@introspection3
Copy link

@AmShaegar13 Tanks very much.Jesus Bless you

@introspection3

This comment has been minimized.

@sampaiodiego
Copy link
Member

thanks @AmShaegar13 for all your support.. nice to see your numbers and you've accomplished.. if you get a change, try to upgrading to 3.0.x as we removed a lot of meteor stuff so we can be more scalable.

@564064202 if you have a different issue, please open a new issue then

@AmShaegar13
Copy link
Contributor Author

@sampaiodiego Thanks to COVID-19 we had a lot of trouble scaling beyond 1650 active users. Thanks to v3.0.4 we are now at 2250 user max. per day. Thank you for further improving on this issue.

@magicbelette
Copy link
Contributor

7K users on 55 nodes but still in Rocket.Chat 2.x.

My problem upgrading to Rocket.Chat 3.x is the number of nodes. It seems that MongoDB doesn't support the broadcasting of a large number of new instances at the same time when restarting node.js.

Can I restart only a few nodes at the same time ? It implies Rocket.Chat instances in version 3.x and other running 2.x.

Cheers

@sampaiodiego
Copy link
Member

Can I restart only a few nodes at the same time ? It implies Rocket.Chat instances in version 3.x and other running 2.x.

yes you can @magicbelette .. it is not always recommended as at some point version schemas might be incompatible, but you usually can do that. we actually use rolling upgrades strategy on k8s that does exact the same. 😉

@magicbelette
Copy link
Contributor

In our case after migration to v3.0.4 (today) from v2.2.1 (yesterday) every node processes uses 100% of CPU :/

We keep the config DISABLE_PRESENCE_MONITOR=true on 53/55 instances.

Capture d’écran_2020-03-26_08-33-03

@introspection3
Copy link

@AmShaegar13 Sir, could you give me a hand
#17020

docker-compose with multiple instances at same time will meet a bug : ' MongoError: ns not found', 'Errors like this can cause oplog processing errors.'

@introspection3
Copy link

@sampaiodiego Sir, could you give me a hand
#17020

docker-compose with multiple instances at same time will meet a bug : ' MongoError: ns not found', 'Errors like this can cause oplog processing errors.'

@introspection3
Copy link

introspection3 commented Apr 25, 2020

We are currently running 2.4.8. Can't tell you much about the physical hardware as the servers are virtual machines in our internal VM cluster. 4 cores, 16 GB RAM. That's all I have at the moment. Also, we stopped using pm2 and use systemd services now. However, this should not have any impact.
@AmShaegar13
Dear Sir, I am so sorry to bother you now, But I don't know who can tell the truth about how to support more than 5000 pepople onlie . You have more than two thousand people online, I want to know if I want 5,000 online users, how many( Or How much server configuration is required) servers do I need? Could you tell me?Or give me some suggestion? Please help me, for God ’s sake。 Thank you every much.

@rodrigok
Copy link
Member

@564064202 for 5k online users 8 instances should be enough, but to help you correctly we would need to understand all the aspects of your installation, usage, etc. If you have any kind of support contract we can do it quicker, without it you may need to wait the answers here when we have time or the help from the community.

Some basic advice:

  • For scale we recommend some docker installation (k8s, openshift, or, at lease, docker-compose).
  • SSD for the database and use replicas.
  • Using the last version you can disable some features at the troubleshooting section on admin area, test them and check how they affect the performance.

@introspection3
Copy link

introspection3 commented Apr 25, 2020

@rodrigok @AmShaegar13
We found that the 3. * version is often unstable, so we dare not use them. I don't know what the hardware configuration you need for the 8 running instances? We also plan to use SSD to run MongoDB. All running platforms are on the cloud.

@ankar84
Copy link

ankar84 commented May 15, 2020

Our current setup is 8 servers with one instance each. Downscaling from 3 instances each helped us support more users, apperently. These NxN connections between each and every Rocket.Chat instance seem to limit scalability.

Hi, @AmShaegar13
We having a very similar issues, and #14488 in my opinion didn't helped a lot.
Now we are on 3.1.1 and experiencing sporadic high resources load where present statuses and notifications about that come into play.
Still do not see the way to reboot one single instance - it will flood all other instances and we will need to restart all instances with warm ups like @nmagedman described here
And I'd like to ask you @AmShaegar13 and @nmagedman about this:

They did. #12353 However, they (correctly) changed the semantics of the environment variable from opt-in to opt-out.

Is that true, that in #12353 whole logic is reversed? In compare to that patch
I mean if I set DISABLE_PRESENCE_MONITOR=yes in docker-compose file - what will be? Is that instance will start with UserPresenceMonitor.start(); or without it?

@AmShaegar13
Copy link
Contributor Author

Hi @ankar84,
as the name of the environment variable DISABLE_PRESENCE_MONITOR indicates, if you set it to true or yes the presence monitor is disabled otherwise enabled. So it works like opt-out. The presence monitor is always on except for when you set DISABLE_PRESENCE_MONITOR=yes.

However, we are not using it anymore. We are on v3.0.4 at the moment which works without any problems.

@ankar84
Copy link

ankar84 commented May 19, 2020

as the name of the environment variable DISABLE_PRESENCE_MONITOR indicates, if you set it to true or yes the presence monitor is disabled otherwise enabled. So it works like opt-out. The presence monitor is always on except for when you set DISABLE_PRESENCE_MONITOR=yes.

I get it! You patch was if user_presence_monitor - then start, but Diego implemented it opposite way - if DISABLE_PRESENCE_MONITOR not true or yes - then start it.

However, we are not using it anymore. We are on v3.0.4 at the moment which works without any problems.

Now we are on 3.1.1 and sometimes we experiencing performance issues, so I configured only 1 of our 20 instances to user_presence_monitor started now. Do not see any problems in presence status system works now. And thanks for your answer!

@ramrami
Copy link
Contributor

ramrami commented Nov 5, 2020

Hi, we're trying to support 4k active users with RocketChat, but we are unable to go above 1k for now.

We are using RocketChat v3.6.3
10 instances (2CPU & 2GB RAM each) on AWS Fargate
3 nodes MongoDB v4.2 cluster (8vCPU & 32GB RAM & 16000 max connections) on Atlas, we use retryWrites=true&w=majority&poolSize=75 in the connection string.

We are using selenium with headless chrome on the cloud to perform the load test.
All users are connected to the same public channel, and wait a random amount of time before sending a text message.
We tried with :
+10 min : const time = Math.floor(Math.random() * 10 * 60) * 1000;
and + 30 min : const time = Math.floor(Math.random() * 30 * 60) * 1000;

Yesterday we tried with 2370 users, and the chat was unusable, I could not send messages (they stay grey and no REST request sendMessage is sent), if I reload the page I can access the channel but the messages loader stays forever.

The problem is that our monitoring does not show any big CPU load, the app instances are at ~50% CPU max and the DB is at ~40% CPU, so we're at lost here.

We first discovered that having the setting Unread_Count set to all_messages is a big no for large channels, it was generating a lot of oplog updates on the subscription collection and was slowing the app.

We also have a lot of this in our instance logs :
Mongodb Exception in setInterval callback: SwitchedToQuery TIMEOUT QUERY OPERATION

We would appreciate any additional hints from the experts in this thread.
@rodrigok @AmShaegar13 @magicbelette @ankar84

@rodrigok
Copy link
Member

@ramrami to start supporting more users you will need to disable notifications and presence.

@codeneno
Copy link

codeneno commented Jan 9, 2021

still here 3.9.4 and 3.10

@NgocTH
Copy link

NgocTH commented Mar 24, 2021

@AmShaegar13 @magicbelette @ankar84
Hello, we are currently having issue about CPU of server and DB increases 100% when many users disconnected and reconnected.
Currently with 1000 users online, we deployed (which is to restart all servers, when each server starts again, user access to server again causing CPU of server rise to 100% and it doesn't go down. We opened devtool on Chrome browser and see that 'stream-notify-logged' stream name and 'user-status' event name showed up alot. Maybe this is the reason why client works slowly

Server Setup Information:
Version of Rocket.Chat Server: 3.5.4
Operating System: Centos7
Deployment Method: docker-compose
Number of Running Instances: 17 instances on 17 servers
DB Replicaset Oplog: On
NodeJS Version: 12.16.1
MongoDB Version: 4.2.9

Current behavior:

  • Case 1: When service is normal, user suddenly can't use the service (ex: send message), server and DB's CPU rise high.
  • Case 2: When deploy with 1000 users active (online + away), CPU of servers and DB rises 100%
    (Our deploy process:
  • Step 1: build docker image, push to docker registry of Azure
  • Step 2: Exclude servers from app GW
  • Step 3: access to each server and restart by docker-compose (manually)

After investigating, we noticed that user status is called multiple times (like image below) during deployment and during normal time when service is used normally
image
Same user has the status logged many times

It would be really helpful if we can have some advice about how to solve this problem.
Thank you!

@ankar84
Copy link

ankar84 commented Mar 24, 2021

@NgocTH Hi!
Please check that issue #1338 #21182
And that thread at open server https://open.rocket.chat/channel/support?msg=3zoNiPWAQLyeNkvzR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests