-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rocket.Chat stops working with 1000 active users #11288
Comments
Do you have Apache2 or Nginx for the frontend ? What about system usage (RAM, CPU, Network, FS) for the machines of the cluster? Cheers |
Sounds like you reached the maximum mongodb connections (1024 by default on linux as far as I know). |
@AmShaegar13 You could try using nginx instead of Apache, and as suggested check your mongo settings? |
hi @kaiiiiiiiii,
So this shouldn't be a thing… Best, |
Did you check Apache2 log for MaxClients reached ? |
Yes, @magicbelette, thanks for the hint. We configured MaxClients to 1500 per node and we're far from reaching that. |
@qchn Thanks. ;) @magicbelette Yup. no errors regarding max clients. At most, rare proxy connection timeouts (about once an hour). @vynmera Trying is nothing I can easily do. This requires another downtime for our users. Additionally, I don't really suspect Apache to be the problem here. Node is causing the CPU load and HTTP is doing fine. I can load all scripts and assets just fine. Just the websocket never finishes receiving those collection update messages. |
Sounds like a job for exponential back-off on the client side, after say 2-3 failed web socket reconnects. |
hello, You can try with haproxy and use forever-service to run nodes. haproxy -> n nodes -> 1 server mongodb |
@dmoldovan123 As already mentioned in my reply to vynmera I can't just try various things. I have to maintain a stable service for 1000 active users. So if you could give me a hint why haproxy with forever-service would be better than apache with pm2 I would be really greatful. This would give me something to justify breaking the service (again) on purpose. The thing is, I do not see a different proxy or service manager reduce status-changed messages over the websocket. |
Don't think that's the best idea ever but you can easily test without PM2, directly with systemd. I don't really know about PM2 but the fact is that you add an another layer and potentially a bottleneck. Another thing according to my experience, be careful with Apache2 config... My instance was incredibly slow (3 seconds to load each avatar). My Apache2 uses mpm_prefork with a dumb copy/paste (MaxRequestsPerChild 1). Servers were consuming a lot of resources forking new processes with a bad user experience but there was no system load. Took me a couple of days to figure it out :/ |
I am using pm2 in fork mode so no extra layer should be present. 3 instances of Rocket.Chat are running. Each with its own port. Cluster mode did not work for some reason. |
https://rocket.chat/docs/installation/manual-installation/multiple-instances-to-improve-performance/ |
@dmoldovan123 This is what we already do. As you can see in the issue description, I am running 8 servers with 3 instances each to utilize CPU cores behind a reverse proxy. I don't see how another proxy would impact CPU load of node processes. We are using mongodb with replicaSet and instances can communicate with each other because I set I am pretty sure, this issue is related to this one in the user-presence library Rocket.Chat uses as well. @rodrigok @sampaiodiego Can one of you confirm this? |
I think that konecty/meteor-user-presence#17 it's the best lead, but did you check your database engine ? |
Thanks for all of your suggestions. We could now prove that the UserPresenceMonitor was responsible for the denial of service we faced. We disabled it on all but two separate instances and can restart the cluster now without causing tons of status updates. We did so by patching the source and setting USER_PRESENCE_MONITOR environment: --- rocket.chat/programs/server/app/app.js 2018-07-04 18:07:36.917547890 +0200
+++ app.js 2018-07-04 18:10:12.273401726 +0200
@@ -7753,7 +7753,10 @@
InstanceStatus.registerInstance('rocket.chat', instance);
UserPresence.start();
- return UserPresenceMonitor.start();
+
+ if (process.env['USER_PRESENCE_MONITOR']) {
+ return UserPresenceMonitor.start();
+ }
});
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
I would still like to have an official fix for this rather than patching the source with every update. @magicbelette We still use the slow database engine but we do not observer high CPU load or memory usage on database servers yet. |
I am also surprised that user presence hasn't caused more issues for more people. I raised the issue a while back. Another question is whether pm2 is handling sessions properly. Meteor uses sticky sessions and if not handled properly your servers may be doing a lot of extra work constantly logging in new users. I'd look to add kadira (meteor apm) to check app performance. Node chef offers a solution for 50 dollars per month (as does meteor galaxy hosting but then you have to use their hosting service which is pricey) |
I'm not sure but I'm under the feeling that this patch causes users to appear offline for some others. As consequence, users receive email notification even if they are online. @AmShaegar13 did you notice it or am I totally wrong ? |
I think, you are not. Usually, the status appears to be correct but I already had complaints about unnecessary emails. So yes, there is still something wrong with it and I am still hoping this will be fixed. But for now, we can at least use the chat again. Currently, 1177 users at it's peak. |
The same appears to happen the other way round. No email notification although the user is offline in all clients. |
This is getting a major annoyance. More and more users complain about broken notifications. This issue is a major drawback for acceptance in our company. |
How many instances are you distributing the load across? |
We are running 6 servers with 3 instances each without UserPresenceMonitor (see the patch above) behind a load balancer. Additionally, we run 2 servers with 1 instance each with UserPresenceMonitor not balanced, so no users can reach them. Those two servers are dedicated to running the UserPresenceMonitor. This setup keeps the cluster at least stable but causes the aforementioned problems with notifications. |
Just wanted to follow up here. We are working through another case like this. So this is definitely on our radar. |
We were having pretty much identical symptoms. CPU pegged at 100%. Packet storm. We implemented @AmShaegar13’s July 5 patch and (combined with splitting the servers into two Auto-Scaling Groups with different environment variables set) it solved that problem. We then noticed that we were experiencing some of the side-effects mentioned above, including users being marked as Away even when actively using the app. User activity would mark the user as Online for a split-second but then the user would return to Away. I was concerned that this fix completely broke the User Presence system, but the almost-immediately-Away problem turned out to be something much simpler. Not a runtime failure, but just a configuration bug. As discussed in Issue #11309 (comment), releases 0.64.0 and 0.66.0 changed the semantics of the "Idle Time Limit" user config setting, changing the units of the idle timeout from milliseconds to seconds. I don't know if the migrations were broken, or ran twice, or something else, but the end result is that the 300-second idle timeout somehow became 0.3 seconds! Point being, be aware that there are multiple issues to manage here. |
We see a similar issues with CPU related to users status when doing blue green deploys. It seems to be in part related to the activeUsers publication. When a sever goes offline all the clients connections for the server are removed from the db by other online servers or by the next server to come online. When the clients reconnect to the new servers they create new client connections. Both of these triggers the users records to get updated, status offline then online. Since the activeUsers publication notifies each client about changes to active users, that could be the number of active users x2 records sent to each client to process. This causes the clients fall behind in processing user statuses. It also seems to have a snowball effect because each client will try to report the users status multiple times as it struggles to sync user statuses. You can see the flood of users status updates using chrome dev tools and monitor the Web Socket frames when restarting the server. |
Question for @geekgonecrazy: |
No. mongodb is pretty stable and can even handle high load. Also, our mongodbs are on extra hosts. |
you can try nginx http2. could tell me :does pm2 work? |
No, I cannot try anything. I have 2000 users in home office because of COVID-19 relying on a stable service. |
does pm2 cluster work or docker swarm? I tink we can try http2 or http3, |
what's Gateway's function @nmagedman @AmShaegar13 |
The gateway is used for push notifications from your Rocket.Chat instance to Android/iOS. pm2 in cluster mode does not work. Every Rocket.Chat instance needs it's own dedicated port. Docker should be fine as long as you pass the instance ip as INSTANCE_IP environment variable to Rocket.Chat. |
@AmShaegar13 ,thank you sir.does it work fine with Multiple instances ? OR are there some bugs? |
Multiple instance work, if you correctly set the INSTANCE_IP environment variable. |
@AmShaegar13 tanks ,somebody said their Multiple instances resolved performance problem,but there 's a bug about notifycation,does this exist now? |
As I said. The issue has been fixed. |
@AmShaegar13 Tanks very much.Jesus Bless you |
This comment has been minimized.
This comment has been minimized.
thanks @AmShaegar13 for all your support.. nice to see your numbers and you've accomplished.. if you get a change, try to upgrading to 3.0.x as we removed a lot of meteor stuff so we can be more scalable. @564064202 if you have a different issue, please open a new issue then |
@sampaiodiego Thanks to COVID-19 we had a lot of trouble scaling beyond 1650 active users. Thanks to v3.0.4 we are now at 2250 user max. per day. Thank you for further improving on this issue. |
7K users on 55 nodes but still in Rocket.Chat 2.x. My problem upgrading to Rocket.Chat 3.x is the number of nodes. It seems that MongoDB doesn't support the broadcasting of a large number of new instances at the same time when restarting node.js. Can I restart only a few nodes at the same time ? It implies Rocket.Chat instances in version 3.x and other running 2.x. Cheers |
yes you can @magicbelette .. it is not always recommended as at some point version schemas might be incompatible, but you usually can do that. we actually use rolling upgrades strategy on k8s that does exact the same. 😉 |
@AmShaegar13 Sir, could you give me a hand docker-compose with multiple instances at same time will meet a bug : ' MongoError: ns not found', 'Errors like this can cause oplog processing errors.' |
@sampaiodiego Sir, could you give me a hand docker-compose with multiple instances at same time will meet a bug : ' MongoError: ns not found', 'Errors like this can cause oplog processing errors.' |
|
@564064202 for 5k online users 8 instances should be enough, but to help you correctly we would need to understand all the aspects of your installation, usage, etc. If you have any kind of support contract we can do it quicker, without it you may need to wait the answers here when we have time or the help from the community. Some basic advice:
|
@rodrigok @AmShaegar13 |
Hi, @AmShaegar13
Is that true, that in #12353 whole logic is reversed? In compare to that patch |
Hi @ankar84, However, we are not using it anymore. We are on v3.0.4 at the moment which works without any problems. |
I get it! You patch was if user_presence_monitor - then start, but Diego implemented it opposite way - if DISABLE_PRESENCE_MONITOR not true or yes - then start it.
Now we are on 3.1.1 and sometimes we experiencing performance issues, so I configured only 1 of our 20 instances to user_presence_monitor started now. Do not see any problems in presence status system works now. And thanks for your answer! |
Hi, we're trying to support 4k active users with RocketChat, but we are unable to go above 1k for now. We are using RocketChat v3.6.3 We are using selenium with headless chrome on the cloud to perform the load test. Yesterday we tried with 2370 users, and the chat was unusable, I could not send messages (they stay grey and no REST request The problem is that our monitoring does not show any big CPU load, the app instances are at ~50% CPU max and the DB is at ~40% CPU, so we're at lost here. We first discovered that having the setting We also have a lot of this in our instance logs : We would appreciate any additional hints from the experts in this thread. |
@ramrami to start supporting more users you will need to disable notifications and presence. |
still here 3.9.4 and 3.10 |
@AmShaegar13 @magicbelette @ankar84 Server Setup Information: Current behavior:
After investigating, we noticed that user status is called multiple times (like image below) during deployment and during normal time when service is used normally It would be really helpful if we can have some advice about how to solve this problem. |
@NgocTH Hi! |
Description:
For us, Rocket.Chat does not work with more than 1000 active users. Rebooting a server, restarting Apache or restarting Rocket.Chat after an update causes all clients to face serious issues connecting to the chat.
Steps to reproduce:
Expected behavior:
Clients can reconnect to the chat.
Actual behavior:
While reconnecting the server sends an enormous amount of the following messages over websocket:
This continues until the server closes the websocket. I assume, this is due to the lack of ping-pong messages in this time. The client instantly requests a new websocket starting the whole thing over and over again.
The only effective way to get the cluster up and working again is to force-logout all users by deleting their
loginTokens
from mongodb directly.Server Setup Information:
Additional context
The high amount of instances we operate directly results from this issue. When we first ran into it with about 700 users, we assumed we might need to scale the cluster accordingly but we are not willing to add another server to the cluster for every 40 new users. We planned to support around 8000 users. Approximately half of them active.
For now, we do not allow mobile clients yet. We would really love to do so but with the current state of the cluster this wont happen soon.
The text was updated successfully, but these errors were encountered: