-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node slamming CPU since update to 3.7 #19082
Comments
@HiveMindNet Which version of RC were you upgrading from? Also which version of MongoDB before and after ? |
Thanks for asking 👍 I cam from 3.6.3 (I'm always up to date) and I didn't upgrade the Mongo docker. (which is and was db version v4.0.20) - could that be the issue? Both run in individual dockers if that matters. Thanks for any guidance :) |
I updated from 3.6.3 too, a manual installation. After upgrade to 3.7.0, I met the same problem. My node is 12.14.0, npm is 6.13.4 and my mongo is 4.0.2. |
Has this happened with any previous upgrade? Did number of users connected change after the upgrade? Does there seem to be any correlating factor with behaviour and the cpu spikes? |
I think the answers of all your questions are no, I haven't meet that before, and only 4 users on my server. After upgrading, the usage of cpu would be down to about 1% - 3# in 2 minutes with previous upgrade, but this time it keeps almost 100% all time. |
One more question, can I downgrade to 3.6.3 by the steps of update in doc.rocket.chat? Is there any problem with it? |
Thanks everyone, all data points help. Anyone not running this on time-shared cloud vps? (on your own dedicated machines?) |
We're on our own AWS EC2 (m5.xlarge), just running RC within docker for ease. Nothing we can't look at or adjust if needed on the main infrastructure. Let me know if you want me to look at anything. We're running sematext for monitoring everything across containers and on the main EC2 too, including logs and events. |
@ankar84 Have you updated to 3.7 on your cluster? And are you seeing similar "increased DB load" behavior? |
@HiveMindNet is jitsi something you can turn off integration for a bit and see if changes anything? Just to rule that part out. We have the jitsi integration enabled on open with out this issue.. so doesn't seem likely. On open we're also using our docker image. So that aspect is also the same But anything we use to narrow in on the cause would be awesome. |
Hi, @Sing-Li I have 3 servers with CentOS7 and 4 RC docker containers each - 12 instances |
You actually point to a PR I hadn’t thought about... @HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing |
I have just tried turning off Jitsi integration and stopping Jitsi servers and it made no difference. Worth a try though - so that's ruled out. |
Jackpot! You can see the drop when I restart the Docker below on the graph - the CPU drops back to normal. Is it safe for me to leave this USE_NATIVE_OPLOG=true on a production environment while you find a fix?? |
I can confirm USE_NATIVE_OPLOG=true fixed high cpu issue |
Same thing on my instance, add "USE_NATIVE_OPLOG=true" on systemd service file fixed the issue. |
@mranderson56 Where should I write this? |
@wolfcreative In the service file (/lib/systemd/system/rocketchat.service), in the Environment parameter. |
Into your systemd file (on Debian) : /etc/systemd/system/multi-user.target.wants/rocketchat.service |
Well, the CPU load dropped to 0.7-1.5% |
Can any of the main devs please let me know, is it safe for me to leave this USE_NATIVE_OPLOG=true on a production environment while you find a fix?? @geekgonecrazy |
@HiveMindNet it's safe to keep |
Thank you :) |
@HiveMindNet how many users do you have online on your instance? And could you give me a screenshot or copy of the startup logs where rocket.chat prints the mongodb version, engine, etc? |
@rodrigok Someone in the forums posted some info: https://forums.rocket.chat/t/rocketchat-3-7-0-high-cpu-usage/8715 |
I solved it by entering this on my docker-compose file volumes: |
I had the same problem and setting |
Same Issue here: USE_NATIVE_OPLOG=true should be definitely the default! I just stumbled upon the changed load by chance! EDIT: To be clear. Setting this flag, did fix the issue for us. Load is back to <5% |
Sorry, please disregard this, I found a missing index in our mongodb that caused this issue. |
Come now, don't leave everybody hanging. What exactly did you do to fix this? 🤔 |
Sorry I think it is something unrelated to this issue but I don't want to leave you hanging ;) |
Sorry @rodrigok we're still early in adoption so although over 1000 users, only around 20-30 concurrently active at any one time. |
We have the same problem with our cluster. The workaround with |
Will this fix get merged into the 3.7.1 release? |
I'm in opposite to majority of users here want to test change streams in test environment.
How to force enable change streams in deployment? |
Thank you ! |
Folks, we release the version 3.7.1 with a fix for this situation, can you all please try this new version without the Thanks |
We have no issues with 3.7.1, MongoDB 3.6 running WiredTiger and no USE_NATIVE_OPLOG variable present. No experience with 3.7.0, so I cannot confirm whether the issue existed for us on that. |
@rodrigok We applied the update and removed |
@rodrigok We applied the update and removed USE_NATIVE_OPLOG=true and at the moment it looks good. |
Does this only fix "mmapv1 database engine or when no admin access was granted to the database user,"? |
no. it didn't |
it wasts too many mongodb cpu 100% |
Because this issue discusses effect of oplog method, I'll add this here. Here is a weekly CPU graph of one server RC Vmware instance with SSD-disks, 8 cores and 38 GB memory. MongoDB 4.0 (WiredTiger) and 20 nodes on the same box. I thought it gives some comparison to idle loads different versions and OpLog methods have. 100 % on the graph would mean highest possible load of 8 cores running all at maximum load. Before evening of 1. March: In the evening of 1. March, upgrade from 3.9.7 to 3.10.6 -> 3.11.2, and in 2. March afternoon finally to 3.12.0: In the evening of 2. March, switching to native OpLog with The server is mostly idling whole week so there is no considerable load for actual use during daytime compared to background load, especially because background load is so high. So we can ignore daytime active loads when looking at the graph. All in all, just the idling load increase from RC 3.9.7 to 3.12.0 with this amount of nodes is staggering 15 %! Switching to native oplog decreased the load by 6%. It does not show in the graph but this also considerably decreased the amount of load on the mongodb process on the server, and moved it towards application nodes themselves. In short, much lighter on DB, but bit heavier for nodes. Because of this, should you be clustering application nodes on different servers, native oplog seems to be the way to spread the load to additional servers instead of them hitting single database instance really hard. We've also tested with having as few application nodes as possible in real life scenarios with 300+ users online. It seems that when the underlying server has fast enough disk and enough memory and at least two cores, even single app node can handle those 300+ active users without any issues, as long as nothing acts or bugs up really badly. With as few as possible nodes additional load created by oplog operations stays in minimum, which saves lots of CPU cycles in mongodb. (No matter which oplog method you use, but the drop down is relatively considerably higher if you had been using mongodb oplog stream instead of native oplog). We read in 2017 that optimal users per application node ratio would be approx. 50 per node, and at least this does not seem to be true anymore. We estimate that one node could manage at least 500 users without issues, as long as the underlying hardware is not of the cheapest kind you get from the cloud. Also having one core per one node rule does not seem to hold it very well with good gear; the loads of single node processes are pretty low with just 8 cores for 20 nodes + database. |
For the record, some of the extra load triggered by RC 3.12.0 ended eventually three weeks later. This makes me think it was an RC based background process doing something after the upgrade. Update to 3.12.3 along with restarts to all nodes also did not trigger it again. I'm also considering some kind of race condition in Mongodb or between RC nodes themselves. |
@Gummikavalier thank you for the very detailed information. It seems to be an edge case that we were not able to reproduce here yet. We will put a new round of focus to this soon and I hope we can improve the idle CPU usage more. |
Since updating to 3.7 from previous version the CPU utilisation has jumped from less than 5% to a constant very high average.
Description:
Node running away with Cpu and Mongo DB docker also high
Steps to reproduce:
Expected behavior:
normal behaviour / ~3% CPU
Actual behavior:
high CPU
Server Setup Information:
Client Setup Information
N/A
Additional context
Relevant logs:
The text was updated successfully, but these errors were encountered: