Node slamming CPU since update to 3.7 #19082

HiveMindNet · 2020-09-30T13:41:11Z

Since updating to 3.7 from previous version the CPU utilisation has jumped from less than 5% to a constant very high average.

Description:

Node running away with Cpu and Mongo DB docker also high

Steps to reproduce:

restart / start rocket

Expected behavior:

normal behaviour / ~3% CPU

Actual behavior:

high CPU

Server Setup Information:

Version of Rocket.Chat Server:
Operating System:
Deployment Method: latest docker
Number of Running Instances: 1
DB Replicaset Oplog:
NodeJS Version:
MongoDB Version:

Client Setup Information

N/A

Additional context

Relevant logs:

Sing-Li · 2020-09-30T14:48:45Z

@HiveMindNet Which version of RC were you upgrading from? Also which version of MongoDB before and after ?

HiveMindNet · 2020-09-30T14:52:17Z

@HiveMindNet Which version of RC were you upgrading from? Also which version of MongoDB before and after ?

Thanks for asking 👍

I cam from 3.6.3 (I'm always up to date) and I didn't upgrade the Mongo docker. (which is and was db version v4.0.20) - could that be the issue?

Both run in individual dockers if that matters.

Thanks for any guidance :)

xzyyyy · 2020-09-30T14:58:11Z

I updated from 3.6.3 too, a manual installation. After upgrade to 3.7.0, I met the same problem. My node is 12.14.0, npm is 6.13.4 and my mongo is 4.0.2.
Thanks for any suggestion.

geekgonecrazy · 2020-09-30T15:57:40Z

Has this happened with any previous upgrade? Did number of users connected change after the upgrade? Does there seem to be any correlating factor with behaviour and the cpu spikes?

xzyyyy · 2020-09-30T16:17:39Z

Has this happened with any previous upgrade? Did number of users connected change after the upgrade? Does there seem to be any correlating factor with behaviour and the cpu spikes?

I think the answers of all your questions are no, I haven't meet that before, and only 4 users on my server. After upgrading, the usage of cpu would be down to about 1% - 3# in 2 minutes with previous upgrade, but this time it keeps almost 100% all time.
Thanks.

xzyyyy · 2020-09-30T16:19:48Z

Has this happened with any previous upgrade? Did number of users connected change after the upgrade? Does there seem to be any correlating factor with behaviour and the cpu spikes?

One more question, can I downgrade to 3.6.3 by the steps of update in doc.rocket.chat? Is there any problem with it?
Thanks

kgdemir · 2020-09-30T17:01:13Z

I have just upgraded from 3.6.2 and I can confirm this issue. I will wait until morning, if it will not be resolved by itself, i will do a restore.

node -v: v12.14.0

wolfcreative · 2020-09-30T17:11:18Z

I have the same problem after upgrading from 3.6.3. Before the update the load was 3-4%
node: 12.14.0
mongoDB: 4.0.20
OS: Ubuntu 20.04.1 LTS

HiveMindNet · 2020-09-30T17:21:01Z

Maybe this is a red herring, but I notice I'm getting Jitsi error log entries which I wasn't before too. Sorry if this isn't relevant, but in case it was I thought I'd share. Is everyone else having these issues seeing anything like this / using Jitsi?

Sing-Li · 2020-09-30T17:24:23Z

Thanks everyone, all data points help.

Anyone not running this on time-shared cloud vps? (on your own dedicated machines?)

HiveMindNet · 2020-09-30T17:28:57Z

Thanks everyone, all data points help.

Anyone not running this on time-shared cloud vps? (on your own dedicated machines?)

We're on our own AWS EC2 (m5.xlarge), just running RC within docker for ease. Nothing we can't look at or adjust if needed on the main infrastructure. Let me know if you want me to look at anything. We're running sematext for monitoring everything across containers and on the main EC2 too, including logs and events.

xzyyyy · 2020-09-30T17:30:40Z

Maybe this is a red herring, but I notice I'm getting Jitsi error log entries which I wasn't before too. Sorry if this isn't relevant, but in case it was I thought I'd share. Is everyone else having these issues seeing anything like this / using Jitsi?

I don't receive any error, there isn't error or warning or any other thing in my log, just like before, so I think it's not a Jitsi problem.
Thanks.

Sing-Li · 2020-09-30T17:42:55Z

@ankar84 Have you updated to 3.7 on your cluster? And are you seeing similar "increased DB load" behavior?

geekgonecrazy · 2020-09-30T23:14:06Z

@HiveMindNet is jitsi something you can turn off integration for a bit and see if changes anything? Just to rule that part out. We have the jitsi integration enabled on open with out this issue.. so doesn't seem likely. On open we're also using our docker image. So that aspect is also the same

But anything we use to narrow in on the cause would be awesome.

ankar84 · 2020-10-01T04:23:42Z

@ankar84 Have you updated to 3.7 on your cluster? And are you seeing similar "increased DB load" behavior?

Hi, @Sing-Li
Yes. I have upgraded one of my test environments to 3.7.0 from 3.6.3 and all looks fluent and nice.
I have only few users in that test deployment (but @xzyyyy have trouble with 4 users here)

I have 3 servers with CentOS7 and 4 RC docker containers each - 12 instances
Each server have one mongoDB container version 4.0.20
Storage engine wiredTiger because of using of change streams #18892
Here is a htop screens from all 3 servers

geekgonecrazy · 2020-10-01T04:40:12Z

You actually point to a PR I hadn’t thought about...

@HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing

HiveMindNet · 2020-10-01T06:54:53Z

@HiveMindNet is jitsi something you can turn off integration for a bit and see if changes anything? Just to rule that part out. We have the jitsi integration enabled on open with out this issue.. so doesn't seem likely. On open we're also using our docker image. So that aspect is also the same

But anything we use to narrow in on the cause would be awesome.

I have just tried turning off Jitsi integration and stopping Jitsi servers and it made no difference. Worth a try though - so that's ruled out.

HiveMindNet · 2020-10-01T06:56:50Z

You actually point to a PR I hadn’t thought about...

@HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing

Jackpot! You can see the drop when I restart the Docker below on the graph - the CPU drops back to normal.

Is it safe for me to leave this USE_NATIVE_OPLOG=true on a production environment while you find a fix??

kgdemir · 2020-10-01T08:15:37Z

You actually point to a PR I hadn’t thought about...

@HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing

I can confirm USE_NATIVE_OPLOG=true fixed high cpu issue

mranderson56 · 2020-10-01T08:22:54Z

Same thing on my instance, add "USE_NATIVE_OPLOG=true" on systemd service file fixed the issue.

wolfcreative · 2020-10-01T08:32:48Z

@mranderson56 Where should I write this?

Stoobitweet · 2020-10-01T08:34:47Z

@wolfcreative In the service file (/lib/systemd/system/rocketchat.service), in the Environment parameter.

mranderson56 · 2020-10-01T08:35:43Z

Into your systemd file (on Debian) : /etc/systemd/system/multi-user.target.wants/rocketchat.service
You can add this in "Environment".
Then : systemctl daemon-reload
And restart Rocketchat

wolfcreative · 2020-10-01T08:47:28Z

Well, the CPU load dropped to 0.7-1.5%
Thank you all for your help!

HiveMindNet · 2020-10-01T09:44:58Z

Can any of the main devs please let me know, is it safe for me to leave this USE_NATIVE_OPLOG=true on a production environment while you find a fix?? @geekgonecrazy

rodrigok · 2020-10-01T16:20:17Z

@HiveMindNet it's safe to keep USE_NATIVE_OPLOG in production env, it will use the code used on all the previous versions.

HiveMindNet · 2020-10-01T16:31:48Z

@HiveMindNet it's safe to keep USE_NATIVE_OPLOG in production env, it will use the code used on all the previous versions.

Thank you :)

rodrigok · 2020-10-01T16:45:39Z

@HiveMindNet how many users do you have online on your instance? And could you give me a screenshot or copy of the startup logs where rocket.chat prints the mongodb version, engine, etc?

fdellwing · 2020-10-01T19:22:08Z

@rodrigok Someone in the forums posted some info: https://forums.rocket.chat/t/rocketchat-3-7-0-high-cpu-usage/8715

netbix · 2020-10-03T10:05:16Z

I solved it by entering this on my docker-compose file

volumes:
- ./uploads:/app/uploads
environment:
- USE_NATIVE_OPLOG=true
- PORT=3000

wolbernd · 2020-10-05T07:46:38Z

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

ataraxus · 2020-10-05T08:19:24Z

Same Issue here: USE_NATIVE_OPLOG=true should be definitely the default! I just stumbled upon the changed load by chance!

EDIT: To be clear. Setting this flag, did fix the issue for us. Load is back to <5%

wolbernd · 2020-10-05T08:35:40Z

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

Sorry, please disregard this, I found a missing index in our mongodb that caused this issue.

benedikt-wegmann · 2020-10-05T08:50:44Z

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

Sorry, please disregard this, I found a missing index in our mongodb that caused this issue.

Come now, don't leave everybody hanging. What exactly did you do to fix this? 🤔

wolbernd · 2020-10-05T09:01:36Z

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

Sorry, please disregard this, I found a missing index in our mongodb that caused this issue.

Come now, don't leave everybody hanging. What exactly did you do to fix this? thinking

Sorry I think it is something unrelated to this issue but I don't want to leave you hanging ;)
In my case there was an index missing in the collection rocketchat.rocketchat_uploads.chunks. The missing index (compound index for fields files_id and n) causes all requests for blobs to do a full table scan. I'm not sure how this index got lost, but we did a database recovery right before the update to 3.7.0 so something might have gone wrong there.

HiveMindNet · 2020-10-05T12:45:59Z

@HiveMindNet how many users do you have online on your instance? And could you give me a screenshot or copy of the startup logs where rocket.chat prints the mongodb version, engine, etc?

Sorry @rodrigok we're still early in adoption so although over 1000 users, only around 20-30 concurrently active at any one time.

dpunkturban · 2020-10-05T14:39:24Z

Same Issue here: USE_NATIVE_OPLOG=true should be definitely the default! I just stumbled upon the changed load by chance!

EDIT: To be clear. Setting this flag, did fix the issue for us. Load is back to <5%

We have the same problem with our cluster. The workaround with USE_NATIVE_OPLOG=true also worked for us

binarydad · 2020-10-06T11:57:01Z

Will this fix get merged into the 3.7.1 release?

ankar84 · 2020-10-06T12:07:29Z

I'm in opposite to majority of users here want to test change streams in test environment.
RC server version is 3.7.0
Storage engine is wiredTiger
MongoDB version is 4.0.19
No any oplog information on instance start.

➔ System ➔ startup
➔ +---------------------------------------------+
➔ |                SERVER RUNNING               |
➔ +---------------------------------------------+
➔ |                                             |
➔ |  Rocket.Chat Version: 3.7.0                 |
➔ |       NodeJS Version: 12.18.4 - x64         |
➔ |      MongoDB Version: 4.0.19                |
➔ |       MongoDB Engine: wiredTiger            |
➔ |             Platform: linux                 |
➔ |         Process Port: 3000                  |
➔ |             Site URL: https://rc.test.ru  |
➔ |     ReplicaSet OpLog: Enabled               |
➔ |          Commit Hash: 2c82cd9cee            |
➔ |        Commit Branch: HEAD                  |
➔ |                                             |
➔ +---------------------------------------------+

How to force enable change streams in deployment?

gbouthenot · 2020-10-08T10:55:58Z

Thank you !
The USE_NATIVE_OPLOG=true fixed the problem !

rodrigok · 2020-10-09T19:16:10Z

Folks, we release the version 3.7.1 with a fix for this situation, can you all please try this new version without the USE_NATIVE_OPLOG=true variable?

Thanks

Gummikavalier · 2020-10-10T07:27:18Z

We have no issues with 3.7.1, MongoDB 3.6 running WiredTiger and no USE_NATIVE_OPLOG variable present. No experience with 3.7.0, so I cannot confirm whether the issue existed for us on that.

wolbernd · 2020-10-10T10:32:59Z

@rodrigok We applied the update and removed USE_NATIVE_OPLOG=true and at the moment it looks good. However we do not have many users online on weekends so we'll have to wait for monday for a definitive answer.

ataraxus · 2020-10-10T13:07:36Z

@rodrigok We applied the update and removed USE_NATIVE_OPLOG=true and at the moment it looks good.

codeneno · 2021-01-02T06:21:39Z

wiredTiger

Does this only fix "mmapv1 database engine or when no admin access was granted to the database user,"?
we use 3.3.3 version,and it will have a high cpu load sometimes,does this work for wiredTiger ?thank you.

introspection3 · 2021-01-10T10:24:35Z

@rodrigok We applied the update and removed USE_NATIVE_OPLOG=true and at the moment it looks good.

no. it didn't

introspection3 · 2021-01-10T10:26:32Z

Folks, we release the version 3.7.1 with a fix for this situation, can you all please try this new version without the USE_NATIVE_OPLOG=true variable?

Thanks

it wasts too many mongodb cpu 100%
too many getmore operations

Gummikavalier · 2021-03-05T08:28:37Z

Because this issue discusses effect of oplog method, I'll add this here.

Here is a weekly CPU graph of one server RC Vmware instance with SSD-disks, 8 cores and 38 GB memory. MongoDB 4.0 (WiredTiger) and 20 nodes on the same box. I thought it gives some comparison to idle loads different versions and OpLog methods have.

100 % on the graph would mean highest possible load of 8 cores running all at maximum load.

Before evening of 1. March:
RC 3.9.7 with MongoDB OpLog streams, load approximately 6 %

In the evening of 1. March, upgrade from 3.9.7 to 3.10.6 -> 3.11.2, and in 2. March afternoon finally to 3.12.0:
RC 3.12.0 with MongoDB OpLog streams, load approximately 21 %

In the evening of 2. March, switching to native OpLog with USE_NATIVE_OPLOG=true:
RC 3.12.0 with native OpLog, load approximately 15 %

The server is mostly idling whole week so there is no considerable load for actual use during daytime compared to background load, especially because background load is so high. So we can ignore daytime active loads when looking at the graph.

All in all, just the idling load increase from RC 3.9.7 to 3.12.0 with this amount of nodes is staggering 15 %!
Most of the load comes from MongoDB when using the default mongo oplog streaming.

Switching to native oplog decreased the load by 6%. It does not show in the graph but this also considerably decreased the amount of load on the mongodb process on the server, and moved it towards application nodes themselves. In short, much lighter on DB, but bit heavier for nodes.

Because of this, should you be clustering application nodes on different servers, native oplog seems to be the way to spread the load to additional servers instead of them hitting single database instance really hard.

We've also tested with having as few application nodes as possible in real life scenarios with 300+ users online. It seems that when the underlying server has fast enough disk and enough memory and at least two cores, even single app node can handle those 300+ active users without any issues, as long as nothing acts or bugs up really badly.

With as few as possible nodes additional load created by oplog operations stays in minimum, which saves lots of CPU cycles in mongodb. (No matter which oplog method you use, but the drop down is relatively considerably higher if you had been using mongodb oplog stream instead of native oplog).

We read in 2017 that optimal users per application node ratio would be approx. 50 per node, and at least this does not seem to be true anymore. We estimate that one node could manage at least 500 users without issues, as long as the underlying hardware is not of the cheapest kind you get from the cloud.

Also having one core per one node rule does not seem to hold it very well with good gear; the loads of single node processes are pretty low with just 8 cores for 20 nodes + database.

Gummikavalier · 2021-03-27T11:00:56Z

For the record, some of the extra load triggered by RC 3.12.0 ended eventually three weeks later. This makes me think it was an RC based background process doing something after the upgrade. Update to 3.12.3 along with restarts to all nodes also did not trigger it again. I'm also considering some kind of race condition in Mongodb or between RC nodes themselves.

rodrigok · 2021-03-27T20:22:59Z

@Gummikavalier thank you for the very detailed information. It seems to be an edge case that we were not able to reproduce here yet. We will put a new round of focus to this soon and I hope we can improve the idle CPU usage more.

ankar84 mentioned this issue Oct 5, 2020

Absent/away after inactivity does not work. Client always online RocketChat/Rocket.Chat.Electron#1783

Closed

Roemer mentioned this issue Oct 6, 2020

Upgrade from RocketChat 3.6.1 to 3.6.3 cannot login #19043

Closed

CrazyHatish mentioned this issue Oct 6, 2020

Exception while invoking method cloud:checkRegisterStatus #19027

Closed

sampaiodiego mentioned this issue Oct 8, 2020

High CPU load; Node proces RocketChat/Docker.Official.Image#125

Closed

rodrigok mentioned this issue Oct 8, 2020

[FIX] Performance issues when using new Oplog implementation #19181

Merged

12 tasks

sampaiodiego closed this as completed in #19181 Oct 9, 2020

ankar84 mentioned this issue May 28, 2021

[Performance] Issue with mongodb and rocket chat #22132

Closed

Node slamming CPU since update to 3.7 #19082

Node slamming CPU since update to 3.7 #19082

Comments

HiveMindNet commented Sep 30, 2020 • edited Loading

Description:

Steps to reproduce:

Expected behavior:

Actual behavior:

Server Setup Information:

Client Setup Information

Additional context

Relevant logs:

Sing-Li commented Sep 30, 2020

HiveMindNet commented Sep 30, 2020 • edited Loading

xzyyyy commented Sep 30, 2020

geekgonecrazy commented Sep 30, 2020 • edited Loading

xzyyyy commented Sep 30, 2020

xzyyyy commented Sep 30, 2020

kgdemir commented Sep 30, 2020 • edited Loading

wolfcreative commented Sep 30, 2020 • edited Loading

HiveMindNet commented Sep 30, 2020 • edited Loading

Sing-Li commented Sep 30, 2020

HiveMindNet commented Sep 30, 2020 • edited Loading

xzyyyy commented Sep 30, 2020

Sing-Li commented Sep 30, 2020

geekgonecrazy commented Sep 30, 2020 • edited Loading

ankar84 commented Oct 1, 2020

geekgonecrazy commented Oct 1, 2020

HiveMindNet commented Oct 1, 2020

HiveMindNet commented Oct 1, 2020 • edited Loading

kgdemir commented Oct 1, 2020

mranderson56 commented Oct 1, 2020 • edited Loading

wolfcreative commented Oct 1, 2020

Stoobitweet commented Oct 1, 2020

mranderson56 commented Oct 1, 2020 • edited Loading

wolfcreative commented Oct 1, 2020

HiveMindNet commented Oct 1, 2020 • edited Loading

rodrigok commented Oct 1, 2020

HiveMindNet commented Oct 1, 2020

rodrigok commented Oct 1, 2020

fdellwing commented Oct 1, 2020

netbix commented Oct 3, 2020

wolbernd commented Oct 5, 2020 • edited Loading

ataraxus commented Oct 5, 2020 • edited Loading

wolbernd commented Oct 5, 2020

benedikt-wegmann commented Oct 5, 2020

wolbernd commented Oct 5, 2020

HiveMindNet commented Oct 5, 2020

dpunkturban commented Oct 5, 2020

binarydad commented Oct 6, 2020

ankar84 commented Oct 6, 2020 • edited Loading

gbouthenot commented Oct 8, 2020

rodrigok commented Oct 9, 2020

Gummikavalier commented Oct 10, 2020

wolbernd commented Oct 10, 2020

ataraxus commented Oct 10, 2020

codeneno commented Jan 2, 2021

introspection3 commented Jan 10, 2021

introspection3 commented Jan 10, 2021

Gummikavalier commented Mar 5, 2021 • edited Loading

Gummikavalier commented Mar 27, 2021 • edited Loading

rodrigok commented Mar 27, 2021

HiveMindNet commented Sep 30, 2020 •

edited

Loading

HiveMindNet commented Sep 30, 2020 •

edited

Loading

geekgonecrazy commented Sep 30, 2020 •

edited

Loading

kgdemir commented Sep 30, 2020 •

edited

Loading

wolfcreative commented Sep 30, 2020 •

edited

Loading

HiveMindNet commented Sep 30, 2020 •

edited

Loading

HiveMindNet commented Sep 30, 2020 •

edited

Loading

geekgonecrazy commented Sep 30, 2020 •

edited

Loading

HiveMindNet commented Oct 1, 2020 •

edited

Loading

mranderson56 commented Oct 1, 2020 •

edited

Loading

mranderson56 commented Oct 1, 2020 •

edited

Loading

HiveMindNet commented Oct 1, 2020 •

edited

Loading

wolbernd commented Oct 5, 2020 •

edited

Loading

ataraxus commented Oct 5, 2020 •

edited

Loading

ankar84 commented Oct 6, 2020 •

edited

Loading

Gummikavalier commented Mar 5, 2021 •

edited

Loading

Gummikavalier commented Mar 27, 2021 •

edited

Loading