Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node slamming CPU since update to 3.7 #19082

Closed
HiveMindNet opened this issue Sep 30, 2020 · 50 comments · Fixed by #19181
Closed

Node slamming CPU since update to 3.7 #19082

HiveMindNet opened this issue Sep 30, 2020 · 50 comments · Fixed by #19181

Comments

@HiveMindNet
Copy link

HiveMindNet commented Sep 30, 2020

Since updating to 3.7 from previous version the CPU utilisation has jumped from less than 5% to a constant very high average.

Description:

Node running away with Cpu and Mongo DB docker also high

Screenshot 2020-09-30 at 14 36 09

Screenshot 2020-09-30 at 14 38 04

Screenshot 2020-09-30 at 14 38 15

Steps to reproduce:

  1. restart / start rocket

Expected behavior:

normal behaviour / ~3% CPU

Actual behavior:

high CPU

Server Setup Information:

  • Version of Rocket.Chat Server:
  • Operating System:
  • Deployment Method: latest docker
  • Number of Running Instances: 1
  • DB Replicaset Oplog:
  • NodeJS Version:
  • MongoDB Version:

Client Setup Information

N/A

Additional context

Relevant logs:

@Sing-Li
Copy link
Member

Sing-Li commented Sep 30, 2020

@HiveMindNet Which version of RC were you upgrading from? Also which version of MongoDB before and after ?

@HiveMindNet
Copy link
Author

HiveMindNet commented Sep 30, 2020

@HiveMindNet Which version of RC were you upgrading from? Also which version of MongoDB before and after ?

Thanks for asking 👍

I cam from 3.6.3 (I'm always up to date) and I didn't upgrade the Mongo docker. (which is and was db version v4.0.20) - could that be the issue?

Both run in individual dockers if that matters.

Thanks for any guidance :)

@xzyyyy
Copy link

xzyyyy commented Sep 30, 2020

I updated from 3.6.3 too, a manual installation. After upgrade to 3.7.0, I met the same problem. My node is 12.14.0, npm is 6.13.4 and my mongo is 4.0.2.
Thanks for any suggestion.

@geekgonecrazy
Copy link
Contributor

geekgonecrazy commented Sep 30, 2020

Has this happened with any previous upgrade? Did number of users connected change after the upgrade? Does there seem to be any correlating factor with behaviour and the cpu spikes?

@xzyyyy
Copy link

xzyyyy commented Sep 30, 2020

Has this happened with any previous upgrade? Did number of users connected change after the upgrade? Does there seem to be any correlating factor with behaviour and the cpu spikes?

I think the answers of all your questions are no, I haven't meet that before, and only 4 users on my server. After upgrading, the usage of cpu would be down to about 1% - 3# in 2 minutes with previous upgrade, but this time it keeps almost 100% all time.
Thanks.

@xzyyyy
Copy link

xzyyyy commented Sep 30, 2020

Has this happened with any previous upgrade? Did number of users connected change after the upgrade? Does there seem to be any correlating factor with behaviour and the cpu spikes?

One more question, can I downgrade to 3.6.3 by the steps of update in doc.rocket.chat? Is there any problem with it?
Thanks

@kgdemir
Copy link

kgdemir commented Sep 30, 2020

I have just upgraded from 3.6.2 and I can confirm this issue. I will wait until morning, if it will not be resolved by itself, i will do a restore.

node -v: v12.14.0

2020-09-30_19-59-39
2020-09-30_20-00-47
2020-09-30_20-02-27

@wolfcreative
Copy link

wolfcreative commented Sep 30, 2020

I have the same problem after upgrading from 3.6.3. Before the update the load was 3-4%
node: 12.14.0
mongoDB: 4.0.20
OS: Ubuntu 20.04.1 LTS

image

@HiveMindNet
Copy link
Author

HiveMindNet commented Sep 30, 2020

Maybe this is a red herring, but I notice I'm getting Jitsi error log entries which I wasn't before too. Sorry if this isn't relevant, but in case it was I thought I'd share. Is everyone else having these issues seeing anything like this / using Jitsi?

Screenshot 2020-09-30 at 18 18 59

@Sing-Li
Copy link
Member

Sing-Li commented Sep 30, 2020

Thanks everyone, all data points help.

Anyone not running this on time-shared cloud vps? (on your own dedicated machines?)

@HiveMindNet
Copy link
Author

HiveMindNet commented Sep 30, 2020

Thanks everyone, all data points help.

Anyone not running this on time-shared cloud vps? (on your own dedicated machines?)

We're on our own AWS EC2 (m5.xlarge), just running RC within docker for ease. Nothing we can't look at or adjust if needed on the main infrastructure. Let me know if you want me to look at anything. We're running sematext for monitoring everything across containers and on the main EC2 too, including logs and events.

@xzyyyy
Copy link

xzyyyy commented Sep 30, 2020

Maybe this is a red herring, but I notice I'm getting Jitsi error log entries which I wasn't before too. Sorry if this isn't relevant, but in case it was I thought I'd share. Is everyone else having these issues seeing anything like this / using Jitsi?

Screenshot 2020-09-30 at 18 18 59

I don't receive any error, there isn't error or warning or any other thing in my log, just like before, so I think it's not a Jitsi problem.
Thanks.

@Sing-Li
Copy link
Member

Sing-Li commented Sep 30, 2020

@ankar84 Have you updated to 3.7 on your cluster? And are you seeing similar "increased DB load" behavior?

@geekgonecrazy
Copy link
Contributor

geekgonecrazy commented Sep 30, 2020

@HiveMindNet is jitsi something you can turn off integration for a bit and see if changes anything? Just to rule that part out. We have the jitsi integration enabled on open with out this issue.. so doesn't seem likely. On open we're also using our docker image. So that aspect is also the same

But anything we use to narrow in on the cause would be awesome.

@ankar84
Copy link

ankar84 commented Oct 1, 2020

@ankar84 Have you updated to 3.7 on your cluster? And are you seeing similar "increased DB load" behavior?

Hi, @Sing-Li
Yes. I have upgraded one of my test environments to 3.7.0 from 3.6.3 and all looks fluent and nice.
I have only few users in that test deployment (but @xzyyyy have trouble with 4 users here)

I have 3 servers with CentOS7 and 4 RC docker containers each - 12 instances
Each server have one mongoDB container version 4.0.20
Storage engine wiredTiger because of using of change streams #18892
Here is a htop screens from all 3 servers
image
image
image

@geekgonecrazy
Copy link
Contributor

You actually point to a PR I hadn’t thought about...

@HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing

@HiveMindNet
Copy link
Author

@HiveMindNet is jitsi something you can turn off integration for a bit and see if changes anything? Just to rule that part out. We have the jitsi integration enabled on open with out this issue.. so doesn't seem likely. On open we're also using our docker image. So that aspect is also the same

But anything we use to narrow in on the cause would be awesome.

I have just tried turning off Jitsi integration and stopping Jitsi servers and it made no difference. Worth a try though - so that's ruled out.

@HiveMindNet
Copy link
Author

HiveMindNet commented Oct 1, 2020

You actually point to a PR I hadn’t thought about...

@HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing

Jackpot! You can see the drop when I restart the Docker below on the graph - the CPU drops back to normal.

Is it safe for me to leave this USE_NATIVE_OPLOG=true on a production environment while you find a fix??

Screenshot 2020-10-01 at 08 03 23

Screenshot 2020-10-01 at 08 03 35

@kgdemir
Copy link

kgdemir commented Oct 1, 2020

You actually point to a PR I hadn’t thought about...

@HiveMindNet would you be able to try adding: USE_NATIVE_OPLOG=true ? I’m curious if it’s in the new change stream processing. As this is a new bit of processing

I can confirm USE_NATIVE_OPLOG=true fixed high cpu issue

@mranderson56
Copy link

mranderson56 commented Oct 1, 2020

Same thing on my instance, add "USE_NATIVE_OPLOG=true" on systemd service file fixed the issue.

@wolfcreative
Copy link

@mranderson56 Where should I write this?

@Stoobitweet
Copy link

@wolfcreative In the service file (/lib/systemd/system/rocketchat.service), in the Environment parameter.

@mranderson56
Copy link

mranderson56 commented Oct 1, 2020

Into your systemd file (on Debian) : /etc/systemd/system/multi-user.target.wants/rocketchat.service
You can add this in "Environment".
Then : systemctl daemon-reload
And restart Rocketchat

@wolfcreative
Copy link

Well, the CPU load dropped to 0.7-1.5%
Thank you all for your help!

@HiveMindNet
Copy link
Author

HiveMindNet commented Oct 1, 2020

Can any of the main devs please let me know, is it safe for me to leave this USE_NATIVE_OPLOG=true on a production environment while you find a fix?? @geekgonecrazy

@rodrigok
Copy link
Member

rodrigok commented Oct 1, 2020

@HiveMindNet it's safe to keep USE_NATIVE_OPLOG in production env, it will use the code used on all the previous versions.

@HiveMindNet
Copy link
Author

@HiveMindNet it's safe to keep USE_NATIVE_OPLOG in production env, it will use the code used on all the previous versions.

Thank you :)

@rodrigok
Copy link
Member

rodrigok commented Oct 1, 2020

@HiveMindNet how many users do you have online on your instance? And could you give me a screenshot or copy of the startup logs where rocket.chat prints the mongodb version, engine, etc?

@fdellwing
Copy link

@rodrigok Someone in the forums posted some info: https://forums.rocket.chat/t/rocketchat-3-7-0-high-cpu-usage/8715

@netbix
Copy link

netbix commented Oct 3, 2020

I solved it by entering this on my docker-compose file

volumes:
- ./uploads:/app/uploads
environment:
- USE_NATIVE_OPLOG=true
- PORT=3000

@wolbernd
Copy link
Contributor

wolbernd commented Oct 5, 2020

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

@ataraxus
Copy link

ataraxus commented Oct 5, 2020

Same Issue here: USE_NATIVE_OPLOG=true should be definitely the default! I just stumbled upon the changed load by chance!

EDIT: To be clear. Setting this flag, did fix the issue for us. Load is back to <5%

@wolbernd
Copy link
Contributor

wolbernd commented Oct 5, 2020

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

Sorry, please disregard this, I found a missing index in our mongodb that caused this issue.

@benedikt-wegmann
Copy link

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

Sorry, please disregard this, I found a missing index in our mongodb that caused this issue.

Come now, don't leave everybody hanging. What exactly did you do to fix this? 🤔

@wolbernd
Copy link
Contributor

wolbernd commented Oct 5, 2020

I had the same problem and setting USE_NATIVE_OPLOG=true does help a bit however the cpu load ist still siginificantly larger since the upgrade to 3.7.0. We have 1 server with 8 VCPUs with 7 Rocket.Chat instances and 1 Mongodb. At the moment we have approx. 420 Users online and the CPU load (5 min avg) is at about 7.5. Before the update the CPU load rarely exceeded 3.0

Sorry, please disregard this, I found a missing index in our mongodb that caused this issue.

Come now, don't leave everybody hanging. What exactly did you do to fix this? thinking

Sorry I think it is something unrelated to this issue but I don't want to leave you hanging ;)
In my case there was an index missing in the collection rocketchat.rocketchat_uploads.chunks. The missing index (compound index for fields files_id and n) causes all requests for blobs to do a full table scan. I'm not sure how this index got lost, but we did a database recovery right before the update to 3.7.0 so something might have gone wrong there.

@HiveMindNet
Copy link
Author

@HiveMindNet how many users do you have online on your instance? And could you give me a screenshot or copy of the startup logs where rocket.chat prints the mongodb version, engine, etc?

Sorry @rodrigok we're still early in adoption so although over 1000 users, only around 20-30 concurrently active at any one time.

@dpunkturban
Copy link

Same Issue here: USE_NATIVE_OPLOG=true should be definitely the default! I just stumbled upon the changed load by chance!

EDIT: To be clear. Setting this flag, did fix the issue for us. Load is back to <5%

We have the same problem with our cluster. The workaround with USE_NATIVE_OPLOG=true also worked for us

@binarydad
Copy link

Will this fix get merged into the 3.7.1 release?

@ankar84
Copy link

ankar84 commented Oct 6, 2020

I'm in opposite to majority of users here want to test change streams in test environment.
RC server version is 3.7.0
Storage engine is wiredTiger
MongoDB version is 4.0.19
No any oplog information on instance start.

➔ System ➔ startup
➔ +---------------------------------------------+
➔ |                SERVER RUNNING               |
➔ +---------------------------------------------+
➔ |                                             |
➔ |  Rocket.Chat Version: 3.7.0                 |
➔ |       NodeJS Version: 12.18.4 - x64         |
➔ |      MongoDB Version: 4.0.19                |
➔ |       MongoDB Engine: wiredTiger            |
➔ |             Platform: linux                 |
➔ |         Process Port: 3000                  |
➔ |             Site URL: https://rc.test.ru  |
➔ |     ReplicaSet OpLog: Enabled               |
➔ |          Commit Hash: 2c82cd9cee            |
➔ |        Commit Branch: HEAD                  |
➔ |                                             |
➔ +---------------------------------------------+

How to force enable change streams in deployment?

@gbouthenot
Copy link

Thank you !
The USE_NATIVE_OPLOG=true fixed the problem !

@rodrigok
Copy link
Member

rodrigok commented Oct 9, 2020

Folks, we release the version 3.7.1 with a fix for this situation, can you all please try this new version without the USE_NATIVE_OPLOG=true variable?

Thanks

@Gummikavalier
Copy link

We have no issues with 3.7.1, MongoDB 3.6 running WiredTiger and no USE_NATIVE_OPLOG variable present. No experience with 3.7.0, so I cannot confirm whether the issue existed for us on that.

@wolbernd
Copy link
Contributor

@rodrigok We applied the update and removed USE_NATIVE_OPLOG=true and at the moment it looks good. However we do not have many users online on weekends so we'll have to wait for monday for a definitive answer.

@ataraxus
Copy link

@rodrigok We applied the update and removed USE_NATIVE_OPLOG=true and at the moment it looks good.

@codeneno
Copy link

codeneno commented Jan 2, 2021

wiredTiger

Does this only fix "mmapv1 database engine or when no admin access was granted to the database user,"?
we use 3.3.3 version,and it will have a high cpu load sometimes,does this work for wiredTiger ?thank you.

@introspection3
Copy link

@rodrigok We applied the update and removed USE_NATIVE_OPLOG=true and at the moment it looks good.

no. it didn't

@introspection3
Copy link

Folks, we release the version 3.7.1 with a fix for this situation, can you all please try this new version without the USE_NATIVE_OPLOG=true variable?

Thanks

it wasts too many mongodb cpu 100%
too many getmore operations

@Gummikavalier
Copy link

Gummikavalier commented Mar 5, 2021

Because this issue discusses effect of oplog method, I'll add this here.

Here is a weekly CPU graph of one server RC Vmware instance with SSD-disks, 8 cores and 38 GB memory. MongoDB 4.0 (WiredTiger) and 20 nodes on the same box. I thought it gives some comparison to idle loads different versions and OpLog methods have.

Screenshot from 2021-03-05 09-56-52

100 % on the graph would mean highest possible load of 8 cores running all at maximum load.

Before evening of 1. March:
RC 3.9.7 with MongoDB OpLog streams, load approximately 6 %

In the evening of 1. March, upgrade from 3.9.7 to 3.10.6 -> 3.11.2, and in 2. March afternoon finally to 3.12.0:
RC 3.12.0 with MongoDB OpLog streams, load approximately 21 %

In the evening of 2. March, switching to native OpLog with USE_NATIVE_OPLOG=true:
RC 3.12.0 with native OpLog, load approximately 15 %


The server is mostly idling whole week so there is no considerable load for actual use during daytime compared to background load, especially because background load is so high. So we can ignore daytime active loads when looking at the graph.

All in all, just the idling load increase from RC 3.9.7 to 3.12.0 with this amount of nodes is staggering 15 %!
Most of the load comes from MongoDB when using the default mongo oplog streaming.

Switching to native oplog decreased the load by 6%. It does not show in the graph but this also considerably decreased the amount of load on the mongodb process on the server, and moved it towards application nodes themselves. In short, much lighter on DB, but bit heavier for nodes.

Because of this, should you be clustering application nodes on different servers, native oplog seems to be the way to spread the load to additional servers instead of them hitting single database instance really hard.


We've also tested with having as few application nodes as possible in real life scenarios with 300+ users online. It seems that when the underlying server has fast enough disk and enough memory and at least two cores, even single app node can handle those 300+ active users without any issues, as long as nothing acts or bugs up really badly.

With as few as possible nodes additional load created by oplog operations stays in minimum, which saves lots of CPU cycles in mongodb. (No matter which oplog method you use, but the drop down is relatively considerably higher if you had been using mongodb oplog stream instead of native oplog).

We read in 2017 that optimal users per application node ratio would be approx. 50 per node, and at least this does not seem to be true anymore. We estimate that one node could manage at least 500 users without issues, as long as the underlying hardware is not of the cheapest kind you get from the cloud.

Also having one core per one node rule does not seem to hold it very well with good gear; the loads of single node processes are pretty low with just 8 cores for 20 nodes + database.

@Gummikavalier
Copy link

Gummikavalier commented Mar 27, 2021

For the record, some of the extra load triggered by RC 3.12.0 ended eventually three weeks later. This makes me think it was an RC based background process doing something after the upgrade. Update to 3.12.3 along with restarts to all nodes also did not trigger it again. I'm also considering some kind of race condition in Mongodb or between RC nodes themselves.

Screenshot from 2021-03-27 12-55-11

@rodrigok
Copy link
Member

@Gummikavalier thank you for the very detailed information. It seems to be an edge case that we were not able to reproduce here yet. We will put a new round of focus to this soon and I hope we can improve the idle CPU usage more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.