Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Every node.js at 100%CPU usage after v3.0.4 upgrade #17025

Closed
magicbelette opened this issue Mar 26, 2020 · 15 comments
Closed

Every node.js at 100%CPU usage after v3.0.4 upgrade #17025

magicbelette opened this issue Mar 26, 2020 · 15 comments

Comments

@magicbelette
Copy link
Contributor

Description:

Migrating from 2.2.1 to 3.0.4, every node.js are stuck at 100%CPU usage and Rocket.Chat becomes unresponsive.

Actual behavior:

The first peak at midnight is node.js instances restarting for 3.0.4 upgrade. Then in the morning it starts growing as users connect, Rocket.chat becomes (very) laggy and the load never decrease. Finally go back to Rocket.Chat 2.2.1, CPU load decreases and even if high load at the become, Rocket.Chat is still usable in 2.2.1.
Capture d’écran_2020-03-26_12-28-32

Server Setup Information:

  • Version of Rocket.Chat Server: 3.0.4
  • Operating System: Debian
  • Deployment Method: tar
  • Number of Running Instances: 99
  • DB Replicaset Oplog: true
  • NodeJS Version: v12.14.0
  • MongoDB Version: 3.6

Additional context

Node.js repartition :
host 1 with 8CPU : 7 node.js instances
host 1 with 64CPU : 62 node.js instances
host 1 with 32CPU : 30 node.js instances

That's maybe a bad idea to have a lot of node.js instances on the same server and keep small 4CPU servers with 3 node.js instances each ?

Relevant logs:

A lot of differents errors but mainly, node.js instances seem to keep discovering each other :

Mar 25 18:33:41 m2-ariane-01 rocketchat3001[41025]: server.js:204 StreamBroadcast ➔ Stream.error Stream broadcast from '10.167.124.86:3001' to 'localhost:3002' with name notify-user not authorized
Mar 25 18:33:41 m2-ariane-01 rocketchat3001[41025]: server.js:204 StreamBroadcast ➔ Stream.error Stream broadcast from '10.167.124.86:3001' to 'localhost:3002' with name notify-user not authorized
Mar 25 18:33:41 m2-ariane-01 rocketchat3001[41025]: StreamBroadcast ➔ Auth.info Authorizing with localhost:3002
[...]
ar 25 18:33:51 m2-ariane-01 rocketchat3001[41025]: StreamBroadcast ➔ Auth.info broadcastAuth with localhost:3002 false
Mar 25 18:33:51 m2-ariane-01 rocketchat3001[41025]: StreamBroadcast ➔ Auth.info broadcastAuth with localhost:3002 false
[...]
Mar 25 19:07:53 m2-ariane-01 rocketchat3000[54349]: server.js:204 StreamBroadcast ➔ Stream.error Stream broadcast from '10.167.124.86:3000' to '10.167.124.87:3020' with name notify-logged not authorized
Mar 25 19:07:53 m2-ariane-01 rocketchat3000[54349]: server.js:204 StreamBroadcast ➔ Stream.error Stream broadcast from '10.167.124.86:3000' to '10.167.124.87:3016' with name notify-logged not authorized
Mar 25 19:07:53 m2-ariane-01 rocketchat3000[54349]: server.js:204 StreamBroadcast ➔ Stream.error Stream broadcast from '10.167.124.86:3000' to '10.167.124.87:3020' with name notify-user not authorized
Mar 25 19:07:53 m2-ariane-01 rocketchat3000[54349]: server.js:204 StreamBroadcast ➔ Stream.error Stream broadcast from '10.167.124.86:3000' to '10.167.124.87:3016' with name notify-user not authorized
[...]
Mar 25 18:45:27 m2-ariane-01 rocketchat3000[39494]: TIMEOUT QUERY OPERATION {
Mar 25 18:45:27 m2-ariane-01 rocketchat3000[39494]:   observeKey: '{"ordered":false,"collectionName":"users","selector":{"_id":"74Nqois8WyZ5CfBck"},"options":{"transform":null,"fields":{"profile":1,"username":1,"emails":1}}}',
Mar 25 18:45:27 m2-ariane-01 rocketchat3000[39494]:   writesToCommitWhenWeReachSteadyLength: 0,
Mar 25 18:45:27 m2-ariane-01 rocketchat3000[39494]:   cursorDescription: '{"collectionName":"users","selector":{"_id":"74Nqois8WyZ5CfBck"},"options":{"transform":null,"fields":{"profile":1,"username":1,"emails":1}}}'
Mar 25 18:45:27 m2-ariane-01 rocketchat3000[39494]: }
Mar 25 18:45:27 m2-ariane-01 rocketchat3000[39494]: Exception in setInterval callback: SwitchedToQuery {}
[...]
Mar 25 20:01:56 m2-ariane-01 rocketchat3000[61003]: (node:61003) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 exit listeners added to [process]. Use emitter.setMaxListeners() to increase limit
[...]
Mar 25 20:05:01 m2-ariane-01 rocketchat3045[66100]: StreamBroadcast ➔ Auth.info broadcastAuth with localhost:3002/ true
Mar 25 20:06:28 m2-ariane-01 rocketchat3002[61683]: server.js:204 StreamBroadcast ➔ Stream.error Stream broadcast from '10.167.124.86:3002' to 'localhost:3045' with name notify-user not authorized
@bbrauns
Copy link
Contributor

bbrauns commented Mar 27, 2020

@magicbelette would you share with us the number of online users you are serving with your setup?

Currently we have one VM with 64 cores and around 20 RC instances for around 2000 online users. Yesterday we had an issue where all instances spiked to 100% CPU. We saw the same "Stream broadcast from xxx to xxx with name xxx not authorized" errors in our logs.

I hesitate to further increase the number or RC instances because I read that every instance has to contact every other instance which could potentially create bottlenecks. Version: 2.4.8

@magicbelette
Copy link
Contributor Author

@bbrauns thanks for your answer.
We also serve 2000 online users with a peak to 7000. The problem appears with around 900 online users.
Yes you're right, every new instance annonces itself to mongodb in instances collection. Then every instances reads this collection to discover and broadcast to other instance.

@introspection3
Copy link

sir,I think,3.* is every unstable。Could you give a hand,but:
run docker-compose with multiple instances will meet a strange but ' MongoError: ns not found', 'Errors like this can cause oplog processing errors.'
#17020

@magicbelette
Copy link
Contributor Author

I noticed a difference between how instances are seen in admin page.
Rocket.Chat 2.2.1 : Auth and Connection status > Connected are unset
rocketchat2 2 1

Rocket.Chat 3.0.4 : Auth and Connection status > Connected are set to true
rocketchat3 0 4

Maybe a new auth mecanism heavily load the streambroadcast process ?

@bbrauns
Copy link
Contributor

bbrauns commented Mar 30, 2020

2.4.8 reports auth and connection status:

image

@magicbelette
Copy link
Contributor Author

OK, that's just a status report.

So.. Could it be due to this change @rodrigok ?
b95cb64#diff-067b50e42d48ca885e4b33cf54cf3124

@introspection3
Copy link

@magicbelette
Could you give a hand,but:
run docker-compose with multiple instances will meet a strange but ' MongoError: ns not found', 'Errors like this can cause oplog processing errors.'
#17020

@bbrauns
Copy link
Contributor

bbrauns commented Mar 30, 2020

We had another issue today. Starting a new instance leads to massive cpu usage of all the other rc instances. Starting multiple instances at once brings down the whole cluster.

@magicbelette
Copy link
Contributor Author

We had another issue today. Starting a new instance leads to massive cpu usage of all the other rc instances. Starting multiple instances at once brings down the whole cluster.

Got the same issue... Can you try to get back to this version ?
b95cb64

I'm setting up a new cluster to test with 64 instances.

@bbrauns
Copy link
Contributor

bbrauns commented Mar 31, 2020

Sorry I can't test in production. Customers are already mad. Why do you find b95cb64 suspicious?

@magicbelette
Copy link
Contributor Author

Sorry I can't test in production. Customers are already mad. Why do you find b95cb64 suspicious?

Just because this was the only change somehow linked to streambroadcast... But it seems that the core team was quite busy to night:
#17088
#17089

As I saw a lot of
Mar 25 19:07:53 m2-ariane-01 rocketchat3000[54349]: server.js:204 StreamBroadcast ➔ Stream.error Stream broadcast from '10.167.124.86:3000' to '10.167.124.87:3016' with name notify-user not authorized
I'll try the last v3.0.9

@bbrauns
Copy link
Contributor

bbrauns commented Mar 31, 2020

Ok makes sense. But these changes aren't in v3.0.9 or am I wrong?
Please report your findings 😄

@magicbelette
Copy link
Contributor Author

Those were included in 3.0.8
https://github.com/RocketChat/Rocket.Chat/releases/tag/3.0.8

@emikolajczak
Copy link

Hi @magicbelette
We have similar problems with high CPU load. Did you have a chance to test fixes in 3.0.8? Is there any
improvement in performance?

@magicbelette
Copy link
Contributor Author

Hi, the situation is far more comfortable in version 3.0.9. Plus, this one #17115 should improve performance too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants