Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runaway RAM usage #908

Open
jgoerzen opened this issue Feb 26, 2021 · 11 comments
Open

Runaway RAM usage #908

jgoerzen opened this issue Feb 26, 2021 · 11 comments
Labels
question This issue is a question related to installation upstream This issue is related to an upstream project

Comments

@jgoerzen
Copy link
Contributor

Hi folks,

After running this for a long time, recently I have seen frequent out-of-memory conditions. It's running in a KVM VM, and I've increased its RAM from 3GB, to 4GB, 5GB, 6GB, and still every few days it all hangs.

This system serves only one user: me. I am in some large channels.

When there are issues, here's what top sorted by RAM looks like:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                
24405 matrix    20   0 2497780   1.1g      0 R  11.8  19.7 275:36.18 python                                                                 
24908 matrix    20   0  603748 491000 140452 R   0.9   8.0  25:01.06 postgres                                                               
24948 matrix    20   0  601776 489564 140412 R   0.9   8.0  24:45.11 postgres                                                               
25073 matrix    20   0  601228 487868 140432 R   0.9   8.0  25:14.48 postgres                                                               
25076 matrix    20   0  600704 487796 140436 R   0.9   8.0  25:05.61 postgres                                                               
25071 matrix    20   0  601028 487356 140440 R   0.9   8.0  24:52.17 postgres                                                               
24898 matrix    20   0  599480 486856 140452 R   0.6   8.0  24:58.00 postgres                                                               
25074 matrix    20   0  598920 486632 140420 R   0.9   8.0  25:19.28 postgres                                                               
24954 matrix    20   0  598736 486384 140424 R   0.6   8.0  24:48.39 postgres                                                               
24951 matrix    20   0  599796 486152 140432 R   0.9   8.0  24:57.11 postgres                                                               
24949 matrix    20   0  593780 480472 140432 R   0.6   7.9  24:44.15 postgres                                                               
 2478 matrix    20   0  193396 139992 138820 R   0.9   2.3   2:02.81 postgres                                                               
 2479 matrix    20   0  193328 139040 137936 S   0.6   2.3   2:44.74 postgres                                                               
24807 matrix    20   0  953632  68068      0 R  11.5   1.1  21:46.87 node                                                                   

I believe it's these Postgres processes that are responsible for all the growth. The Synapse process has always been at around that size.

I see #532 but it seems to be targeting tuning for very large systems. Moreover, it doesn't seem to be editing things in .yml files, making me think the changes wouldn't be persistent.

It would be great to see some documentation on how to tune for lower-RAM situations as well as how to make those changes persistent in .yml. Thanks!

@skepticalwaves
Copy link
Contributor

I am in some large channels.

This is not an issue with this repo but rather synapse itself matrix-org/synapse#7339

There is of course a solution for this in the repo:
https://github.com/spantaleev/matrix-docker-ansible-deploy/blob/master/docs/faq.md#how-do-i-optimize-this-setup-for-a-low-power-server

You can also consider implementing a restriction on room complexity, in order to prevent users from joining very heavy rooms:

matrix_synapse_configuration_extension_yaml: |
  limit_remote_rooms:
    enabled: true
    complexity: 1.0 # this limits joining complex (~large) rooms, can be
					# increased, but larger values can require more RAM

Tl;dr Don't join large rooms.

@jgoerzen
Copy link
Contributor Author

Thanks @skepticalwaves . I already am following most of those recommendations, but the large rooms are part of the reason I'm using this in the first place.

I have accepted the RAM usage of Synapse. The problem I'm experiencing is related to the dramatically increased RAM usage by Postgres.

As I've been researching this further, it seems that the tuning mentioned in #532 no longer is in the code base. I'm not sure exactly where it's gone, but I do see an option to add some -c to Postgres there in the yml so I'll experiment with some tuning there.

Thanks!

@skepticalwaves
Copy link
Contributor

Perhaps you can attach to the postgres container and try analyzing what's going on with pg_top:
https://severalnines.com/database-blog/what-check-if-postgresql-memory-utilization-high

Its worth figuring out what postgres is doing before attempting to tune it.

@spantaleev
Copy link
Owner

#642 says how you can pass additional flags to Postgres:

matrix_postgres_process_extra_arguments: [
  "-c 'max_connections=200'"
]

@jgoerzen
Copy link
Contributor Author

Initially,

matrix_postgres_process_extra_arguments: ["-c 'shared_buffers=24MB'", "-c fsync=off", "-c synchronous_commit=off"]

Seems to have made a dramatic improvement. I will keep an eye on things over the weekend and see if it has maintained it.

@pushytoxin
Copy link
Contributor

The mem spikes appear randomly, and it's easy to fall for the placebo effect.

Keeping that in mind, I've not seen crippling swap usage ever since I added the following option to my /etc/docker/daemon.json

{
        "exec-opts": ["native.cgroupdriver=systemd"]
}

@jgoerzen
Copy link
Contributor Author

jgoerzen commented Mar 6, 2021

An update...

The changes I made increased the amount of time before an OOM, but eventually that behavior returned (now after a week or two instead of a few days). I tried pg_top but was unable to figure out where the memory was being used.

I am wondering if it's possible there's a memory leak, but I'm also trying to tweak some other parameters (eg, disabling huge_pages). I will report back what I find.

@PC-Admin
Copy link
Contributor

PC-Admin commented Mar 9, 2021

Noticed this too on perthchat.org, we currently don't have a room complexity limit.

jgoerzen you should be aware that disabling fsync is dangerous, it removed atomicity, meaning that an unexpected shutdown of your server could leave your DB corrupted. I like to use the much safer '-c synchronous_commit=off' setting, which maintains atomicity but removes the requirement that DB updates must be written to the disk before they are acknowledged.

@jgoerzen
Copy link
Contributor Author

jgoerzen commented Mar 9, 2021

I continue to see memory usage for the Postgres processes gradually increasing over a period of days until it reaches over 300MB per process and triggers OOM. Changes to settings have sometimes slowed this behavior, but not solved it.

This thread https://www.postgresql-archive.org/BUG-16707-Memory-leak-td6161863.html mentions JIT as a possible source of leakins in PostgreSQL 12. I'll try disabling that next and see what happens.

The filesystem on here is backed by ZFS and can be trivially rolled back to an earlier snapshot, but your point about fsync is a good one for me and for other travelers. I was willing to take the risk for diagnosis given the ZFS backing but not everyone would be.

@PC-Admin
Copy link
Contributor

PC-Admin commented Mar 9, 2021

We've also noticed raising the global cache factor might make it "runaway" slower, more caching in Synapse can reduce the strain on the DB. We jumped from 2.0 to 4.0 and it wasn't as bad. (6 core, 24GB RAM with ~100 users and 1 worker.)

@ptman
Copy link
Contributor

ptman commented Oct 27, 2021

matrix-org/synapse#10440 txn_limit could help

@luixxiul luixxiul added question This issue is a question related to installation upstream This issue is related to an upstream project labels Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question This issue is a question related to installation upstream This issue is related to an upstream project
Projects
None yet
Development

No branches or pull requests

7 participants