Runaway RAM usage #908

jgoerzen · 2021-02-26T14:24:19Z

Hi folks,

After running this for a long time, recently I have seen frequent out-of-memory conditions. It's running in a KVM VM, and I've increased its RAM from 3GB, to 4GB, 5GB, 6GB, and still every few days it all hangs.

This system serves only one user: me. I am in some large channels.

When there are issues, here's what top sorted by RAM looks like:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                
24405 matrix    20   0 2497780   1.1g      0 R  11.8  19.7 275:36.18 python                                                                 
24908 matrix    20   0  603748 491000 140452 R   0.9   8.0  25:01.06 postgres                                                               
24948 matrix    20   0  601776 489564 140412 R   0.9   8.0  24:45.11 postgres                                                               
25073 matrix    20   0  601228 487868 140432 R   0.9   8.0  25:14.48 postgres                                                               
25076 matrix    20   0  600704 487796 140436 R   0.9   8.0  25:05.61 postgres                                                               
25071 matrix    20   0  601028 487356 140440 R   0.9   8.0  24:52.17 postgres                                                               
24898 matrix    20   0  599480 486856 140452 R   0.6   8.0  24:58.00 postgres                                                               
25074 matrix    20   0  598920 486632 140420 R   0.9   8.0  25:19.28 postgres                                                               
24954 matrix    20   0  598736 486384 140424 R   0.6   8.0  24:48.39 postgres                                                               
24951 matrix    20   0  599796 486152 140432 R   0.9   8.0  24:57.11 postgres                                                               
24949 matrix    20   0  593780 480472 140432 R   0.6   7.9  24:44.15 postgres                                                               
 2478 matrix    20   0  193396 139992 138820 R   0.9   2.3   2:02.81 postgres                                                               
 2479 matrix    20   0  193328 139040 137936 S   0.6   2.3   2:44.74 postgres                                                               
24807 matrix    20   0  953632  68068      0 R  11.5   1.1  21:46.87 node

I believe it's these Postgres processes that are responsible for all the growth. The Synapse process has always been at around that size.

I see #532 but it seems to be targeting tuning for very large systems. Moreover, it doesn't seem to be editing things in .yml files, making me think the changes wouldn't be persistent.

It would be great to see some documentation on how to tune for lower-RAM situations as well as how to make those changes persistent in .yml. Thanks!

The text was updated successfully, but these errors were encountered:

skepticalwaves · 2021-02-26T14:35:21Z

I am in some large channels.

This is not an issue with this repo but rather synapse itself matrix-org/synapse#7339

There is of course a solution for this in the repo:
https://github.com/spantaleev/matrix-docker-ansible-deploy/blob/master/docs/faq.md#how-do-i-optimize-this-setup-for-a-low-power-server

You can also consider implementing a restriction on room complexity, in order to prevent users from joining very heavy rooms:

matrix_synapse_configuration_extension_yaml: |
  limit_remote_rooms:
    enabled: true
    complexity: 1.0 # this limits joining complex (~large) rooms, can be
					# increased, but larger values can require more RAM

Tl;dr Don't join large rooms.

jgoerzen · 2021-02-26T14:40:30Z

Thanks @skepticalwaves . I already am following most of those recommendations, but the large rooms are part of the reason I'm using this in the first place.

I have accepted the RAM usage of Synapse. The problem I'm experiencing is related to the dramatically increased RAM usage by Postgres.

As I've been researching this further, it seems that the tuning mentioned in #532 no longer is in the code base. I'm not sure exactly where it's gone, but I do see an option to add some -c to Postgres there in the yml so I'll experiment with some tuning there.

Thanks!

skepticalwaves · 2021-02-26T14:44:34Z

Perhaps you can attach to the postgres container and try analyzing what's going on with pg_top:
https://severalnines.com/database-blog/what-check-if-postgresql-memory-utilization-high

Its worth figuring out what postgres is doing before attempting to tune it.

spantaleev · 2021-02-26T14:45:12Z

#642 says how you can pass additional flags to Postgres:

matrix_postgres_process_extra_arguments: [
  "-c 'max_connections=200'"
]

jgoerzen · 2021-02-26T15:15:42Z

Initially,

matrix_postgres_process_extra_arguments: ["-c 'shared_buffers=24MB'", "-c fsync=off", "-c synchronous_commit=off"]

Seems to have made a dramatic improvement. I will keep an eye on things over the weekend and see if it has maintained it.

pushytoxin · 2021-02-26T20:38:58Z

The mem spikes appear randomly, and it's easy to fall for the placebo effect.

Keeping that in mind, I've not seen crippling swap usage ever since I added the following option to my /etc/docker/daemon.json

{
        "exec-opts": ["native.cgroupdriver=systemd"]
}

jgoerzen · 2021-03-06T14:04:33Z

An update...

The changes I made increased the amount of time before an OOM, but eventually that behavior returned (now after a week or two instead of a few days). I tried pg_top but was unable to figure out where the memory was being used.

I am wondering if it's possible there's a memory leak, but I'm also trying to tweak some other parameters (eg, disabling huge_pages). I will report back what I find.

PC-Admin · 2021-03-09T03:28:53Z

Noticed this too on perthchat.org, we currently don't have a room complexity limit.

jgoerzen you should be aware that disabling fsync is dangerous, it removed atomicity, meaning that an unexpected shutdown of your server could leave your DB corrupted. I like to use the much safer '-c synchronous_commit=off' setting, which maintains atomicity but removes the requirement that DB updates must be written to the disk before they are acknowledged.

jgoerzen · 2021-03-09T05:19:56Z

I continue to see memory usage for the Postgres processes gradually increasing over a period of days until it reaches over 300MB per process and triggers OOM. Changes to settings have sometimes slowed this behavior, but not solved it.

This thread https://www.postgresql-archive.org/BUG-16707-Memory-leak-td6161863.html mentions JIT as a possible source of leakins in PostgreSQL 12. I'll try disabling that next and see what happens.

The filesystem on here is backed by ZFS and can be trivially rolled back to an earlier snapshot, but your point about fsync is a good one for me and for other travelers. I was willing to take the risk for diagnosis given the ZFS backing but not everyone would be.

PC-Admin · 2021-03-09T05:27:57Z

We've also noticed raising the global cache factor might make it "runaway" slower, more caching in Synapse can reduce the strain on the DB. We jumped from 2.0 to 4.0 and it wasn't as bad. (6 core, 24GB RAM with ~100 users and 1 worker.)

ptman · 2021-10-27T10:20:51Z

matrix-org/synapse#10440 txn_limit could help

luixxiul added question This issue is a question related to installation upstream This issue is related to an upstream project labels Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runaway RAM usage #908

Runaway RAM usage #908

jgoerzen commented Feb 26, 2021

skepticalwaves commented Feb 26, 2021

jgoerzen commented Feb 26, 2021

skepticalwaves commented Feb 26, 2021

spantaleev commented Feb 26, 2021

jgoerzen commented Feb 26, 2021

pushytoxin commented Feb 26, 2021

jgoerzen commented Mar 6, 2021 •

edited

Loading

PC-Admin commented Mar 9, 2021

jgoerzen commented Mar 9, 2021

PC-Admin commented Mar 9, 2021 •

edited

Loading

ptman commented Oct 27, 2021

Runaway RAM usage #908

Runaway RAM usage #908

Comments

jgoerzen commented Feb 26, 2021

skepticalwaves commented Feb 26, 2021

jgoerzen commented Feb 26, 2021

skepticalwaves commented Feb 26, 2021

spantaleev commented Feb 26, 2021

jgoerzen commented Feb 26, 2021

pushytoxin commented Feb 26, 2021

jgoerzen commented Mar 6, 2021 • edited Loading

PC-Admin commented Mar 9, 2021

jgoerzen commented Mar 9, 2021

PC-Admin commented Mar 9, 2021 • edited Loading

ptman commented Oct 27, 2021

jgoerzen commented Mar 6, 2021 •

edited

Loading

PC-Admin commented Mar 9, 2021 •

edited

Loading