-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Synapse v1.0.0rc1 significantly increases RAM usage #5395
Comments
I'm running synapse v1.0.0rc1 on a vps with 8G RAM. Not long after restarting synapse (less than an hour) I see it using about 2.8g of RES memory per the top command. Unfortunately, I don't have a baseline reading prior to upgrading to v1.0.0rc1. |
My RAM usage also spiked immediately after upgrading to 1.0.0rc1. Usually I hover around 2.1-2.8 GB used but it takes a while to get there. After upgrading 1.0.0rc1 my RAM usage was steadily climbing until it got to 3.7GB (out of 3.85GB total) and is currently sitting at 100% CPU on both cores. |
It's possible that the matrix.org outages are causing problems for people - how are extremities (#1760) looking? |
Extremities are ok. Highest is 5. (this is after Synapse OOMed and I stopped it from restarting but I assume it doesn't need to be running) |
Haven't setup grafana yet but this gets spammed in the logs as right memory usage is climbing rapidly. When I just restart my server it goes to about normal memory levels but it spikes as soon as I say something in a big room.
|
I have setup grafana, but the graphs are useless if I enable presence, since the server just crashes. |
Just as an update: RAM usage did keep increasing until the host crashed even with presence disabled, apparently that just bought this server some time. Started a |
Adding our comments from For me, as long as I don't speak in large rooms the memory usage stays stable. As soon as I try to speak in
Seems to be correct. |
Im on 0.99.5.2 current. Had activated "federation_verify_certificates" from first install a few weeks ago. 10:20 restart with federation_verify_certificates: false. 12:20 restarted with true again. Memory grows shortly to 3.4 GB from 1.5 GB before. Maybe this is only more caching. But as you can see in the graph, it can groß up to 5.5 GB. |
I'm on 0.99.5.2, reproduced @ingothierack's results. With a procedure of start the server then post in #synapse:matrix.org:
|
I added a stub to my synapse install that dumps a tracemalloc snapshot to a file when it receives a SIGUSR1. This time, I used a procedure of start synapse, start typing in #synapse:matrix.org (thereby sending out typing notifications), watch the memory usage. Again, with With I sent the SIGUSR1 to the first case once it had settled, and the second case when it was around 1.5GB and about to kill the machine. Surprisingly, there is very little difference between the two! The largest difference is +18.3MiB in I see two possibilities:
I might see if I can figure out how to debug/prove the second case. |
I can can confirm the issue is with the change of default for |
I found this likely relevant issue: pyca/cryptography#4833, which is fixed in the recently-released 2.7 (my version, provided by the latest matrix-provided deb, is 2.6) |
(I say likely-relevant because the leaks are specifically from ASN.1 parsing, and the issue relates to X509 which is ASN.1-encoded) |
@JJJollyjim: the server seeing issues on my side has |
@JJJollyjim do you get backtraces in heaptrack to see where the leaky ASN.1 is being leaked from? Also, do you get the same behaviour if you upgrade to 2.7? |
I do have the backtraces (of the C stack frames, not the python ones): all the leaks (actually about 800MB, now I've figured out how to read this) come from I could try and see if there is some allocation tracing thing that can also collect python function names from the stack frames. At a guess though, are we loading the CA store every time we do some sort of ssl thing, and keeping them around, instead of just having one shared copy? |
Interesting. I've been trying a patch, it seems to work well so far. There is still increased RAM usage but not terribly so, there may still be room for improvement but it was written in 10 mins. Basically, the way things are, each message sent over federation defines a new class and holds an instance of a class and there appears to be no caching whatsoever, this is the modified file: I'll see if I manage time-wise to add context caching and specifically look at the CA bits, but it'd be useful if someone could confirm that this is an improvement. PS, the related file:
|
The patch definitely didn't fix the whole problem for me (can't say if it had any affect, as the memory usage is still at the point where the server crashes) Setting up a synapse instance on my workstation to test further. |
Thank you. I saw lower mem usage but not too much. |
Thanks to everyone who has helped investigate this so far, and sorry we didn't catch it before putting out the RC.
This is very helpful indeed. Some background here: Every time we start an outbound federation connection, ClientTLSOptionsFactory.get_options is called. That in turn calls In theory that should be fine, because the context (along with the loaded certs) gets thrown away when the connection is closed. However, in practice, we are making several hundred outbound connections at once, and, thanks to http keepalive and connection pooling, the connections stay open for several minutes. What I therefore suspect is happening is that this is not a true leak, but simply massively increased memory usage thanks to attaching all of the CA certificates to each of the outbound connections. If Synapse lasted long enough to send all of the outbound requests and time out the connections, then we would see memory usage drop again. (The connection keepalive timeout is 120 seconds). Intuitively it seems stupid that we are reading the certs from disk and storing them in a separate memory store for every single outbound connection; however, I'm not aware of a way to avoid this with the openssl interface. I don't think it is correct to share OpenSSL contexts between connections. Any thoughts or advice from those familiar with the OpenSSL API would be welcome. |
https://www.postgresql.org/message-id/E1bsROr-0002Z2-25%40gemulon.postgresql.org implies that it /is/ possible to share contexts between connections, albeit fiddly (which is why postgres stopped doing so) |
(#4673 was a bug that happened when we briefly tried sharing contexts: I'm not sure offhand if there is a better solution to that than having one-context-per-connection) |
http://openssl.6102.n7.nabble.com/Possibility-to-cache-ca-bundle-and-reuse-it-between-SSL-sessions-td51090.html looks to give most of the answers on how the API is meant to be used for this scenario. |
Right now, you need one, because you set the info callback (which is how you do SNI) on the context, not the connection. twisted/twisted#1128 will make it a little faster, though. |
This node.js issue from 2011 also discusses the issue: https://paul.querna.org/articles/2011/04/05/openssl-memory-use/ They link this commit nodejs/node-v0.x-archive@5c35dff which does the caching across contexts, and mention that they plan to make node.crypto reuse contexts in future -- not sure if this ended up happening. The other issue they discuss in the article - allocation of zlib buffers for TLS compression - is likely no longer relevant since TLS compression was turned off by default in openssl 1.1.0 (2016) to mitigate the CRIME attack |
Conclusion of discussion elsewhere is that you probably can share context objects, but have to do some hoop-jumping to get the SNI and cert verification done right. Those interested can try playing with my |
|
Me and others have indeed seen a huge improvement For the record: this improvement is also showing near 20% less RAM consumption on my biggest matrix server compared to |
Description
My main Synapse server had been running happily with 4G RAM for some months, after upgrading to v1.0.0rc1 RAM usage increased a lot, leading to system crash.
After disabling presence (usual trick to reduce RAM usage), the RAM consumption went down enough for the server to be usable, but it is still out of the ordinary.
Steps to reproduce
Version information
Version: v1.0.0rc1
Install method: pip
Note: I waited until another user on #synapse-admins:matrix.org confirmed this.
The text was updated successfully, but these errors were encountered: