-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Synapse stops responding to incoming requests if PostgreSQL stops responding #8574
Comments
It seems that Synapse is losing its connection to Postgres. It is known that Synapse does not handle the db connection disappearing and reappearing very well. Is your Postgres overloaded or running out of memory? If you get stuck investigating postgres, then (redacted) logs from postgres and synapse with SQL debug logging turned out would be helpful. |
Hmm, maybe. CPU usage of the postgres process is normally ~10%, but if I start opening a lot of rooms after each other it jumps up a lot and starts maxing out some cores. Out of memory doesn't seem to be it; the machine running the DB has some left and it doesn't really increase if I start switch between a lot of rooms quickly. It does seem like a probable cause yes, although I then wonder why just restarting the Synapse process and workers isn't enough 🤔 |
@PureTryOut I tried to DM you about this on matrix (because I'm running into the exact same problem since a few days ago), but it seems your server is not responding to requests right now.. If we're both experiencing starting a few days ago, I'd blame this on a recent synapse update. Still, maybe it makes sense to compare our setups to help synapse devs with reproducing? I'm running both synapse and postgres in docker containers, and traefik for getting traffic around between services, both for http traffic coming from outside, and for getting the postgres traffic from synapse to postgres. Any similarities in your setup? Differences I know already: I don't run any workers, and restarting the containers is enough for me to fix it. |
I've found the exact same issue happening rather often at home as well, with Synapse running in Kubernetes, with workers. I'm also using traefik to get traffic in to the Synapse server and the workers, but all internal communication is done with the K8s overlay network. In there I'm running a triple-node HA postgres cluster for the database, with a proxy in front of it to redirect to the active master. (The proxy forwards all traffic, and generates a connection reset if the master dies, which hasn't happened during these issues) An example of the exceptions I'm seeing; (Here taken from a federation reader worker)
|
Well yes, that's exactly why I made this issue haha. I become unreachable for ages at a time and I'm using an old matrix.org alongside my normal one now just to keep up-to-date with some important rooms, it's really annoying. I use Nginx for proxying incoming requests to the Synapse workers, but internally Synapse has a direct connection to the PostgreSQL box. PostgreSQL does run in a Docker container, but I'm not sure if that matters. I've been experiencing this from around 1.20 actually, but I'm not sure if the Synapse version is to blame in my case as I just changed my database setup when this started happening (from a beefy, power hungry x86_64 machine to a lower performance but low power ARM64 machine) which might also be the cause of this. |
I usually notice within a few hours, considering matrix is my primary means of communication, but still... I've been offline again for 3 hours this morning due to this bug. @anoadragon453 Is there anything we can do to help this getting fixed? |
just had a similar issue overnight, after an automated postgres restart at 03:30, which promptly made Synapse break, with low cpu, database failures and http 500 responses. |
t2bot.io also had this overnight - at roughly 08:00 UTC the database went missing on the network for 2 minutes which sent the server into sadness. The synchrotron started consuming memory until it was eventually killed and the workers responded with 500 errors to all other requests. This lead to a 4 hour outage (though that is partially because of how I have the paging rules set up). It used to recover fine from database problems - it might have been a bit rocky and still wanted a restart, but it at least didn't require manual intervention. |
Well my homeserver has been unreachable because of this for a few days now since I got annoyed by constantly restarting it. Element seems to be connected, but I haven't received any messages in any room for ages, even though those same rooms seem to be very active if I check them from a matrix.org account. Could this please get some dev attention? My setup is literally unusable because of this. |
happened today at exactly 3:30 again, I've now disabled my postgres restart service, but that makes it rather reproducible? |
Hey all, apologies for the delay on getting this fixed and the annoyance it's caused in the meantime. We suspect this is caused by the postgres autocommit change made in #8456 that went out in v1.21.0. It's a subtle and nasty bug. We'll try to get this sorted and a fix out asap. |
@clokep that title seems wrong. My PostgreSQL instance doesn't restart at any point. It just gets a bit overloaded at some points after which Synapse stops responding. I never had the PostgreSQL service restart at any point. |
Ah, that's good info. It seemed from the conversation it was only due to restarts. Thanks! |
I'm running into this issue roughly every other day, it's really annoying. One of the worst parts: It seems like Synapse still answers to requests that don't need to touch the database just fine, so my syncs don't break and I often just realize I've run into this after hours of suspicious silence or when trying to send a message. |
Can people try setting
We've got a PR (#8726) targeting this issue that enables it, but it'd be good to get confirmation on whether it works early on. |
I've reproduced the issue by restarting pg to force connection loss to make sure this works for reproducing the issue, the logs appeared as above and message sending broke. After adding |
@jcgruenhage Are there any logs in postgres during the outage? I know sometimes after restart it can take forever for postgres to start accepting connections. |
@jcgruenhage If you can reproduce while having a tcpdump running then that would be awesomely useful |
`adbapi.ConnectionPool` let's you turn on auto reconnect of DB connections. This is off by default. As far as I can tell if its not enabled dead connections never get removed from the pool. Maybe helps #8574
I gave that setting a shot for a few days, but after a day or so of trying to catch up after not having incoming federation for at least a month, it stopped responding again. I saw no PostgreSQL connection errors in the log this time, but the symptons were the same. |
Weird. I think we'll need TCP dumps to see what's going on. A random stab in the dark is that the connections are being dropped/black holed without Synapse noticing, if so then the default tcp timeout settings will mean it can take ages for the connection to finally timeout. Fiddling with the |
I'm seeing a bunch of new errors after changing the setting as well, here are copies from my federation sender worker;
These errors continue until I restart the container, at which point everything starts working again until the next time my postgres becomes overloaded. |
I think the info from #8574 (comment) is still what is needed to help debug this further! Thanks! |
Unfortunately I haven't run into the issue at a time when I haven't required the continued use of federation, but if it happens during non-working hours I'll do my best to grab a tcpdump as well. |
So I just ran into this again, I took the time to grab a tcpdump of all PostgreSQL traffic this time, can't really upload all of it though as it's massive (generates at over 10MB per second) and completely full of sensitive information. Though there's not a single lost connection or response-less request in there. Apart from those, the entire rest of the capture is just a mix of Could it be that a cache breaks when the database connection resets? Edit: Edit: |
I've not run into this again since the update, so it seems to be fixed, at least for me. |
This is most definitely still happening for me. Turned on federation again yesterday to test with Synapse 1.24, and it basically immediately stopped responding even though Element never mentions the server being offline. |
Just had this happen to me again on 1.24.0, took another packet capture and I'm still seeing Synapse hammering the database to repeatedly get members of all joined rooms. This time at a lesser rate though. Filtered out the queries for the Matrix HQ, Matrix News, Synapse Admins, and synapse-dev rooms into another loose pcap if it's helpful; |
Seeing the same in kubernetes, synapse 1.22.1 and postgres 12 |
Having a very similar setup as ananace (with Docker Swarm instead of Kubernetes) and seeing the same problems. That was on 1.22.1. Now going to see if an update to 1.26.0 helps. Irrespective from that, especially when running synapse under a container orchestrator it would be really helpful to have a better health check endpoint which also checks if the database connection works. Then the orchestrator could restart the container automatically (which would be an okay workaround for this problem for now). |
We seem to have dropped the ball (again) here. Sorry. :/
I'm going to assume that this is because the capture only started after the connection dropped - Still, it's interesting to hear that you're not seeing new connection attempts.
This sounds like a symptom, rather than a cause, so not worth expending much effort on. I suspect there's some sort of race between the connection dropping and our attempting to reconnect, but I really can't reproduce it or figure out what could be causing it. Something that might be worth trying, for people affected here: set
... which might help us figure out what's going on (those 'connecting' lines should appear at startup). Also: is anyone still seeing the |
I added tons of resources to postgres and removed liveness probes, so it never restarts... Can't really tell |
I've personally moved to a completely different postgres setup that deals with routing in a whole other manner, so I don't have the same underlying problem anymore. |
our retry behaviour on reconnect isn't ideal: have opened #9779 to track that, though I don't think it could be causing this problem. |
Since nobody is having this problem any more, I'm going to close this. If people see it again, we can reopen and continue investigating. |
Possibly the same issue as #11167, for which we have a fix in the works. |
Description
For a while I've been having an issue where after a certain amount of time after a fresh boot, Synapse stops receiving (or at least handling) federation events and often even stops responding to client connections. CPU usage goes down to basically 0% and only a machine reboot (not just rebooting the service) fixes the issue. Strangely enough the federation tester at that point reports federation works fine, but clearly it doesn't.
Note that I have the federation reader and synchrotron workers enabled and they replicate via Redis.
I get some errors in the federation reader log which might be relevant:
Steps to reproduce
I don't know honestly. It seems to be specific to something in my setup, but I can't pinpoint what it is.
Version information
If not matrix.org:
Version: 1.21.2
Install method: The Alpine Linux package manager
Platform: Alpine Linux, bare metal. Running on a RockPro64 so ARM64, with the PostgreSQL database running on a different (also RockPro64) machine.
The text was updated successfully, but these errors were encountered: