Synapse stops responding to incoming requests if PostgreSQL stops responding #8574

PureTryOut · 2020-10-16T19:56:18Z

Description

For a while I've been having an issue where after a certain amount of time after a fresh boot, Synapse stops receiving (or at least handling) federation events and often even stops responding to client connections. CPU usage goes down to basically 0% and only a machine reboot (not just rebooting the service) fixes the issue. Strangely enough the federation tester at that point reports federation works fine, but clearly it doesn't.

Note that I have the federation reader and synchrotron workers enabled and they replicate via Redis.

I get some errors in the federation reader log which might be relevant:

2020-10-16 08:15:27,206 - twisted - 250 - CRITICAL - - Rollback failed
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection
    conn.rollback()
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 52, in rollback
    self._connection.rollback()
psycopg2.InterfaceError: connection already closed

2020-10-16 08:15:27,217 - synapse.util.ratelimitutils - 182 - DEBUG - PUT-52551- Ratelimit [281473023937824]: Processed req
2020-10-16 08:15:27,219 - synapse.util.async_helpers - 293 - DEBUG - PUT-52551- Releasing linearizer lock 'fed_txn_handler' for key 'en-root.org'
2020-10-16 08:15:27,222 - synapse.federation.transport.server - 428 - ERROR - PUT-52551- on_incoming_transaction failed
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/synapse/federation/transport/server.py", line 424, in on_PUT
    code, response = await self.handler.on_incoming_transaction(
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 164, in on_incoming_transaction
    return await self._transaction_resp_cache.wrap(
  File "/usr/lib/python3.8/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3.8/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 190, in _on_incoming_transaction_inner
    result = await self._handle_incoming_transaction(
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 209, in _handle_incoming_transaction
    response = await self.transaction_actions.have_responded(origin, transaction)
  File "/usr/lib/python3.8/site-packages/synapse/federation/persistence.py", line 54, in have_responded
    return await self.store.get_received_txn_response(transaction_id, origin)
  File "/usr/lib/python3.8/site-packages/synapse/storage/databases/main/transactions.py", line 77, in get_received_txn_response
    return await self.db_pool.runInteraction(
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 569, in runInteraction
    result = await self.runWithConnection(
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 646, in runWithConnection
    return await make_deferred_yieldable(
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/lib/python3.8/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 626, in inner_func
    assert not self.engine.in_transaction(conn)
AssertionError
2020-10-16 08:15:27,230 - synapse.http.server - 81 - ERROR - PUT-52551- Failed handle request via 'FederationSendServlet': <SynapseRequest at 0xffffa21b07f0 method='PUT' uri='/_matrix/federation/v1/send/1594808804394' clientproto='HTTP/1.0' site=8084>
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3.8/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 190, in _on_incoming_transaction_inner
    result = await self._handle_incoming_transaction(
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 209, in _handle_incoming_transaction
    response = await self.transaction_actions.have_responded(origin, transaction)
  File "/usr/lib/python3.8/site-packages/synapse/federation/persistence.py", line 54, in have_responded
    return await self.store.get_received_txn_response(transaction_id, origin)
  File "/usr/lib/python3.8/site-packages/synapse/storage/databases/main/transactions.py", line 77, in get_received_txn_response
    return await self.db_pool.runInteraction(
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 569, in runInteraction
    result = await self.runWithConnection(
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 646, in runWithConnection
    return await make_deferred_yieldable(
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/lib/python3.8/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 626, in inner_func
    assert not self.engine.in_transaction(conn)
AssertionError

Steps to reproduce

I don't know honestly. It seems to be specific to something in my setup, but I can't pinpoint what it is.

Version information

Homeserver: fam-ribbers.com

If not matrix.org:

Version: 1.21.2
Install method: The Alpine Linux package manager
Platform: Alpine Linux, bare metal. Running on a RockPro64 so ARM64, with the PostgreSQL database running on a different (also RockPro64) machine.

The text was updated successfully, but these errors were encountered:

anoadragon453 · 2020-10-19T10:47:44Z

psycopg2.InterfaceError: connection already closed

It seems that Synapse is losing its connection to Postgres. It is known that Synapse does not handle the db connection disappearing and reappearing very well. Is your Postgres overloaded or running out of memory?

If you get stuck investigating postgres, then (redacted) logs from postgres and synapse with SQL debug logging turned out would be helpful.

PureTryOut · 2020-10-19T11:04:57Z

Hmm, maybe. CPU usage of the postgres process is normally ~10%, but if I start opening a lot of rooms after each other it jumps up a lot and starts maxing out some cores. Out of memory doesn't seem to be it; the machine running the DB has some left and it doesn't really increase if I start switch between a lot of rooms quickly.

It does seem like a probable cause yes, although I then wonder why just restarting the Synapse process and workers isn't enough 🤔

jcgruenhage · 2020-10-20T19:36:41Z

@PureTryOut I tried to DM you about this on matrix (because I'm running into the exact same problem since a few days ago), but it seems your server is not responding to requests right now.. If we're both experiencing starting a few days ago, I'd blame this on a recent synapse update. Still, maybe it makes sense to compare our setups to help synapse devs with reproducing?

I'm running both synapse and postgres in docker containers, and traefik for getting traffic around between services, both for http traffic coming from outside, and for getting the postgres traffic from synapse to postgres. Any similarities in your setup? Differences I know already: I don't run any workers, and restarting the containers is enough for me to fix it.

ananace · 2020-10-21T08:08:53Z

I've found the exact same issue happening rather often at home as well, with Synapse running in Kubernetes, with workers.
Seems to be new from around 1.21.
My setup also only needs a container recreation to get things federating again, but then that's basically the same as a reboot of a regular machine.

I'm also using traefik to get traffic in to the Synapse server and the workers, but all internal communication is done with the K8s overlay network. In there I'm running a triple-node HA postgres cluster for the database, with a proxy in front of it to redirect to the active master. (The proxy forwards all traffic, and generates a connection reset if the master dies, which hasn't happened during these issues)

An example of the exceptions I'm seeing; (Here taken from a federation reader worker)

2020-10-21 14:16:24,458 - twisted - 250 - CRITICAL - - Rollback failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection
    conn.rollback()
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 52, in rollback
    self._connection.rollback()
psycopg2.InterfaceError: connection already closed

2020-10-21 14:16:24,460 - synapse.federation.transport.server - 428 - ERROR - PUT-58348- on_incoming_transaction failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/transport/server.py", line 425, in on_PUT
    origin, transaction_data
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 169, in on_incoming_transaction
    request_time,
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 191, in _on_incoming_transaction_inner
    origin, transaction, request_time
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 252, in _handle_incoming_transaction
    await self.transaction_actions.set_response(origin, transaction, 200, response)
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/persistence.py", line 67, in set_response
    transaction_id, origin, code, response
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/transactions.py", line 130, in set_received_txn_response
    desc="set_received_txn_response",
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 727, in simple_insert
    await self.runInteraction(desc, self.simple_insert_txn, table, values)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 577, in runInteraction
    **kwargs
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 647, in runWithConnection
    self._db_pool.runWithConnection(inner_func, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 626, in inner_func
    assert not self.engine.in_transaction(conn)
AssertionError
2020-10-21 14:16:24,460 - synapse.http.server - 85 - ERROR - PUT-58348- Failed handle request via 'FederationSendServlet': <SynapseRequest at 0x7efee762f6d0 method='PUT' uri='/_matrix/federation/v1/send/1602441424225' clientproto='HTTP/1.1' site=8083>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 191, in _on_incoming_transaction_inner
    origin, transaction, request_time
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 252, in _handle_incoming_transaction
    await self.transaction_actions.set_response(origin, transaction, 200, response)
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/persistence.py", line 67, in set_response
    transaction_id, origin, code, response
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/transactions.py", line 130, in set_received_txn_response
    desc="set_received_txn_response",
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 727, in simple_insert
    await self.runInteraction(desc, self.simple_insert_txn, table, values)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 577, in runInteraction
    **kwargs
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 647, in runWithConnection
    self._db_pool.runWithConnection(inner_func, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 626, in inner_func
    assert not self.engine.in_transaction(conn)
AssertionError

PureTryOut · 2020-10-21T10:46:55Z

I tried to DM you about this on matrix (because I'm running into the exact same problem since a few days ago), but it seems your server is not responding to requests right now..

Well yes, that's exactly why I made this issue haha. I become unreachable for ages at a time and I'm using an old matrix.org alongside my normal one now just to keep up-to-date with some important rooms, it's really annoying.

I use Nginx for proxying incoming requests to the Synapse workers, but internally Synapse has a direct connection to the PostgreSQL box. PostgreSQL does run in a Docker container, but I'm not sure if that matters.

I've been experiencing this from around 1.20 actually, but I'm not sure if the Synapse version is to blame in my case as I just changed my database setup when this started happening (from a beefy, power hungry x86_64 machine to a lower performance but low power ARM64 machine) which might also be the cause of this.

jcgruenhage · 2020-10-25T08:59:34Z

I usually notice within a few hours, considering matrix is my primary means of communication, but still... I've been offline again for 3 hours this morning due to this bug. @anoadragon453 Is there anything we can do to help this getting fixed?

f0x52 · 2020-10-28T11:25:54Z

psycopg2.InterfaceError: connection already closed

It seems that Synapse is losing its connection to Postgres. It is known that Synapse does not handle the db connection disappearing and reappearing very well. Is your Postgres overloaded or running out of memory?

just had a similar issue overnight, after an automated postgres restart at 03:30, which promptly made Synapse break, with low cpu, database failures and http 500 responses.
Any reason why synapse handles postgres connection loss this badly? Mastodon survives this just fine each night, and Synapse deadlocking like this has happened a few times now

turt2live · 2020-10-28T12:37:53Z

t2bot.io also had this overnight - at roughly 08:00 UTC the database went missing on the network for 2 minutes which sent the server into sadness. The synchrotron started consuming memory until it was eventually killed and the workers responded with 500 errors to all other requests. This lead to a 4 hour outage (though that is partially because of how I have the paging rules set up).

It used to recover fine from database problems - it might have been a bit rocky and still wanted a restart, but it at least didn't require manual intervention.

PureTryOut · 2020-10-28T14:20:56Z

Well my homeserver has been unreachable because of this for a few days now since I got annoyed by constantly restarting it. Element seems to be connected, but I haven't received any messages in any room for ages, even though those same rooms seem to be very active if I check them from a matrix.org account.

Could this please get some dev attention? My setup is literally unusable because of this.

f0x52 · 2020-10-29T11:13:25Z

happened today at exactly 3:30 again, I've now disabled my postgres restart service, but that makes it rather reproducible?

anoadragon453 · 2020-10-29T12:02:20Z

Hey all, apologies for the delay on getting this fixed and the annoyance it's caused in the meantime. We suspect this is caused by the postgres autocommit change made in #8456 that went out in v1.21.0.

It's a subtle and nasty bug. We'll try to get this sorted and a fix out asap.

PureTryOut · 2020-10-31T09:06:24Z

@clokep that title seems wrong. My PostgreSQL instance doesn't restart at any point. It just gets a bit overloaded at some points after which Synapse stops responding. I never had the PostgreSQL service restart at any point.

clokep · 2020-11-02T17:52:23Z

@clokep that title seems wrong. My PostgreSQL instance doesn't restart at any point. It just gets a bit overloaded at some points after which Synapse stops responding. I never had the PostgreSQL service restart at any point.

Ah, that's good info. It seemed from the conversation it was only due to restarts. Thanks!

jcgruenhage · 2020-11-04T09:39:29Z

I'm running into this issue roughly every other day, it's really annoying. One of the worst parts: It seems like Synapse still answers to requests that don't need to touch the database just fine, so my syncs don't break and I often just realize I've run into this after hours of suspicious silence or when trying to send a message.

anoadragon453 · 2020-11-10T11:18:13Z

Can people try setting cp_reconnect: true under their database.args homeserver config option and see if that helps? e.g:

database:
  names: psycopg2
  args:
    user: xxx
    password: yyy
    database: synapse
    host: localhost
    cp_min: 5
    cp_max: 10
    cp_reconnect: true

We've got a PR (#8726) targeting this issue that enables it, but it'd be good to get confirmation on whether it works early on.

jcgruenhage · 2020-11-10T11:30:48Z

I've reproduced the issue by restarting pg to force connection loss to make sure this works for reproducing the issue, the logs appeared as above and message sending broke. After adding cp_reconnect: true, the "connection already closed" log lines and tracebacks disappeared, but I still can't send messages until I also restart synapse.

erikjohnston · 2020-11-11T11:22:41Z

@jcgruenhage Are there any logs in postgres during the outage? I know sometimes after restart it can take forever for postgres to start accepting connections.

erikjohnston · 2020-11-11T11:35:56Z

@jcgruenhage If you can reproduce while having a tcpdump running then that would be awesomely useful

`adbapi.ConnectionPool` let's you turn on auto reconnect of DB connections. This is off by default. As far as I can tell if its not enabled dead connections never get removed from the pool. Maybe helps #8574

PureTryOut · 2020-11-13T19:10:02Z

I gave that setting a shot for a few days, but after a day or so of trying to catch up after not having incoming federation for at least a month, it stopped responding again. I saw no PostgreSQL connection errors in the log this time, but the symptons were the same.

erikjohnston · 2020-11-16T14:41:10Z

Weird. I think we'll need TCP dumps to see what's going on.

A random stab in the dark is that the connections are being dropped/black holed without Synapse noticing, if so then the default tcp timeout settings will mean it can take ages for the connection to finally timeout. Fiddling with the tcp_keepalives_* settings to make things timeout quicker may help in such a case: https://www.postgresql.org/docs/current/runtime-config-connection.html

ananace · 2020-11-18T10:01:14Z

I'm seeing a bunch of new errors after changing the setting as well, here are copies from my federation sender worker;

2020-11-18 08:37:40,582 - synapse.storage.databases.main.event_push_actions - 530 - INFO - event_push_action_stream_orderings-221- Searching for stream ordering 1 month ago
2020-11-18 08:37:40,582 - synapse.storage.txn - 512 - WARNING - event_push_action_stream_orderings-221- [TXN OPERROR] {_find_stream_orderings_for_times-f5a1a} server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
 0/5
2020-11-18 08:37:40,582 - twisted - 250 - CRITICAL - event_push_action_stream_orderings-221- Rollback failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 702, in inner_func
    return func(db_conn, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 517, in new_transaction
    conn.rollback()
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 157, in rollback
    self.conn.rollback(*args, **kwargs)
--- <exception caught here> ---
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 56, in rollback
    self._connection.rollback()
psycopg2.InterfaceError: connection already closed

2020-11-18 08:37:40,583 - twisted - 250 - CRITICAL - - Rollback failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection
    conn.rollback()
--- <exception caught here> ---
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 56, in rollback
    self._connection.rollback()
psycopg2.InterfaceError: connection already closed

2020-11-18 08:37:40,583 - twisted - 250 - CRITICAL - - Rollback failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection
    conn.rollback()
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 65, in rollback
    self._pool.disconnect(self._connection)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 444, in disconnect
    raise Exception("wrong connection for thread")
builtins.Exception: wrong connection for thread

2020-11-18 08:37:40,583 - synapse.metrics.background_process_metrics - 211 - ERROR - event_push_action_stream_orderings-221- Background process 'event_push_action_stream_orderings' threw an exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 505, in new_transaction
    r = func(cursor, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/event_push_actions.py", line 532, in _find_stream_orderings_for_times_txn
    txn, self._clock.time_msec() - 30 * 24 * 60 * 60 * 1000
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/event_push_actions.py", line 584, in _find_first_stream_ordering_after_ts_txn
    txn.execute("SELECT MAX(stream_ordering) FROM events")
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 266, in execute
    self._do_execute(self.txn.execute, sql, *args)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 292, in _do_execute
    return func(sql, *args)
psycopg2.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/synapse/metrics/background_process_metrics.py", line 206, in run
    result = await result
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/event_push_actions.py", line 526, in _find_stream_orderings_for_times
    self._find_stream_orderings_for_times_txn,
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 635, in runInteraction
    **kwargs
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 708, in runWithConnection
    self._db_pool.runWithConnection(inner_func, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 702, in inner_func
    return func(db_conn, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 517, in new_transaction
    conn.rollback()
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 157, in rollback
    self.conn.rollback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 70, in rollback
    raise ConnectionLost()
twisted.enterprise.adbapi.ConnectionLost

These errors continue until I restart the container, at which point everything starts working again until the next time my postgres becomes overloaded.

clokep · 2020-11-18T17:36:37Z

I think the info from #8574 (comment) is still what is needed to help debug this further! Thanks!

ananace · 2020-11-18T17:47:08Z

Unfortunately I haven't run into the issue at a time when I haven't required the continued use of federation, but if it happens during non-working hours I'll do my best to grab a tcpdump as well.

ananace · 2020-11-24T17:55:08Z

So I just ran into this again, I took the time to grab a tcpdump of all PostgreSQL traffic this time, can't really upload all of it though as it's massive (generates at over 10MB per second) and completely full of sensitive information. Though there's not a single lost connection or response-less request in there.
Synapse seems to have gotten stuck constantly running a whole bunch of SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = '[...]' AND membership = 'join' queries. Rapidly and repeatedly looping through all the rooms the server is part of, at around two times per second. In 18 seconds of capture I can see a full 38 queries to get all members of !OGEhHVWSdvArJzumhm:matrix.org - the Matrix HQ room (with its 8k members) - for example.

Apart from those, the entire rest of the capture is just a mix of COMMIT/BEGIN ISOLATION LEVEL REPEATABLE READ/SET default_transaction_isolation TO DEFAULT, and what seems like the handling of incoming federation (inserting transactions, SELECT-ing server signatures, INSERT-ing server_keys_json, updating presence, looking at destinations), checking of group information/membership, and updating of user last-seen data.

Could it be that a cache breaks when the database connection resets?

Edit:
After a restart, another 15 seconds of capture shows a grand total of two such SELECT state_key ... queries, so it seems reasonable.

Edit:
I'm going to see if I can get a similar capture from if my federation sender worker breaks, so I can cut away any other noise.

jcgruenhage · 2020-12-14T10:23:35Z

I've not run into this again since the update, so it seems to be fixed, at least for me.

PureTryOut · 2020-12-15T11:08:50Z

This is most definitely still happening for me. Turned on federation again yesterday to test with Synapse 1.24, and it basically immediately stopped responding even though Element never mentions the server being offline.

ananace · 2020-12-17T09:54:58Z

Just had this happen to me again on 1.24.0, took another packet capture and I'm still seeing Synapse hammering the database to repeatedly get members of all joined rooms. This time at a lesser rate though.

Filtered out the queries for the Matrix HQ, Matrix News, Synapse Admins, and synapse-dev rooms into another loose pcap if it's helpful;
Synapse_bug.pcap.gz

dimm0 · 2020-12-23T06:10:19Z

Seeing the same in kubernetes, synapse 1.22.1 and postgres 12
Just tried tuning postgres by giving it more mem and 15 connections, same errors

djmaze · 2021-02-16T00:39:50Z

Having a very similar setup as ananace (with Docker Swarm instead of Kubernetes) and seeing the same problems. That was on 1.22.1. Now going to see if an update to 1.26.0 helps.

Irrespective from that, especially when running synapse under a container orchestrator it would be really helpful to have a better health check endpoint which also checks if the database connection works. Then the orchestrator could restart the container automatically (which would be an okay workaround for this problem for now).

richvdh · 2021-04-09T17:10:22Z

We seem to have dropped the ball (again) here. Sorry. :/

Though there's not a single lost connection or response-less request in there.

I'm going to assume that this is because the capture only started after the connection dropped - server closed the connection unexpectedly is a message from the low-level library that can only really happen when the connection drops.

Still, it's interesting to hear that you're not seeing new connection attempts.

Synapse seems to have gotten stuck constantly running a whole bunch of SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = '[...]' AND membership = 'join' queries.

This sounds like a symptom, rather than a cause, so not worth expending much effort on.

I suspect there's some sort of race between the connection dropping and our attempting to reconnect, but I really can't reproduce it or figure out what could be causing it.

Something that might be worth trying, for people affected here: set cp_noisy: true under your database.args. Also, make sure that the twisted logger is not configured to drop INFO logs in your log_config.yaml. This won't fix anything, but should give logs along the lines of:

2021-04-09 16:18:04,356 - twisted - 260 - INFO - sentinel - adbapi connecting: psycopg2
2021-04-09 16:18:04,794 - twisted - 260 - INFO - sentinel - adbapi connecting: psycopg2
2021-04-09 16:18:12,455 - twisted - 260 - INFO - sentinel - adbapi connecting: psycopg2
2021-04-09 16:18:15,044 - twisted - 260 - INFO - sentinel - adbapi connecting: psycopg2
2021-04-09 16:20:41,161 - twisted - 260 - CRITICAL - PUT-1087 - Rollback failed
2021-04-09 16:20:41,162 - twisted - 260 - INFO - PUT-1087 - adbapi closing: psycopg2
2021-04-09 16:20:41,162 - twisted - 260 - INFO - PUT-1087 - Connection lost.
2021-04-09 16:20:41,164 - twisted - 260 - CRITICAL - sentinel - Rollback failed
2021-04-09 16:20:41,164 - twisted - 260 - CRITICAL - sentinel - Rollback failed
2021-04-09 16:20:41,806 - twisted - 260 - INFO - sentinel - adbapi connecting: psycopg2

... which might help us figure out what's going on (those 'connecting' lines should appear at startup).

Also: is anyone still seeing the assert not self.engine.in_transaction(conn) exception, in recent versions (since 1.24) of Synapse?

dimm0 · 2021-04-09T17:54:14Z

I added tons of resources to postgres and removed liveness probes, so it never restarts... Can't really tell

ananace · 2021-04-09T18:26:23Z

I've personally moved to a completely different postgres setup that deals with routing in a whole other manner, so I don't have the same underlying problem anymore.

richvdh · 2021-04-12T09:47:25Z

our retry behaviour on reconnect isn't ideal: have opened #9779 to track that, though I don't think it could be causing this problem.

richvdh · 2021-05-25T21:09:18Z

Since nobody is having this problem any more, I'm going to close this. If people see it again, we can reopen and continue investigating.

squahtx · 2021-11-25T17:01:21Z

Possibly the same issue as #11167, for which we have a fix in the works.

anoadragon453 added the info-needed label Oct 19, 2020

joepie91 mentioned this issue Oct 28, 2020

Very slow state res in local rooms #8612

Closed

anoadragon453 added z-bug (Deprecated Label) p1 and removed info-needed labels Oct 29, 2020

clokep changed the title ~~Synapse stops responding to incoming requests after a while~~ Synapse stops responding to incoming requests after PostgreSQL restarts Oct 30, 2020

clokep changed the title ~~Synapse stops responding to incoming requests after PostgreSQL restarts~~ Synapse stops responding to incoming requests if PostgreSQL stops responding Nov 2, 2020

anoadragon453 added the z-regression (Deprecated Label) label Nov 5, 2020

neilisfragile added the release blocker label Nov 5, 2020

erikjohnston mentioned this issue Nov 6, 2020

Enable reconnection in DB pool #8726

Merged

erikjohnston self-assigned this Nov 6, 2020

erikjohnston closed this as completed in #8726 Nov 12, 2020

erikjohnston reopened this Nov 12, 2020

erikjohnston removed the release blocker label Nov 13, 2020

callahad added the info-needed label Nov 20, 2020

richvdh closed this as completed May 25, 2021

Synapse stops responding to incoming requests if PostgreSQL stops responding #8574

Synapse stops responding to incoming requests if PostgreSQL stops responding #8574

Comments

PureTryOut commented Oct 16, 2020

Description

Steps to reproduce

Version information

anoadragon453 commented Oct 19, 2020 • edited Loading

PureTryOut commented Oct 19, 2020

jcgruenhage commented Oct 20, 2020 • edited Loading

ananace commented Oct 21, 2020 • edited Loading

PureTryOut commented Oct 21, 2020

jcgruenhage commented Oct 25, 2020

f0x52 commented Oct 28, 2020

turt2live commented Oct 28, 2020

PureTryOut commented Oct 28, 2020

f0x52 commented Oct 29, 2020

anoadragon453 commented Oct 29, 2020

PureTryOut commented Oct 31, 2020

clokep commented Nov 2, 2020

jcgruenhage commented Nov 4, 2020

anoadragon453 commented Nov 10, 2020 • edited Loading

jcgruenhage commented Nov 10, 2020

erikjohnston commented Nov 11, 2020

erikjohnston commented Nov 11, 2020

PureTryOut commented Nov 13, 2020

erikjohnston commented Nov 16, 2020

ananace commented Nov 18, 2020

clokep commented Nov 18, 2020

ananace commented Nov 18, 2020

ananace commented Nov 24, 2020 • edited Loading

jcgruenhage commented Dec 14, 2020

PureTryOut commented Dec 15, 2020

ananace commented Dec 17, 2020 • edited Loading

dimm0 commented Dec 23, 2020

djmaze commented Feb 16, 2021 • edited Loading

richvdh commented Apr 9, 2021

dimm0 commented Apr 9, 2021

ananace commented Apr 9, 2021

richvdh commented Apr 12, 2021

richvdh commented May 25, 2021

squahtx commented Nov 25, 2021

anoadragon453 commented Oct 19, 2020 •

edited

Loading

jcgruenhage commented Oct 20, 2020 •

edited

Loading

ananace commented Oct 21, 2020 •

edited

Loading

anoadragon453 commented Nov 10, 2020 •

edited

Loading

ananace commented Nov 24, 2020 •

edited

Loading

ananace commented Dec 17, 2020 •

edited

Loading

djmaze commented Feb 16, 2021 •

edited

Loading