Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Synapse stops responding to incoming requests if PostgreSQL stops responding #8574

Closed
PureTryOut opened this issue Oct 16, 2020 · 35 comments · Fixed by #8726
Closed

Synapse stops responding to incoming requests if PostgreSQL stops responding #8574

PureTryOut opened this issue Oct 16, 2020 · 35 comments · Fixed by #8726
Assignees
Labels
S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Info This issue is blocked awaiting information from the reporter

Comments

@PureTryOut
Copy link

Description

For a while I've been having an issue where after a certain amount of time after a fresh boot, Synapse stops receiving (or at least handling) federation events and often even stops responding to client connections. CPU usage goes down to basically 0% and only a machine reboot (not just rebooting the service) fixes the issue. Strangely enough the federation tester at that point reports federation works fine, but clearly it doesn't.

Note that I have the federation reader and synchrotron workers enabled and they replicate via Redis.

I get some errors in the federation reader log which might be relevant:

2020-10-16 08:15:27,206 - twisted - 250 - CRITICAL - - Rollback failed
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection
    conn.rollback()
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 52, in rollback
    self._connection.rollback()
psycopg2.InterfaceError: connection already closed

2020-10-16 08:15:27,217 - synapse.util.ratelimitutils - 182 - DEBUG - PUT-52551- Ratelimit [281473023937824]: Processed req
2020-10-16 08:15:27,219 - synapse.util.async_helpers - 293 - DEBUG - PUT-52551- Releasing linearizer lock 'fed_txn_handler' for key 'en-root.org'
2020-10-16 08:15:27,222 - synapse.federation.transport.server - 428 - ERROR - PUT-52551- on_incoming_transaction failed
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/synapse/federation/transport/server.py", line 424, in on_PUT
    code, response = await self.handler.on_incoming_transaction(
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 164, in on_incoming_transaction
    return await self._transaction_resp_cache.wrap(
  File "/usr/lib/python3.8/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3.8/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 190, in _on_incoming_transaction_inner
    result = await self._handle_incoming_transaction(
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 209, in _handle_incoming_transaction
    response = await self.transaction_actions.have_responded(origin, transaction)
  File "/usr/lib/python3.8/site-packages/synapse/federation/persistence.py", line 54, in have_responded
    return await self.store.get_received_txn_response(transaction_id, origin)
  File "/usr/lib/python3.8/site-packages/synapse/storage/databases/main/transactions.py", line 77, in get_received_txn_response
    return await self.db_pool.runInteraction(
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 569, in runInteraction
    result = await self.runWithConnection(
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 646, in runWithConnection
    return await make_deferred_yieldable(
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/lib/python3.8/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 626, in inner_func
    assert not self.engine.in_transaction(conn)
AssertionError
2020-10-16 08:15:27,230 - synapse.http.server - 81 - ERROR - PUT-52551- Failed handle request via 'FederationSendServlet': <SynapseRequest at 0xffffa21b07f0 method='PUT' uri='/_matrix/federation/v1/send/1594808804394' clientproto='HTTP/1.0' site=8084>
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3.8/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 190, in _on_incoming_transaction_inner
    result = await self._handle_incoming_transaction(
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 209, in _handle_incoming_transaction
    response = await self.transaction_actions.have_responded(origin, transaction)
  File "/usr/lib/python3.8/site-packages/synapse/federation/persistence.py", line 54, in have_responded
    return await self.store.get_received_txn_response(transaction_id, origin)
  File "/usr/lib/python3.8/site-packages/synapse/storage/databases/main/transactions.py", line 77, in get_received_txn_response
    return await self.db_pool.runInteraction(
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 569, in runInteraction
    result = await self.runWithConnection(
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 646, in runWithConnection
    return await make_deferred_yieldable(
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/lib/python3.8/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 626, in inner_func
    assert not self.engine.in_transaction(conn)
AssertionError

Steps to reproduce

I don't know honestly. It seems to be specific to something in my setup, but I can't pinpoint what it is.

Version information

  • Homeserver: fam-ribbers.com

If not matrix.org:

  • Version: 1.21.2

  • Install method: The Alpine Linux package manager

  • Platform: Alpine Linux, bare metal. Running on a RockPro64 so ARM64, with the PostgreSQL database running on a different (also RockPro64) machine.

@anoadragon453
Copy link
Member

anoadragon453 commented Oct 19, 2020

psycopg2.InterfaceError: connection already closed

It seems that Synapse is losing its connection to Postgres. It is known that Synapse does not handle the db connection disappearing and reappearing very well. Is your Postgres overloaded or running out of memory?

If you get stuck investigating postgres, then (redacted) logs from postgres and synapse with SQL debug logging turned out would be helpful.

@PureTryOut
Copy link
Author

Hmm, maybe. CPU usage of the postgres process is normally ~10%, but if I start opening a lot of rooms after each other it jumps up a lot and starts maxing out some cores. Out of memory doesn't seem to be it; the machine running the DB has some left and it doesn't really increase if I start switch between a lot of rooms quickly.

It does seem like a probable cause yes, although I then wonder why just restarting the Synapse process and workers isn't enough 🤔

@jcgruenhage
Copy link
Contributor

jcgruenhage commented Oct 20, 2020

@PureTryOut I tried to DM you about this on matrix (because I'm running into the exact same problem since a few days ago), but it seems your server is not responding to requests right now.. If we're both experiencing starting a few days ago, I'd blame this on a recent synapse update. Still, maybe it makes sense to compare our setups to help synapse devs with reproducing?

I'm running both synapse and postgres in docker containers, and traefik for getting traffic around between services, both for http traffic coming from outside, and for getting the postgres traffic from synapse to postgres. Any similarities in your setup? Differences I know already: I don't run any workers, and restarting the containers is enough for me to fix it.

@ananace
Copy link

ananace commented Oct 21, 2020

I've found the exact same issue happening rather often at home as well, with Synapse running in Kubernetes, with workers.
Seems to be new from around 1.21.
My setup also only needs a container recreation to get things federating again, but then that's basically the same as a reboot of a regular machine.

I'm also using traefik to get traffic in to the Synapse server and the workers, but all internal communication is done with the K8s overlay network. In there I'm running a triple-node HA postgres cluster for the database, with a proxy in front of it to redirect to the active master. (The proxy forwards all traffic, and generates a connection reset if the master dies, which hasn't happened during these issues)

An example of the exceptions I'm seeing; (Here taken from a federation reader worker)

2020-10-21 14:16:24,458 - twisted - 250 - CRITICAL - - Rollback failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection
    conn.rollback()
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 52, in rollback
    self._connection.rollback()
psycopg2.InterfaceError: connection already closed

2020-10-21 14:16:24,460 - synapse.federation.transport.server - 428 - ERROR - PUT-58348- on_incoming_transaction failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/transport/server.py", line 425, in on_PUT
    origin, transaction_data
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 169, in on_incoming_transaction
    request_time,
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 191, in _on_incoming_transaction_inner
    origin, transaction, request_time
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 252, in _handle_incoming_transaction
    await self.transaction_actions.set_response(origin, transaction, 200, response)
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/persistence.py", line 67, in set_response
    transaction_id, origin, code, response
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/transactions.py", line 130, in set_received_txn_response
    desc="set_received_txn_response",
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 727, in simple_insert
    await self.runInteraction(desc, self.simple_insert_txn, table, values)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 577, in runInteraction
    **kwargs
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 647, in runWithConnection
    self._db_pool.runWithConnection(inner_func, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 626, in inner_func
    assert not self.engine.in_transaction(conn)
AssertionError
2020-10-21 14:16:24,460 - synapse.http.server - 85 - ERROR - PUT-58348- Failed handle request via 'FederationSendServlet': <SynapseRequest at 0x7efee762f6d0 method='PUT' uri='/_matrix/federation/v1/send/1602441424225' clientproto='HTTP/1.1' site=8083>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 191, in _on_incoming_transaction_inner
    origin, transaction, request_time
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 252, in _handle_incoming_transaction
    await self.transaction_actions.set_response(origin, transaction, 200, response)
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/persistence.py", line 67, in set_response
    transaction_id, origin, code, response
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/transactions.py", line 130, in set_received_txn_response
    desc="set_received_txn_response",
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 727, in simple_insert
    await self.runInteraction(desc, self.simple_insert_txn, table, values)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 577, in runInteraction
    **kwargs
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 647, in runWithConnection
    self._db_pool.runWithConnection(inner_func, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 626, in inner_func
    assert not self.engine.in_transaction(conn)
AssertionError

@PureTryOut
Copy link
Author

I tried to DM you about this on matrix (because I'm running into the exact same problem since a few days ago), but it seems your server is not responding to requests right now..

Well yes, that's exactly why I made this issue haha. I become unreachable for ages at a time and I'm using an old matrix.org alongside my normal one now just to keep up-to-date with some important rooms, it's really annoying.

I use Nginx for proxying incoming requests to the Synapse workers, but internally Synapse has a direct connection to the PostgreSQL box. PostgreSQL does run in a Docker container, but I'm not sure if that matters.

I've been experiencing this from around 1.20 actually, but I'm not sure if the Synapse version is to blame in my case as I just changed my database setup when this started happening (from a beefy, power hungry x86_64 machine to a lower performance but low power ARM64 machine) which might also be the cause of this.

@jcgruenhage
Copy link
Contributor

I usually notice within a few hours, considering matrix is my primary means of communication, but still... I've been offline again for 3 hours this morning due to this bug. @anoadragon453 Is there anything we can do to help this getting fixed?

@f0x52
Copy link

f0x52 commented Oct 28, 2020

psycopg2.InterfaceError: connection already closed

It seems that Synapse is losing its connection to Postgres. It is known that Synapse does not handle the db connection disappearing and reappearing very well. Is your Postgres overloaded or running out of memory?

just had a similar issue overnight, after an automated postgres restart at 03:30, which promptly made Synapse break, with low cpu, database failures and http 500 responses.
Any reason why synapse handles postgres connection loss this badly? Mastodon survives this just fine each night, and Synapse deadlocking like this has happened a few times now

@turt2live
Copy link
Member

t2bot.io also had this overnight - at roughly 08:00 UTC the database went missing on the network for 2 minutes which sent the server into sadness. The synchrotron started consuming memory until it was eventually killed and the workers responded with 500 errors to all other requests. This lead to a 4 hour outage (though that is partially because of how I have the paging rules set up).

It used to recover fine from database problems - it might have been a bit rocky and still wanted a restart, but it at least didn't require manual intervention.

@PureTryOut
Copy link
Author

Well my homeserver has been unreachable because of this for a few days now since I got annoyed by constantly restarting it. Element seems to be connected, but I haven't received any messages in any room for ages, even though those same rooms seem to be very active if I check them from a matrix.org account.

Could this please get some dev attention? My setup is literally unusable because of this.

@f0x52
Copy link

f0x52 commented Oct 29, 2020

happened today at exactly 3:30 again, I've now disabled my postgres restart service, but that makes it rather reproducible?

@anoadragon453 anoadragon453 added z-bug (Deprecated Label) p1 and removed info-needed labels Oct 29, 2020
@anoadragon453
Copy link
Member

Hey all, apologies for the delay on getting this fixed and the annoyance it's caused in the meantime. We suspect this is caused by the postgres autocommit change made in #8456 that went out in v1.21.0.

It's a subtle and nasty bug. We'll try to get this sorted and a fix out asap.

@clokep clokep changed the title Synapse stops responding to incoming requests after a while Synapse stops responding to incoming requests after PostgreSQL restarts Oct 30, 2020
@PureTryOut
Copy link
Author

@clokep that title seems wrong. My PostgreSQL instance doesn't restart at any point. It just gets a bit overloaded at some points after which Synapse stops responding. I never had the PostgreSQL service restart at any point.

@clokep
Copy link
Member

clokep commented Nov 2, 2020

@clokep that title seems wrong. My PostgreSQL instance doesn't restart at any point. It just gets a bit overloaded at some points after which Synapse stops responding. I never had the PostgreSQL service restart at any point.

Ah, that's good info. It seemed from the conversation it was only due to restarts. Thanks!

@clokep clokep changed the title Synapse stops responding to incoming requests after PostgreSQL restarts Synapse stops responding to incoming requests if PostgreSQL stops responding Nov 2, 2020
@jcgruenhage
Copy link
Contributor

I'm running into this issue roughly every other day, it's really annoying. One of the worst parts: It seems like Synapse still answers to requests that don't need to touch the database just fine, so my syncs don't break and I often just realize I've run into this after hours of suspicious silence or when trying to send a message.

@anoadragon453
Copy link
Member

anoadragon453 commented Nov 10, 2020

Can people try setting cp_reconnect: true under their database.args homeserver config option and see if that helps? e.g:

database:
  names: psycopg2
  args:
    user: xxx
    password: yyy
    database: synapse
    host: localhost
    cp_min: 5
    cp_max: 10
    cp_reconnect: true

We've got a PR (#8726) targeting this issue that enables it, but it'd be good to get confirmation on whether it works early on.

@jcgruenhage
Copy link
Contributor

I've reproduced the issue by restarting pg to force connection loss to make sure this works for reproducing the issue, the logs appeared as above and message sending broke. After adding cp_reconnect: true, the "connection already closed" log lines and tracebacks disappeared, but I still can't send messages until I also restart synapse.

@erikjohnston
Copy link
Member

@jcgruenhage Are there any logs in postgres during the outage? I know sometimes after restart it can take forever for postgres to start accepting connections.

@erikjohnston
Copy link
Member

@jcgruenhage If you can reproduce while having a tcpdump running then that would be awesomely useful

erikjohnston added a commit that referenced this issue Nov 12, 2020
`adbapi.ConnectionPool` let's you turn on auto reconnect of DB connections. This is off by default.
As far as I can tell if its not enabled dead connections never get removed from the pool.

Maybe helps #8574
@erikjohnston erikjohnston reopened this Nov 12, 2020
@PureTryOut
Copy link
Author

I gave that setting a shot for a few days, but after a day or so of trying to catch up after not having incoming federation for at least a month, it stopped responding again. I saw no PostgreSQL connection errors in the log this time, but the symptons were the same.

@erikjohnston
Copy link
Member

Weird. I think we'll need TCP dumps to see what's going on.

A random stab in the dark is that the connections are being dropped/black holed without Synapse noticing, if so then the default tcp timeout settings will mean it can take ages for the connection to finally timeout. Fiddling with the tcp_keepalives_* settings to make things timeout quicker may help in such a case: https://www.postgresql.org/docs/current/runtime-config-connection.html

@ananace
Copy link

ananace commented Nov 18, 2020

I'm seeing a bunch of new errors after changing the setting as well, here are copies from my federation sender worker;

2020-11-18 08:37:40,582 - synapse.storage.databases.main.event_push_actions - 530 - INFO - event_push_action_stream_orderings-221- Searching for stream ordering 1 month ago
2020-11-18 08:37:40,582 - synapse.storage.txn - 512 - WARNING - event_push_action_stream_orderings-221- [TXN OPERROR] {_find_stream_orderings_for_times-f5a1a} server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
 0/5
2020-11-18 08:37:40,582 - twisted - 250 - CRITICAL - event_push_action_stream_orderings-221- Rollback failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 702, in inner_func
    return func(db_conn, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 517, in new_transaction
    conn.rollback()
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 157, in rollback
    self.conn.rollback(*args, **kwargs)
--- <exception caught here> ---
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 56, in rollback
    self._connection.rollback()
psycopg2.InterfaceError: connection already closed

2020-11-18 08:37:40,583 - twisted - 250 - CRITICAL - - Rollback failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection
    conn.rollback()
--- <exception caught here> ---
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 56, in rollback
    self._connection.rollback()
psycopg2.InterfaceError: connection already closed

2020-11-18 08:37:40,583 - twisted - 250 - CRITICAL - - Rollback failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection
    conn.rollback()
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 65, in rollback
    self._pool.disconnect(self._connection)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 444, in disconnect
    raise Exception("wrong connection for thread")
builtins.Exception: wrong connection for thread

2020-11-18 08:37:40,583 - synapse.metrics.background_process_metrics - 211 - ERROR - event_push_action_stream_orderings-221- Background process 'event_push_action_stream_orderings' threw an exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 505, in new_transaction
    r = func(cursor, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/event_push_actions.py", line 532, in _find_stream_orderings_for_times_txn
    txn, self._clock.time_msec() - 30 * 24 * 60 * 60 * 1000
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/event_push_actions.py", line 584, in _find_first_stream_ordering_after_ts_txn
    txn.execute("SELECT MAX(stream_ordering) FROM events")
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 266, in execute
    self._do_execute(self.txn.execute, sql, *args)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 292, in _do_execute
    return func(sql, *args)
psycopg2.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/synapse/metrics/background_process_metrics.py", line 206, in run
    result = await result
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/event_push_actions.py", line 526, in _find_stream_orderings_for_times
    self._find_stream_orderings_for_times_txn,
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 635, in runInteraction
    **kwargs
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 708, in runWithConnection
    self._db_pool.runWithConnection(inner_func, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 702, in inner_func
    return func(db_conn, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 517, in new_transaction
    conn.rollback()
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 157, in rollback
    self.conn.rollback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 70, in rollback
    raise ConnectionLost()
twisted.enterprise.adbapi.ConnectionLost

These errors continue until I restart the container, at which point everything starts working again until the next time my postgres becomes overloaded.

@clokep
Copy link
Member

clokep commented Nov 18, 2020

I think the info from #8574 (comment) is still what is needed to help debug this further! Thanks!

@ananace
Copy link

ananace commented Nov 18, 2020

Unfortunately I haven't run into the issue at a time when I haven't required the continued use of federation, but if it happens during non-working hours I'll do my best to grab a tcpdump as well.

@ananace
Copy link

ananace commented Nov 24, 2020

So I just ran into this again, I took the time to grab a tcpdump of all PostgreSQL traffic this time, can't really upload all of it though as it's massive (generates at over 10MB per second) and completely full of sensitive information. Though there's not a single lost connection or response-less request in there.
Synapse seems to have gotten stuck constantly running a whole bunch of SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = '[...]' AND membership = 'join' queries. Rapidly and repeatedly looping through all the rooms the server is part of, at around two times per second. In 18 seconds of capture I can see a full 38 queries to get all members of !OGEhHVWSdvArJzumhm:matrix.org - the Matrix HQ room (with its 8k members) - for example.

Apart from those, the entire rest of the capture is just a mix of COMMIT/BEGIN ISOLATION LEVEL REPEATABLE READ/SET default_transaction_isolation TO DEFAULT, and what seems like the handling of incoming federation (inserting transactions, SELECT-ing server signatures, INSERT-ing server_keys_json, updating presence, looking at destinations), checking of group information/membership, and updating of user last-seen data.

Could it be that a cache breaks when the database connection resets?

Edit:
After a restart, another 15 seconds of capture shows a grand total of two such SELECT state_key ... queries, so it seems reasonable.

Edit:
I'm going to see if I can get a similar capture from if my federation sender worker breaks, so I can cut away any other noise.

@jcgruenhage
Copy link
Contributor

I've not run into this again since the update, so it seems to be fixed, at least for me.

@PureTryOut
Copy link
Author

This is most definitely still happening for me. Turned on federation again yesterday to test with Synapse 1.24, and it basically immediately stopped responding even though Element never mentions the server being offline.

@ananace
Copy link

ananace commented Dec 17, 2020

Just had this happen to me again on 1.24.0, took another packet capture and I'm still seeing Synapse hammering the database to repeatedly get members of all joined rooms. This time at a lesser rate though.

Filtered out the queries for the Matrix HQ, Matrix News, Synapse Admins, and synapse-dev rooms into another loose pcap if it's helpful;
Synapse_bug.pcap.gz

@dimm0
Copy link

dimm0 commented Dec 23, 2020

Seeing the same in kubernetes, synapse 1.22.1 and postgres 12
Just tried tuning postgres by giving it more mem and 15 connections, same errors

@djmaze
Copy link

djmaze commented Feb 16, 2021

Having a very similar setup as ananace (with Docker Swarm instead of Kubernetes) and seeing the same problems. That was on 1.22.1. Now going to see if an update to 1.26.0 helps.

Irrespective from that, especially when running synapse under a container orchestrator it would be really helpful to have a better health check endpoint which also checks if the database connection works. Then the orchestrator could restart the container automatically (which would be an okay workaround for this problem for now).

@richvdh
Copy link
Member

richvdh commented Apr 9, 2021

We seem to have dropped the ball (again) here. Sorry. :/

Though there's not a single lost connection or response-less request in there.

I'm going to assume that this is because the capture only started after the connection dropped - server closed the connection unexpectedly is a message from the low-level library that can only really happen when the connection drops.

Still, it's interesting to hear that you're not seeing new connection attempts.

Synapse seems to have gotten stuck constantly running a whole bunch of SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = '[...]' AND membership = 'join' queries.

This sounds like a symptom, rather than a cause, so not worth expending much effort on.

I suspect there's some sort of race between the connection dropping and our attempting to reconnect, but I really can't reproduce it or figure out what could be causing it.

Something that might be worth trying, for people affected here: set cp_noisy: true under your database.args. Also, make sure that the twisted logger is not configured to drop INFO logs in your log_config.yaml. This won't fix anything, but should give logs along the lines of:

2021-04-09 16:18:04,356 - twisted - 260 - INFO - sentinel - adbapi connecting: psycopg2
2021-04-09 16:18:04,794 - twisted - 260 - INFO - sentinel - adbapi connecting: psycopg2
2021-04-09 16:18:12,455 - twisted - 260 - INFO - sentinel - adbapi connecting: psycopg2
2021-04-09 16:18:15,044 - twisted - 260 - INFO - sentinel - adbapi connecting: psycopg2
2021-04-09 16:20:41,161 - twisted - 260 - CRITICAL - PUT-1087 - Rollback failed
2021-04-09 16:20:41,162 - twisted - 260 - INFO - PUT-1087 - adbapi closing: psycopg2
2021-04-09 16:20:41,162 - twisted - 260 - INFO - PUT-1087 - Connection lost.
2021-04-09 16:20:41,164 - twisted - 260 - CRITICAL - sentinel - Rollback failed
2021-04-09 16:20:41,164 - twisted - 260 - CRITICAL - sentinel - Rollback failed
2021-04-09 16:20:41,806 - twisted - 260 - INFO - sentinel - adbapi connecting: psycopg2

... which might help us figure out what's going on (those 'connecting' lines should appear at startup).

Also: is anyone still seeing the assert not self.engine.in_transaction(conn) exception, in recent versions (since 1.24) of Synapse?

@richvdh richvdh added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Info This issue is blocked awaiting information from the reporter and removed z-bug (Deprecated Label) z-info-needed z-regression (Deprecated Label) labels Apr 9, 2021
@dimm0
Copy link

dimm0 commented Apr 9, 2021

I added tons of resources to postgres and removed liveness probes, so it never restarts... Can't really tell

@ananace
Copy link

ananace commented Apr 9, 2021

I've personally moved to a completely different postgres setup that deals with routing in a whole other manner, so I don't have the same underlying problem anymore.

@richvdh
Copy link
Member

richvdh commented Apr 12, 2021

our retry behaviour on reconnect isn't ideal: have opened #9779 to track that, though I don't think it could be causing this problem.

@richvdh
Copy link
Member

richvdh commented May 25, 2021

Since nobody is having this problem any more, I'm going to close this. If people see it again, we can reopen and continue investigating.

@richvdh richvdh closed this as completed May 25, 2021
@squahtx
Copy link
Contributor

squahtx commented Nov 25, 2021

Possibly the same issue as #11167, for which we have a fix in the works.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Info This issue is blocked awaiting information from the reporter
Projects
None yet
Development

Successfully merging a pull request may close this issue.