Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Federation reader stops processing incoming requests after database crash #8470

Closed
chr-1x opened this issue Oct 6, 2020 · 5 comments
Closed
Labels
A-Federation T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. z-bug (Deprecated Label) z-p2 (Deprecated Label)

Comments

@chr-1x
Copy link

chr-1x commented Oct 6, 2020

Description

Following my postgres instance being OOMkilled (a presumably unrelated issue), my federation reader worker stops processing incoming events (or processes them extremely slowly):

Here's the database server's memory usage chart showing the time at which the crash occurred:
image

Stacked up with the requests-in-flight (dark red is PUT FederationSendServlet on my federation_reader worker):
image

and age of last processed event (the new events that do come in are probably due to local activity?)
image

(I can provide other metrics graphs for this period upon request)

Note that the rest of the server continued working fine, it could exchange local messages and sync with clients without issues.

Log excerpt from the time of the crash attached (note that it appears to recover, the logs continue as if it were processing incoming requests but it doesn't seem to be reflected in the above graphs (or the observed behavior that messages from other servers stop coming in).
federation_reader.log.txt

Steps to reproduce

(note: I haven't attempted to reproduce this in isolation, but it has happened multiple times in situ with my current configuration)

  1. Set up the homeserver, with a postgres database, separate synapse.app.generic_worker handling the ^/_matrix/federation/v1/send/ endpoint and redis replication.

(My worker config:
federation_reader.yaml.txt

  1. Kill postgres

Expected: possibly a few requests error out, but the worker should recover after the database comes back up

Actual: worker stops processing requests until killed and restarted

Version information

  • Homeserver: matrix.cybre.space

If not matrix.org:

  • Version:
{
   "python_version": "3.6.8", 
   "server_version": "1.20.1 (b=master,86a72d1)" ,
}
  • Install method: pip

  • Platform: Ubuntu 18.04 VPS, not containerized.

@chr-1x
Copy link
Author

chr-1x commented Oct 10, 2020

Opened an issue for the apparent cause of my database crashes here: #8516

@clokep
Copy link
Member

clokep commented Oct 14, 2020

I can't seem to find an open issue for it, but @erikjohnston assures me that this is an old known issue not only with the federation reader, but in general that Synapse does not like it if the postgresql database goes away while it is running.

@clokep clokep added z-bug (Deprecated Label) z-p2 (Deprecated Label) labels Oct 14, 2020
@erikjohnston
Copy link
Member

I have no idea why Synapse doesn't recover, it really really should

@MadLittleMods
Copy link
Contributor

Related to #11167

@MadLittleMods MadLittleMods added the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label Nov 10, 2021
@MadLittleMods
Copy link
Contributor

MadLittleMods commented Nov 10, 2021

With my squints on, looking at the Requests in flight graphs from this issue and #11167 acting in similar ways, although it's pretty hard to distinguish the handlers from the colors and both issues talking about database failures, I think we can assume this issue was also solved ⏩

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federation T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. z-bug (Deprecated Label) z-p2 (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

4 participants