This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
Federation reader stops processing incoming requests after database crash #8470
Labels
A-Federation
T-Defect
Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
z-bug
(Deprecated Label)
z-p2
(Deprecated Label)
Description
Following my postgres instance being OOMkilled (a presumably unrelated issue), my federation reader worker stops processing incoming events (or processes them extremely slowly):
Here's the database server's memory usage chart showing the time at which the crash occurred:
Stacked up with the requests-in-flight (dark red is
PUT FederationSendServlet
on my federation_reader worker):and age of last processed event (the new events that do come in are probably due to local activity?)
(I can provide other metrics graphs for this period upon request)
Note that the rest of the server continued working fine, it could exchange local messages and sync with clients without issues.
Log excerpt from the time of the crash attached (note that it appears to recover, the logs continue as if it were processing incoming requests but it doesn't seem to be reflected in the above graphs (or the observed behavior that messages from other servers stop coming in).
federation_reader.log.txt
Steps to reproduce
(note: I haven't attempted to reproduce this in isolation, but it has happened multiple times in situ with my current configuration)
^/_matrix/federation/v1/send/
endpoint and redis replication.(My worker config:
federation_reader.yaml.txt
Expected: possibly a few requests error out, but the worker should recover after the database comes back up
Actual: worker stops processing requests until killed and restarted
Version information
If not matrix.org:
Install method: pip
Platform: Ubuntu 18.04 VPS, not containerized.
The text was updated successfully, but these errors were encountered: