Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Synapse does not recover correctly after a database server outage #11167

Closed
richvdh opened this issue Oct 23, 2021 · 9 comments · Fixed by #11376
Closed

Synapse does not recover correctly after a database server outage #11167

richvdh opened this issue Oct 23, 2021 · 9 comments · Fixed by #11376
Assignees
Labels
S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@richvdh
Copy link
Member

richvdh commented Oct 23, 2021

Earlier today our database server restarted. The database recovered itself; Synapse did not. In particular we saw "in flight requests" stacking up:

image

... and the reverse-proxy returned 429s for many requests.

@richvdh
Copy link
Member Author

richvdh commented Oct 23, 2021

I think that if the event-fetch job fails with an exception (as it did), _event_fetch_ongoing is not correctly decremented, so no new event-fetch jobs are restarted.

@squahtx squahtx added the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label Oct 25, 2021
@richvdh
Copy link
Member Author

richvdh commented Oct 28, 2021

also: we should export _event_fetch_ongoing as a prometheus metric

@callahad callahad added P1 S-Major Major functionality / product severely impaired, no satisfactory workaround. labels Oct 28, 2021
@squahtx squahtx self-assigned this Nov 2, 2021
@squahtx
Copy link
Contributor

squahtx commented Nov 3, 2021

Event fetches explain most of the stuck requests.

There's one class of stuck request that I can't explain yet: GET FederationUserDevicesQueryServlet requests were piling up on federation_reader-1 and don't read any events as far as I can tell. Perhaps it's something to do with the @cachedList on _get_bare_e2e_cross_signing_keys_bulk? The code for @cachedList is tricky to understand but looks sound...

@squahtx
Copy link
Contributor

squahtx commented Nov 3, 2021

In testing, @cachedList handled exceptions just fine. I can't reproduce the pile up of FederationUserDevicesQueryServlet requests.

@richvdh
Copy link
Member Author

richvdh commented Nov 3, 2021

yeah, very odd. I'm also at a loss to explain it.

@squahtx
Copy link
Contributor

squahtx commented Nov 4, 2021

Fixed by #11240, except for the FederationUserDevicesQueryServlet requests which I can't reproduce.

@richvdh
Copy link
Member Author

richvdh commented Nov 4, 2021

Let's close this for now, then.

@richvdh
Copy link
Member Author

richvdh commented Nov 17, 2021

I think the problem here is that we increment _event_fetch_ongoing in the main thread, but decrement it in the event fetch thread.

Apart from the fact that integer increment/decrement operations aren't atomic in Python (so decrementing it without holding the lock is racy), we also have the problem that if we're unable to get a connection to the database (eg, because it is shutting down...), runWithConnection will fail without calling _do_fetch - hence the counter is incremented but not decremented.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants