Synapse does not recover correctly after a database server outage #11167

richvdh · 2021-10-23T21:10:41Z

Earlier today our database server restarted. The database recovered itself; Synapse did not. In particular we saw "in flight requests" stacking up:

... and the reverse-proxy returned 429s for many requests.

richvdh · 2021-10-23T21:32:39Z

I think that if the event-fetch job fails with an exception (as it did), _event_fetch_ongoing is not correctly decremented, so no new event-fetch jobs are restarted.

richvdh · 2021-10-28T13:44:04Z

also: we should export _event_fetch_ongoing as a prometheus metric

squahtx · 2021-11-03T15:46:56Z

Event fetches explain most of the stuck requests.

There's one class of stuck request that I can't explain yet: GET FederationUserDevicesQueryServlet requests were piling up on federation_reader-1 and don't read any events as far as I can tell. Perhaps it's something to do with the @cachedList on _get_bare_e2e_cross_signing_keys_bulk? The code for @cachedList is tricky to understand but looks sound...

squahtx · 2021-11-03T18:32:39Z

In testing, @cachedList handled exceptions just fine. I can't reproduce the pile up of FederationUserDevicesQueryServlet requests.

richvdh · 2021-11-03T23:45:34Z

yeah, very odd. I'm also at a loss to explain it.

squahtx · 2021-11-04T10:39:42Z

Fixed by #11240, except for the FederationUserDevicesQueryServlet requests which I can't reproduce.

richvdh · 2021-11-04T12:30:01Z

Let's close this for now, then.

richvdh · 2021-11-17T08:30:52Z

Sadly I think we still have wedging event fetch queries. https://prometheus.matrix.org/graph?g0.expr=min_over_time(synapse_event_fetch_ongoing%5B10m%5D)%3E%200&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=17m2s852ms&g0.end_input=2021-11-17%2008%3A28%3A10&g0.moment_input=2021-11-17%2008%3A28%3A10:

richvdh · 2021-11-17T12:37:30Z

I think the problem here is that we increment _event_fetch_ongoing in the main thread, but decrement it in the event fetch thread.

Apart from the fact that integer increment/decrement operations aren't atomic in Python (so decrementing it without holding the lock is racy), we also have the problem that if we're unable to get a connection to the database (eg, because it is shutting down...), runWithConnection will fail without calling _do_fetch - hence the counter is incremented but not decremented.

squahtx added the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label Oct 25, 2021

callahad added P1 S-Major Major functionality / product severely impaired, no satisfactory workaround. labels Oct 28, 2021

squahtx self-assigned this Nov 2, 2021

squahtx mentioned this issue Nov 3, 2021

Track ongoing event fetches correctly in the presence of failure #11240

Merged

richvdh closed this as completed Nov 4, 2021

MadLittleMods mentioned this issue Nov 10, 2021

Federation reader stops processing incoming requests after database crash #8470

Closed

richvdh reopened this Nov 17, 2021

squahtx mentioned this issue Nov 17, 2021

Track ongoing event fetches correctly (again) #11376

Merged

squahtx mentioned this issue Nov 25, 2021

Synapse stops responding to incoming requests if PostgreSQL stops responding #8574

Closed

squahtx closed this as completed in #11376 Nov 26, 2021

kittykat added the z-p1 label Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synapse does not recover correctly after a database server outage #11167

Synapse does not recover correctly after a database server outage #11167

richvdh commented Oct 23, 2021

richvdh commented Oct 23, 2021 •

edited

Loading

richvdh commented Oct 28, 2021

squahtx commented Nov 3, 2021

squahtx commented Nov 3, 2021

richvdh commented Nov 3, 2021

squahtx commented Nov 4, 2021

richvdh commented Nov 4, 2021

richvdh commented Nov 17, 2021

richvdh commented Nov 17, 2021

Synapse does not recover correctly after a database server outage #11167

Synapse does not recover correctly after a database server outage #11167

Comments

richvdh commented Oct 23, 2021

richvdh commented Oct 23, 2021 • edited Loading

richvdh commented Oct 28, 2021

squahtx commented Nov 3, 2021

squahtx commented Nov 3, 2021

richvdh commented Nov 3, 2021

squahtx commented Nov 4, 2021

richvdh commented Nov 4, 2021

richvdh commented Nov 17, 2021

richvdh commented Nov 17, 2021

richvdh commented Oct 23, 2021 •

edited

Loading