Catch-up after Federation Outage #8096

reivilibre · 2020-08-14T19:22:20Z

There are a few points marked XXX REVIEW which I'd like help with.

This fixes #2528 by:

having a 'catch-up' phase of federation sending
- no new PDUs are sent in this phase, only the latest PDU per room (in order of least recent) that has changed since our last transmission to a destination
- the destination can then /get_missing_events to get other events in rooms
- this phase ends when there are no catch-up PDUs left; then we return to normal behaviour
enabling this catch-up flag on startup and when a destination goes offline for so long that we stop retrying a normal transaction
(N.B. when we hear from a destination that we are backing off from, we will also resume transmissions in the catch-up phase — this is what wake_destination is about)

Database notes:

destination_rooms stores the reference to the latest PDU of each room that has been enqueued to be sent to a destination.
destinations has sprouted last_successful_stream_ordering which is the stream_ordering of the most recent successfully-sent PDU to the destination

Limitations/Flaws:

we don't currently start catching-up a destination:
- on startup until we already intend to send a PDU or EDU there
  - would like to address but separate PR, since the current PR is already a vast incremental improvement
  - (unless we have a backoff from the destination and then hear from it, then it recovers itself)
  - I believe this can occur if you shut down your HS whilst still retrying transmissions to a destination (there won't be a backoff applied for that destination yet, but no events get through and nothing on startup kicks off a retransmission, since the queues are entirely in-memory).
    - to clarify: if you shut down your HS whilst retrying, then the destination comes online (say) and you restart your HS, then even when we hear from the destination, we don't think about triggering a catch-up — nothing happens until we feel like sending an EDU or PDU (in my tests, once someone starts typing, it sends a typing EDU which kicks that off)

Confused me for a second there!

synapse/storage/databases/main/schema/delta/58/11recovery_after_outage.sql

Signed-off-by: Olivier Wilkinson (reivilibre) <olivier@librepush.net>

synapse/federation/sender/per_destination_queue.py

reivilibre · 2020-08-14T19:40:19Z

2 fails, but I still believe it to be in a reviewable state, I'll sort the 2 failures on monday, they can't be too bad (he said, knowing nothing at all).

reivilibre · 2020-08-18T07:35:00Z

I note that test https://buildkite.com/matrix-dot-org/synapse/builds/11053#98a85fe9-ce0d-4a7b-83dd-09812911bff7 (Python: 3.7 /
Postgres: 11 / Workers FAILURE Test #595: Outbound federation can request missing events) failed the first time, then I re-ran and it passed. Not sure if this is of concern or not…

For now I'll open this to review.

anoadragon453 · 2020-08-18T10:51:45Z

Not sure if this is of concern or not…

It was noted in #synapse-dev that this is a known flakey test.

anoadragon453

I'd still like to see a cursory review of this from either @richvdh or @erikjohnston as they were involved in the initial design phase, but on the whole looks like an excellent stab at it.

Comments are particularly good 👍

synapse/storage/data_stores/main/schema/delta/58/11recovery_after_outage.sql

synapse/storage/data_stores/main/transactions.py

synapse/federation/sender/per_destination_queue.py

tests/federation/test_federation_catch_up.py

anoadragon453 · 2020-08-19T16:14:45Z

synapse/federation/sender/per_destination_queue.py

+            # XXX REVIEW needs scrutiny
+            #  to note: up to 50 pdus can be lost from the
+            #  main queue by a transaction that triggers a backoff — do we
+            #  clear the main queue now? I can see arguments for and against.


What are those arguments?

D'ow, that'll teach me for not writing my full thoughts whilst I still remember what I was on about.

So, vaguely I may have had two wavelengths of thought here:

We're dropping 50 PDUs, isn't that bad?! Won't that mean we will forget to catch-up some rooms potentially?

If I recall, I appear to have been under the delusion that those PDUs would be lost altogether.

However, now I realise that this will enable catch-up, and those PDUs will be eligible for catch-up if they are the most recent PDUs in their respective rooms (a destination_rooms entry will exist and we won't update our last_successful_stream_ordering on a failure…) — so the remote will still get a chance to hear about them when they next come up. (If they aren't, then the more recent PDUs will be in the main queue so the remote will hear eventually!)

So: all OK here.

Is there any point in dropping 50 but not just dropping the whole queue?

Erik mentioned that sometimes, moving on to a different transaction can help a remote recover (i.e. just one particular one causes it to blow up), so 'skipping' is not necessarily bad.

I wonder if waiting for retries to exceed an hour before sweeping out the queue could be bad for memory usage (of course, no worse than present-day, but …. is this a chance to make it better?)

On the other hand, I don't suppose that /get_missing_events is the most efficient/high-throughput way to get events so forcing catch-up onto destinations too aggressively may be harmful.

Broadly I am tempted to think this is currently fine, but may need tweaking/limiting/aggressifying in the future in case that really does grow too large. I think Erik or Rich will have a more trustworthy sense here, though.

I think it's fine for now, though once we have more confidence in the "catching up" behaviour, I think we may as well drop the whole queue immediately.

changelog.d/8096.bugfix

Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>

synapse/storage/databases/main/schema/delta/58/11recovery_after_outage.sql

synapse/federation/sender/__init__.py

richvdh · 2020-08-26T15:33:53Z

synapse/storage/databases/main/schema/delta/58/11recovery_after_outage.sql

+-- of the latest event to be enqueued for transmission to that destination.
+CREATE TABLE IF NOT EXISTS destination_rooms (
+  -- the destination in question
+  destination TEXT NOT NULL,


I'd kinda like to see the logic shuffled so that this can be a foreign key; however at the very least it needs a comment saying why it can't be one.

Sure, would it suffice to INSERT IGNORE a NULL row into the destinations table when upserting a destination_rooms row?

probably. It would be nice to avoid doing so on every PDU (for example by doing it when we first create a PerDestinationQueue for the destination), but that might be fiddly, and is something we can do later.

synapse/storage/databases/main/schema/delta/58/11recovery_after_outage.sql

richvdh · 2020-08-26T16:23:25Z

synapse/federation/sender/per_destination_queue.py

+            self._catching_up = False
+            return
+
+        if self._catch_up_max_stream_order is None:


question: why do we need a high-water-mark at all? why not just keep going until get_catch_up_room_event_ids returns an empty list?

(doing so without care will introduce races though...)

I suppose we could do that if you are keen — as you say though, needs more thought.

The advantage of this approach is that it's easy to keep the logic in my head. Was keen not to give a chance for any nasty bugs to crawl in, because it'd be nice to have confidence in this.

richvdh · 2020-08-26T16:26:14Z

synapse/federation/sender/per_destination_queue.py

+
+            # zip them together with their stream orderings
+            catch_up_pdus = [
+                (event, event.internal_metadata.stream_ordering) for event in events


as noted elsewhere: I think the order is redundant: it might be easier to get rid of it (in a separate PR).

I'll put it on my list, then :)

synapse/federation/sender/per_destination_queue.py

richvdh · 2020-08-26T16:27:39Z

synapse/federation/sender/per_destination_queue.py

+            if not catch_up_pdus:
+                break


when does this happen, and why is breaking the loop the correct behaviour?

I suppose this should be log an ERROR, since this shouldn't happen.

break isn't the correct behaviour really, since it disables catch-up when we supposedly should have some to do.

But only break will let us make progress and go to the main loop again.

So I will log an error either way, but is it best to break or return?

Is this being unnecessarily paranoid?

There will be a foreign key constraint to link our rows to events, so this should darn well be impossible.

Better to assert events or if not events: raise AssertionError(...)?

synapse/storage/databases/main/transactions.py

Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>

(hopefully with a worm as the return value)

reivilibre · 2020-08-27T07:36:43Z

@richvdh Thanks for the juicy review — some questions above on how to proceed

reivilibre · 2020-08-27T07:45:53Z

Gah, broken.

…ed_outage

richvdh · 2020-09-02T10:29:35Z

synapse/storage/database.py

@@ -1092,7 +1092,7 @@ async def simple_select_one_onecol(
        self,
        table: str,
        keyvalues: Dict[str, Any],
-        retcol: Iterable[str],
+        retcol: str,


please can you take these fixes to a separate PR?

richvdh · 2020-09-02T10:40:47Z

synapse/storage/databases/main/schema/delta/58/11recovery_after_outage.sql

+-- of the latest event to be enqueued for transmission to that destination.
+CREATE TABLE IF NOT EXISTS destination_rooms (
+  -- the destination in question
+  destination TEXT NOT NULL,


probably. It would be nice to avoid doing so on every PDU (for example by doing it when we first create a PerDestinationQueue for the destination), but that might be fiddly, and is something we can do later.

richvdh · 2020-09-02T10:41:21Z

synapse/storage/databases/main/schema/delta/58/11recovery_after_outage.sql

+  stream_ordering INTEGER NOT NULL,
+  PRIMARY KEY (destination, room_id),
+  FOREIGN KEY (room_id) REFERENCES rooms (room_id)
+    ON DELETE CASCADE,


ON DELETE CASCADE sounds scary. Why do we do that?

richvdh · 2020-09-04T11:40:10Z

this is being superceded by smaller PRs.

reivilibre added 10 commits July 23, 2020 09:48

Fix wrong type annotation

18d900e

Confused me for a second there!

Little type hint

07a415c

Add data store functions and delta

5bc321a

Track destination_rooms entries

c60e259

Add catch-up logic

20c896a

Add tests!

d232200

Merge branch 'develop' into rei/2528_catchup_fed_outage

967d8c1

synapse/storage/databases/main/schema/delta/58/11recovery_after_outage.sql

Track async/await and db_pool transitions

d910798

Newsfile

74a6f4f

Signed-off-by: Olivier Wilkinson (reivilibre) <olivier@librepush.net>

Antilint

c1b32ae

reivilibre requested a review from a team August 14, 2020 19:22

reivilibre commented Aug 14, 2020

View reviewed changes

synapse/federation/sender/per_destination_queue.py Outdated Show resolved Hide resolved

reivilibre added 3 commits August 17, 2020 12:24

Fix up test

5de9313

Remove unused method

759e027

Use Python 3.5-friendly Collection type

6c52666

reivilibre removed the request for review from a team August 17, 2020 12:56

Fix logic bug in prior code

9ba56cb

reivilibre requested a review from a team August 18, 2020 07:35

reivilibre self-assigned this Aug 18, 2020

anoadragon453 self-requested a review August 19, 2020 14:11

anoadragon453 reviewed Aug 19, 2020

View reviewed changes

anoadragon453 requested review from erikjohnston and richvdh August 19, 2020 16:17

reivilibre and others added 3 commits August 20, 2020 09:37

Update changelog.d/8096.bugfix

af13948

Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>

Handle suggestions from review

558af38

How could we ever forget you, ./scripts-dev/lint.sh?

44765e9

richvdh suggested changes Aug 26, 2020

View reviewed changes

reivilibre and others added 13 commits August 26, 2020 20:08

Apply suggestions from Rich's code review

56aaa17

Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>

Apply suggestions from Rich's code review

84dbc43

Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>

Foreign key on rooms, SQL comment

2c740a7

NOT NULL, foreign key (events)

d77e444

SQL column doc

33874d4

Behaviour confirmed reasonable-seeming

c1a2b68

The early bird gets the early return

16eec5c

(hopefully with a worm as the return value)

Assertion on bug

92517e9

Last successful stream ordering is about destinations

ef4680d

Catch-up on all cases except federation denial

de5caf0

Don't explicitly store the event_id

3e308f9

Antilint

b0bdadd

Remove review question

843403f

reivilibre requested a review from richvdh August 27, 2020 07:34

reivilibre removed request for erikjohnston and richvdh August 27, 2020 07:45

reivilibre added 4 commits August 27, 2020 08:49

Merge branch 'develop' into rei/2528_catchup_fed_outage

ad7124d

Fix wrong type signatures (even if str is Iterable[str]…)

b1fd67b

Fix the tests after removing event_id column

e6890c7

Antilint

7cfecf3

reivilibre requested a review from richvdh August 27, 2020 08:02

reivilibre added 4 commits August 27, 2020 09:10

Also fix simple_select_onecol_txn

bf51d2f

Antilint again :(

7589a03

Merge remote-tracking branch 'origin/develop' into rei/2528_catchup_f…

8d9f4ba

…ed_outage

Merge remote-tracking branch 'origin/develop' into rei/2528_catchup_f…

b60ad35

…ed_outage

richvdh reviewed Sep 2, 2020

View reviewed changes

richvdh closed this Sep 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch-up after Federation Outage #8096

Catch-up after Federation Outage #8096

reivilibre commented Aug 14, 2020 •

edited

Loading

reivilibre commented Aug 14, 2020

reivilibre commented Aug 18, 2020

anoadragon453 commented Aug 18, 2020

anoadragon453 left a comment

anoadragon453 Aug 19, 2020

reivilibre Aug 20, 2020

richvdh Aug 26, 2020

richvdh Aug 26, 2020

reivilibre Aug 26, 2020 •

edited

Loading

richvdh Sep 2, 2020

richvdh Aug 26, 2020

richvdh Aug 26, 2020

reivilibre Aug 26, 2020

richvdh Aug 26, 2020

reivilibre Aug 26, 2020

richvdh Aug 26, 2020

reivilibre Aug 26, 2020

reivilibre Aug 26, 2020

reivilibre Aug 26, 2020

reivilibre commented Aug 27, 2020

reivilibre commented Aug 27, 2020

richvdh Sep 2, 2020

richvdh Sep 2, 2020

richvdh Sep 2, 2020

richvdh commented Sep 4, 2020

Catch-up after Federation Outage #8096

Catch-up after Federation Outage #8096

Conversation

reivilibre commented Aug 14, 2020 • edited Loading

reivilibre commented Aug 14, 2020

reivilibre commented Aug 18, 2020

anoadragon453 commented Aug 18, 2020

anoadragon453 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reivilibre Aug 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reivilibre commented Aug 27, 2020

reivilibre commented Aug 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richvdh commented Sep 4, 2020

reivilibre commented Aug 14, 2020 •

edited

Loading

reivilibre Aug 26, 2020 •

edited

Loading