Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Catch-up after Federation Outage #8096

Closed
wants to merge 38 commits into from
Closed

Conversation

reivilibre
Copy link
Contributor

@reivilibre reivilibre commented Aug 14, 2020

There are a few points marked XXX REVIEW which I'd like help with.

This fixes #2528 by:

  • having a 'catch-up' phase of federation sending
    • no new PDUs are sent in this phase, only the latest PDU per room (in order of least recent) that has changed since our last transmission to a destination
    • the destination can then /get_missing_events to get other events in rooms
    • this phase ends when there are no catch-up PDUs left; then we return to normal behaviour
  • enabling this catch-up flag on startup and when a destination goes offline for so long that we stop retrying a normal transaction
  • (N.B. when we hear from a destination that we are backing off from, we will also resume transmissions in the catch-up phase — this is what wake_destination is about)

Database notes:

  • destination_rooms stores the reference to the latest PDU of each room that has been enqueued to be sent to a destination.
  • destinations has sprouted last_successful_stream_ordering which is the stream_ordering of the most recent successfully-sent PDU to the destination

Limitations/Flaws:

  • we don't currently start catching-up a destination:
    • on startup until we already intend to send a PDU or EDU there
      • would like to address but separate PR, since the current PR is already a vast incremental improvement
      • (unless we have a backoff from the destination and then hear from it, then it recovers itself)
      • I believe this can occur if you shut down your HS whilst still retrying transmissions to a destination (there won't be a backoff applied for that destination yet, but no events get through and nothing on startup kicks off a retransmission, since the queues are entirely in-memory).
        • to clarify: if you shut down your HS whilst retrying, then the destination comes online (say) and you restart your HS, then even when we hear from the destination, we don't think about triggering a catch-up — nothing happens until we feel like sending an EDU or PDU (in my tests, once someone starts typing, it sends a typing EDU which kicks that off)

@reivilibre reivilibre requested a review from a team August 14, 2020 19:22
@reivilibre
Copy link
Contributor Author

2 fails, but I still believe it to be in a reviewable state, I'll sort the 2 failures on monday, they can't be too bad (he said, knowing nothing at all).

@reivilibre reivilibre removed the request for review from a team August 17, 2020 12:56
@reivilibre
Copy link
Contributor Author

I note that test https://buildkite.com/matrix-dot-org/synapse/builds/11053#98a85fe9-ce0d-4a7b-83dd-09812911bff7 (Python: 3.7 /
Postgres: 11 / Workers FAILURE Test #595: Outbound federation can request missing events) failed the first time, then I re-ran and it passed. Not sure if this is of concern or not…

For now I'll open this to review.

@reivilibre reivilibre requested a review from a team August 18, 2020 07:35
@anoadragon453
Copy link
Member

Not sure if this is of concern or not…

It was noted in #synapse-dev that this is a known flakey test.

@reivilibre reivilibre self-assigned this Aug 18, 2020
@anoadragon453 anoadragon453 self-requested a review August 19, 2020 14:11
Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to see a cursory review of this from either @richvdh or @erikjohnston as they were involved in the initial design phase, but on the whole looks like an excellent stab at it.

Comments are particularly good 👍

synapse/storage/data_stores/main/transactions.py Outdated Show resolved Hide resolved
synapse/storage/data_stores/main/transactions.py Outdated Show resolved Hide resolved
synapse/storage/data_stores/main/transactions.py Outdated Show resolved Hide resolved
synapse/storage/data_stores/main/transactions.py Outdated Show resolved Hide resolved
synapse/federation/sender/per_destination_queue.py Outdated Show resolved Hide resolved
tests/federation/test_federation_catch_up.py Outdated Show resolved Hide resolved
tests/federation/test_federation_catch_up.py Outdated Show resolved Hide resolved
# XXX REVIEW needs scrutiny
# to note: up to 50 pdus can be lost from the
# main queue by a transaction that triggers a backoff — do we
# clear the main queue now? I can see arguments for and against.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are those arguments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'ow, that'll teach me for not writing my full thoughts whilst I still remember what I was on about.

So, vaguely I may have had two wavelengths of thought here:

  1. We're dropping 50 PDUs, isn't that bad?! Won't that mean we will forget to catch-up some rooms potentially?

If I recall, I appear to have been under the delusion that those PDUs would be lost altogether.

However, now I realise that this will enable catch-up, and those PDUs will be eligible for catch-up if they are the most recent PDUs in their respective rooms (a destination_rooms entry will exist and we won't update our last_successful_stream_ordering on a failure…) — so the remote will still get a chance to hear about them when they next come up. (If they aren't, then the more recent PDUs will be in the main queue so the remote will hear eventually!)

So: all OK here.

  1. Is there any point in dropping 50 but not just dropping the whole queue?
  • Erik mentioned that sometimes, moving on to a different transaction can help a remote recover (i.e. just one particular one causes it to blow up), so 'skipping' is not necessarily bad.
  • I wonder if waiting for retries to exceed an hour before sweeping out the queue could be bad for memory usage (of course, no worse than present-day, but …. is this a chance to make it better?)
  • On the other hand, I don't suppose that /get_missing_events is the most efficient/high-throughput way to get events so forcing catch-up onto destinations too aggressively may be harmful.

Broadly I am tempted to think this is currently fine, but may need tweaking/limiting/aggressifying in the future in case that really does grow too large. I think Erik or Rich will have a more trustworthy sense here, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine for now, though once we have more confidence in the "catching up" behaviour, I think we may as well drop the whole queue immediately.

changelog.d/8096.bugfix Outdated Show resolved Hide resolved
reivilibre and others added 3 commits August 20, 2020 09:37
synapse/federation/sender/__init__.py Outdated Show resolved Hide resolved
-- of the latest event to be enqueued for transmission to that destination.
CREATE TABLE IF NOT EXISTS destination_rooms (
-- the destination in question
destination TEXT NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd kinda like to see the logic shuffled so that this can be a foreign key; however at the very least it needs a comment saying why it can't be one.

Copy link
Contributor Author

@reivilibre reivilibre Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, would it suffice to INSERT IGNORE a NULL row into the destinations table when upserting a destination_rooms row?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably. It would be nice to avoid doing so on every PDU (for example by doing it when we first create a PerDestinationQueue for the destination), but that might be fiddly, and is something we can do later.

self._catching_up = False
return

if self._catch_up_max_stream_order is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why do we need a high-water-mark at all? why not just keep going until get_catch_up_room_event_ids returns an empty list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(doing so without care will introduce races though...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could do that if you are keen — as you say though, needs more thought.

The advantage of this approach is that it's easy to keep the logic in my head. Was keen not to give a chance for any nasty bugs to crawl in, because it'd be nice to have confidence in this.


# zip them together with their stream orderings
catch_up_pdus = [
(event, event.internal_metadata.stream_ordering) for event in events
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as noted elsewhere: I think the order is redundant: it might be easier to get rid of it (in a separate PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put it on my list, then :)

synapse/federation/sender/per_destination_queue.py Outdated Show resolved Hide resolved
Comment on lines 514 to 515
if not catch_up_pdus:
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does this happen, and why is breaking the loop the correct behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this should be log an ERROR, since this shouldn't happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break isn't the correct behaviour really, since it disables catch-up when we supposedly should have some to do.

But only break will let us make progress and go to the main loop again.

So I will log an error either way, but is it best to break or return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being unnecessarily paranoid?

There will be a foreign key constraint to link our rows to events, so this should darn well be impossible.

Better to assert events or if not events: raise AssertionError(...)?

synapse/storage/databases/main/transactions.py Outdated Show resolved Hide resolved
@reivilibre reivilibre requested a review from richvdh August 27, 2020 07:34
@reivilibre
Copy link
Contributor Author

@richvdh Thanks for the juicy review — some questions above on how to proceed

@reivilibre
Copy link
Contributor Author

Gah, broken.

@reivilibre reivilibre requested a review from richvdh August 27, 2020 08:02
@@ -1092,7 +1092,7 @@ async def simple_select_one_onecol(
self,
table: str,
keyvalues: Dict[str, Any],
retcol: Iterable[str],
retcol: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please can you take these fixes to a separate PR?

-- of the latest event to be enqueued for transmission to that destination.
CREATE TABLE IF NOT EXISTS destination_rooms (
-- the destination in question
destination TEXT NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably. It would be nice to avoid doing so on every PDU (for example by doing it when we first create a PerDestinationQueue for the destination), but that might be fiddly, and is something we can do later.

stream_ordering INTEGER NOT NULL,
PRIMARY KEY (destination, room_id),
FOREIGN KEY (room_id) REFERENCES rooms (room_id)
ON DELETE CASCADE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ON DELETE CASCADE sounds scary. Why do we do that?

@richvdh
Copy link
Member

richvdh commented Sep 4, 2020

this is being superceded by smaller PRs.

@richvdh richvdh closed this Sep 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Homeservers don't catch up with missed traffic until someone sends another event
3 participants