-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Conversation
Confused me for a second there!
synapse/storage/databases/main/schema/delta/58/11recovery_after_outage.sql
2 fails, but I still believe it to be in a reviewable state, I'll sort the 2 failures on monday, they can't be too bad (he said, knowing nothing at all). |
I note that test https://buildkite.com/matrix-dot-org/synapse/builds/11053#98a85fe9-ce0d-4a7b-83dd-09812911bff7 (Python: 3.7 / For now I'll open this to review. |
It was noted in #synapse-dev that this is a known flakey test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to see a cursory review of this from either @richvdh or @erikjohnston as they were involved in the initial design phase, but on the whole looks like an excellent stab at it.
Comments are particularly good 👍
synapse/storage/data_stores/main/schema/delta/58/11recovery_after_outage.sql
Outdated
Show resolved
Hide resolved
# XXX REVIEW needs scrutiny | ||
# to note: up to 50 pdus can be lost from the | ||
# main queue by a transaction that triggers a backoff — do we | ||
# clear the main queue now? I can see arguments for and against. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are those arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D'ow, that'll teach me for not writing my full thoughts whilst I still remember what I was on about.
So, vaguely I may have had two wavelengths of thought here:
- We're dropping 50 PDUs, isn't that bad?! Won't that mean we will forget to catch-up some rooms potentially?
If I recall, I appear to have been under the delusion that those PDUs would be lost altogether.
However, now I realise that this will enable catch-up, and those PDUs will be eligible for catch-up if they are the most recent PDUs in their respective rooms (a destination_rooms
entry will exist and we won't update our last_successful_stream_ordering
on a failure…) — so the remote will still get a chance to hear about them when they next come up. (If they aren't, then the more recent PDUs will be in the main queue so the remote will hear eventually!)
So: all OK here.
- Is there any point in dropping 50 but not just dropping the whole queue?
- Erik mentioned that sometimes, moving on to a different transaction can help a remote recover (i.e. just one particular one causes it to blow up), so 'skipping' is not necessarily bad.
- I wonder if waiting for retries to exceed an hour before sweeping out the queue could be bad for memory usage (of course, no worse than present-day, but …. is this a chance to make it better?)
- On the other hand, I don't suppose that /get_missing_events is the most efficient/high-throughput way to get events so forcing catch-up onto destinations too aggressively may be harmful.
Broadly I am tempted to think this is currently fine, but may need tweaking/limiting/aggressifying in the future in case that really does grow too large. I think Erik or Rich will have a more trustworthy sense here, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine for now, though once we have more confidence in the "catching up" behaviour, I think we may as well drop the whole queue immediately.
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
synapse/storage/databases/main/schema/delta/58/11recovery_after_outage.sql
Show resolved
Hide resolved
-- of the latest event to be enqueued for transmission to that destination. | ||
CREATE TABLE IF NOT EXISTS destination_rooms ( | ||
-- the destination in question | ||
destination TEXT NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd kinda like to see the logic shuffled so that this can be a foreign key; however at the very least it needs a comment saying why it can't be one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, would it suffice to INSERT IGNORE
a NULL row into the destinations table when upserting a destination_rooms
row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably. It would be nice to avoid doing so on every PDU (for example by doing it when we first create a PerDestinationQueue
for the destination), but that might be fiddly, and is something we can do later.
synapse/storage/databases/main/schema/delta/58/11recovery_after_outage.sql
Outdated
Show resolved
Hide resolved
synapse/storage/databases/main/schema/delta/58/11recovery_after_outage.sql
Outdated
Show resolved
Hide resolved
self._catching_up = False | ||
return | ||
|
||
if self._catch_up_max_stream_order is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why do we need a high-water-mark at all? why not just keep going until get_catch_up_room_event_ids
returns an empty list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(doing so without care will introduce races though...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could do that if you are keen — as you say though, needs more thought.
The advantage of this approach is that it's easy to keep the logic in my head. Was keen not to give a chance for any nasty bugs to crawl in, because it'd be nice to have confidence in this.
|
||
# zip them together with their stream orderings | ||
catch_up_pdus = [ | ||
(event, event.internal_metadata.stream_ordering) for event in events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as noted elsewhere: I think the order
is redundant: it might be easier to get rid of it (in a separate PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put it on my list, then :)
if not catch_up_pdus: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when does this happen, and why is breaking the loop the correct behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this should be log an ERROR, since this shouldn't happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break
isn't the correct behaviour really, since it disables catch-up when we supposedly should have some to do.
But only break
will let us make progress and go to the main loop again.
So I will log an error either way, but is it best to break or return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this being unnecessarily paranoid?
There will be a foreign key constraint to link our rows to events, so this should darn well be impossible.
Better to assert events
or if not events: raise AssertionError(...)
?
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
(hopefully with a worm as the return value)
@richvdh Thanks for the juicy review — some questions above on how to proceed |
Gah, broken. |
@@ -1092,7 +1092,7 @@ async def simple_select_one_onecol( | |||
self, | |||
table: str, | |||
keyvalues: Dict[str, Any], | |||
retcol: Iterable[str], | |||
retcol: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please can you take these fixes to a separate PR?
-- of the latest event to be enqueued for transmission to that destination. | ||
CREATE TABLE IF NOT EXISTS destination_rooms ( | ||
-- the destination in question | ||
destination TEXT NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably. It would be nice to avoid doing so on every PDU (for example by doing it when we first create a PerDestinationQueue
for the destination), but that might be fiddly, and is something we can do later.
stream_ordering INTEGER NOT NULL, | ||
PRIMARY KEY (destination, room_id), | ||
FOREIGN KEY (room_id) REFERENCES rooms (room_id) | ||
ON DELETE CASCADE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ON DELETE CASCADE
sounds scary. Why do we do that?
this is being superceded by smaller PRs. |
There are a few points marked
XXX REVIEW
which I'd like help with.This fixes #2528 by:
/get_missing_events
to get other events in roomswake_destination
is about)Database notes:
destination_rooms
stores the reference to the latest PDU of each room that has been enqueued to be sent to a destination.destinations
has sproutedlast_successful_stream_ordering
which is thestream_ordering
of the most recent successfully-sent PDU to the destinationLimitations/Flaws: