Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Reduce spurious replication catchup #16555

Merged
merged 5 commits into from
Oct 27, 2023
Merged

Conversation

erikjohnston
Copy link
Member

@erikjohnston erikjohnston commented Oct 26, 2023

The POSITION command includes two tokens, the "prev" and "new" tokens. The "new" token is what we should set the current token for the instance to be, and the "prev" token was where it was previously (i.e. the instance produced no rows between the "prev" and "new" tokens).

This means that we can safely relax the condition to only check for missing updates if the current token for the writer is strictly less than the "prev" token.

A common case is receiving something like POSITIONS event_persister1 events 15 16, which indicates the writer advanced their events stream token from 15 to 16 (usually as a response to another instance persisting an event with stream ID 16). Usually, the other instances would already have inferred that event_persister1 must have advanced their token, so the current_token would already be set to 16, thus triggering a lookup for missing updates between tokens 16 (cmd.new_token) and 16 (current_token), which is silly.

We also move the logging back into the loop so that we don't log in the common case.

@erikjohnston erikjohnston marked this pull request as ready for review October 26, 2023 13:55
@erikjohnston erikjohnston requested a review from a team as a code owner October 26, 2023 13:55
Comment on lines 614 to 616
# If the position token matches our current token then we're up to date
# and there's nothing to do. Otherwise, fetch all updates between then
# and now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# If the position token matches our current token then we're up to date
# and there's nothing to do. Otherwise, fetch all updates between then
# and now.
# If the incoming position is at least our current position then we're
# up to date and there's nothing to do. Otherwise, fetch all updates
# between then and now.

Copy link
Contributor

@DMRobertson DMRobertson Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example: I think the stream is at position 10.
I receive POSITION 4 8.
Uh oh, I might have missed stream entries 5, 6, 7 and 8. Need to rescan for them.

(Here prev < next < current.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no. POSITION 4 8 means that the new token for that instance should be set to 8, and nothing has happened between 4 and 8 from that instance. (The reason for this is that normally we'd just get a stream of RDATA from the instance, which implicitly bumps the token. If for some reason the token has advanced without any updates to send, it sends a POSITION, hence why it gives the range where things have not happened).

Copy link
Member Author

@erikjohnston erikjohnston Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# If the position token matches our current token then we're up to date
# and there's nothing to do. Otherwise, fetch all updates between then
# and now.
# If the incoming previous position is less than our current position
# then we're up to date and there's nothing to do. Otherwise, fetch
# all updates between then and now.

Maybe?

Comment on lines 618 to 620
# Note: We have to check that `current_token` is within the range, to
# handle the case where the stream gets "reset" (e.g. for `caches` and
# `typing` after the writer's restart).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Note: We have to check that `current_token` is within the range, to
# handle the case where the stream gets "reset" (e.g. for `caches` and
# `typing` after the writer's restart).
# Note: We also have to check that `current_token` is at least the
# previous position, to handle the case where the stream gets "reset"
# (e.g. for `caches` and `typing` after the writer's restart).

Copy link
Contributor

@DMRobertson DMRobertson Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example: I think the stream is at position 10.
I receive POSITION 12 15.
Uh oh, I might have missed stream entries 11, 12, 13, 14, and 15. Need to rescan for them.

(Here current < prev < next.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh oh, I might have missed stream entries 11, 12, 13, 14, and 15. Need to rescan for them.

You might have missed entries 11 and 12.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does POSITION need some better documentation somewhere (either https://matrix-org.github.io/synapse/develop/development/synapse_architecture/streams.html or maybe in the code?) describing how it works? I think I was also under the impression that it told you were it had updates, not where it jumped to without updates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not blocking this PR, but would appreciate a follow-up or at least an issue.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#16560 since its a Friday afternoon and I'm liable to forget to follow up otherwise

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Note: We have to check that `current_token` is within the range, to
# handle the case where the stream gets "reset" (e.g. for `caches` and
# `typing` after the writer's restart).
# Note: We also have to check that `current_token` is at least the
# new position, to handle the case where the stream gets "reset"
# (e.g. for `caches` and `typing` after the writer's restart).

Copy link
Contributor

@DMRobertson DMRobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct? LGTM if so. Wording tweaks are just suggestions, ignore at your discretion.

# Note: We have to check that `current_token` is within the range, to
# handle the case where the stream gets "reset" (e.g. for `caches` and
# `typing` after the writer's restart).
missing_updates = not (cmd.prev_token <= current_token <= cmd.new_token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check: are the inequalities correct here? E.g. should one of them be < instead of <=?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point that it looks weird. But I think they're right as we're comparing "upper bounds"

erikjohnston and others added 2 commits October 27, 2023 13:23
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
@erikjohnston erikjohnston enabled auto-merge (squash) October 27, 2023 12:42
@erikjohnston erikjohnston merged commit 89dbbd6 into develop Oct 27, 2023
41 checks passed
@erikjohnston erikjohnston deleted the erikj/reduce_spurious_fetches branch October 27, 2023 13:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants