Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Faster joins: potential infinite loop in resync #13001

Closed
richvdh opened this issue Jun 9, 2022 · 2 comments · Fixed by #13353
Closed

Faster joins: potential infinite loop in resync #13001

richvdh opened this issue Jun 9, 2022 · 2 comments · Fixed by #13353
Assignees
Labels
A-Federated-Join joins over federation generally suck T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@richvdh
Copy link
Member

richvdh commented Jun 9, 2022

Resyncing the state for a room follows the following algorithm:

  • Find a batch of events with partial state. We choose the first 100 events ordered by arrival time. For each event in the batch:
    • Skip if we have prev_events with partial state
    • Request the state at any completely missing prev_events
    • Resolve the state across the prev_events and persist

It's possible for that to get into a loop, where we pull out the first 100 events, fail to make progress on any of them, and repeat.

if context.partial_state:
# this can happen if some or all of the event's prev_events still have
# partial state - ie, an event has an earlier stream_ordering than one
# or more of its prev_events, so we de-partial-state it before its
# prev_events.
#
# TODO(faster_joins): we probably need to be more intelligent, and
# exclude partial-state prev_events from consideration
# https://github.com/matrix-org/synapse/issues/13001
logger.warning(
"%s still has partial state: can't de-partial-state it yet",
event.event_id,
)
return

part of #12646

@richvdh richvdh added A-Federated-Join joins over federation generally suck T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jun 9, 2022
@richvdh richvdh self-assigned this Jul 15, 2022
@richvdh
Copy link
Member Author

richvdh commented Jul 20, 2022

I don't think that this is actually a situation that can happen, even for one event, never mind 100.

To hit this "still has partial state" case, note that we must be processing an event E where:

  • we have one or more prev_events as non-outlier events in the database, and
  • one or more of those prev_events (say E') have partial state.

... which implies that E' was received after E (otherwise we would already have de-partial-stated it). Which in turn implies that, at the time we received E, we didn't have E' - ie, we were persisting an event without having previously persisted its prev_events. (Note that we pick events to de-partial-state in order of stream_ordering - ie, order that events were persisted - rather than received timestamp.)

But when we persist an event (as a non-outlier) without having its prev_events, then we always request the state at that point in the DAG - ie, E will not have partial state.

@richvdh
Copy link
Member Author

richvdh commented Jul 20, 2022

... except, if we receive an event as an outlier, we don't give it a new stream_ordering when we de-outlier it. So, if we receive 100 outliers, then an event early in the room history, then de-outlier the outliers, then we may hit this. Will test.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federated-Join joins over federation generally suck T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant