Faster joins: potential infinite loop in resync #13001

richvdh · 2022-06-09T08:29:59Z

Resyncing the state for a room follows the following algorithm:

Find a batch of events with partial state. We choose the first 100 events ordered by arrival time. For each event in the batch:
- Skip if we have prev_events with partial state
- Request the state at any completely missing prev_events
- Resolve the state across the prev_events and persist

It's possible for that to get into a loop, where we pull out the first 100 events, fail to make progress on any of them, and repeat.

synapse/synapse/handlers/federation_event.py

Lines 527 to 540 in 7c6b220

    
           if context.partial_state: 
        
               # this can happen if some or all of the event's prev_events still have 
        
               # partial state - ie, an event has an earlier stream_ordering than one 
        
               # or more of its prev_events, so we de-partial-state it before its 
        
               # prev_events. 
        
               # 
        
               # TODO(faster_joins): we probably need to be more intelligent, and 
        
               #    exclude partial-state prev_events from consideration 
        
               #    https://github.com/matrix-org/synapse/issues/13001 
        
               logger.warning( 
        
                   "%s still has partial state: can't de-partial-state it yet", 
        
                   event.event_id, 
        
               ) 
        
               return

part of #12646

The text was updated successfully, but these errors were encountered:

richvdh · 2022-07-20T15:10:04Z

I don't think that this is actually a situation that can happen, even for one event, never mind 100.

To hit this "still has partial state" case, note that we must be processing an event E where:

we have one or more prev_events as non-outlier events in the database, and
one or more of those prev_events (say E') have partial state.

... which implies that E' was received after E (otherwise we would already have de-partial-stated it). Which in turn implies that, at the time we received E, we didn't have E' - ie, we were persisting an event without having previously persisted its prev_events. (Note that we pick events to de-partial-state in order of stream_ordering - ie, order that events were persisted - rather than received timestamp.)

But when we persist an event (as a non-outlier) without having its prev_events, then we always request the state at that point in the DAG - ie, E will not have partial state.

richvdh · 2022-07-20T17:10:47Z

... except, if we receive an event as an outlier, we don't give it a new stream_ordering when we de-outlier it. So, if we receive 100 outliers, then an event early in the room history, then de-outlier the outliers, then we may hit this. Will test.

richvdh added A-Federated-Join joins over federation generally suck T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jun 9, 2022

richvdh added this to the Q2 2022 ─ Faster joins phase 2: correctness milestone Jun 9, 2022

richvdh self-assigned this Jul 15, 2022

richvdh modified the milestones: Q2 2022 ─ Faster joins phase 2: correctness, Q3 2022: Faster joins: fix major known bugs for monoliths Jul 20, 2022

This was referenced Jul 21, 2022

Fix infinite loop in partial-state resync #13353

Merged

Regression test for infinite-loop-in-resync matrix-org/complement#418

Merged

richvdh closed this as completed in #13353 Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster joins: potential infinite loop in resync #13001

Faster joins: potential infinite loop in resync #13001

richvdh commented Jun 9, 2022 •

edited

Loading

richvdh commented Jul 20, 2022

richvdh commented Jul 20, 2022

Faster joins: potential infinite loop in resync #13001

Faster joins: potential infinite loop in resync #13001

Comments

richvdh commented Jun 9, 2022 • edited Loading

richvdh commented Jul 20, 2022

richvdh commented Jul 20, 2022

richvdh commented Jun 9, 2022 •

edited

Loading