Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Sep 22, 2022

Addressing:

-- #10737

So Synapse is fast enough to merge this MSC2716 Complement test for importing many batches, matrix-org/complement#214 (comment)


Complement tests: matrix-org/complement#214

Dev notes

COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestImportHistoricalMessages/parallel/Federation/Backfill_still_works_after_many_contiguous_batches_are_imported -p 1

Why are we seeing some of these historical events being rejected?

I think it's because of changes to the auth-event reconciliation in #12943 (comment) which I was brought in on but didn't realize the magnitude of the change since the MSC2716 tests still passed. Although not sure I realized that it actually removed one the MSC2716 tests from the Synapse code base.

event's auth_events are different to our calculated auth_events. Claimed but not calculated: [<FrozenEventV3 event_id=$iTgahP49zNUQYuUfVnC0bkfNZxX_dJqu4KXUdL_GvBY, type=m.room.power_levels, state_key=, outlier=0>, <FrozenEventV3 event_id=$ycSMCM1-gaeUIFP_oR8FflBu8DSv5hfYBOk2SbHXsik, type=m.room.create, state_key=, outlier=0>, <FrozenEventV3 event_id=$COUGQXCMRhmShggKIuD7E2PiGbOKbMeXylf2RmYByIg, type=m.room.member, state_key=@maria:hs1, outlier=1>]. Calculated but not claimed: []

While checking auth of <FrozenEventV3 event_id=$VM325DZMXTz9mjGGG6pfAIK0R6ubNdmNIf5YIMdWk3o, type=m.room.message, state_key=None, outlier=False> against room state before the event: 403: User @maria:hs1 not in room !vXsKJXhHXQtktxuNfm:hs1 (None)

Why aren't we sorting topologically when receiving backfill events?

See #11114 (comment)

How is stream_ordering given out?

stream_id_gen
stream_ordering_manager

Persisting events

See _persist_events_and_state_updates for where we normally assign stream_ordering and continue down to _persist_events_txn.

_process_pulled_events
_process_pulled_event
_process_received_pdu
_run_push_actions_and_persist_event
persist_events_and_notify
persist_events
self._event_persist_queue
_process_event_persist_queue_task
_persist_event_batch
_persist_events_and_state_updates
_persist_events_txn
_store_event_txn

_process_pulled_event
_process_received_pdu
_check_event_auth

Pretty print list

This one is great if you're printing a JSON-like thing:

import json

json.dumps(expected_event_order, indent=4)

This one sucks because it doesn't print the first and last items on indented new lines

import pprint

pprint.pformat(expected_event_order, indent=4)

Random

\\n["|'], 
", \n\t

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

Fix #13856

`_invalidate_caches_for_event` doesn't run in monolith mode which
means we never even tried to clear the `have_seen_event` and other
caches. And even in worker mode, it only runs on the workers, not
the master (AFAICT).

Additionally there is bug with the key being wrong so
`_invalidate_caches_for_event` never invalidates the
`have_seen_event` cache even when it does run.

Wrong:
```py
self.have_seen_event.invalidate((room_id, event_id))
```

Correct:
```py
self.have_seen_event.invalidate(((room_id, event_id),))
```
Fix #13856

`_invalidate_caches_for_event` doesn't run in monolith mode which
means we never even tried to clear the `have_seen_event` and other
caches. And even in worker mode, it only runs on the workers, not
the master (AFAICT).

Additionally there is bug with the key being wrong so
`_invalidate_caches_for_event` never invalidates the
`have_seen_event` cache even when it does run.

Wrong:
```py
self.have_seen_event.invalidate((room_id, event_id))
```

Correct:
```py
self.have_seen_event.invalidate(((room_id, event_id),))
```
…rder) and persist in the oldest -> newest to get the least missing prev_event fetch thrashing
@MadLittleMods MadLittleMods changed the title Maddlittlemods/msc2716 many batches optimization Optimize backfill receiving to have less prev_event thrashing Sep 22, 2022
@MadLittleMods MadLittleMods changed the title Optimize backfill receiving to have less prev_event thrashing Optimize backfill receiving to have less missing prev_event thrashing Sep 22, 2022
…in so everyhting is valid

We are going to lose the benefit of keeping the join noise out of the timeline.
And will probably have to hide "historical" state on the client.
Comment on lines 872 to 1309
for event, _ in known_event_dict.values():
if event.internal_metadata.outlier:
self.fail("Our pristine events should not be marked as an outlier")

self.get_success(
self.hs.get_federation_event_handler()._process_pulled_events(
self.OTHER_SERVER_NAME,
[
# Make copies of events since Synapse modifies the
# internal_metadata in place and we want to keep our
# pristine copies
make_event_from_dict(pulled_event.get_pdu_json(), room_version)
for pulled_event in pulled_events
],
backfilled=True,
)
)

from_token = self.get_success(
self.hs.get_event_sources().get_current_token_for_pagination(room_id)
)
actual_events_in_room_reverse_chronological, _ = self.get_success(
main_store.paginate_room_events(
room_id, from_key=from_token.room_key, limit=100, direction="b"
)
)

# We have to reverse the list to make it chronological.
actual_events_in_room_chronological = list(
reversed(actual_events_in_room_reverse_chronological)
)

expected_event_order = [
# Beginning of room (oldest messages)
# *list(state_map.values()),
room_create_event,
as_membership_event,
pl_event,
state_map.get((EventTypes.JoinRules, "")),
state_map.get((EventTypes.RoomHistoryVisibility, "")),
event_before,
# HISTORICAL MESSAGE END
insertion_event,
historical_message_event,
batch_event,
base_insertion_event,
# HISTORICAL MESSAGE START
event_after,
# Latest in the room (newest messages)
]

event_id_diff = {event.event_id for event in expected_event_order} - {
event.event_id for event in actual_events_in_room_chronological
}
event_diff_ordered = [
event for event in expected_event_order if event.event_id in event_id_diff
]
event_id_extra = {
event.event_id for event in actual_events_in_room_chronological
} - {event.event_id for event in expected_event_order}
event_extra_ordered = [
event
for event in actual_events_in_room_chronological
if event.event_id in event_id_extra
]
assertion_message = (
"Debug info:\nActual events missing from expected list: %s\nActual events contain %d additional events compared to expected: %s\nExpected event order: %s\nActual event order: %s"
% (
json.dumps(
[_debug_event_string(event) for event in event_diff_ordered],
indent=4,
),
len(event_extra_ordered),
json.dumps(
[_debug_event_string(event) for event in event_extra_ordered],
indent=4,
),
json.dumps(
[_debug_event_string(event) for event in expected_event_order],
indent=4,
),
json.dumps(
[
_debug_event_string(event)
for event in actual_events_in_room_chronological
],
indent=4,
),
)
)

# assert (
# actual_events_in_room_chronological == expected_event_order
# ), assertion_message

self.assertEqual(
[event.event_id for event in actual_events_in_room_chronological],
[event.event_id for event in expected_event_order],
assertion_message,
)
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice test for figuring out the mess with historical events from MSC2716 being rejected.

It eliminates all the federation variables when trying to do the same thing in Complement. And is so much faster to iterate on. Seconds vs minutes.

# It all works when I add a prev_event for the floating
# insertion event but the event no longer floats.
# It's able to resolve state at the prev_events though.
prev_event_ids=[event_before.event_id],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can work without connecting it to event_before but then we rely on maria_membership_event not being gossiped about during backfill because if it is, then it will be rejected and we can't use a rejected event to auth the following historical events.

_process_pulled_event
_process_pulled_event
_compute_event_context_with_maybe_missing_prevs
compute_event_context

The reason it works when it's not gossiped about is that _compute_event_context_with_maybe_missing_prevs fills in the state_ids_before_event and resolves the state magically for us without rejecting.

logger.info(
"Event %s is missing prev_events %s: calculating state for a "
"backwards extremity",
event_id,
shortstr(missing_prevs),
)
# Calculate the state after each of the previous events, and
# resolve them to find the correct state at the current event.
try:
# Determine whether we may be about to retrieve partial state
# Events may be un-partial stated right after we compute the partial state
# flag, but that's okay, as long as the flag errs on the conservative side.
partial_state_flags = await self._store.get_partial_state_events(seen)
partial_state = any(partial_state_flags.values())
# Get the state of the events we know about
ours = await self._state_storage_controller.get_state_groups_ids(
room_id, seen, await_full_state=False
)
# state_maps is a list of mappings from (type, state_key) to event_id
state_maps: List[StateMap[str]] = list(ours.values())
# we don't need this any more, let's delete it.
del ours
# Ask the remote server for the states we don't
# know about
for p in missing_prevs:
logger.info("Requesting state after missing prev_event %s", p)
with nested_logging_context(p):
# note that if any of the missing prevs share missing state or
# auth events, the requests to fetch those events are deduped
# by the get_pdu_cache in federation_client.
remote_state_map = (
await self._get_state_ids_after_missing_prev_event(
dest, room_id, p
)
)
state_maps.append(remote_state_map)
room_version = await self._store.get_room_version_id(room_id)
state_map = await self._state_resolution_handler.resolve_events_with_store(
room_id,
room_version,
state_maps,
event_map={event_id: event},
state_res_store=StateResolutionStore(self._store),
)
except Exception:
logger.warning(
"Error attempting to resolve state at missing prev_events",
exc_info=True,
)
raise FederationError(
"ERROR",
403,
"We can't get valid state history.",
affected=event_id,
)
return await self._state_handler.compute_event_context(
event, state_ids_before_event=state_map, partial_state=partial_state
)

I have a suspicion that we will also stop doing sort of thing since we also removed auth event resolving in #12943

@@ -458,6 +457,7 @@ async def handle_batch_of_events(
events_to_create=events_to_create,
room_id=room_id,
inherited_depth=inherited_depth,
state_chain_event_id_to_connect_to=state_chain_event_id_to_connect_to,
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes make this go from before -> after where we now connect historical batch to the historical state chain. And connect the state chain to the prev_event so everything has valid prev_events and auth_events to resolve organically from the DAG.

We do lose the benefit of removing the noise @mxid joined the room noise between each batch but we might have to solve this on the client by hiding historical state.

Before

flowchart BT
    A --- annotation1>"Note: older events are at the top"]
    subgraph live timeline
        marker1>m.room.marker] ----> B -----------------> A
    end
    
    subgraph batch0
        batch0-batch[[m.room.batch]] --> batch0-2(("2")) --> batch0-1((1)) --> batch0-0((0)) --> batch0-insertion[/m.room.insertion\]
    end
    subgraph batch1
        batch1-batch[[m.room.batch]] --> batch1-2(("2")) --> batch1-1((1)) --> batch1-0((0)) --> batch1-insertion[/m.room.insertion\]
    end
    
    subgraph batch2
        batch2-batch[[m.room.batch]] --> batch2-2(("2")) --> batch2-1((1)) --> batch2-0((0)) --> batch2-insertion[/m.room.insertion\]
    end
    batch0-insertion -.-> memberBob0(["m.room.member (bob)"]) --> memberAlice0(["m.room.member (alice)"])
    batch1-insertion -.-> memberBob1(["m.room.member (bob)"]) --> memberAlice1(["m.room.member (alice)"])
    batch2-insertion -.-> memberBob2(["m.room.member (bob)"]) --> memberAlice2(["m.room.member (alice)"])
    marker1 -.-> batch0-insertionBase
    batch0-insertionBase[/m.room.insertion\] ---------------> A
    batch0-batch -.-> batch0-insertionBase
    batch1-batch -.-> batch0-insertion
    batch2-batch -.-> batch1-insertion
    %% make the annotation links invisible
    linkStyle 0 stroke-width:2px,fill:none,stroke:none;
Loading

After

flowchart BT
    A --- annotation1>"Note: older events are at the top"]
    subgraph live timeline
        marker1>m.room.marker] ----> B -----------------> A
    end
    
    subgraph batch0
        batch0-batch[[m.room.batch]] --> batch0-2(("2")) --> batch0-1((1)) --> batch0-0((0)) --> batch0-insertion[/m.room.insertion\]
    end
    subgraph batch1
        batch1-batch[[m.room.batch]] --> batch1-2(("2")) --> batch1-1((1)) --> batch1-0((0)) --> batch1-insertion[/m.room.insertion\]
    end
    
    subgraph batch2
        batch2-batch[[m.room.batch]] --> batch2-2(("2")) --> batch2-1((1)) --> batch2-0((0)) --> batch2-insertion[/m.room.insertion\]
    end
    batch0-insertion --> memberBob0(["m.room.member (bob)"]) --> memberAlice0(["m.room.member (alice)"]) --> A
    batch1-insertion --> memberBob1(["m.room.member (bob)"]) --> memberAlice1(["m.room.member (alice)"]) --> A
    batch2-insertion --> memberBob2(["m.room.member (bob)"]) --> memberAlice2(["m.room.member (alice)"]) --> A
    marker1 -.-> batch0-insertionBase
    batch0-insertionBase[/m.room.insertion\] ---------------> A
    batch0-batch -.-> batch0-insertionBase
    batch1-batch -.-> batch0-insertion
    batch2-batch -.-> batch1-insertion
    %% make the annotation links invisible
    linkStyle 0 stroke-width:2px,fill:none,stroke:none;
Loading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split out to #13971

MadLittleMods added a commit that referenced this pull request Sep 30, 2022
@MadLittleMods MadLittleMods changed the title Optimize backfill receiving to have less missing prev_event thrashing Optimize backfill receiving to have less missing prev_event thrashing (scratch) Sep 30, 2022
MadLittleMods added a commit that referenced this pull request Sep 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant