Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Fix missing sync events during historical batch imports #12319

Merged

Conversation

Fizzadar
Copy link
Contributor

@Fizzadar Fizzadar commented Mar 29, 2022

Discovered after much in-depth investigation in #12281. Adding an explicit order_by argument to change the ordering of results returned from room pagination queries and updating use in the sync & message handlers.

Closes: #12281
Closes: #3305

Signed off by: Nick Mills-Barrett nick@beeper.com

Pull Request Checklist

@Fizzadar Fizzadar requested a review from a team as a code owner March 29, 2022 07:12
This can be used to flip betweek topological ordering (the default)
and stream ordering as needed by the caller.
room_id,
end_token=stream_position.room_key,
limit=1,
order_by="stream",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a test for #3305 so it doesn't regress.

I don't have clear reproduction steps for this one and maybe we can just consider the test we write for #12281 as a subset of this problem and only need one test 🤷

This aligns with automatic selection of bounds depending on the token
passed into the pagination function.
@Fizzadar Fizzadar changed the title Recent events stream order_by argument Recent events stream order fix Mar 31, 2022
@Fizzadar Fizzadar requested review from MadLittleMods and removed request for a team March 31, 2022 07:13
@Fizzadar
Copy link
Contributor Author

Fizzadar commented Apr 4, 2022

@MadLittleMods I'm a bit confused on next steps here - this seems to have unearthed another (complex looking) bug I don't have time (currently) to work on a fix for.

My suggestion would be to roll back to the explicit argument fix which resolves this issue and create a new issue to tackle the get_messages bug (which can also remove the new argument).

@MadLittleMods
Copy link
Contributor

My suggestion would be to roll back to the explicit argument fix which resolves this issue and create a new issue to tackle the get_messages bug (which can also remove the new argument).

@Fizzadar Go for it 👍 We can get some feedback from the Synapse team whether that is acceptable

@MadLittleMods MadLittleMods requested a review from a team April 5, 2022 01:28
@MadLittleMods
Copy link
Contributor

@MadLittleMods I'm a bit confused on next steps here - this seems to have unearthed another (complex looking) bug I don't have time (currently) to work on a fix for.

As an update, I've fixed up the other complex bug via #12370

So we can probably leave this PR as-is. I've assigned it back to the Synapse core for review ⏩


assert (
"m.room.create" in state_event_types
), "Missing room full state in sync response"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and fails on develop but passes with this PR ✅ (as expected)

@erikjohnston erikjohnston self-assigned this Apr 6, 2022
@erikjohnston
Copy link
Member

erikjohnston commented Apr 6, 2022

Thanks @Fizzadar and @MadLittleMods for tracking this down!

My main concern is how this changes clients back paginating using a prev_batch token from a /sync timeline section. Those tokens are currently stream tokens, but /messages will return them in topological order with a topological next token. This change means that the first /messages will return events in stream order but will still return a topological next token, meaning that the next /messages will return events in a topological order.

I don't think that is a change we really want to make. My understanding is that that behaviour works correctly currently? Or does that also have issues with backfilled/historical events?

@erikjohnston erikjohnston removed the request for review from a team April 6, 2022 10:58
@@ -175,10 +175,6 @@ async def get_state_events(
state_filter = state_filter or StateFilter.all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is how this changes clients back paginating using a prev_batch token from a /sync timeline section. Those tokens are currently stream tokens, but /messages will return them in topological order with a topological next token. This change means that the first /messages will return events in stream order but will still return a topological next token, meaning that the next /messages will return events in a topological order.

-- @erikjohnston, #12319 (comment)

/messages could be updated to always return in topological_ordering regardless of what type of pagination token is given. That's what it was doing before this PR anyway.

I don't think that is a change we really want to make. My understanding is that that behaviour works correctly currently? Or does that also have issues with backfilled/historical events?

I think we have two assumptions that we should probably enforce:

  • /messages should always sort by topological_ordering regardless of pagination token given
  • /sync should always sort by stream_ordering regardless of pagination token given

This also aligns with what we have documented:

  • /sync returns things in the order they arrive at the server (stream_ordering).
  • /messages (and /backfill in the federation API) return them in the order determined by the event graph (topological_ordering, stream_ordering).

-- Room DAG concepts -> Depth and stream ordering

The current behavior of /messages and /sync returns events as expected but not state.

There is a bug in how incremental /sync returns state which this PR aims to fix. The best way to understand the problem is probably reading #12281 (comment) - but basically when calculating what state to return, /sync is ordering events by topological_ordering even though it should be stream_ordering.


I initially assumed we would want flexibility in how these endpoints sorted events according to the type of pagination token given but it's sounding like we actually want to enforce a given sort according to the endpoint. In which case, we can revert to @Fizzadar initial approach of plumbing a order_by argument which we can set.

Copy link
Member

@erikjohnston erikjohnston Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry for not getting to this yesterday, it sat at about the second thing on my todo list all day :/)

Thanks for the clarification. There's the added bonus that /sync will return events in pagination order for the first batch of events for the room, i.e. in an initial sync or when a user joins the room. (The rationale being that it's equivalent to getting no events then immediately doing a back pagination). This is relevant as we use get_recent_events_for_room for that case as well.

I initially assumed we would want flexibility in how these endpoints sorted events according to the type of pagination token given but it's sounding like we actually want to enforce a given sort according to the endpoint. In which case, we can revert to @Fizzadar initial approach of plumbing a order_by argument which we can set.

I wonder if we're just making this harder for ourselves by re-using the same query to do both /messages and when looking up the state for /sync? I think the query we want to run is simply:

SELECT event_id FROM events
WHERE room_id = ? AND stream_ordering <= ?
ORDER BY stream_ordering DESC
LIMIT 1

So it might be simpler to just have a separate get_last_event_in_room_before function that we call when getting the state instead? Rather than extending the already slightly horrific _paginate_room_events_txn function?


Entirely separately, and I'm not suggesting we necessarily do this now, but it occurs to me that we have a get_forward_extremities_for_room_at_stream_ordering store function that maps from room_id to the list of forward extremities in the room at the time. This can then be used to calculate the "current" state of the room at the time more accurately than the current method. The downside of this approach is that get_forward_extremities_for_room_at_stream_ordering will fail if the stream ordering is too old.

Another approach that has also literally just now occurred to me is that we could use the current_state_delta_stream to work out the changes in state between two stream orderings in the room, as the stream_id column correspond to the stream orderings of the changes (I think?).

Anyway, just wanted to record my thoughts here before I forget them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fizzadar sorry for going around the houses really slowly on this 😞

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries at all @erikjohnston - totally agree a separate function makes sense here; I've pushed that up in 9a78d14. I just wrapped an existing function get_room_event_before_stream_ordering (which doesn't actually return an event) and pulled out the event.

Have undone the pagination ordering changes too!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Will have a look after I've finished doing the v1.57.0rc1 release ❤️

@MadLittleMods MadLittleMods added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. A-Sync defects related to /sync labels Apr 8, 2022
This uses an existing method, `get_last_event_in_room_before_stream_ordering`,
to retrieve the most recent (stream ordered) event in a room and replaces
uses of the complex pagination txn.
@Fizzadar Fizzadar force-pushed the recent-events-stream-ordering branch from 2a5ee07 to 8e319cc Compare April 12, 2022 14:22
synapse/handlers/sync.py Show resolved Hide resolved
Co-authored-by: Erik Johnston <erikj@jki.re>
Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>
@erikjohnston
Copy link
Member

Thank you again for bearing with us @Fizzadar ❤️

@erikjohnston erikjohnston changed the title Recent events stream order fix Fix missing sync events during historical batch imports Apr 13, 2022
@erikjohnston erikjohnston merged commit e3a49f4 into matrix-org:develop Apr 13, 2022
@Fizzadar Fizzadar deleted the recent-events-stream-ordering branch April 13, 2022 11:59
babolivier added a commit to matrix-org/synapse-dinsic that referenced this pull request Apr 28, 2022
Synapse 1.58.0rc1 (2022-04-26)
==============================

As of this release, the groups/communities feature in Synapse is now disabled by default. See [\#11584](matrix-org/synapse#11584) for details. As mentioned in [the upgrade notes](https://github.com/matrix-org/synapse/blob/develop/docs/upgrade.md#upgrading-to-v1580), this feature will be removed in Synapse 1.61.

Features
--------

- Implement [MSC3383](matrix-org/matrix-spec-proposals#3383) for including the destination in server-to-server authentication headers. Contributed by @Bubu and @jcgruenhage for Famedly. ([\#11398](matrix-org/synapse#11398))
- Docker images and Debian packages from matrix.org now contain a locked set of Python dependencies, greatly improving build reproducibility. ([Board](https://github.com/orgs/matrix-org/projects/54), [\#11537](matrix-org/synapse#11537))
- Enable processing of device list updates asynchronously. ([\#12365](matrix-org/synapse#12365), [\#12465](matrix-org/synapse#12465))
- Implement [MSC2815](matrix-org/matrix-spec-proposals#2815) to allow room moderators to view redacted event content. Contributed by @tulir. ([\#12427](matrix-org/synapse#12427))
- Build Debian packages for Ubuntu 22.04 "Jammy Jellyfish". ([\#12543](matrix-org/synapse#12543))

Bugfixes
--------

- Prevent a sync request from removing a user's busy presence status. ([\#12213](matrix-org/synapse#12213))
- Fix bug with incremental sync missing events when rejoining/backfilling. Contributed by Nick @ Beeper. ([\#12319](matrix-org/synapse#12319))
- Fix a long-standing bug which incorrectly caused `GET /_matrix/client/v3/rooms/{roomId}/event/{eventId}` to return edited events rather than the original. ([\#12476](matrix-org/synapse#12476))
- Fix a bug introduced in Synapse 1.27.0 where the admin API for [deleting forward extremities](https://github.com/matrix-org/synapse/blob/erikj/fix_delete_event_response_count/docs/admin_api/rooms.md#deleting-forward-extremities) would always return a count of 1, no matter how many extremities were deleted. ([\#12496](matrix-org/synapse#12496))
- Fix a long-standing bug where the image thumbnails embedded into email notifications were broken. ([\#12510](matrix-org/synapse#12510))
- Fix a bug in the implementation of [MSC3202](matrix-org/matrix-spec-proposals#3202) where Synapse would use the field name `device_unused_fallback_keys`, rather than `device_unused_fallback_key_types`. ([\#12520](matrix-org/synapse#12520))
- Fix a bug introduced in Synapse 0.99.3 which could cause Synapse to consume large amounts of RAM when back-paginating in a large room. ([\#12522](matrix-org/synapse#12522))

Improved Documentation
----------------------

- Fix rendering of the documentation site when using the 'print' feature. ([\#12340](matrix-org/synapse#12340))
- Add a manual documenting config file options. ([\#12368](matrix-org/synapse#12368), [\#12527](matrix-org/synapse#12527))
- Update documentation to reflect that both the `run_background_tasks_on` option and the options for moving stream writers off of the main process are no longer experimental. ([\#12451](matrix-org/synapse#12451))
- Update worker documentation and replace old `federation_reader` with `generic_worker`. ([\#12457](matrix-org/synapse#12457))
- Strongly recommend [Poetry](https://python-poetry.org/) for development. ([\#12475](matrix-org/synapse#12475))
- Add some example configurations for workers and update architectural diagram. ([\#12492](matrix-org/synapse#12492))
- Fix a broken link in `README.rst`. ([\#12495](matrix-org/synapse#12495))
- Add HAProxy delegation example with CORS headers to docs. ([\#12501](matrix-org/synapse#12501))
- Remove extraneous comma in User Admin API's device deletion section so that the example JSON is actually valid and works. Contributed by @olmari. ([\#12533](matrix-org/synapse#12533))

Deprecations and Removals
-------------------------

- The groups/communities feature in Synapse is now disabled by default. ([\#12344](matrix-org/synapse#12344))
- Remove unstable identifiers from [MSC3440](matrix-org/matrix-spec-proposals#3440). ([\#12382](matrix-org/synapse#12382))

Internal Changes
----------------

- Preparation for faster-room-join work: start a background process to resynchronise the room state after a room join. ([\#12394](matrix-org/synapse#12394))
- Preparation for faster-room-join work: Implement a tracking mechanism to allow functions to wait for full room state to arrive. ([\#12399](matrix-org/synapse#12399))
- Remove an unstable identifier from [MSC3083](matrix-org/matrix-spec-proposals#3083). ([\#12395](matrix-org/synapse#12395))
- Run CI in the locked [Poetry](https://python-poetry.org/) environment, and remove corresponding `tox` jobs. ([\#12425](matrix-org/synapse#12425), [\#12434](matrix-org/synapse#12434), [\#12438](matrix-org/synapse#12438), [\#12441](matrix-org/synapse#12441), [\#12449](matrix-org/synapse#12449), [\#12478](matrix-org/synapse#12478), [\#12514](matrix-org/synapse#12514), [\#12472](matrix-org/synapse#12472))
- Change Mutual Rooms' `unstable_features` flag to `uk.half-shot.msc2666.mutual_rooms` which matches the current iteration of [MSC2666](matrix-org/matrix-spec-proposals#2666). ([\#12445](matrix-org/synapse#12445))
- Fix typo in the release script help string. ([\#12450](matrix-org/synapse#12450))
- Fix a minor typo in the Debian changelogs generated by the release script. ([\#12497](matrix-org/synapse#12497))
- Reintroduce the list of targets to the linter script, to avoid linting unwanted local-only directories during development. ([\#12455](matrix-org/synapse#12455))
- Limit length of `device_id` to less than 512 characters. ([\#12454](matrix-org/synapse#12454))
- Dockerfile-workers: reduce the amount we install in the image. ([\#12464](matrix-org/synapse#12464))
- Dockerfile-workers: give the master its own log config. ([\#12466](matrix-org/synapse#12466))
- complement-synapse-workers: factor out separate entry point script. ([\#12467](matrix-org/synapse#12467))
- Back out experimental implementation of [MSC2314](matrix-org/matrix-spec-proposals#2314). ([\#12474](matrix-org/synapse#12474))
- Fix grammatical error in federation error response when the room version of a room is unknown. ([\#12483](matrix-org/synapse#12483))
- Remove unnecessary configuration overrides in tests. ([\#12511](matrix-org/synapse#12511))
- Refactor the relations code for clarity. ([\#12519](matrix-org/synapse#12519))
- Add type hints so `docker` and `stubs` directories pass `mypy --disallow-untyped-defs`. ([\#12528](matrix-org/synapse#12528))
- Update `delay_cancellation` to accept any awaitable, rather than just `Deferred`s. ([\#12468](matrix-org/synapse#12468))
- Handle cancellation in `EventsWorkerStore._get_events_from_cache_or_db`. ([\#12529](matrix-org/synapse#12529))
babolivier added a commit to matrix-org/synapse-dinsic that referenced this pull request May 3, 2022
Synapse 1.58.0 (2022-05-03)
===========================

As of this release, the groups/communities feature in Synapse is now disabled by default. See [\#11584](matrix-org/synapse#11584) for details. As mentioned in [the upgrade notes](https://github.com/matrix-org/synapse/blob/develop/docs/upgrade.md#upgrading-to-v1580), this feature will be removed in Synapse 1.61.

No significant changes since 1.58.0rc2.

Synapse 1.58.0rc2 (2022-04-26)
==============================

This release candidate fixes bugs related to Synapse 1.58.0rc1's logic for handling device list updates.

Bugfixes
--------

- Fix a bug introduced in Synapse 1.58.0rc1 where the main process could consume excessive amounts of CPU and memory while handling sentry logging failures. ([\#12554](matrix-org/synapse#12554))
- Fix a bug introduced in Synapse 1.58.0rc1 where opentracing contexts were not correctly sent to whitelisted remote servers with device lists updates. ([\#12555](matrix-org/synapse#12555))

Internal Changes
----------------

- Reduce unnecessary work when handling remote device list updates. ([\#12557](matrix-org/synapse#12557))

Synapse 1.58.0rc1 (2022-04-26)
==============================

Features
--------

- Implement [MSC3383](matrix-org/matrix-spec-proposals#3383) for including the destination in server-to-server authentication headers. Contributed by @Bubu and @jcgruenhage for Famedly. ([\#11398](matrix-org/synapse#11398))
- Docker images and Debian packages from matrix.org now contain a locked set of Python dependencies, greatly improving build reproducibility. ([Board](https://github.com/orgs/matrix-org/projects/54), [\#11537](matrix-org/synapse#11537))
- Enable processing of device list updates asynchronously. ([\#12365](matrix-org/synapse#12365), [\#12465](matrix-org/synapse#12465))
- Implement [MSC2815](matrix-org/matrix-spec-proposals#2815) to allow room moderators to view redacted event content. Contributed by @tulir @ Beeper. ([\#12427](matrix-org/synapse#12427))
- Build Debian packages for Ubuntu 22.04 "Jammy Jellyfish". ([\#12543](matrix-org/synapse#12543))

Bugfixes
--------

- Prevent a sync request from removing a user's busy presence status. ([\#12213](matrix-org/synapse#12213))
- Fix bug with incremental sync missing events when rejoining/backfilling. Contributed by Nick @ Beeper. ([\#12319](matrix-org/synapse#12319))
- Fix a long-standing bug which incorrectly caused `GET /_matrix/client/v3/rooms/{roomId}/event/{eventId}` to return edited events rather than the original. ([\#12476](matrix-org/synapse#12476))
- Fix a bug introduced in Synapse 1.27.0 where the admin API for [deleting forward extremities](https://github.com/matrix-org/synapse/blob/erikj/fix_delete_event_response_count/docs/admin_api/rooms.md#deleting-forward-extremities) would always return a count of 1, no matter how many extremities were deleted. ([\#12496](matrix-org/synapse#12496))
- Fix a long-standing bug where the image thumbnails embedded into email notifications were broken. ([\#12510](matrix-org/synapse#12510))
- Fix a bug in the implementation of [MSC3202](matrix-org/matrix-spec-proposals#3202) where Synapse would use the field name `device_unused_fallback_keys`, rather than `device_unused_fallback_key_types`. ([\#12520](matrix-org/synapse#12520))
- Fix a bug introduced in Synapse 0.99.3 which could cause Synapse to consume large amounts of RAM when back-paginating in a large room. ([\#12522](matrix-org/synapse#12522))

Improved Documentation
----------------------

- Fix rendering of the documentation site when using the 'print' feature. ([\#12340](matrix-org/synapse#12340))
- Add a manual documenting config file options. ([\#12368](matrix-org/synapse#12368), [\#12527](matrix-org/synapse#12527))
- Update documentation to reflect that both the `run_background_tasks_on` option and the options for moving stream writers off of the main process are no longer experimental. ([\#12451](matrix-org/synapse#12451))
- Update worker documentation and replace old `federation_reader` with `generic_worker`. ([\#12457](matrix-org/synapse#12457))
- Strongly recommend [Poetry](https://python-poetry.org/) for development. ([\#12475](matrix-org/synapse#12475))
- Add some example configurations for workers and update architectural diagram. ([\#12492](matrix-org/synapse#12492))
- Fix a broken link in `README.rst`. ([\#12495](matrix-org/synapse#12495))
- Add HAProxy delegation example with CORS headers to docs. ([\#12501](matrix-org/synapse#12501))
- Remove extraneous comma in User Admin API's device deletion section so that the example JSON is actually valid and works. Contributed by @olmari. ([\#12533](matrix-org/synapse#12533))

Deprecations and Removals
-------------------------

- The groups/communities feature in Synapse is now disabled by default. ([\#12344](matrix-org/synapse#12344))
- Remove unstable identifiers from [MSC3440](matrix-org/matrix-spec-proposals#3440). ([\#12382](matrix-org/synapse#12382))

Internal Changes
----------------

- Preparation for faster-room-join work: start a background process to resynchronise the room state after a room join. ([\#12394](matrix-org/synapse#12394))
- Preparation for faster-room-join work: Implement a tracking mechanism to allow functions to wait for full room state to arrive. ([\#12399](matrix-org/synapse#12399))
- Remove an unstable identifier from [MSC3083](matrix-org/matrix-spec-proposals#3083). ([\#12395](matrix-org/synapse#12395))
- Run CI in the locked [Poetry](https://python-poetry.org/) environment, and remove corresponding `tox` jobs. ([\#12425](matrix-org/synapse#12425), [\#12434](matrix-org/synapse#12434), [\#12438](matrix-org/synapse#12438), [\#12441](matrix-org/synapse#12441), [\#12449](matrix-org/synapse#12449), [\#12478](matrix-org/synapse#12478), [\#12514](matrix-org/synapse#12514), [\#12472](matrix-org/synapse#12472))
- Change Mutual Rooms' `unstable_features` flag to `uk.half-shot.msc2666.mutual_rooms` which matches the current iteration of [MSC2666](matrix-org/matrix-spec-proposals#2666). ([\#12445](matrix-org/synapse#12445))
- Fix typo in the release script help string. ([\#12450](matrix-org/synapse#12450))
- Fix a minor typo in the Debian changelogs generated by the release script. ([\#12497](matrix-org/synapse#12497))
- Reintroduce the list of targets to the linter script, to avoid linting unwanted local-only directories during development. ([\#12455](matrix-org/synapse#12455))
- Limit length of `device_id` to less than 512 characters. ([\#12454](matrix-org/synapse#12454))
- Dockerfile-workers: reduce the amount we install in the image. ([\#12464](matrix-org/synapse#12464))
- Dockerfile-workers: give the master its own log config. ([\#12466](matrix-org/synapse#12466))
- complement-synapse-workers: factor out separate entry point script. ([\#12467](matrix-org/synapse#12467))
- Back out experimental implementation of [MSC2314](matrix-org/matrix-spec-proposals#2314). ([\#12474](matrix-org/synapse#12474))
- Fix grammatical error in federation error response when the room version of a room is unknown. ([\#12483](matrix-org/synapse#12483))
- Remove unnecessary configuration overrides in tests. ([\#12511](matrix-org/synapse#12511))
- Refactor the relations code for clarity. ([\#12519](matrix-org/synapse#12519))
- Add type hints so `docker` and `stubs` directories pass `mypy --disallow-untyped-defs`. ([\#12528](matrix-org/synapse#12528))
- Update `delay_cancellation` to accept any awaitable, rather than just `Deferred`s. ([\#12468](matrix-org/synapse#12468))
- Handle cancellation in `EventsWorkerStore._get_events_from_cache_or_db`. ([\#12529](matrix-org/synapse#12529))
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Sync defects related to /sync T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
3 participants