Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sliding sync: various fixups to the background update #17652

Merged
merged 23 commits into from
Sep 11, 2024

Conversation

erikjohnston
Copy link
Member

@erikjohnston erikjohnston commented Sep 3, 2024

Follow-up to #17641, #17634, #17631 and #17632 to fix-up #17512

A few things are going on here.

  • Skip over existing rows
    • This saves a bunch of time ignoring work we've already done.
    • This does mean that we don't update the forgotten state for those memberships, but I don't actually think we need to. If a row already exists for that room/user then it was inserted as a new one by the event persist path, in which case it is by definition not forgotten, so we don't need to fix that path up.
  • Handle missing previous memberships
    • Turns out there are valid (though rare) reasons for us to not have a previous membership for a leave/ban outlier, so we need to gracefully handle that case
  • Ignore leave events for bg updates
    • This is probably controversial, but let's skip inserting rows for left rooms. We'll end up filtering them out anyway, so we just end up spending ages inserting them for no benefit. If we want we can add another background update to add them back in (which wouldn't block using the new tables).
    • We still insert new left rooms, to handle the case of detecting newly left rooms

Reviewable commit-by-commit.

@@ -1978,7 +1978,9 @@ def _get_sliding_sync_insert_values_from_state_map(
elif state_key == (EventTypes.Name, ""):
room_name = event.content.get(EventContentFields.ROOM_NAME)
# Scrutinize JSON values
if room_name is None or isinstance(room_name, str):
if room_name is None or (
isinstance(room_name, str) and "\0" not in room_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we care to check this? As far as I can tell, it's valid JSON. Is this a Matrix spec thing?

Perhaps it's because we have to mix with a system that is sensitive to null-terminated strings (C strings)? Doing some quick searching, it seems like Postgres does not allow null bytes in TEXT fields but it is allowed in SQLite.

The only other place we do this is in

# Ensure that the values to update are valid, they should be strings and
# not contain any null bytes.
#
# Invalid data gets overwritten with null.
#
# Note that a missing value should not be overwritten (it keeps the
# previous value).
sentinel = object()
for col in (
"join_rules",
"history_visibility",
"encryption",
"name",
"topic",
"avatar",
"canonical_alias",
"guest_access",
"room_type",
):
field = fields.get(col, sentinel)
if field is not sentinel and (not isinstance(field, str) or "\0" in field):

Why aren't we checking other content values like we do in the stats code?

Why aren't we checking this in the events code where we also insert these state values into the database? Do we disallow this with some validation somewhere to prevent this with new data?

We should explain the reason why we're doing this. I assume it's the Postgres reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh! Sorry yeah, postgres TEXT fields can't have \0 bytes in them (nyargghghgh why). Will update with comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we checking other content values like we do in the stats code?

^

Why aren't we checking this in the events code where we also insert these state values into the database? Do we disallow this with some validation somewhere to prevent this with new data?

^

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I forgot that room_type is also a thing. The rest I think are fine, e.g. tombstone room ID should be a valid room ID (but we may as well check that too).

Comment on lines 1907 to 1911
AND (c.membership != ? OR c.user_id != e.sender)
ORDER BY c.room_id ASC, c.user_id ASC
LIMIT ?
""",
(last_room_id, last_user_id, batch_size),
(last_room_id, last_user_id, Membership.LEAVE, batch_size),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to exclude leave memberships. It's just extra things to think about.

We want the table to be complete in the end so that means we will need to add another background update to fill in the leave events. If we're going to split the background update, we should just do that in this PR so we don't need to remove those assertions from the tests that will probably get lost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit in tow minds about this. The problem is that this is a surprisingly huge amount of data that will just never be read. On the other hand, it feels inconsistent to not port them over but to keep future old rows.

I wonder if the right thing to do here maybe is to skip these rows, but add a background job that clears out old left rooms from the table? Which is possible since we can response with M_UNKNOWN_POS if we get a sliding sync request that has a pos from before we purged?

Copy link
Collaborator

@MadLittleMods MadLittleMods Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to accept the forever background update to clean-up leaves and keep the size of the database table down, that can work 👍

I assume we want some grace period for left rooms (a day)? That way people don't immediately get a M_UNKNOWN_POS a soon as they leave a room and allows other clients to catch-up gracefully if the room is left on another client.

Is it really worth this complexity though? We're not storing every membership ever, just the latest membership of a given user for a given room.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that we're not going to add a include_leave option like Sync v2 has?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we want some grace period for left rooms (a day)?

Yeah, probably something like a week or something, maybe longer.

Is it really worth this complexity though? We're not storing every membership ever, just the latest membership of a given user fro a given room.

It's kinda sucky to keep around things forever. It's fine when it's not causing issues, but it'll take a really long time for the background updates to run for data that we currently don't use.

Are we sure that we're not going to add a include_leave option like Sync v2 has?

That's a fair question, I guess it would be good to know if any clients actually use that option.


I guess a potential half-way house is for us to not port over the metadata for left rooms for now? Which also seems wrong but will be a lot quicker and have a chance to actually complete.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some discussion in an internal room (light discussion with people on both sides)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove this patch and open a separate PR are you happy with the rest? We probably want the actual bug fixes to go into the RC

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 🙂 ("Ignore leave events for bg updates" moved to another PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.f. #17699

synapse/storage/databases/main/events_bg_updates.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/events_bg_updates.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/events_bg_updates.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/events_bg_updates.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/events_bg_updates.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/events.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/events.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/events_bg_updates.py Outdated Show resolved Hide resolved
erikjohnston added a commit that referenced this pull request Sep 10, 2024
erikjohnston and others added 3 commits September 10, 2024 10:26
Co-authored-by: Eric Eastwood <eric.eastwood@beta.gouv.fr>
Co-authored-by: Eric Eastwood <eric.eastwood@beta.gouv.fr>
@erikjohnston erikjohnston merged commit 596b964 into develop Sep 11, 2024
39 checks passed
@erikjohnston erikjohnston deleted the erikj/ss_fixups branch September 11, 2024 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants