Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Prevent join->join membership transitions changing member count #7977

Merged
merged 10 commits into from
Aug 3, 2020

Conversation

anoadragon453
Copy link
Member

@anoadragon453 anoadragon453 commented Jul 28, 2020

Fixes #1294

StatsHandler handles updates to the current_state_delta_stream, and updates room stats such as the amount of state events, joined users, etc.

However, it counts every new join membership as a new user entering a room (and that user being in another room), whereas it's possible for a user's membership status to go from join -> join, for instance when they change their per-room profile information.

This PR adds a check for join->join membership transitions, and bails out early, as none of the further checks are necessary at that point.

Due to this bug, membership stats in many rooms have ended up being wildly larger than their true values. I am not sure if we also want to include a migration step which recalculates these statistics (possibly using the _populate_stats_process_rooms bg update).

Bug introduced in the initial implementation #4338.

@anoadragon453 anoadragon453 requested a review from a team July 28, 2020 23:20
Copy link
Contributor

@babolivier babolivier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the right fix be to change the if on line 244 into an elif? It seems to me that this is what's causing the bug (i.e. not no-oping if the membership doesn't change).

I'm not sure bailing out early buys us much at this point, but if you're still really keen on doing so it should probably be

if membership == prev_membership:
    continue

instead, and getting rid of the similar conditional clauses lower in the function.

@babolivier
Copy link
Contributor

Oh and also could you add a test please?

@anoadragon453
Copy link
Member Author

Wouldn't the right fix be to change the if on line 244 into an elif?

Oh, yes indeed! I completely missed that. Thanks 🙂

Oh and also could you add a test please?

👍

@anoadragon453
Copy link
Member Author

Seems that running

INSERT INTO background_updates (update_name, progress_json, depends_on) VALUES ('populate_stats_process_rooms', '{}', 'current_state_events_membership');

works to fix the broken room counts, and runs in the background during operation rather than blocking startup.

@anoadragon453
Copy link
Member Author

Applied your fix and added a test @babolivier 🙂 Still not sure what to do w.r.t the background update however. It will likely take a long time on big homeservers, but it does run in the background. I'm tempted to just chuck it in migration step.

@anoadragon453 anoadragon453 requested a review from babolivier July 29, 2020 21:05
Copy link
Contributor

@babolivier babolivier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code lgtm!

Still not sure what to do w.r.t the background update however. It will likely take a long time on big homeservers, but it does run in the background. I'm tempted to just chuck it in migration step.

What exactly are you referring to with "in migration step"? A schema delta?
I'd be in favour of just making homeservers rerun the update, but it's probably better to get other opinions in #synapse-dev first.

@anoadragon453
Copy link
Member Author

anoadragon453 commented Jul 30, 2020

Conclusion from #synapse-dev is that we're going to include a delta to re-run the update. However, it's important to note that the populate_stats_process_rooms background update is already set to run if you're upgrading from Synapse <1.0.0.

Additionally, if you've upgraded to v1.18.0 (which doesn't include this fix), the earlier bg job runs, and then update to v1.19.0, you'd end up with only half of your rooms having room stats recalculated after this fix was in place.

So the proposed solution is to switch the old populate_stats_process_rooms background job to a no-op, and then kick off a bg job with a new name, but the same functionality as the old one. This effectively restarts the background job from the beginning, without running it twice in a row, supporting both upgrade usecases.

by creating a new bg job, and converting the old one to a no-op. Reasoning
behind this is explained in the PR comments
…bership_join_count

* 'develop' of github.com:matrix-org/synapse:
  Update workers docs (#7990)
  Fix invite rejection when we have no forward-extremeties (#7980)
  Fix typo in docs/workers.md (#7992)
  Convert federation client to async/await. (#7975)
  Convert appservice to async. (#7973)
  Convert some of the data store to async. (#7976)
  Fix formatting of changelog and upgrade notes
  Ensure that remove_pusher is always async (#7981)
  Add deprecation warnings
  1.18.0
  Update worker docs with recent enhancements  (#7969)
  Ensure the msg property of HttpResponseException is a string. (#7979)
  Remove from the event_relations table when purging historical events. (#7978)
  Add additional logging for SAML sessions. (#7971)
  Add MSC reference to changelog for #7736
  Re-implement unread counts (#7736)
  Various improvements to the docs (#7899)
@anoadragon453
Copy link
Member Author

I've requested the two people involved in the discussion linked above so one can check what I've done matches up.

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

I'm not completely sure this is the only bug in the room lists, but it's obviously a big one!

changelog.d/7977.bugfix Outdated Show resolved Hide resolved
anoadragon453 and others added 2 commits August 3, 2020 13:12
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
@anoadragon453
Copy link
Member Author

anoadragon453 commented Aug 3, 2020

I think the title of the linked bug is a bit misleading, as the discussing therein seems to only discuss the accidental increasing room members.

If this continues to happen via another bug then we can open a new issue.

@anoadragon453 anoadragon453 merged commit 5d92a14 into develop Aug 3, 2020
@anoadragon453 anoadragon453 deleted the anoa/membership_join_count branch August 3, 2020 20:54
reivilibre added a commit that referenced this pull request Aug 13, 2020
Synapse 1.19.0rc1 (2020-08-13)
==============================

Removal warning
---------------

As outlined in the [previous release](https://github.com/matrix-org/synapse/releases/tag/v1.18.0), we are no longer publishing Docker images with the `-py3` tag suffix. On top of that, we have also removed the `latest-py3` tag. Please see [the announcement in the upgrade notes for 1.18.0](https://github.com/matrix-org/synapse/blob/develop/UPGRADE.rst#upgrading-to-v1180).

Features
--------

- Add option to allow server admins to join rooms which fail complexity checks. Contributed by @lugino-emeritus. ([\#7902](#7902))
- Add an option to purge room or not with delete room admin endpoint (`POST /_synapse/admin/v1/rooms/<room_id>/delete`). Contributed by @dklimpel. ([\#7964](#7964))
- Add rate limiting to users joining rooms. ([\#8008](#8008))
- Add a `/health` endpoint to every configured HTTP listener that can be used as a health check endpoint by load balancers. ([\#8048](#8048))
- Allow login to be blocked based on the values of SAML attributes. ([\#8052](#8052))
- Allow guest access to the `GET /_matrix/client/r0/rooms/{room_id}/members` endpoint, according to MSC2689. Contributed by Awesome Technologies Innovationslabor GmbH. ([\#7314](#7314))

Bugfixes
--------

- Fix a bug introduced in Synapse v1.7.2 which caused inaccurate membership counts in the room directory. ([\#7977](#7977))
- Fix a long standing bug: 'Duplicate key value violates unique constraint "event_relations_id"' when message retention is configured. ([\#7978](#7978))
- Fix "no create event in auth events" when trying to reject invitation after inviter leaves. Bug introduced in Synapse v1.10.0. ([\#7980](#7980))
- Fix various comments and minor discrepencies in server notices code. ([\#7996](#7996))
- Fix a long standing bug where HTTP HEAD requests resulted in a 400 error. ([\#7999](#7999))
- Fix a long-standing bug which caused two copies of some log lines to be written when synctl was used along with a MemoryHandler logger. ([\#8011](#8011), [\#8012](#8012))

Updates to the Docker image
---------------------------

- We no longer publish Docker images with the `-py3` tag suffix, as [announced in the upgrade notes](https://github.com/matrix-org/synapse/blob/develop/UPGRADE.rst#upgrading-to-v1180). ([\#8056](#8056))

Improved Documentation
----------------------

- Document how to set up a client .well-known file and fix several pieces of outdated documentation. ([\#7899](#7899))
- Improve workers docs. ([\#7990](#7990), [\#8000](#8000))
- Fix typo in `docs/workers.md`. ([\#7992](#7992))
- Add documentation for how to undo a room shutdown. ([\#7998](#7998), [\#8010](#8010))

Internal Changes
----------------

- Reduce the amount of whitespace in JSON stored and sent in responses. Contributed by David Vo. ([\#7372](#7372))
- Switch to the JSON implementation from the standard library and bump the minimum version of the canonicaljson library to 1.2.0. ([\#7936](#7936), [\#7979](#7979))
- Convert various parts of the codebase to async/await. ([\#7947](#7947), [\#7948](#7948), [\#7949](#7949), [\#7951](#7951), [\#7963](#7963), [\#7973](#7973), [\#7975](#7975), [\#7976](#7976), [\#7981](#7981), [\#7987](#7987), [\#7989](#7989), [\#8003](#8003), [\#8014](#8014), [\#8016](#8016), [\#8027](#8027), [\#8031](#8031), [\#8032](#8032), [\#8035](#8035), [\#8042](#8042), [\#8044](#8044), [\#8045](#8045), [\#8061](#8061), [\#8062](#8062), [\#8063](#8063), [\#8066](#8066), [\#8069](#8069), [\#8070](#8070))
- Move some database-related log lines from the default logger to the database/transaction loggers. ([\#7952](#7952))
- Add a script to detect source code files using non-unix line terminators. ([\#7965](#7965), [\#7970](#7970))
- Log the SAML session ID during creation. ([\#7971](#7971))
- Implement new experimental push rules for some users. ([\#7997](#7997))
- Remove redundant and unreliable signature check for v1 Identity Service lookup responses. ([\#8001](#8001))
- Improve the performance of the register endpoint. ([\#8009](#8009))
- Reduce less useful output in the newsfragment CI step. Add a link to the changelog section of the contributing guide on error. ([\#8024](#8024))
- Rename storage layer objects to be more sensible. ([\#8033](#8033))
- Change the default log config to reduce disk I/O and storage for new servers. ([\#8040](#8040))
- Add an assertion on `prev_events` in `create_new_client_event`. ([\#8041](#8041))
- Add a comment to `ServerContextFactory` about the use of `SSLv23_METHOD`. ([\#8043](#8043))
- Log `OPTIONS` requests at `DEBUG` rather than `INFO` level to reduce amount logged at `INFO`. ([\#8049](#8049))
- Reduce amount of outbound request logging at `INFO` level. ([\#8050](#8050))
- It is no longer necessary to explicitly define `filters` in the logging configuration. (Continuing to do so is redundant but harmless.) ([\#8051](#8051))
- Add and improve type hints. ([\#8058](#8058), [\#8064](#8064), [\#8060](#8060), [\#8067](#8067))
babolivier pushed a commit that referenced this pull request Sep 1, 2021
* commit '5d92a1428':
  Prevent join->join membership transitions changing member count (#7977)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Room membership count lies in many different ways. (SYN-307)
3 participants