Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Add all Stream Writer worker types to configure_workers_and_start.py #14197

Merged
merged 14 commits into from
Nov 8, 2022

Conversation

realtyem
Copy link
Contributor

@realtyem realtyem commented Oct 16, 2022

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

There are a few missing types of workers that aren't included in the templating to generate configuration that is built into configure_workers_and_start.py for the Dockerfile-workers, specifically five of six kinds of stream writers. These other worker types are documented in the configuration manual. Let's add them to the scripting so if they ever need to be used, it's simple.
🔴 BIG WARNING: Typing and Receipts stream writers do not work correctly at this time as-is and fail Complement tests.

Removed the unnecessary gating from add_sharding_to_shared_config(), commit adb34bc, then rename function to process_sharding_and_stream_writers_for_shared_config as this gating is actually taken care of by worker type inside the subroutine. Unsupported worker types pass right through. A side effect of removing the gate was that a bug was possibly fixed in how pusher_instances and federation_senders_instances was processed.

Signed-off-by: Jason Little realtyem@gmail.com

@realtyem
Copy link
Contributor Author

realtyem commented Oct 16, 2022

I welcome suggestions on what to rename add_sharding_to_shared_config() since it does a little bit more now(and probably should have been all along). Perhaps add_streamwriters_and_sharding_to_shared_config()? Feels kinda long, for a function name. OTOH, have you seen Windows function names? Win64DoThisAndThatAndThoseOthersAndThenResetAllSettingsToDefaultsWithoutAsking_45927()

@anoadragon453 Since you wrote the original function, I feel like yours is the input for this.

@realtyem realtyem force-pushed the declare-all-worker-types branch from a0f1e22 to 92a1345 Compare October 16, 2022 09:35
@realtyem
Copy link
Contributor Author

realtyem commented Oct 16, 2022

If you will accept the testing from my own repo, which doesn't touch functionality of the core Synapse project, you can see example CI logs of the failing stream writers from before this fix here and the after here. Flakes are fun, but I think these are known. I'm still testing actual usage on my live server to make sure those passes on stream writers aren't a fluke.

@realtyem realtyem marked this pull request as ready for review October 16, 2022 09:44
@realtyem realtyem requested a review from a team as a code owner October 16, 2022 09:44
@realtyem realtyem marked this pull request as draft October 16, 2022 12:42
@realtyem
Copy link
Contributor Author

realtyem commented Oct 17, 2022

Strange occurrence in my Postgresql logs after live testing this.

2022-10-16 06:41:55.897 CDT [7116] ERROR:  could not serialize access due to concurrent delete
2022-10-16 06:41:55.897 CDT [7116] STATEMENT:
                        DELETE FROM stream_positions
                        WHERE
                            stream_name = 'account_data'
                            AND instance_name != ALL(ARRAY['account_data1'])
2022-10-16 05:53:23.135 CDT [6528] ERROR:  could not serialize access due to concurrent delete
2022-10-16 05:53:23.135 CDT [6528] STATEMENT:
                        DELETE FROM stream_positions
                        WHERE
                            stream_name = 'presence_stream'
                            AND instance_name != ALL(ARRAY['master'])
2022-10-16 06:52:49.281 CDT [276] ERROR:  could not serialize access due to concurrent delete
2022-10-16 06:52:49.281 CDT [276] STATEMENT:
                        DELETE FROM stream_positions
                        WHERE
                            stream_name = 'receipts'
                            AND instance_name != ALL(ARRAY['receipts1'])

Was spammed for several minutes. Is this the system resetting itself after worker reconfiguration? It did seem to stop after a while while it wasn't racing against itself.

Further exceptions looked like this in the synapse log:

********************************************************
 Error during initialisation:
    could not serialize access due to concurrent delete

 There may be more information in the logs.
********************************************************

  File "/usr/local/lib/python3.10/site-packages/synapse/storage/databases/main/media_repository.py", line 148, in __init__
    super().__init__(database, db_conn, hs)
  File "/usr/local/lib/python3.10/site-packages/synapse/storage/databases/main/media_repository.py", line 68, in __init__
    super().__init__(database, db_conn, hs)
  File "/usr/local/lib/python3.10/site-packages/synapse/storage/databases/main/metrics.py", line 68, in __init__
    super().__init__(database, db_conn, hs)
  File "/usr/local/lib/python3.10/site-packages/synapse/storage/databases/main/event_push_actions.py", line 247, in __init__
    super().__init__(database, db_conn, hs)
  File "/usr/local/lib/python3.10/site-packages/synapse/storage/databases/main/receipts.py", line 71, in __init__
    self._receipts_id_gen = MultiWriterIdGenerator(
  File "/usr/local/lib/python3.10/site-packages/synapse/storage/util/id_generators.py", line 368, in __init__
    self._load_current_ids(db_conn, tables)
  File "/usr/local/lib/python3.10/site-packages/synapse/storage/util/id_generators.py", line 393, in _load_current_ids
    cur.execute(sql, (self._stream_name, self._writers))
  File "/usr/local/lib/python3.10/site-packages/synapse/storage/database.py", line 388, in execute
    self._do_execute(self.txn.execute, sql, *args)
  File "/usr/local/lib/python3.10/site-packages/synapse/storage/database.py", line 436, in _do_execute
    return func(sql, *args, **kwargs)
psycopg2.errors.SerializationFailure: could not serialize access due to concurrent delete

Which I assume to be normal for this situation.

Final testing results on live: Everything appears to function(or at least not be a show stopper) except for the receipts worker type which did bring down the whole synapse server. This may be a red herring, as Complement tests still pass and may just be an issue with migrating a live server with existing database. Can anyone provide me with a way to check this?

@realtyem realtyem mentioned this pull request Oct 17, 2022
4 tasks
@H-Shay
Copy link
Contributor

H-Shay commented Oct 18, 2022

This is still in draft, is it ready for review?

@realtyem
Copy link
Contributor Author

realtyem commented Oct 18, 2022 via email

@H-Shay
Copy link
Contributor

H-Shay commented Oct 18, 2022

I think if you're ready for feedback go ahead and remove it from draft.

…riters_for_shared_config

For the record, I think this function name is to long but seems to get the point across.
@realtyem realtyem marked this pull request as ready for review October 21, 2022 01:39
This is no longer accurate
@realtyem realtyem requested a review from H-Shay October 25, 2022 21:30
@H-Shay H-Shay requested review from a team and removed request for H-Shay October 31, 2022 18:58
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

psycopg2.errors.SerializationFailure: could not serialize access due to concurrent delete

These issues are because the workers all start up very quickly at the same time and hit the database table at the same time, especially if you're using the complement fork launcher.
Probably this case should be handled with a retry loop with a short but random backoff — that is, if we can't figure out how to prevent the race outright.
(Maybe some explicit locking could help.)

docker/configure_workers_and_start.py Outdated Show resolved Hide resolved
@realtyem realtyem requested a review from reivilibre November 8, 2022 11:18
@reivilibre reivilibre enabled auto-merge (squash) November 8, 2022 12:35
@reivilibre reivilibre merged commit d85cba1 into matrix-org:develop Nov 8, 2022
@realtyem realtyem deleted the declare-all-worker-types branch November 8, 2022 23:09
bradtgmurray added a commit to beeper/synapse-legacy-fork that referenced this pull request Nov 22, 2022
Synapse 1.72.0 (2022-11-22)
===========================

Please note that Synapse now only supports PostgreSQL 11+, because PostgreSQL 10 has reached end-of-life, c.f. our [Deprecation Policy](https://github.com/matrix-org/synapse/blob/develop/docs/deprecation_policy.md).

Bugfixes
--------

- Update forgotten references to legacy metrics in the included Grafana dashboard. ([\matrix-org#14477](matrix-org#14477))

Synapse 1.72.0rc1 (2022-11-16)
==============================

Features
--------

- Add experimental support for [MSC3912](matrix-org/matrix-spec-proposals#3912): Relation-based redactions. ([\matrix-org#14260](matrix-org#14260))
- Build Debian packages for Ubuntu 22.10 (Kinetic Kudu). ([\matrix-org#14396](matrix-org#14396))
- Add an [Admin API](https://matrix-org.github.io/synapse/latest/usage/administration/admin_api/index.html) endpoint for user lookup based on third-party ID (3PID). Contributed by @ashfame. ([\matrix-org#14405](matrix-org#14405))
- Faster joins: include heroes' membership events in the partial join response, for rooms without a name or canonical alias. ([\matrix-org#14442](matrix-org#14442))

Bugfixes
--------

- Faster joins: do not block creation of or queries for room aliases during the resync. ([\matrix-org#14292](matrix-org#14292))
- Fix a bug introduced in Synapse 1.64.0rc1 which could cause log spam when fetching events from other homeservers. ([\matrix-org#14347](matrix-org#14347))
- Fix a bug introduced in 1.66 which would not send certain pushrules to clients. Contributed by Nico. ([\matrix-org#14356](matrix-org#14356))
- Fix a bug introduced in v1.71.0rc1 where the power level event was incorrectly created during initial room creation. ([\matrix-org#14361](matrix-org#14361))
- Fix the refresh token endpoint to be under /r0 and /v3 instead of /v1. Contributed by Tulir @ Beeper. ([\matrix-org#14364](matrix-org#14364))
- Fix a long-standing bug where Synapse would raise an error when encountering an unrecognised field in a `/sync` filter, instead of ignoring it for forward compatibility. ([\matrix-org#14369](matrix-org#14369))
- Fix a background database update, introduced in Synapse 1.64.0, which could cause poor database performance. ([\matrix-org#14374](matrix-org#14374))
- Fix PostgreSQL sometimes using table scans for queries against the `event_search` table, taking a long time and a large amount of IO. ([\matrix-org#14409](matrix-org#14409))
- Fix rendering of some HTML templates (including emails). Introduced in v1.71.0. ([\matrix-org#14448](matrix-org#14448))
- Fix a bug introduced in Synapse 1.70.0 where the background updates to add non-thread unique indexes on receipts could fail when upgrading from 1.67.0 or earlier. ([\matrix-org#14453](matrix-org#14453))

Updates to the Docker image
---------------------------

- Add all Stream Writer worker types to `configure_workers_and_start.py`. ([\matrix-org#14197](matrix-org#14197))
- Remove references to legacy worker types in the multi-worker Dockerfile. ([\matrix-org#14294](matrix-org#14294))

Improved Documentation
----------------------

- Upload documentation PRs to Netlify. ([\matrix-org#12947](matrix-org#12947), [\matrix-org#14370](matrix-org#14370))
- Add addtional TURN server configuration example based on [eturnal](https://github.com/processone/eturnal) and adjust general TURN server doc structure. ([\matrix-org#14293](matrix-org#14293))
- Add example on how to load balance /sync requests. Contributed by [aceArt](https://aceart.de). ([\matrix-org#14297](matrix-org#14297))
- Edit sample Nginx reverse proxy configuration to use HTTP/1.1. Contributed by Brad Jones. ([\matrix-org#14414](matrix-org#14414))

Deprecations and Removals
-------------------------

- Remove support for PostgreSQL 10. ([\matrix-org#14392](matrix-org#14392), [\matrix-org#14397](matrix-org#14397))

Internal Changes
----------------

- Run unit tests against Python 3.11. ([\matrix-org#13812](matrix-org#13812))
- Add TLS support for generic worker endpoints. ([\matrix-org#14128](matrix-org#14128), [\matrix-org#14455](matrix-org#14455))
- Switch to a maintained action for installing Rust in CI. ([\matrix-org#14313](matrix-org#14313))
- Add override ability to `complement.sh` command line script to request certain types of workers. ([\matrix-org#14324](matrix-org#14324))
- Enabling testing of [MSC3874](matrix-org/matrix-spec-proposals#3874) (filtering of `/messages` by relation type) in complement. ([\matrix-org#14339](matrix-org#14339))
- Concisely log a failure to resolve state due to missing `prev_events`. ([\matrix-org#14346](matrix-org#14346))
- Use a maintained Github action to install Rust. ([\matrix-org#14351](matrix-org#14351))
- Cleanup old worker datastore classes. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#14375](matrix-org#14375))
- Test against PostgreSQL 15 in CI. ([\matrix-org#14394](matrix-org#14394))
- Remove unreachable code. ([\matrix-org#14410](matrix-org#14410))
- Clean-up event persistence code. ([\matrix-org#14411](matrix-org#14411))
- Update docstring to clarify that `get_partial_state_events_batch` does not just give you completely arbitrary partial-state events. ([\matrix-org#14417](matrix-org#14417))
- Fix mypy errors introduced by bumping the locked version of `attrs` and `gitpython`. ([\matrix-org#14433](matrix-org#14433))
- Make Dependabot only bump Rust deps in the lock file. ([\matrix-org#14434](matrix-org#14434))
- Fix an incorrect stub return type for `PushRuleEvaluator.run`. ([\matrix-org#14451](matrix-org#14451))
- Improve performance of `/context` in large rooms. ([\matrix-org#14461](matrix-org#14461))
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants