Move client command handling out of TCP protocol #7185

erikjohnston · 2020-03-31T12:35:15Z

The aim here is to move the command handling out of the TCP protocol classes and to also merge the client and server command handling (so that we can reuse them for redis protocol). This PR simply moves the client paths to the new ReplicationCommandHandler, a future PR will move the server paths too.

The commits should explain what's going on at each step. "Add ReplicationCommandHandler" and "Merge replication command and client handlers" should mostly just be c+p.

Replaces #7134.

This stops us having to subclass ReplicationClientHandler and override methods.

This will get replaced with a Linearizer later.

In a Redis world we won't necessarily get one POSITION per stream at the start of the connection, so we rejig our "streams connecting" logic.

The intention is to move all command processing logic to there, out of protocol, client and resource modules. Currently we only pull out client command processing from protocol and delegate to ReplicationClientHandler.

richvdh

well, it seems generally plausible, though I have some Opinions.

I went though commit-by-commit to try to follow it, so some of the comments may no longer apply (though some of the ones that github thinks are outdated certainly aren't).

I'm struggling to visualise the relationship between the classes. How do you fancy sticking an ascii art class relationship diagram in synapse/replication/tcp/__init__.py ?

richvdh · 2020-04-01T09:16:36Z

synapse/replication/tcp/client.py

+
+
+class ReplicationDataHandler:
+    """A replication data handler that calls slave data stores.


could you try to describe what a replication data handler is supposed to do? I assume it receives data from some other class and does stuff with it?

FWIW currently this is subclassed in generic_worker module but the intention is to fold all of the handling there into this class to remove the need to subclass.

synapse/replication/tcp/client.py

synapse/replication/tcp/handler.py

tests/replication/tcp/streams/_base.py

richvdh · 2020-04-01T10:55:30Z

synapse/replication/tcp/handler.py

-        while limited:
-            updates, current_token, limited = await stream.get_updates_since(
-                current_token, cmd.token
+        # We protect catching up with a linearizer in case the replicaiton


replicaiton

erikjohnston · 2020-04-02T12:44:12Z

synapse/replication/tcp/handler.py

+            # We do this by walking the list backwards, first removing any RDATA
+            # rows that are part of an uncompeted batch, then taking rows while
+            # their token is either None or greater than where we've caught up
+            # to.


Actually, am I overthinking this? Should we just pass on all RDATA commands that we get after the POSITION? I.e., clearing pending_batches when we receive a POSITION?

It looks like that's already happening (https://github.com/matrix-org/synapse/pull/7185/files/5104d1673bb4a3be3bd2a655dfde568fda01226c..534bd868e50cd1fe2efd52d4ec0ec92452ac6a6b#diff-bff709ecab561a0aa9f155333fcc1b0dR127), but I think there's a slight problem here:

we receive a POSITION and start a catchup

RDATA arrives

connection drops and we reconnect

we receive another POSITION, clear pending_batches, and start waiting for the linearizer

more RDATA arrives and gets added to pending_batches

the first catchup completes, and we process all the RDATAs which arrived since the second POSITION despite having not caught up to it.

At this point, this stuff feels very much a separate problem to "Move client command handling out of TCP protocol". I've also realised that the current impl is also racy af, which can't be helping with #7206.

Agreed, I'll revert to previous handling and we can think about this separately.

richvdh · 2020-04-03T13:10:35Z

synapse/replication/tcp/client.py

@@ -73,7 +77,10 @@ def clientConnectionFailed(self, connector, reason):


 class ReplicationDataHandler:
-    """A replication data handler that calls slave data stores.
+    """A replication data handler handles incoming stream updates from replication.


Suggested change

"""A replication data handler handles incoming stream updates from replication.

"""Handles incoming stream updates from replication.

richvdh · 2020-04-03T13:21:35Z

synapse/replication/tcp/handler.py

+
+
+class ReplicationCommandHandler:
+    """Handles incoming commands from replication.


it has send_command too, so I think it does more than this.

richvdh · 2020-04-03T13:23:17Z

synapse/replication/tcp/handler.py

+        # The current connection. None if we are currently (re)connecting
+        self._connection = None


I'm assuming it is also None if the process has chosen not to connect to the replication server (eg because it is the nominated master in an old-style master/workers topology)?

Oh, I had forgotten I had split the server merge out to a separate PR and didn't review request it: #7187.

Basically, we're changing this #7187 to be a list of connections, so comment is currently accurate and will be fixed up there.

richvdh · 2020-04-03T13:27:44Z

synapse/replication/tcp/protocol.py

@@ -589,6 +587,7 @@ def connectionMade(self):

        # We've now finished connecting to so inform the client handler
        self.handler.update_connection(self)
+        self.handler.finished_connecting()


ok, so... can we get rid of it?

maybe in a future PR, tbf

richvdh · 2020-04-03T13:56:49Z

synapse/replication/tcp/handler.py

-        while limited:
-            updates, current_token, limited = await stream.get_updates_since(
-                current_token, cmd.token
+        # We protect catching up with a linearizer in case the replicaiton


replicaiton

synapse/replication/tcp/handler.py

richvdh · 2020-04-03T14:23:40Z

synapse/replication/tcp/handler.py

+            # We do this by walking the list backwards, first removing any RDATA
+            # rows that are part of an uncompeted batch, then taking rows while
+            # their token is either None or greater than where we've caught up
+            # to.


It looks like that's already happening (https://github.com/matrix-org/synapse/pull/7185/files/5104d1673bb4a3be3bd2a655dfde568fda01226c..534bd868e50cd1fe2efd52d4ec0ec92452ac6a6b#diff-bff709ecab561a0aa9f155333fcc1b0dR127), but I think there's a slight problem here:

we receive a POSITION and start a catchup

RDATA arrives

connection drops and we reconnect

we receive another POSITION, clear pending_batches, and start waiting for the linearizer

more RDATA arrives and gets added to pending_batches

the first catchup completes, and we process all the RDATAs which arrived since the second POSITION despite having not caught up to it.

At this point, this stuff feels very much a separate problem to "Move client command handling out of TCP protocol". I've also realised that the current impl is also racy af, which can't be helping with #7206.

…nto linearizer

richvdh

looks fine, modulo (a) the fact I havne't yet looked at #7187 and (b) we still need to fix the raciness.

erikjohnston · 2020-04-06T08:56:16Z

Thanks for chugging through this epic!

@nekatak

Synapse 1.13.0 (2020-05-19) =========================== This release brings some potential changes necessary for certain configurations of Synapse: * If your Synapse is configured to use SSO and have a custom `sso_redirect_confirm_template_dir` configuration option set, you will need to duplicate the new `sso_auth_confirm.html`, `sso_auth_success.html` and `sso_account_deactivated.html` templates into that directory. * Synapse plugins using the `complete_sso_login` method of `synapse.module_api.ModuleApi` should instead switch to the async/await version, `complete_sso_login_async`, which includes additional checks. The former version is now deprecated. * A bug was introduced in Synapse 1.4.0 which could cause the room directory to be incomplete or empty if Synapse was upgraded directly from v1.2.1 or earlier, to versions between v1.4.0 and v1.12.x. Please review [UPGRADE.rst](https://github.com/matrix-org/synapse/blob/master/UPGRADE.rst) for more details on these changes and for general upgrade guidance. Notice of change to the default `git` branch for Synapse -------------------------------------------------------- With the release of Synapse 1.13.0, the default `git` branch for Synapse has changed to `develop`, which is the development tip. This is more consistent with common practice and modern `git` usage. The `master` branch, which tracks the latest release, is still available. It is recommended that developers and distributors who have scripts which run builds using the default branch of Synapse should therefore consider pinning their scripts to `master`. Features -------- - Extend the `web_client_location` option to accept an absolute URL to use as a redirect. Adds a warning when running the web client on the same hostname as homeserver. Contributed by Martin Milata. ([\#7006](#7006)) - Set `Referrer-Policy` header to `no-referrer` on media downloads. ([\#7009](#7009)) - Add support for running replication over Redis when using workers. ([\#7040](#7040), [\#7325](#7325), [\#7352](#7352), [\#7401](#7401), [\#7427](#7427), [\#7439](#7439), [\#7446](#7446), [\#7450](#7450), [\#7454](#7454)) - Admin API `POST /_synapse/admin/v1/join/<roomIdOrAlias>` to join users to a room like `auto_join_rooms` for creation of users. ([\#7051](#7051)) - Add options to prevent users from changing their profile or associated 3PIDs. ([\#7096](#7096)) - Support SSO in the user interactive authentication workflow. ([\#7102](#7102), [\#7186](#7186), [\#7279](#7279), [\#7343](#7343)) - Allow server admins to define and enforce a password policy ([MSC2000](matrix-org/matrix-spec-proposals#2000)). ([\#7118](#7118)) - Improve the support for SSO authentication on the login fallback page. ([\#7152](#7152), [\#7235](#7235)) - Always whitelist the login fallback in the SSO configuration if `public_baseurl` is set. ([\#7153](#7153)) - Admin users are no longer required to be in a room to create an alias for it. ([\#7191](#7191)) - Require admin privileges to enable room encryption by default. This does not affect existing rooms. ([\#7230](#7230)) - Add a config option for specifying the value of the Accept-Language HTTP header when generating URL previews. ([\#7265](#7265)) - Allow `/requestToken` endpoints to hide the existence (or lack thereof) of 3PID associations on the homeserver. ([\#7315](#7315)) - Add a configuration setting to tweak the threshold for dummy events. ([\#7422](#7422)) Bugfixes -------- - Don't attempt to use an invalid sqlite config if no database configuration is provided. Contributed by @nekatak. ([\#6573](#6573)) - Fix single-sign on with CAS systems: pass the same service URL when requesting the CAS ticket and when calling the `proxyValidate` URL. Contributed by @Naugrimm. ([\#6634](#6634)) - Fix missing field `default` when fetching user-defined push rules. ([\#6639](#6639)) - Improve error responses when accessing remote public room lists. ([\#6899](#6899), [\#7368](#7368)) - Transfer alias mappings on room upgrade. ([\#6946](#6946)) - Ensure that a user interactive authentication session is tied to a single request. ([\#7068](#7068), [\#7455](#7455)) - Fix a bug in the federation API which could cause occasional "Failed to get PDU" errors. ([\#7089](#7089)) - Return the proper error (`M_BAD_ALIAS`) when a non-existant canonical alias is provided. ([\#7109](#7109)) - Fix a bug which meant that groups updates were not correctly replicated between workers. ([\#7117](#7117)) - Fix starting workers when federation sending not split out. ([\#7133](#7133)) - Ensure `is_verified` is a boolean in responses to `GET /_matrix/client/r0/room_keys/keys`. Also warn the user if they forgot the `version` query param. ([\#7150](#7150)) - Fix error page being shown when a custom SAML handler attempted to redirect when processing an auth response. ([\#7151](#7151)) - Avoid importing `sqlite3` when using the postgres backend. Contributed by David Vo. ([\#7155](#7155)) - Fix excessive CPU usage by `prune_old_outbound_device_pokes` job. ([\#7159](#7159)) - Fix a bug which could cause outbound federation traffic to stop working if a client uploaded an incorrect e2e device signature. ([\#7177](#7177)) - Fix a bug which could cause incorrect 'cyclic dependency' error. ([\#7178](#7178)) - Fix a bug that could cause a user to be invited to a server notices (aka System Alerts) room without any notice being sent. ([\#7199](#7199)) - Fix some worker-mode replication handling not being correctly recorded in CPU usage stats. ([\#7203](#7203)) - Do not allow a deactivated user to login via SSO. ([\#7240](#7240), [\#7259](#7259)) - Fix --help command-line argument. ([\#7249](#7249)) - Fix room publish permissions not being checked on room creation. ([\#7260](#7260)) - Reject unknown session IDs during user interactive authentication instead of silently creating a new session. ([\#7268](#7268)) - Fix a SQL query introduced in Synapse 1.12.0 which could cause large amounts of logging to the postgres slow-query log. ([\#7274](#7274)) - Persist user interactive authentication sessions across workers and Synapse restarts. ([\#7302](#7302)) - Fixed backwards compatibility logic of the first value of `trusted_third_party_id_servers` being used for `account_threepid_delegates.email`, which occurs when the former, deprecated option is set and the latter is not. ([\#7316](#7316)) - Fix a bug where event updates might not be sent over replication to worker processes after the stream falls behind. ([\#7337](#7337), [\#7358](#7358)) - Fix bad error handling that would cause Synapse to crash if it's provided with a YAML configuration file that's either empty or doesn't parse into a key-value map. ([\#7341](#7341)) - Fix incorrect metrics reporting for `renew_attestations` background task. ([\#7344](#7344)) - Prevent non-federating rooms from appearing in responses to federated `POST /publicRoom` requests when a filter was included. ([\#7367](#7367)) - Fix a bug which would cause the room durectory to be incorrectly populated if Synapse was upgraded directly from v1.2.1 or earlier to v1.4.0 or later. Note that this fix does not apply retrospectively; see the [upgrade notes](UPGRADE.rst#upgrading-to-v1130) for more information. ([\#7387](#7387)) - Fix bug in `EventContext.deserialize`. ([\#7393](#7393)) - Fix a long-standing bug which could cause messages not to be sent over federation, when state events with state keys matching user IDs (such as custom user statuses) were received. ([\#7376](#7376)) - Restore compatibility with non-compliant clients during the user interactive authentication process, fixing a problem introduced in v1.13.0rc1. ([\#7483](#7483)) - Hash passwords as early as possible during registration. ([\#7523](#7523)) Improved Documentation ---------------------- - Update Debian installation instructions to recommend installing the `virtualenv` package instead of `python3-virtualenv`. ([\#6892](#6892)) - Improve the documentation for database configuration. ([\#6988](#6988)) - Improve the documentation of application service configuration files. ([\#7091](#7091)) - Update pre-built package name for FreeBSD. ([\#7107](#7107)) - Update postgres docs with login troubleshooting information. ([\#7119](#7119)) - Clean up INSTALL.md a bit. ([\#7141](#7141)) - Add documentation for running a local CAS server for testing. ([\#7147](#7147)) - Improve README.md by being explicit about public IP recommendation for TURN relaying. ([\#7167](#7167)) - Fix a small typo in the `metrics_flags` config option. ([\#7171](#7171)) - Update the contributed documentation on managing synapse workers with systemd, and bring it into the core distribution. ([\#7234](#7234)) - Add documentation to the `password_providers` config option. Add known password provider implementations to docs. ([\#7238](#7238), [\#7248](#7248)) - Modify suggested nginx reverse proxy configuration to match Synapse's default file upload size. Contributed by @ProCycleDev. ([\#7251](#7251)) - Documentation of media_storage_providers options updated to avoid misunderstandings. Contributed by Tristan Lins. ([\#7272](#7272)) - Add documentation on monitoring workers with Prometheus. ([\#7357](#7357)) - Clarify endpoint usage in the users admin api documentation. ([\#7361](#7361)) Deprecations and Removals ------------------------- - Remove nonfunctional `captcha_bypass_secret` option from `homeserver.yaml`. ([\#7137](#7137)) Internal Changes ---------------- - Add benchmarks for LruCache. ([\#6446](#6446)) - Return total number of users and profile attributes in admin users endpoint. Contributed by Awesome Technologies Innovationslabor GmbH. ([\#6881](#6881)) - Change device list streams to have one row per ID. ([\#7010](#7010)) - Remove concept of a non-limited stream. ([\#7011](#7011)) - Move catchup of replication streams logic to worker. ([\#7024](#7024), [\#7195](#7195), [\#7226](#7226), [\#7239](#7239), [\#7286](#7286), [\#7290](#7290), [\#7318](#7318), [\#7326](#7326), [\#7378](#7378), [\#7421](#7421)) - Convert some of synapse.rest.media to async/await. ([\#7110](#7110), [\#7184](#7184), [\#7241](#7241)) - De-duplicate / remove unused REST code for login and auth. ([\#7115](#7115)) - Convert `*StreamRow` classes to inner classes. ([\#7116](#7116)) - Clean up some LoggingContext code. ([\#7120](#7120), [\#7181](#7181), [\#7183](#7183), [\#7408](#7408), [\#7426](#7426)) - Add explicit `instance_id` for USER_SYNC commands and remove implicit `conn_id` usage. ([\#7128](#7128)) - Refactored the CAS authentication logic to a separate class. ([\#7136](#7136)) - Run replication streamers on workers. ([\#7146](#7146)) - Add tests for outbound device pokes. ([\#7157](#7157)) - Fix device list update stream ids going backward. ([\#7158](#7158)) - Use `stream.current_token()` and remove `stream_positions()`. ([\#7172](#7172)) - Move client command handling out of TCP protocol. ([\#7185](#7185)) - Move server command handling out of TCP protocol. ([\#7187](#7187)) - Fix consistency of HTTP status codes reported in log lines. ([\#7188](#7188)) - Only run one background database update at a time. ([\#7190](#7190)) - Remove sent outbound device list pokes from the database. ([\#7192](#7192)) - Add a background database update job to clear out duplicate `device_lists_outbound_pokes`. ([\#7193](#7193)) - Remove some extraneous debugging log lines. ([\#7207](#7207)) - Add explicit Python build tooling as dependencies for the snapcraft build. ([\#7213](#7213)) - Add typing information to federation server code. ([\#7219](#7219)) - Extend room admin api (`GET /_synapse/admin/v1/rooms`) with additional attributes. ([\#7225](#7225)) - Unblacklist '/upgrade creates a new room' sytest for workers. ([\#7228](#7228)) - Remove redundant checks on `daemonize` from synctl. ([\#7233](#7233)) - Upgrade jQuery to v3.4.1 on fallback login/registration pages. ([\#7236](#7236)) - Change log line that told user to implement onLogin/onRegister fallback js functions to a warning, instead of an info, so it's more visible. ([\#7237](#7237)) - Correct the parameters of a test fixture. Contributed by Isaiah Singletary. ([\#7243](#7243)) - Convert auth handler to async/await. ([\#7261](#7261)) - Add some unit tests for replication. ([\#7278](#7278)) - Improve typing annotations in `synapse.replication.tcp.streams.Stream`. ([\#7291](#7291)) - Reduce log verbosity of url cache cleanup tasks. ([\#7295](#7295)) - Fix sample SAML Service Provider configuration. Contributed by @frcl. ([\#7300](#7300)) - Fix StreamChangeCache to work with multiple entities changing on the same stream id. ([\#7303](#7303)) - Fix an incorrect import in IdentityHandler. ([\#7319](#7319)) - Reduce logging verbosity for successful federation requests. ([\#7321](#7321)) - Convert some federation handler code to async/await. ([\#7338](#7338)) - Fix collation for postgres for unit tests. ([\#7359](#7359)) - Convert RegistrationWorkerStore.is_server_admin and dependent code to async/await. ([\#7363](#7363)) - Add an `instance_name` to `RDATA` and `POSITION` replication commands. ([\#7364](#7364)) - Thread through instance name to replication client. ([\#7369](#7369)) - Convert synapse.server_notices to async/await. ([\#7394](#7394)) - Convert synapse.notifier to async/await. ([\#7395](#7395)) - Fix issues with the Python package manifest. ([\#7404](#7404)) - Prevent methods in `synapse.handlers.auth` from polling the homeserver config every request. ([\#7420](#7420)) - Speed up fetching device lists changes when handling `/sync` requests. ([\#7423](#7423)) - Run group attestation renewal in series rather than parallel for performance. ([\#7442](#7442)) - Fix linting errors in new version of Flake8. ([\#7470](#7470)) - Update the version of dh-virtualenv we use to build debs, and add focal to the list of target distributions. ([\#7526](#7526))

The aim here is to move the command handling out of the TCP protocol classes and to also merge the client and server command handling (so that we can reuse them for redis protocol). This PR simply moves the client paths to the new `ReplicationCommandHandler`, a future PR will move the server paths too.

erikjohnston added 8 commits March 31, 2020 13:26

Add replication data handler concept.

699ccf3

This stops us having to subclass ReplicationClientHandler and override methods.

Remove connection closed checks.

5b1e760

This will get replaced with a Linearizer later.

Don't use POSITION to detect "finished connecting".

8f1a878

In a Redis world we won't necessarily get one POSITION per stream at the start of the connection, so we rejig our "streams connecting" logic.

Add ReplicationCommandHandler.

a0063c9

The intention is to move all command processing logic to there, out of protocol, client and resource modules. Currently we only pull out client command processing from protocol and delegate to ReplicationClientHandler.

Merge replication command and client handlers

7e2593b

Fix test

90bd170

Add linearizer to protect stream catchup

6ac1eca

Newsfile

5104d16

erikjohnston force-pushed the erikj/repl_merge_client_server2 branch from 7298ab3 to 5104d16 Compare March 31, 2020 12:57

erikjohnston mentioned this pull request Mar 31, 2020

Move command processing out of transport #7134

Closed

erikjohnston requested a review from a team March 31, 2020 13:27

erikjohnston mentioned this pull request Mar 31, 2020

Move server command handling out of TCP protocol #7187

Merged

richvdh suggested changes Apr 1, 2020

View reviewed changes

erikjohnston added 11 commits April 1, 2020 16:35

Add diagram off how the classes are laid out

730dbee

Update docstring of ReplicationDataHandler

23de3af

Remove MYPY=False hack

0d6e753

Fixup handler

e16225a

Fixup protocol.py

8503564

Fix up client factory

cf57d56

Fixup tests

dc91879

Add on_remote_server_up to ReplicationDataHandler

bf99c8e

Fixup admin handler

1ebfa39

Only accept RDATA commands if we've caught up with stream.

ca9778c

Correctly handle RDATA that comes in while we catch up with a stream

534bd86

erikjohnston requested a review from richvdh April 1, 2020 19:34

erikjohnston commented Apr 2, 2020

View reviewed changes

richvdh reviewed Apr 3, 2020

View reviewed changes

erikjohnston added 2 commits April 3, 2020 15:44

Fix up comments

99e4a99

Revert to previous (racey) handling of POSITION and RDATA, and move i…

4873583

…nto linearizer

erikjohnston force-pushed the erikj/repl_merge_client_server2 branch from 2eb4141 to 4873583 Compare April 3, 2020 14:51

erikjohnston requested a review from richvdh April 3, 2020 15:12

richvdh approved these changes Apr 6, 2020

View reviewed changes

erikjohnston merged commit 5016b16 into develop Apr 6, 2020

erikjohnston deleted the erikj/repl_merge_client_server2 branch April 6, 2020 08:58

heftig mentioned this pull request May 19, 2020

GenericWorkerReplicationHandler has no attribute 'send_federation_ack' #7535

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move client command handling out of TCP protocol #7185

Move client command handling out of TCP protocol #7185

erikjohnston commented Mar 31, 2020

richvdh left a comment

richvdh Apr 1, 2020

erikjohnston Apr 1, 2020

richvdh Apr 1, 2020

richvdh Apr 3, 2020

erikjohnston Apr 2, 2020

richvdh Apr 3, 2020

erikjohnston Apr 3, 2020

richvdh Apr 3, 2020

richvdh Apr 3, 2020

richvdh Apr 3, 2020

erikjohnston Apr 3, 2020

richvdh Apr 3, 2020

richvdh Apr 3, 2020

richvdh Apr 3, 2020

richvdh left a comment

erikjohnston commented Apr 6, 2020



		class ReplicationDataHandler:
		"""A replication data handler that calls slave data stores.

	"""A replication data handler handles incoming stream updates from replication.
	"""Handles incoming stream updates from replication.



		class ReplicationCommandHandler:
		"""Handles incoming commands from replication.

		# The current connection. None if we are currently (re)connecting
		self._connection = None

Move client command handling out of TCP protocol #7185

Move client command handling out of TCP protocol #7185

Conversation

erikjohnston commented Mar 31, 2020

richvdh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richvdh left a comment

Choose a reason for hiding this comment

erikjohnston commented Apr 6, 2020