Tests for IPv6 bugs / redis replication over IPv6 #11508

evilham · 2021-12-05T11:45:20Z

This will help close some long-standing issues:
("X" means we are adding a test for it)
[ ] AS #4092
[ ] metrics EP #6644
[X] replication #7695
[X] smtp #7720
[X] redis #10694

In order to run these IPv6-related tests, the test suite must run with:
env SYNAPSE_POSTGRES=YES DEBUG_IPv6=YES trial tests

Sponsored by: ungleich.ch
Signed-off-by: @telmich Nico Schottelius foss@ungleich.ch

richvdh

I appreciate the effort, but I don't really see the value of adding tests which are disabled by default. They'll just bitrot until they get removed again (see https://github.com/matrix-org/synapse/pull/11495/files#diff-cda5424a49ddae4fe5181900d1b607a9daf5c9afa539ed1fdb01622de069e499L198 for a recent example).

evilham · 2021-12-06T10:41:56Z

Hey @richvdh, thanks for taking a look.

Maybe I should have elaborated a tad more:
the motivation with @telmich to start with these tests is that they provide a very simple path to check that these known bugs are solved without breaking Synapse's pipeline, and a simpler way to fix the issue with replication workers (the main motivation for this), since that is indeed more involved than it looks like at first.
Note that the tests are not really new, they re-use the code from existing tests, just with different fake DNS entries and changes in tests/server.py mean that this whole class of issues will be caught in the future, so that is also a win.

Indeed, I'd recommend that @xfk abandons and re-opens #10717 (comment) with just the fix for the SMTP bug (#7720) and enabling the corresponding test by default, that way we both check that the bug is fixed and that regressions do not happen.
If they do not in a reasonable time, since the original fix was somewhat simple and already signed-off, I'd see that this happens.

Also based on your comment on #10717 (comment) I started re-working the replication bits, which is the most complex of these bugs, and will need to ask a couple questions to finish that.

richvdh · 2021-12-06T10:56:57Z

Yes, having the tests is very useful, and your efforts in contributing them are very much appreciated. It's just that from a practical point of view I'd much rather we combine the fix and the tests into the same PR - so for example let's have a PR which fixes #7720 and lands the new test.

Perhaps we can mark this PR as a draft for now so that people can use it as inspiration for writing those PRs?

evilham · 2021-12-07T10:11:21Z

OK, let's redirect this PR into fixing #10694 (and #7695 as a particular case), if in the meantime @xfk wants to open a PR with the fix for #7720 and the corresponding (passing and always enabled) test from this PR, that'd be great.

Based on previous discussion and particularly this comment #10717 (comment), I've been looking into refactoring the communication with Redis using a ClientService, which proves more challenging than I first expected.

Here are some things I've noticed, please correct if any of these is wrong since understanding this right is quite important:

Redis is only ever used through synapse.server.HomeServer.get_outbound_redis_connection and RedisDirectTcpReplicationClientFactory
The outbound_redis_connection is used in the synapse.replication.tcp.external_cache.ExternalCache which is using the RedisProtocol directly.
This ExternalCache is used in synapse.federation.sender.FederationSender in a subscription, read-only fashion, but it is used in a write-fashion by synapse.handlers/message.EventCreationHandler. This is a tad unexpected and confusing.
It is also used by the synapse.replication.tcp.handler.ReplicationCommandHandler, passing it to synapse.replication.tcp.redis.RedisDirectTcpReplicationClientFactory.
RedisDirectTcpReplicationClientFactory creates its own connection to Redis and it is used in tandem with the outbound_redis_connection, it uses the outbound_redis_connection in a write-fashion to send commands and the associated connection is used in a subscription, read-only fashion.

Now, the doubts:

synapse.replication.tcp.redis.lazyConnection (used for the outbound_redis_connection) claims to return a RedisProtocol, but it realy returns a txredisapi.ConnectionHandler (see:

synapse/synapse/replication/tcp/redis.py

Line 383 in 2f053f3

return factory.handler

). Which.... yes and no, it wraps the behaviour of a RedisProtocol, but it is doing some magic regarding connection pools. I don't think this should happen.
There is a periodic ping that claims to keep the connection alive, I wonder if this would be needed if we switch to ClientService, how exactly did txredisapi "stop"? From my understanding of Twisted, the instance of the RedisProtocol should be treated as something that is bound to change, so if the connection is lost, you should get another one, with the proper factories and all that, it should be ready to use. Maybe the fact that txredisapi is exposing weird objects like the ConnectionHandler lead to treating it as something semi-persistent and to these issues?

My overall plan to make this happen is:

Stay away from txredisapi.ConnectionHandler and txredisapi.RedisFactory, they are trying to do way too much and are inflexible.
Create our own SimpleRedisFactory (which might be upstreamed to txredisapi in the future, but in the meantime should live in synapse as well due to packaging) which does none of the reconnection and pool magic.
Use this SimpleRedisFactory with a RedisClientService to have free connections with exponential backoff and reset of the timers.
Never pass the RedisProtocol from synapse.server.HomeServer (or ever), but instead the RedisClientService which should have ways of letting users of the service know when the connection is lost / recovered.
RedisDirectTcpReplicationClientFactory should use two such RedisClientService objects, the one passed by the HomeServer and its own connection service.

Does this sound OK? am I missing something?

richvdh · 2021-12-07T12:23:51Z

Redis is only ever used through synapse.server.HomeServer.get_outbound_redis_connection and RedisDirectTcpReplicationClientFactory

The outbound_redis_connection is used in the synapse.replication.tcp.external_cache.ExternalCache which is using the RedisProtocol directly.

This ExternalCache is used in synapse.federation.sender.FederationSender in a subscription, read-only fashion, but it is used in a write-fashion by synapse.handlers/message.EventCreationHandler. This is a tad unexpected and confusing.

It [outbound_redis_connection] is also used by the synapse.replication.tcp.handler.ReplicationCommandHandler, passing it to synapse.replication.tcp.redis.RedisDirectTcpReplicationClientFactory.

RedisDirectTcpReplicationClientFactory creates its own connection to Redis and it is used in tandem with the outbound_redis_connection, it uses the outbound_redis_connection in a write-fashion to send commands and the associated connection is used in a subscription, read-only fashion.

This is all correct.

This [use of ExternalCache] is a tad unexpected and confusing.

I don't think you need to worry too much about how ExternalCache works: for our purposes it's just a class which needs access to a redis connection where it can send commands and receive the replies. But I'm not sure which part is unexpected: the fact that EventCreationHandler writes to the cache? What would be the use of a cache which is never populated?

Now, the doubts:

synapse.replication.tcp.redis.lazyConnection (used for the outbound_redis_connection) claims to return a RedisProtocol, but it realy returns a txredisapi.ConnectionHandler. Which.... yes and no, it wraps the behaviour of a RedisProtocol, but it is doing some magic regarding connection pools. I don't think this should happen.

Maybe. It would certainly be nice if there were a type definition for the interface exposed by RedisProtocol, so that lazyConnection can say "here is a thing which behaves the same as a RedisProtocol, but isn't one". But txredisapi doesn't define such a class.

There is a periodic ping that claims to keep the connection alive,

Did you link to the right line here? That's nothing to do with pings, but rather about whether txredisapi.RedisFactory reconnects after a connection failure. But I think it is specific to RedisFactory and not required for ClientService.

My overall plan to make this happen is:

Stay away from txredisapi.ConnectionHandler and txredisapi.RedisFactory, they are trying to do way too much and are inflexible.

+1

Create our own SimpleRedisFactory (which might be upstreamed to txredisapi in the future, but in the meantime should live in synapse as well due to packaging) which does none of the reconnection and pool magic.

You may find that a factory created with Factory.forProtocol is sufficent here.

Use this SimpleRedisFactory with a RedisClientService to have free connections with exponential backoff and reset of the timers.

Are you intending to make RedisClientService a derived class of ClientService? I would avoid doing so - either wrap the ClientService inside another class, or just pass the ClientService around.

Never pass the RedisProtocol from synapse.server.HomeServer (or ever), but instead the RedisClientService which should have ways of letting users of the service know when the connection is lost / recovered.

I'm not sure if you actually need to know that? Could you just call ClientService.whenConnected to get the current connection?

RedisDirectTcpReplicationClientFactory should use two such RedisClientService objects, the one passed by the HomeServer and its own connection service.

Does this sound OK? am I missing something?

I think it sounds about right. To be honest, I'm struggling to figure out how it all fits together when looking at it in the abstract.

richvdh · 2022-04-21T15:06:38Z

@evilham: are you still interested in working on this?

evilham · 2022-04-21T15:12:33Z

Yep, I want to get this fixed and it's on my TODO, but things have gotten in the way >,<. Will try to re-prioritise, thanks for the ping!

clokep · 2023-01-25T19:27:53Z

This has a bunch of conflicts that need to be solved; I wonder if a more reasonable way to handle this would be to:

Close this PR;
Take the tests for each item you've listed above (e.g. SMTP) and open a new PR with just those tests;
Fix support for whatever that item tests in the same PR;
Review it!

Hopefully this is pretty doable and folks can crib off this.

evilham requested a review from a team as a code owner December 5, 2021 11:45

richvdh reviewed Dec 6, 2021

View reviewed changes

evilham marked this pull request as draft December 6, 2021 11:05

evilham changed the title ~~Add optional tests documenting misbehaviour in IPv6-only scenarios~~ Tests for IPv6 bugs / redis replication over IPv6 Dec 7, 2021

richvdh added the X-Awaiting-Changes A contributed PR which needs changes and re-review before it can be merged label Apr 21, 2022

dklimpel mentioned this pull request Jun 20, 2022

Enable full IPv6 support for synapse #13107

Open

evilham closed this Jul 28, 2023

evilham force-pushed the develop branch from 2e0e231 to 68b2611 Compare July 28, 2023 11:58

evilham mentioned this pull request Aug 22, 2023

Fix IPv6-only bugs on SMTP settings #16155

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests for IPv6 bugs / redis replication over IPv6 #11508

Tests for IPv6 bugs / redis replication over IPv6 #11508

evilham commented Dec 5, 2021

richvdh left a comment

evilham commented Dec 6, 2021 •

edited

Loading

richvdh commented Dec 6, 2021

evilham commented Dec 7, 2021

richvdh commented Dec 7, 2021

richvdh commented Apr 21, 2022

evilham commented Apr 21, 2022

clokep commented Jan 25, 2023

Tests for IPv6 bugs / redis replication over IPv6 #11508

Tests for IPv6 bugs / redis replication over IPv6 #11508

Conversation

evilham commented Dec 5, 2021

richvdh left a comment

Choose a reason for hiding this comment

evilham commented Dec 6, 2021 • edited Loading

richvdh commented Dec 6, 2021

evilham commented Dec 7, 2021

richvdh commented Dec 7, 2021

richvdh commented Apr 21, 2022

evilham commented Apr 21, 2022

clokep commented Jan 25, 2023

evilham commented Dec 6, 2021 •

edited

Loading