Split fed worker & sender config means destination retry sched doesn't invalidate when you receive traffic #3798

ara4n · 2018-09-05T19:21:32Z

No description provided.

erikjohnston · 2018-09-12T09:20:36Z

So this is a general problem for any cache that gets invalidated on the workers, since only invalidations that happen on master get correctly propagated to the workers. However, currently there should be very few caches that get invalidated on the workers.

The destination retry cache was allowed to be invalidated on the workers on the assumption that enough federation traffic would happen on master that the cache would be correctly invalidated fairly quickly. Now that we've moved more traffic off of master this is no longer true.

@hawkowl has a number of suggestions on the way forward:

Remove the cache if you're a worker
Make the cache have a TTL
If you're a worker, have the cache backed by something like memcached

None of which are hugely palatable atm: 1) would unacceptably increase DB load, 2) feels icky and may work for the current situation but feels like a bit of a footgun and 3) sounds like a fair amount of effort in terms of implementation and operations.

Another alternative may be to use postgres's inbuilt LISTEN/NOTIFY pubsub mechanism to communicate cache invalidations to the master. This is a bit annoying as it'll take a bit of implementing and is asymmetric from master to worker invalidation, but has the advantage over option 3 in that it'll be less work and less ongoing operations.

@matrix-org/synapse-core Thoughts?

ara4n · 2018-09-12T09:47:31Z

For the federation mess surely we just put a TTL on the cache for now - 5 mins or whatever rather than having the federation blackholed indefinitely?

ara4n · 2018-09-12T09:47:43Z

(LISTEN/NOTIFY could be fun as a future adventure tho)

erikjohnston · 2018-09-12T09:49:44Z

Possibly, but like I said I'm disinclined to continue putting footguns in our code given that this sort of hacky solution is exactly what has led to this mess. As a stop gap measure it may be the best thing to do right now but I'm not comfortable sticking a band aid on this and moving on.

hawkowl · 2018-09-14T15:33:44Z

Related to cache backing: #2123

richvdh · 2018-09-17T16:01:43Z

any reason not to communicate the cache invalidation from worker to master via the existing replication channels?

[fwiw if we do end up deciding to build something new, I'd much rather we use something like memcached than postgres pubsub, for many of the reasons mentioned in #2123, but really just because we shouldn't be building our own wheel]

erikjohnston · 2018-09-19T10:51:23Z

any reason not to communicate the cache invalidation from worker to master via the existing replication channels?

The worker -> master communication is done via HTTP, which is actually probably fine if we batch it up for get_destination_retry_timings though I wouldn't necessarily want to rely on it for other caches

[fwiw if we do end up deciding to build something new, I'd much rather we use something like memcached than postgres pubsub, for many of the reasons mentioned in #2123, but really just because we shouldn't be building our own wheel]

Historically the problem is that external cache services just don't match the use cases of our caches, in particular CPU usage and latency (I believe there are places that assume that things are probably going to be quick due to things being in cache). While I agree that we shouldn't be reinventing the wheel, requiring memcached/redis/etc just for a communication layer feels a bit overkill.

OTOH, it might be useful for the different problem of reducing DB load.

erikjohnston · 2018-09-19T13:14:24Z

After some discussion, the conclusion was that (for now) we'll just remove the cache for get_destination_retry_timings, since on matrix.org they're only hit ~50-100Hz

Currently we rely on the master to invalidate this cache promptly. However, after having moved most federation endpoints off of master this no longer happens, causing outbound fedeariont to get blackholed. Fixes #3798

hawkowl mentioned this issue Sep 7, 2018

Messages are not transmitted via federation, typing notifications are #3750

Closed

hawkowl added federation-meltdown A-Workers Problems related to running Synapse in Worker Mode (or replication) labels Sep 14, 2018

erikjohnston self-assigned this Sep 19, 2018

erikjohnston mentioned this issue Sep 19, 2018

Remove get_destination_retry_timings cache #3914

Merged

richvdh mentioned this issue Sep 19, 2018

Using an external cache server #2123

Open

erikjohnston closed this as completed Sep 20, 2018

DMRobertson added A-Federation and removed z-federation-meltdown labels Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split fed worker & sender config means destination retry sched doesn't invalidate when you receive traffic #3798

Split fed worker & sender config means destination retry sched doesn't invalidate when you receive traffic #3798

ara4n commented Sep 5, 2018

erikjohnston commented Sep 12, 2018

ara4n commented Sep 12, 2018

ara4n commented Sep 12, 2018

erikjohnston commented Sep 12, 2018

hawkowl commented Sep 14, 2018

richvdh commented Sep 17, 2018 •

edited

Loading

erikjohnston commented Sep 19, 2018

erikjohnston commented Sep 19, 2018

Split fed worker & sender config means destination retry sched doesn't invalidate when you receive traffic #3798

Split fed worker & sender config means destination retry sched doesn't invalidate when you receive traffic #3798

Comments

ara4n commented Sep 5, 2018

erikjohnston commented Sep 12, 2018

ara4n commented Sep 12, 2018

ara4n commented Sep 12, 2018

erikjohnston commented Sep 12, 2018

hawkowl commented Sep 14, 2018

richvdh commented Sep 17, 2018 • edited Loading

erikjohnston commented Sep 19, 2018

erikjohnston commented Sep 19, 2018

richvdh commented Sep 17, 2018 •

edited

Loading