Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Split fed worker & sender config means destination retry sched doesn't invalidate when you receive traffic #3798

Closed
ara4n opened this issue Sep 5, 2018 · 8 comments
Assignees
Labels
A-Federation A-Workers Problems related to running Synapse in Worker Mode (or replication)

Comments

@ara4n
Copy link
Member

ara4n commented Sep 5, 2018

No description provided.

@erikjohnston
Copy link
Member

So this is a general problem for any cache that gets invalidated on the workers, since only invalidations that happen on master get correctly propagated to the workers. However, currently there should be very few caches that get invalidated on the workers.

The destination retry cache was allowed to be invalidated on the workers on the assumption that enough federation traffic would happen on master that the cache would be correctly invalidated fairly quickly. Now that we've moved more traffic off of master this is no longer true.

@hawkowl has a number of suggestions on the way forward:

  1. Remove the cache if you're a worker
  2. Make the cache have a TTL
  3. If you're a worker, have the cache backed by something like memcached

None of which are hugely palatable atm: 1) would unacceptably increase DB load, 2) feels icky and may work for the current situation but feels like a bit of a footgun and 3) sounds like a fair amount of effort in terms of implementation and operations.

Another alternative may be to use postgres's inbuilt LISTEN/NOTIFY pubsub mechanism to communicate cache invalidations to the master. This is a bit annoying as it'll take a bit of implementing and is asymmetric from master to worker invalidation, but has the advantage over option 3 in that it'll be less work and less ongoing operations.

@matrix-org/synapse-core Thoughts?

@ara4n
Copy link
Member Author

ara4n commented Sep 12, 2018

For the federation mess surely we just put a TTL on the cache for now - 5 mins or whatever rather than having the federation blackholed indefinitely?

@ara4n
Copy link
Member Author

ara4n commented Sep 12, 2018

(LISTEN/NOTIFY could be fun as a future adventure tho)

@erikjohnston
Copy link
Member

Possibly, but like I said I'm disinclined to continue putting footguns in our code given that this sort of hacky solution is exactly what has led to this mess. As a stop gap measure it may be the best thing to do right now but I'm not comfortable sticking a band aid on this and moving on.

@hawkowl hawkowl added federation-meltdown A-Workers Problems related to running Synapse in Worker Mode (or replication) labels Sep 14, 2018
@hawkowl
Copy link
Contributor

hawkowl commented Sep 14, 2018

Related to cache backing: #2123

@richvdh
Copy link
Member

richvdh commented Sep 17, 2018

any reason not to communicate the cache invalidation from worker to master via the existing replication channels?

[fwiw if we do end up deciding to build something new, I'd much rather we use something like memcached than postgres pubsub, for many of the reasons mentioned in #2123, but really just because we shouldn't be building our own wheel]

@erikjohnston
Copy link
Member

any reason not to communicate the cache invalidation from worker to master via the existing replication channels?

The worker -> master communication is done via HTTP, which is actually probably fine if we batch it up for get_destination_retry_timings though I wouldn't necessarily want to rely on it for other caches

[fwiw if we do end up deciding to build something new, I'd much rather we use something like memcached than postgres pubsub, for many of the reasons mentioned in #2123, but really just because we shouldn't be building our own wheel]

Historically the problem is that external cache services just don't match the use cases of our caches, in particular CPU usage and latency (I believe there are places that assume that things are probably going to be quick due to things being in cache). While I agree that we shouldn't be reinventing the wheel, requiring memcached/redis/etc just for a communication layer feels a bit overkill.

OTOH, it might be useful for the different problem of reducing DB load.

@erikjohnston erikjohnston self-assigned this Sep 19, 2018
@erikjohnston
Copy link
Member

After some discussion, the conclusion was that (for now) we'll just remove the cache for get_destination_retry_timings, since on matrix.org they're only hit ~50-100Hz

erikjohnston added a commit that referenced this issue Sep 19, 2018
Currently we rely on the master to invalidate this cache promptly.
However, after having moved most federation endpoints off of master this
no longer happens, causing outbound fedeariont to get blackholed.

Fixes #3798
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federation A-Workers Problems related to running Synapse in Worker Mode (or replication)
Projects
None yet
Development

No branches or pull requests

5 participants