-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Split fed worker & sender config means destination retry sched doesn't invalidate when you receive traffic #3798
Comments
So this is a general problem for any cache that gets invalidated on the workers, since only invalidations that happen on master get correctly propagated to the workers. However, currently there should be very few caches that get invalidated on the workers. The destination retry cache was allowed to be invalidated on the workers on the assumption that enough federation traffic would happen on master that the cache would be correctly invalidated fairly quickly. Now that we've moved more traffic off of master this is no longer true. @hawkowl has a number of suggestions on the way forward:
None of which are hugely palatable atm: 1) would unacceptably increase DB load, 2) feels icky and may work for the current situation but feels like a bit of a footgun and 3) sounds like a fair amount of effort in terms of implementation and operations. Another alternative may be to use postgres's inbuilt @matrix-org/synapse-core Thoughts? |
For the federation mess surely we just put a TTL on the cache for now - 5 mins or whatever rather than having the federation blackholed indefinitely? |
(LISTEN/NOTIFY could be fun as a future adventure tho) |
Possibly, but like I said I'm disinclined to continue putting footguns in our code given that this sort of hacky solution is exactly what has led to this mess. As a stop gap measure it may be the best thing to do right now but I'm not comfortable sticking a band aid on this and moving on. |
Related to cache backing: #2123 |
any reason not to communicate the cache invalidation from worker to master via the existing replication channels? [fwiw if we do end up deciding to build something new, I'd much rather we use something like memcached than postgres pubsub, for many of the reasons mentioned in #2123, but really just because we shouldn't be building our own wheel] |
The worker -> master communication is done via HTTP, which is actually probably fine if we batch it up for
Historically the problem is that external cache services just don't match the use cases of our caches, in particular CPU usage and latency (I believe there are places that assume that things are probably going to be quick due to things being in cache). While I agree that we shouldn't be reinventing the wheel, requiring memcached/redis/etc just for a communication layer feels a bit overkill. OTOH, it might be useful for the different problem of reducing DB load. |
After some discussion, the conclusion was that (for now) we'll just remove the cache for |
Currently we rely on the master to invalidate this cache promptly. However, after having moved most federation endpoints off of master this no longer happens, causing outbound fedeariont to get blackholed. Fixes #3798
No description provided.
The text was updated successfully, but these errors were encountered: