-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Split out mutable event content from event cache into new caches that are keyed by room ID #13916
Comments
Questions:
|
I've finished a second write up which should hopefully be clearer and answer the above questions. ProposalCurrently we have both a @attr.s(slots=True, auto_attribs=True)
class EventCacheEntry:
event: EventBase
redacted_event: Optional[EventBase] Today Expiration of an event occurs when the undocumented Therefore, we define a new cache:
class EventMetadataCacheEntry:
redacted_event: Optional[EventBase]
rejected_reason: Optional[str] If Invalidation scenariosIn each invalidation scenario today we have direct access to the Room ID, therefore we can invalidate all three caches when an event changes: Event redaction
Event expiration or censorship
Event rejection (and un-rejection)
Persisting a new event
Partial room purge
|
One reason is that an event can go from synapse/synapse/storage/databases/main/events.py Lines 451 to 455 in 2c237de
|
@MadLittleMods Ah right! I missed that it would be running after the txn, and thus after everything below it had run as well. And thanks for pointing out another invalidation case! |
It does beg the question of whether we should also store outlier status in the proposed When we transition an event from outlier to ex-outer ( Perhaps we could store outlier status in |
If the retrieval is transparent then that sounds fine 🤷 |
@richvdh and I spoke out of band to discuss this. Out of that meeting came a couple simple changes that will resolve #11521: #14161 and #14164. Assuming those merge, that will unblock me for #4720, which was the current motivation for this change. Those PRs don't allow correctly invalidating the event caches when purging a room though - they just eliminate some of the symptoms. To fix the cause, we still need the ability to invalidate the event caches by room ID. Some notes on augmenting the above proposal as a result of our chat (full minutes):
In the end, we may just want to consider having a separate dict of Room ID -> Event ID to fix the problems with invalidating the event caches by room ID. And then a separate amount of work after that to address multi-homeserver caching use cases. |
What does "multiple homeservers" mean in this context (and other references like "multi-homeserver", "shared with other homeservers")? How can I share a cache with multiple homeservers? Multiple workers? |
If you think of EMS' use case, where you have many homeservers together in a kubernetes cluster, it'd be nice if all of those homeservers could share a single external event store. And on top of that an external event cache. That way you don't duplicate that information across every one of your hundreds of servers. It should be feasible as long as the data you store is consistent across all homeservers. Event content is, whether the event has been rejection is not necessarily. The latter is currently stored in the event caches (and now the |
In the hopes of fixing #11521 and paving the way towards an immutable external event cache (#2123), a new architecture for the event cache is proposed.
The current state
Currently there are a couple separate data structures related to caching event contents in memory in Synapse (see code for fetching an event from cache/db here):
EventsWorkerStore._get_event_cache
- An instance of AsyncLruCache implemented as a map ofEventID
-> (EventBase
, redactedEventBase
| None).EventsWorkerStore._event_ref
- A WeakValueDictionary which serves as a single point of reference for EventBase's in memory, ensuring that we don't end up with multiple, unnecessary copies of a single EventBase in memory.EventsWorkerStore._get_event_cache
, as something else may still be processing the event, even if it's been removed from that cache.What's the problem?
See #11521; because each of these caches are keyed by EventID alone, it becomes tricky to invalidate them when all you have is a RoomID (i.e. when purging a room completely). We could query all known events for a room from the database, but that may result in millions of events. Ideally we'd have some map of RoomID -> EventID which only covers the events that are actually currently held in memory. We could then use that to invalidate all three of these caches.
Additionally, as
get_event_cache
contains mutableEventCacheEntry
s (comprised ofEventBase, redacted EventBase | None
), invalidating them is necessary when an event is both redacted or marked as rejected. These can differ per-homeserver, so removing this component from the cache entries opens up avenues for multiple homeservers sharing the same, immutable event cache.Proposal
After speaking with @erikjohnston we've (mostly Erik :) came up with the following idea:
EventsWorkerStore._get_event_cache
would simply become a map ofEventID
->EventBase
.RoomID
->EventID
->{rejected_status: bool, redacted_event_content: Optional[EventBase]}
.EventsWorkerStore._get_event_cache
is invalidated due to hitting the cache size.The beauty of this is that we no longer need to invalidate the
_get_event_cache
at all (unless the size limit is hit)! Even in the room purge use case! How? Here are some examples of using this system:Fetch EventID A which is not in-memory
_get_event_cache
(nor other caches) so we query from the database. The event and related metadata is fetched from the DB (event_json
,redactions
,rejections
) and both the_get_event_cache
and event metadata cache are populated.Fetch EventID A which is in-memory
_get_event_cache
, and presumably the metadata cache. We take the RoomID from the EventBase in the_get_event_cache
and query the event metadata cache.EventID A is not in-memory but the event has been purged
_get_event_cache
, and presumably the metadata cache. We take the RoomID from the EventBase in the cache and query the event metadata cache - but uh oh, there's no matching entry in the metadata cache! The event must have been purged.Thus when purging a room, we only need to purge entries in the metadata cache (which we can easily do by RoomID due to the metadata cache's structure). Entries in the
get_event_cache
andevent_ref
will be invalidated as they are fetched.I'm curious for thoughts on whether this sounds reasonable from other members of the Synapse team + cc @Fizzadar.
The text was updated successfully, but these errors were encountered: