-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
gappy syncs appear to be slower than you might expect #10786
Comments
I don't really see an obvious step change across all synchrotrons in the last 50 days on matrix.org. I do however see one synchrotron hitting 100% CPU, before it relaxes and another one jumps to 100% CPU. It looks as if one account is causing one synchrotron to become slow. However, people have mentioned incremental syncs becoming slow across more homeservers than just matrix.org. So I think this is just a red herring (although one still worth investigating separately). |
the timeframe of change could easily be >50 days here. |
Result of the discussion is that it would be helpful to have detailed client logs of I see a few mentions in features/bugfixes over the past few Synapse releases regarding |
It might be worth profiling some of my catchup syncs via jaeger (which I think is enabled for my account). For instance, I just got a great instance of this on iOS after an hour of being offline - see the linked rageshake. timezones are UTC+1 I closed the app at
...which completes after 36 seconds?!
So, let's go marry this up with jaeger and see what on earth it was doing. My phone should be NTP synced, it was on excellent connectivity, we have the |
|
From a quick look at the trace, 30s of the 36s is spent in db.get_users_in_room, which in turn is spending a bunch of time in the DB. Assuming the calling pattern of db.get_users_in_room hasn't changed, this feels like the data might just not be cached on the synchrotron... and it seems my account is moving around a lot between synchrotrons. grepping for |
Huh, the DB requests which are chewing all the time are simply
...which takes 231ms?! (looking at https://jaeger.int.matrix.org/trace/9c7d5a56eb8ced14?uiFind=2c9e84f049e51f33). Perhaps this is just a DB indexing problem. It shouldn't take 231ms to pull a single state_key out of the current_state_events table, surely? EDIT: oh, it must be 231ms because the room has a bazillion users and it's just got to churn through a lot of records to get them. |
I think the 231ms is just a particularly bad one, but there are still 6328 calls to get_users_in_room, roughly twice for every room on my account, which seems completely crazy. I also can't see any part of the sync handler which actually leans hard on it; it's only ever called by So I wonder if there's been a terrible regression on calculating newly joined/left rooms... |
grafana confirms that it's the so something is indeed going nuts and making i've had a quick look in the code and tried to reproduce against a local non-worker synapse, but it looks like it needs something more than just a gappy sync to repro. I also tried adding in some join/parts from other users during the gap, but that didn't trigger it either (they correctly turned up on the device_list, and newly_joined_rooms and newly_parted_rooms look to have been blank). Perhaps you have to join/part something yourself during the gap. It's worth noting that I /think/ i started noticing this around July, which is roughly when knocking landed, which made a bunch of changes around the newly_joined code paths? Anyway, someone in backendland should probably pick this up, especially if it's causing a ~6x slowdown (30s of 36s doing unnecessary work). |
bah, couldn't stop thinking about this. reproduced it after disabling caching on my local server (which is why I couldn't see this before). I wonder whether the thing which has exposed the bug is the 30 minute expiration on caches on matrix.org (as well as the lack of stickiness). repro steps:
2021-09-10 01:49:22,846 - synapse.util.metrics - 137 - DEBUG - GET-145 - Entering block _generate_sync_entry_for_device_list 2021-09-10 01:49:22,847 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_rooms_for_user_with_stream_ordering-62b} 2021-09-10 01:49:22,847 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_rooms_for_user_with_stream_ordering-62b} SELECT room_id, e.instance_name, e.stream_ordering FROM current_state_events AS c INNER JOIN events AS e USING (room_id, event_id) WHERE c.type = 'm.room.member' AND s tate_key = ? AND c.membership = ? 2021-09-10 01:49:22,848 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_rooms_for_user_with_stream_ordering-62b} ('@matthew:bellerophon', 'join') 2021-09-10 01:49:22,849 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_rooms_for_user_with_stream_ordering-62b} 0.000063 sec 2021-09-10 01:49:22,849 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_rooms_for_user_with_stream_ordering-62b} 0.002834 sec 2021-09-10 01:49:22,851 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-62c} 2021-09-10 01:49:22,851 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-62c} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ? 2021-09-10 01:49:22,852 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-62c} ('!dprKjMDJTQbVbutttb:bellerophon', 'join') 2021-09-10 01:49:22,853 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-62c} 0.000041 sec 2021-09-10 01:49:22,853 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-62c} 0.002855 sec 2021-09-10 01:49:22,854 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-62d} 2021-09-10 01:49:22,855 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-62d} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ? 2021-09-10 01:49:22,856 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-62d} ('!mQSqavcZeWYZncQwJh:bellerophon', 'join') 2021-09-10 01:49:22,857 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-62d} 0.000053 sec 2021-09-10 01:49:22,857 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-62d} 0.002847 sec 2021-09-10 01:49:22,858 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-62e} 2021-09-10 01:49:22,859 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-62e} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ? 2021-09-10 01:49:22,860 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-62e} ('!JjMuBUGyRFIHpUvqBp:bellerophon', 'join') 2021-09-10 01:49:22,860 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-62e} 0.000034 sec 2021-09-10 01:49:22,861 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-62e} 0.002630 sec 2021-09-10 01:49:22,862 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-62f} 2021-09-10 01:49:22,862 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-62f} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ? 2021-09-10 01:49:22,863 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-62f} ('!nLFjWkXmrEScbXhLmI:bellerophon', 'join') 2021-09-10 01:49:22,864 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-62f} 0.000030 sec 2021-09-10 01:49:22,864 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-62f} 0.002385 sec 2021-09-10 01:49:22,865 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-630} 2021-09-10 01:49:22,866 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-630} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ? 2021-09-10 01:49:22,867 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-630} ('!dAHSpVBCjyiitQGwRn:bellerophon', 'join') 2021-09-10 01:49:22,868 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-630} 0.000035 sec 2021-09-10 01:49:22,868 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-630} 0.003061 sec 2021-09-10 01:49:22,869 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-631} 2021-09-10 01:49:22,870 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-631} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ? 2021-09-10 01:49:22,870 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-631} ('!DKgHddIbuazlkglUdu:bellerophon', 'join') 2021-09-10 01:49:22,871 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-631} 0.000030 sec 2021-09-10 01:49:22,871 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-631} 0.002362 sec 2021-09-10 01:49:22,872 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-632} 2021-09-10 01:49:22,873 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-632} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ? 2021-09-10 01:49:22,874 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-632} ('!aTKzWEOvZRvouBktlN:bellerophon', 'join') 2021-09-10 01:49:22,874 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-632} 0.000030 sec 2021-09-10 01:49:22,875 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-632} 0.002457 sec 2021-09-10 01:49:22,876 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-633} 2021-09-10 01:49:22,876 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-633} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ? 2021-09-10 01:49:22,877 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-633} ('!tvslgMmeyHGMpWGGIX:bellerophon', 'join') 2021-09-10 01:49:22,877 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-633} 0.000031 sec 2021-09-10 01:49:22,878 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-633} 0.002365 sec 2021-09-10 01:49:22,879 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-634} 2021-09-10 01:49:22,879 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-634} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ? 2021-09-10 01:49:22,880 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-634} ('!FbTgWTlLSDXqPspmzq:bellerophon', 'join') 2021-09-10 01:49:22,881 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-634} 0.000076 sec 2021-09-10 01:49:22,882 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-634} 0.003240 sec 2021-09-10 01:49:22,883 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-635} 2021-09-10 01:49:22,884 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-635} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ? 2021-09-10 01:49:22,884 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-635} ('!EgeuEnoxEjtDFiCGyq:bellerophon', 'join') 2021-09-10 01:49:22,885 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-635} 0.000033 sec 2021-09-10 01:49:22,885 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-635} 0.002419 sec 2021-09-10 01:49:22,886 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-636} 2021-09-10 01:49:22,887 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-636} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ? 2021-09-10 01:49:22,887 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-636} ('!IiIuPTaEyjYtlzVpWs:bellerophon', 'join') 2021-09-10 01:49:22,888 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-636} 0.000031 sec 2021-09-10 01:49:22,889 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-636} 0.002365 sec 2021-09-10 01:49:22,890 - synapse.storage.txn - 535 - DEBUG - GET-145 - [TXN START] {get_users_in_room-637} 2021-09-10 01:49:22,890 - synapse.storage.SQL - 311 - DEBUG - GET-145 - [SQL] {get_users_in_room-637} SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = ? AND membership = ? 2021-09-10 01:49:22,891 - synapse.storage.SQL - 316 - DEBUG - GET-145 - [SQL values] {get_users_in_room-637} ('!mQSqavcZeWYZncQwJh:bellerophon', 'join') 2021-09-10 01:49:22,891 - synapse.storage.SQL - 337 - DEBUG - GET-145 - [SQL time] {get_users_in_room-637} 0.000030 sec 2021-09-10 01:49:22,892 - synapse.storage.txn - 638 - DEBUG - GET-145 - [TXN END] {get_users_in_room-637} 0.002469 sec 2021-09-10 01:49:22,893 - synapse.util.metrics - 145 - DEBUG - GET-145 - Exiting block _generate_sync_entry_for_device_list |
Right, got to the bottom of it. The thing which is hammering So, I think this has become way heavier because we expire caches after 30 minutes, hence making things feel bad. A better solution might be to calculate the users who need to be told about devicelist changes whenever a devicelist change happens, rather than checking every possible user every time we call /sync. |
Another storage method in the sync process, |
yup, there are other bad ones too - i just picked the worst (30s in |
Annoyingly |
presumably one could cache the per-room results, and so avoid hitting all the per-room queries every time one room membership changes but only query the one which changed? |
As a temporary patch, let's try exempting just the relevant cache from time-based eviction and see if that improves things on matrix.org. Assigning to Erik (knows the code) and David (current maintainer) |
Cache entries on So turning off time based eviction might help, but only to a certain extent (because other people joining will nuke synapses' cache entries). |
Note: expiry time is currently 1h on matrix.org. |
If cache invalidation for |
For completness/context, I had in mind the use of synapse/synapse/storage/databases/main/roommember.py Lines 547 to 563 in 01c88a0
|
Another potential solution here is to record which rooms need to be told about a device list change, which saves us from ever having to call |
anyone remember what we're doing about this? it seems like the subject needs an update. |
The decision was to let this sit where it is for the time being: we don't have the bandwidth to dig deeper, and any investment will likely be rendered moot by Sync V3 next year. |
If Sync V3 starts looking like it won't deliver its anticipated improvements, or its delivery date gets pushed too far out, then we'll re-prioritize this work. We'd similarly re-prioritize if the problem becomes significantly worse in the interim. |
I have this same problem, but I developed #9798 for it and applied it to my own server. However, i soon dropped it because it required a bunch of changes in sytest, which i wasn't qualified to do. |
I've updated the subject as best I can. |
Just experienced this with a 6-month old Element session (which doesn't handle it well, element-hq/element-web#23637, element-hq/element-web#9198). Some of my The trace is https://jaeger.proxy.matrix.org/trace/657fa919e14b7196 for those that catch it in the few days before it expires. Otherwise you can drag and drop this The rest of the long |
empirically my ios app takes aaaages to incremental sync in the morning or after any significant time spent offline. manu and others have spotted it too. need to check average incr sync times to see when the step change was?
The text was updated successfully, but these errors were encountered: