Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Sync requests immediately return with empty payload #8518

Closed
KitsuneRal opened this issue Oct 11, 2020 · 8 comments · Fixed by #9358 or #10158
Closed

Sync requests immediately return with empty payload #8518

KitsuneRal opened this issue Oct 11, 2020 · 8 comments · Fixed by #9358 or #10158
Labels
A-Performance Performance, both client-facing and admin-facing S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@KitsuneRal
Copy link
Member

KitsuneRal commented Oct 11, 2020

Description

Occasionally, without any specific action on my end, a normal long-polling sync loop becomes a storm of syncs, as each next request sent from Quotient returns immediately (30-40 ms) with a JSON object that has the full sync response structure (rooms with invite/join/leave inside, account_data, to_device, etc.) and reasonable-looking next_batch but with empty objects (apparently because no events arrived since the last sync). Usually this happens within 30 seconds to 2 minutes, and requests stay with the server in a normal way (seconds) after that.

Steps to reproduce

I couldn't identify a particular pattern. For reference, it occured between 10:28 and 10:29 CEST with sync coming from my user (@kitsune), with next_batch at 10:29:07 looking like "s1589639296_757284957_14250253_610667103_445506552_1585308_106571744_371537760_134971"

  • Homeserver: matrix.org
@richvdh
Copy link
Member

richvdh commented Oct 12, 2020

@KitsuneRal any idea when you first saw this? in the last couple of weeks, or longer?

@anoadragon453
Copy link
Member

We can see this happening clearly in the server logs for roughly 1m04s at the mentioned time. We're not entirely sure if this is a regression in v1.21.0 or not.

Here are the logs from the server side: https://gist.github.com/anoadragon453/4a54bbe3388cdc9f4d9c05feb1c028d2 We can see that they indeed return immediately for the same access token, even when a 30000 ms timeout is provided.

My hunch is that somehow this condition is True, meaning we don't wait for a new event before returning:

if timeout == 0 or since_token is None or full_state:
# we are going to return immediately, so don't bother calling
# notifier.wait_for_events.
result = await self.current_sync_for_user(
sync_config, since_token, full_state=full_state
)

The parameters to the request look fine, so timeout, since or full_state may be changing somewhere internally.

It would be helpful to add some debug logging here to help track down what exactly is going wrong, although be aware this is a fairly hot path.

@anoadragon453 anoadragon453 added z-bug (Deprecated Label) z-p2 (Deprecated Label) A-Performance Performance, both client-facing and admin-facing labels Oct 12, 2020
@KitsuneRal
Copy link
Member Author

@KitsuneRal any idea when you first saw this? in the last couple of weeks, or longer?

Yes, in something like a couple of weeks.

@richvdh
Copy link
Member

richvdh commented Nov 11, 2020

This seems fairly easy to reproduce with any user who sees no activity in their rooms during the sync timeout: the first /sync times out, returning the same sync token in next_batch as was passed in since. Subsequent requests (again using that same sync token) hit the cache and return immediately.

Look at the logs from that first /sync request:

2020-11-11 18:46:19,812 - synapse.access.http.8008 - 258 - DEBUG - GET-26414 - ::ffff:127.0.0.1 - 8008 - Received request: GET /_matrix/client/r0/sync?filter=1&timeout=30000&since=s206_4657_8_25_111_5_1_48_1
2020-11-11 18:46:19,813 - synapse.rest.client.v2_alpha.sync - 119 - DEBUG - GET-26414 - /sync: user=@richvdh:fred.sw1v.org, timeout=30000, since='s206_4657_8_25_111_5_1_48_1', set_presence='online', filter_id='1', device_id='RLLEKDLCTE'
2020-11-11 18:46:19,814 - synapse.util.caches.response_cache - 154 - DEBUG - GET-26414 - [sync]: no cached result for [(@richvdh:fred.sw1v.org, 30000, 's206_4657_8_25_111_5_1_48_1', '1', False, 'RLLEKDLCTE')], calculating new one
2020-11-11 18:46:19,814 - synapse.handlers.sync - 969 - DEBUG - GET-26414 - Calculating sync response for @richvdh:fred.sw1v.org between StreamToken(room_key=RoomStreamToken(topological=None, stream=206, instance_map={}), presence_key=4657, typing_key=8, receipt_key=25, account_data_key=111, push_rules_key=5, to_device_key=1, device_list_key=48, groups_key=1) and StreamToken(room_key=RoomStreamToken(topological=None, stream=206, instance_map={}), presence_key=4657, typing_key=8, receipt_key=25, account_data_key=111, push_rules_key=5, to_device_key=1, device_list_key=48, groups_key=1)
2020-11-11 18:46:19,814 - synapse.handlers.sync - 990 - DEBUG - GET-26414 - Fetching account data
2020-11-11 18:46:19,815 - synapse.handlers.sync - 996 - DEBUG - GET-26414 - Fetching room data
2020-11-11 18:46:19,815 - synapse.handlers.sync - 1392 - DEBUG - GET-26414 - no-oping sync
2020-11-11 18:46:19,815 - synapse.handlers.sync - 1008 - DEBUG - GET-26414 - Fetching presence data
2020-11-11 18:46:19,815 - synapse.handlers.presence - 1024 - DEBUG - GET-26414 - Invoked 'get_new_events' with args: self=<synapse.handlers.presence.PresenceEventSource obj..., kwargs={'is_guest': 0}, user=@richvdh:fred.sw1v.org, from_key=4657, include_offline=True, room_ids=None, explicit_room_id=None
2020-11-11 18:46:19,815 - synapse.handlers.sync - 1013 - DEBUG - GET-26414 - Fetching to-device data
2020-11-11 18:46:19,816 - synapse.handlers.sync - 1024 - DEBUG - GET-26414 - Fetching OTK data
2020-11-11 18:46:19,816 - synapse.handlers.sync - 1036 - DEBUG - GET-26414 - Fetching group data
2020-11-11 18:46:19,816 - synapse.handlers.sync - 1047 - DEBUG - GET-26414 - Sync response calculation complete
2020-11-11 18:46:49,815 - synapse.handlers.sync - 287 - DEBUG - GET-26414 - Returning sync response for @richvdh:fred.sw1v.org
2020-11-11 18:46:49,815 - synapse.rest.client.v2_alpha.sync - 195 - DEBUG - GET-26414 - Formatting events in sync response
2020-11-11 18:46:49,816 - synapse.rest.client.v2_alpha.sync - 223 - DEBUG - GET-26414 - building sync response dict
2020-11-11 18:46:49,816 - synapse.rest.client.v2_alpha.sync - 191 - DEBUG - GET-26414 - Event formatting complete
2020-11-11 18:46:49,816 - synapse.access.http.8008 - 331 - INFO - GET-26414 - ::ffff:127.0.0.1 - 8008 - {@richvdh:fred.sw1v.org} Processed request: 30.004sec/-0.000sec (0.001sec, 0.000sec) (0.000sec/0.000sec/0) 362B 200 "GET /_matrix/client/r0/sync?filter=1&timeout=30000&since=s206_4657_8_25_111_5_1_48_1 HTTP/1.1" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0" [0 dbevts]

It builds the empty response when the endpoint is first called, rather than waiting for the timeout. Then it waits for 30 seconds and returns the empty response, so even though there has been traffic in other rooms in the meantime, the sync token is not advanced.

Compare with the debug for an empty sync response from synapse 1.20:

2020-11-11 19:00:00,993 - synapse.access.http.8008 - 246 - DEBUG - GET-17 - ::ffff:127.0.0.1 - 8008 - Received request: GET /_matrix/client/r0/sync?filter=1&timeout=30000&since=s206_4684_0_25_111_5_1_48_1
2020-11-11 19:00:00,993 - synapse.rest.client.v2_alpha.sync - 118 - DEBUG - GET-17 - /sync: user=@richvdh:fred.sw1v.org, timeout=30000, since='s206_4684_0_25_111_5_1_48_1', set_presence='online', filter_id='1', device_id='RLLEKDLCTE'
2020-11-11 19:00:00,994 - synapse.util.caches.response_cache - 148 - DEBUG - GET-17 - [sync]: no cached result for [(@richvdh:fred.sw1v.org, 30000, 's206_4684_0_25_111_5_1_48_1', '1', False, 'RLLEKDLCTE')], calculating new one
2020-11-11 19:00:30,996 - synapse.handlers.sync - 973 - DEBUG - GET-17 - Calculating sync response for @richvdh:fred.sw1v.org between StreamToken(room_key='s206', presence_key='4684', typing_key='0', receipt_key='25', account_data_key='111', push_rules_key='5', to_device_key='1', device_list_key='48', groups_key='1') and StreamToken(room_key='s206', presence_key=4684, typing_key=0, receipt_key=25, account_data_key=111, push_rules_key=5, to_device_key=1, device_list_key=48, groups_key=1)
2020-11-11 19:00:30,997 - synapse.handlers.sync - 994 - DEBUG - GET-17 - Fetching account data
2020-11-11 19:00:30,997 - synapse.handlers.sync - 1000 - DEBUG - GET-17 - Fetching room data
2020-11-11 19:00:30,998 - synapse.handlers.sync - 1391 - DEBUG - GET-17 - no-oping sync
2020-11-11 19:00:30,998 - synapse.handlers.sync - 1012 - DEBUG - GET-17 - Fetching presence data
2020-11-11 19:00:30,999 - synapse.handlers.sync - 1017 - DEBUG - GET-17 - Fetching to-device data
2020-11-11 19:00:30,999 - synapse.handlers.sync - 1028 - DEBUG - GET-17 - Fetching OTK data
2020-11-11 19:00:30,999 - synapse.handlers.sync - 1036 - DEBUG - GET-17 - Fetching group data
2020-11-11 19:00:31,000 - synapse.handlers.sync - 1047 - DEBUG - GET-17 - Sync response calculation complete
2020-11-11 19:00:31,000 - synapse.handlers.sync - 296 - DEBUG - GET-17 - Returning sync response for @richvdh:fred.sw1v.org
2020-11-11 19:00:31,001 - synapse.rest.client.v2_alpha.sync - 195 - DEBUG - GET-17 - Formatting events in sync response
2020-11-11 19:00:31,001 - synapse.rest.client.v2_alpha.sync - 223 - DEBUG - GET-17 - building sync response dict
2020-11-11 19:00:31,001 - synapse.rest.client.v2_alpha.sync - 191 - DEBUG - GET-17 - Event formatting complete
2020-11-11 19:00:31,002 - synapse.access.http.8008 - 311 - INFO - GET-17 - ::ffff:127.0.0.1 - 8008 - {@richvdh:fred.sw1v.org} Processed request: 30.009sec/-0.000sec (0.006sec, 0.000sec) (0.000sec/0.000sec/0) 305B 200 "GET /_matrix/client/r0/sync?filter=1&timeout=30000&since=s206_4684_0_25_111_5_1_48_1 HTTP/1.1" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0" [0 dbevts]

... it waits for 30 seconds before calling the synapse.handlers.sync code.

Presumably this means that the timeout at https://github.com/matrix-org/synapse/blob/v1.22.0/synapse/notifier.py#L437-L444 is not working correctly.

@richvdh
Copy link
Member

richvdh commented Nov 11, 2020

Note that it only happens on matrix-org-hotfixes which has a non-zero SYNC_RESPONSE_CACHE_MS. Presumably that implies that even before that regression was introduced, you could end up in a similar situation on an idle homeserver with non-zero SYNC_RESPONSE_CACHE_MS.

@ShadowJonathan
Copy link
Contributor

ShadowJonathan commented Feb 9, 2021

Preliminary observations; It looks like the prev_key == current_key? is it possible that due to a timeout (where nothing happened for 30 seconds) which got cached (by prev_key), the cache got returned which includes next_key == prev_key, thus creating a loop? If so, maybe we should have a way of creating a "don't cache" response in that case (if the sync result is due to timeout, and as a sanity check, if prev_key == next_key)?

Edit: Just read everything, yeah, it looks like that's the case.

anoadragon453 pushed a commit that referenced this issue Feb 24, 2021
This fixes #8518 by adding a conditional check on `SyncResult` in a function when `prev_stream_token == current_stream_token`, as a sanity check. In `CachedResponse.set.<remove>()`, the result is immediately popped from the cache if the conditional function returns "false".

This prevents the caching of a timed-out `SyncResult` (that has `next_key` as the stream key that produced that `SyncResult`). The cache is prevented from returning a `SyncResult` that makes the client request the same stream key over and over again, effectively making it stuck in a loop of requesting and getting a response immediately for as long as the cache keeps those values.

Signed-off-by: Jonathan de Jong <jonathan@automatia.nl>
clokep added a commit that referenced this issue Mar 2, 2021
)"

This reverts commit f5c93fc.

This is being backed out due to a regression (#9507) and additional
review feedback being provided.
@clokep
Copy link
Member

clokep commented Mar 2, 2021

Re-opening as the fix has been backed out, see #9358 (comment).

@clokep clokep reopened this Mar 2, 2021
@clokep clokep added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. and removed z-bug (Deprecated Label) z-p2 (Deprecated Label) labels Mar 2, 2021
@richvdh
Copy link
Member

richvdh commented Jun 9, 2021

To summarise what was said before: the proximate cause of this is current_sync_for_user returning a result where next_batch is the same as the since param. That implies that the client will repeat an identical request next time, and hence get back the cached result they just saw.

An obvious instance of this would happen on a completely idle homeserver: if there is nothing going on, then there is nothing to advance the tokens.


However, that doesn't help explain what's wrong with matrix.org.

I think this was significantly exacerbated in v1.22.0 by #8439, in particular https://github.com/matrix-org/synapse/pull/8439/files#diff-d5643d4a94ee928bdd677a4ffbfbf3c571bc0a9f84d00d92dde31d91a6d53533R379. Setting cmp=False on RoomStreamToken means that no two stream tokens ever compare for equality. That means that _NotifierUserStream.new_listener always returns immediately on its first call, which is why we see the behaviour above with current_sync_for_user being called at the start of the /sync call, rather than after the timeout.

However, this report predates that change. It's possible that this change to _NotifierUserStream.new_listener in #8423 (v1.21.0) also plays a part, though I'm not really sure current_sync_for_user would return next_batch==since in that case.


Anyway: we need to fix the caching behaviour anyway to fix the idle homeserver case. I think we should also fix RoomStreamToken so that it correctly compares for equality; hopefully that will also bring performance improvements.

richvdh added a commit that referenced this issue Jun 14, 2021
…0157)

This is the first of two PRs which seek to address #8518. This first PR lays the groundwork by extending ResponseCache; a second PR (#10158) will update the SyncHandler to actually use it, and fix the bug.

The idea here is that we allow the callback given to ResponseCache.wrap to decide whether its result should be cached or not. We do that by (optionally) passing a ResponseCacheContext into it, which it can modify.
richvdh added a commit that referenced this issue Jun 17, 2021
)

Fixes #8518 by telling the ResponseCache not to cache the /sync response if the next_batch param is the same as the since token.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
5 participants