Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Push notifications stop working when a history purge deletes the event a read receipt is pointing at #12880

Closed
Taraslysak opened this issue May 26, 2022 · 9 comments
Labels
A-Message-Retention-Policies Issues related to Synapse's support of MSC1763 (message retention policies) A-Push Issues related to push/notifications O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@Taraslysak
Copy link

Taraslysak commented May 26, 2022

Description

We have a matrix homeserver which is configured without using additional workers. When the default message retention policy is on, and after the purge job took place, users stop to receive push notifications until they enter a room and send a message

Steps to reproduce

  • setup matrix homeserver from synapse docker container;
  • turn on message retention policy inhemoserver.yaml file:
  # The message retention policies feature is disabled by default. Uncomment the
  # following line to enable it.
  #
  enabled: true

  # Default retention policy. If set, Synapse will apply it to rooms that lack the
  # 'm.room.retention' state event. Currently, the value of 'min_lifetime' doesn't
  # matter much because Synapse doesn't take it into account yet.
  #
  default_policy:
    min_lifetime: 1m
    max_lifetime: 2d
  • write encrypted messages from user A to user B in room R. Receive push notifications as expected. Server logs for 'normal' behaviour:
synapse_1         | 2022-05-26 07:00:53,213 - synapse.push.httppusher - 192 - INFO - httppush.process-15 - Processing 1 unprocessed push actions for @user_b:my.homeserver.com}/com.chatApp/AAAA:XXXX starting at stream_ordering 24659
synapse_1         | 2022-05-26 07:00:53,222 - synapse.access.http.8008 - 427 - INFO - GET-2025 - ::ffff:172.31.0.4 - 8008 - {@user_a:my.homeserver.com} Processed request: 7.244sec/0.001sec (0.008sec, 0.002sec) (0.006sec/0.010sec/3) 1352B 200 "GET /_matrix/client/r0/sync?since=s24659_46789_0_25324_1_194_8725_1775_1&timeout=45000 HTTP/1.0" "okhttp/4.9.2" [0 dbevts]
synapse_1         | 2022-05-26 07:00:53,230 - synapse.access.http.8008 - 427 - INFO - GET-2026 - ::ffff:172.31.0.4 - 8008 - {@suer_b:homeserver.com} Processed request: 7.136sec/0.001sec (0.009sec, 0.001sec) (0.011sec/0.016sec/5) 1296B 200 "GET /_matrix/client/r0/sync?since=s24659_46789_0_25324_1_194_8725_1775_1&timeout=45000 HTTP/1.0" "okhttp/4.9.2" [0 dbevts]
synapse_1         | 2022-05-26 07:00:53,328 - synapse.http.client - 442 - INFO - httppush.process-15 - Received response to POST https://sygnal-dev.my.homeserver.com/_matrix/push/v1/notify: 200
synapse_1         | 2022-05-26 07:00:53,547 - synapse.access.http.8008 - 427 - INFO - GET-2031 - ::ffff:172.31.0.4 - 8008 - {@user_b:my.homeserver.com} Processed request: 0.013sec/0.001sec (0.006sec, 0.000sec) (0.001sec/0.007sec/4) 883B 200 "GET /_matrix/client/r0/rooms/!XYwqVywzUXnTMmIwvf:my.homeserver.com/event/$otYFwSkMlcAr-K3GpxeL5T8d1lMHEf6262p22gtVMd4 HTTP/1.0" "okhttp/4.9.2" [0 dbevts]
  • The DB purge process seem to go as expected, no visible errors seem to occur in logs:
synapse.handlers.pagination - 227 - INFO - purge_history_for_rooms_in_range-28 - [purge] Attempting to purge messages in room !XYwqVywzUXnTMmIwvf:matrix-dev.kryptr.li
synapse_1         | 2022-05-26 07:45:19,349 - synapse.storage.databases.main.purge_events - 229 - INFO - _purge_history-4926 - [purge] removing events from event_auth
synapse_1         | 2022-05-26 07:45:19,351 - synapse.storage.databases.main.purge_events - 162 - INFO - _purge_history-4927 - [purge] found 0 events before cutoff, of which 0 can be deleted
synapse_1         | 2022-05-26 07:45:19,352 - synapse.storage.databases.main.purge_events - 283 - INFO - _purge_history-4925 - [purge] updating room_depth to 0
synapse_1         | 2022-05-26 07:45:19,354 - synapse.storage.databases.main.purge_events - 168 - INFO - _purge_history-4927 - [purge] Finding new backward extremities
synapse_1         | 2022-05-26 07:45:19,356 - synapse.storage.databases.main.purge_events - 124 - INFO - _purge_history-4928 - [purge] looking for events to delete
synapse_1         | 2022-05-26 07:45:19,357 - synapse.storage.databases.main.purge_events - 181 - INFO - _purge_history-4927 - [purge] replacing backward extremities: []
synapse_1         | 2022-05-26 07:45:19,358 - synapse.storage.databases.main.purge_events - 194 - INFO - _purge_history-4927 - [purge] finding state groups referenced by deleted events
synapse_1         | 2022-05-26 07:45:19,359 - synapse.storage.databases.main.purge_events - 206 - INFO - _purge_history-4927 - [purge] found 0 referenced state groups
synapse_1         | 2022-05-26 07:45:19,359 - synapse.storage.databases.main.purge_events - 210 - INFO - _purge_history-4927 - [purge] removing events from event_to_state_groups
synapse_1         | 2022-05-26 07:45:19,361 - synapse.storage.databases.main.purge_events - 229 - INFO - _purge_history-4926 - [purge] removing events from event_edges
synapse_1         | 2022-05-26 07:45:19,361 - synapse.storage.databases.main.purge_events - 229 - INFO - _purge_history-4927 - [purge] removing events from events
synapse_1         | 2022-05-26 07:45:19,362 - synapse.storage.databases.main.purge_events - 229 - INFO - _purge_history-4927 - [purge] removing events from event_json
synapse_1         | 2022-05-26 07:45:19,363 - synapse.storage.databases.main.purge_events - 229 - INFO - _purge_history-4926 - [purge] removing events from event_forward_extremities
synapse_1         | 2022-05-26 07:45:19,363 - synapse.storage.databases.main.purge_events - 229 - INFO - _purge_history-4927 - [purge] removing events from event_auth
synapse_1         | 2022-05-26 07:45:19,363 - synapse.storage.databases.main.purge_events - 229 - INFO - _purge_history-4926 - [purge] removing events from event_reference_hashes
synapse_1         | 2022-05-26 07:45:19,364 - synapse.storage.databases.main.purge_events - 229 - INFO - _purge_history-4926 - [purge] removing events from event_relations
synapse_1         | 2022-05-26 07:45:19,365 - synapse.storage.databases.main.purge_events - 309 - INFO - _purge_history-4925 - [purge] done
  • send a message from user A to user B after DB is purged. User B receives no notifications. Logs for those messages:
synapse_1         | 2022-05-26 07:56:16,220 - synapse.access.http.8008 - 427 - INFO - PUT-3128 - ::ffff:172.31.0.4 - 8008 - {@user_a:my.homeserver.com} Processed request: 0.074sec/0.004sec (0.004sec, 0.001sec) (0.001sec/0.005sec/4) 59B 200 "PUT /_matrix/client/r0/rooms/!XYwqVywzUXnTMmIwvf:matrix-dev.kryptr.li/send/m.room.encrypted/85fed1c4-ade1-4735-888e-8bdbff34462d HTTP/1.0" "okhttp/4.9.2" [0 dbevts]
synapse_1         | 2022-05-26 07:56:16,235 - synapse.push.httppusher - 192 - INFO - httppush.process-30 - Processing 0 unprocessed push actions for @user_b:my.homeserver.com/com.myChatApp/AAA:XXXX starting at stream_ordering 24673
synapse_1         | 2022-05-26 07:56:16,237 - synapse.access.http.8008 - 427 - INFO - GET-3126 - ::ffff:172.31.0.4 - 8008 - {@user_a:my.homeserver.com} Processed request: 7.372sec/0.002sec (0.007sec, 0.001sec) (0.006sec/0.007sec/4) 1352B 200 "GET /_matrix/client/r0/sync?since=s24674_46856_0_25340_1_194_8725_1775_1&timeout=45000 HTTP/1.0" "okhttp/4.9.2" [0 dbevts]
  • User B logs into the app, sync with the server, but still do not receive any notifications.
  • User B enters the chat. After that, synapse starts to push notifications to sygnal:
synapse_1         | 2022-05-26 10:52:07,952 - synapse.access.http.8008 - 427 - INFO - PUT-6256 - ::ffff:172.31.0.4 - 8008 - {@user_a:mt.homeserver.com} Processed request: 0.064sec/0.004sec (0.004sec, 0.002sec) (0.001sec/0.005sec/4) 59B 200 "PUT /_matrix/client/r0/rooms/!XYwqVywzUXnTMmIwvf:matrix-dev.kryptr.li/send/m.room.encrypted/194db016-4d33-4333-8a18-b0d043a3752e HTTP/1.0" "okhttp/4.9.2" [0 dbevts]
synapse_1         | 2022-05-26 10:52:07,968 - synapse.access.http.8008 - 427 - INFO - GET-6249 - ::ffff:172.31.0.4 - 8008 - {@user_a:my.homeserver.com} Processed request: 11.750sec/0.001sec (0.005sec, 0.004sec) (0.009sec/0.005sec/4) 1352B 200 "GET /_matrix/client/r0/sync?since=s24678_46993_0_25342_1_194_8725_1775_1&timeout=45000 HTTP/1.0" "okhttp/4.9.2" [0 dbevts]
synapse_1         | 2022-05-26 10:52:07,974 - synapse.access.http.8008 - 427 - INFO - GET-6250 - ::ffff:172.31.0.4 - 8008 - {@user_b:my.homeserver.com} Processed request: 11.661sec/0.001sec (0.005sec, 0.003sec) (0.005sec/0.012sec/4) 1296B 200 "GET /_matrix/client/r0/sync?since=s24678_46993_0_25342_1_194_8725_1775_1&timeout=45000 HTTP/1.0" "okhttp/4.9.2" [0 dbevts]
synapse_1         | 2022-05-26 10:52:07,981 - synapse.push.httppusher - 192 - INFO - httppush.process-34 - Processing 1 unprocessed push actions for @user_b:my.homeserver.com/com.myApp/AAA:XXXXX starting at stream_ordering 24678
synapse_1         | 2022-05-26 10:52:08,077 - synapse.http.client - 442 - INFO - httppush.process-34 - Received response to POST https://sygnal-dev.my.homeserver.com/_matrix/push/v1/notify: 200
synapse_1         | 2022-05-26 10:52:08,274 - synapse.access.http.8008 - 427 - INFO - GET-6257 - ::ffff:172.31.0.4 - 8008 - {@user_b:my.homeserver.com} Processed request: 0.012sec/0.001sec (0.004sec, 0.001sec) (0.001sec/0.006sec/4) 883B 200 "GET /_matrix/client/r0/rooms/!XYwqVywzUXnTMmIwvf:matrix-dev.kryptr.li/event/$HTe3KFoTXDx7PLk2BmnJ-jDkKi0PAP-shwrOEj0nlnk HTTP/1.0" "okhttp/4.9.2" [0 dbevts]

Version information

  • Homeserver:

Homeserver is deployed inside VPN

If not matrix.org:

  • Version:

  • {
    server_version: "1.59.1",
    python_version: "3.9.13"
    }:

  • {
    server_version: "1.53.0",
    python_version: "3.8.12"
    }:

  • Install method:
    via docker-compose file

  • Platform:
    Matrix ecosystem is deployed in docker containers via docker-compose. Container versions used:

  • v1.59.1

  • v1.53.0

@squahtx
Copy link
Contributor

squahtx commented May 26, 2022

Thanks for identifying the request in the logs, it's going to be helpful.

  • send a message from user A to user B after DB is purged. User B receives no notifications. Logs for those messages:
synapse_1         | 2022-05-26 07:56:16,220 - synapse.access.http.8008 - 427 - INFO - PUT-3128 - ::ffff:172.31.0.4 - 8008 - {@user_a:my.homeserver.com} Processed request: 0.074sec/0.004sec (0.004sec, 0.001sec) (0.001sec/0.005sec/4) 59B 200 "PUT /_matrix/client/r0/rooms/!XYwqVywzUXnTMmIwvf:matrix-dev.kryptr.li/send/m.room.encrypted/85fed1c4-ade1-4735-888e-8bdbff34462d HTTP/1.0" "okhttp/4.9.2" [0 dbevts]
synapse_1         | 2022-05-26 07:56:16,235 - synapse.push.httppusher - 192 - INFO - httppush.process-30 - Processing 0 unprocessed push actions for @user_b:my.homeserver.com/com.myChatApp/AAA:XXXX starting at stream_ordering 24673

Could you search your logs for all lines containing PUT-3128 or httppush.process-29, httppush.process-30, httppush.process-31?

@Taraslysak
Copy link
Author

Taraslysak commented May 27, 2022

PUT-3128:

2022-05-26 07:56:16,220 - synapse.access.http.8008 - 427 - INFO - PUT-3128 - ::ffff:172.31.0.4 - 8008 - {@user_a:my.homeserver.com} Processed request: 0.074sec/0.004sec (0.004sec, 0.001sec) (0.001sec/0.005sec/4) 59B 200 "PUT /_matrix/client/r0/rooms/!XYwqVywzUXnTMmIwvf:my.homeserver.com/send/m.room.encrypted/85fed1c4-ade1-4735-888e-8bdbff34462d HTTP/1.0" "okhttp/4.9.2" [0 dbevts]

httppush.process-29:

2022-05-26 07:56:08,324 - synapse.push.httppusher - 192 - INFO - httppush.process-29 - Processing 0 unprocessed push actions for @device_b:my.homeserver.com/com.myChatApp/AAA:XXXXX starting at stream_ordering 24673

httppush.process-30:

2022-05-26 07:56:16,235 - synapse.push.httppusher - 192 - INFO - httppush.process-30 - Processing 0 unprocessed push actions for @user_b:my.homeserver.com/com.myChatApp/AAA:AXXXXX starting at stream_ordering 24673

httppush.process-31:

2022-05-26 07:58:50,778 - synapse.push.httppusher - 192 - INFO - httppush.process-31 - Processing 0 unprocessed push actions for @user_b:my.homeserver.com/com.myChatApp/AAA:XXXXX starting at stream_ordering 24673

@squahtx squahtx added A-Push Issues related to push/notifications T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels May 30, 2022
@squahtx
Copy link
Contributor

squahtx commented May 30, 2022

Thanks for the logs. I think we've got enough info to figure out what's going wrong now.

synapse_1         | 2022-05-26 07:56:16,220 - synapse.access.http.8008 - 427 - INFO - PUT-3128 - ::ffff:172.31.0.4 - 8008 - {@user_a:my.homeserver.com} Processed request: 0.074sec/0.004sec (0.004sec, 0.001sec) (0.001sec/0.005sec/4) 59B 200 "PUT /_matrix/client/r0/rooms/!XYwqVywzUXnTMmIwvf:matrix-dev.kryptr.li/send/m.room.encrypted/85fed1c4-ade1-4735-888e-8bdbff34462d HTTP/1.0" "okhttp/4.9.2" [0 dbevts]
synapse_1         | 2022-05-26 07:56:16,235 - synapse.push.httppusher - 192 - INFO - httppush.process-30 - Processing 0 unprocessed push actions for @user_b:my.homeserver.com/com.myChatApp/AAA:XXXX starting at stream_ordering 24673

From your logs, it can be seen that the HTTP pusher picks up 0 unprocessed push actions after a message is sent.
The list of unprocessed push actions come from EventPushActionsWorkerStore.get_unread_push_actions_for_user_in_range_for_http, which combines the result of two queries:

  • The first query gets the read receipt in each room from receipts_linearized, looks up the corresponding event in events, gets that event's stream_ordering and returns all event_push_actions with a larger stream_ordering for each room.
  • The second query gets all event_push_actions for rooms with no read receipts.

Notably, if @user_b's read receipt is on an event that has been purged, then the first query won't pick up any notifications for that room.

The reason push notifications work again when @user_b enters the room and sends a message is that their read receipt gets updated to point to an event that exists again.

To fix this, we could either:

  • Delete entries in receipts_linearized pointing at purged events during purges
  • Or update one of the queries in EventPushActionsWorkerStore.get_unread_push_actions_for_user_in_range_for_http
    • Maybe change events INNER JOIN receipts_linearized to events RIGHT JOIN receipts_linearized and sprinkle COALESCE(...)s around?

@squahtx squahtx added S-Minor Blocks non-critical functionality, workarounds exist. P4 (OBSOLETE: use S- labels.) Okay backlog: will not schedule, will accept patches labels May 30, 2022
@squahtx squahtx changed the title Push notifications stop to work after DB purge Push notifications stop working when a history purge deletes the event a read receipt is pointing to May 30, 2022
@squahtx squahtx changed the title Push notifications stop working when a history purge deletes the event a read receipt is pointing to Push notifications stop working when a history purge deletes the event a read receipt is pointing at May 30, 2022
@Taraslysak
Copy link
Author

Thanks for the logs. I think we've got enough info to figure out what's going wrong now.

synapse_1         | 2022-05-26 07:56:16,220 - synapse.access.http.8008 - 427 - INFO - PUT-3128 - ::ffff:172.31.0.4 - 8008 - {@user_a:my.homeserver.com} Processed request: 0.074sec/0.004sec (0.004sec, 0.001sec) (0.001sec/0.005sec/4) 59B 200 "PUT /_matrix/client/r0/rooms/!XYwqVywzUXnTMmIwvf:matrix-dev.kryptr.li/send/m.room.encrypted/85fed1c4-ade1-4735-888e-8bdbff34462d HTTP/1.0" "okhttp/4.9.2" [0 dbevts]
synapse_1         | 2022-05-26 07:56:16,235 - synapse.push.httppusher - 192 - INFO - httppush.process-30 - Processing 0 unprocessed push actions for @user_b:my.homeserver.com/com.myChatApp/AAA:XXXX starting at stream_ordering 24673

From your logs, it can be seen that the HTTP pusher picks up 0 unprocessed push actions after a message is sent. The list of unprocessed push actions come from EventPushActionsWorkerStore.get_unread_push_actions_for_user_in_range_for_http, which combines the result of two queries:

  • The first query gets the read receipt in each room from receipts_linearized, looks up the corresponding event in events, gets that event's stream_ordering and returns all event_push_actions with a larger stream_ordering for each room.
  • The second query gets all event_push_actions for rooms with no read receipts.

Notably, if @user_b's read receipt is on an event that has been purged, then the first query won't pick up any notifications for that room.

The reason push notifications work again when @user_b enters the room and sends a message is that their read receipt gets updated to point to an event that exists again.

To fix this, we could either:

  • Delete entries in receipts_linearized pointing at purged events during purges

  • Or update one of the queries in EventPushActionsWorkerStore.get_unread_push_actions_for_user_in_range_for_http

    • Maybe change events INNER JOIN receipts_linearized to events RIGHT JOIN receipts_linearized and sprinkle COALESCE(...)s around?

Ok, So will it be fixed in next synapse update, or I should implement some workaround on my own?

@squahtx
Copy link
Contributor

squahtx commented May 31, 2022

It's unlikely that the Synapse team will find the time to fix it for the next update, though a PR with a fix and appropriate tests would be very welcome.

In the mean time it's best to use some sort of workaround. Raising max_lifetime in the retention config might make the bug less likely to happen? Though I'm guessing you've set it that low for a reason.

@laurent-treeb
Copy link

laurent-treeb commented Nov 14, 2022

I made this PR : #14441, it can be helpfull for everyone who wants to override an existing Synapse for versions before 1.62. (I test the modification on a 1.46)

In 1.62, there is a big refactoring with #13005, and i don't know if this refactoring fix the bug.

Explanation of fix:
It's easier to modify the request that get room that have no receipt, by adding rooms that have read receipt that pointed to an event that have been purged.

I only made the modification for get_unread_push_actions_for_user_in_range_for_http but i think it should be done for get_unread_push_actions_for_user_in_range_for_email that have duplicate code....

@DMRobertson DMRobertson added A-Message-Retention-Policies Issues related to Synapse's support of MSC1763 (message retention policies) O-Uncommon Most users are unlikely to come across this or unexpected workflow and removed P4 (OBSOLETE: use S- labels.) Okay backlog: will not schedule, will accept patches labels Nov 17, 2022
@anoadragon453
Copy link
Member

Thank you for the excellent write-up @squahtx! I personally like the suggestion of deleting entries in receipts_linearized while doing a purge.

If we wanted to avoid similar desyncs in the future, we could also add a background update to delete any existing rows in receipts_linearized that correspond to an event that no longer exists. And on the completion of that job, modify the event_id column of receipts_linearized to have a foreign key property pointing at the event_id column of events.

@laurent-treeb
Copy link

In 1.80.0 i do not reproduce, maybe the refactoring in 1.62 (#13005) or others fixes have solved this issue. Someone else confirm ?

@erikjohnston
Copy link
Member

This looks to be fixed, I'm adding a regression test in matrix-org/sytest#1361

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Message-Retention-Policies Issues related to Synapse's support of MSC1763 (message retention policies) A-Push Issues related to push/notifications O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
6 participants