Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

to-device message went missing #15335

Closed
richvdh opened this issue Mar 28, 2023 · 4 comments · Fixed by #15349
Closed

to-device message went missing #15335

richvdh opened this issue Mar 28, 2023 · 4 comments · Fixed by #15349
Assignees
Labels
A-E2EE End-to-end encryption for Matrix clients T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@richvdh
Copy link
Member

richvdh commented Mar 28, 2023

We received a to-device message on matrix.org:

2023-03-23 13:52:18,439 - synapse.9533_debug - 890 - DEBUG - POST-3642069----- - Stored to-device messages with stream_id 821825860: ['@dave:matrix.org/CSMDXZIMRU (msgid 3eca6bb0-ebaf-4106-b3da-f66da959c621)', ... ]

There is no sign of this message ever being received on the client (rageshake at https://github.com/matrix-org/element-web-rageshakes/issues/20790). The server-side logging isn't completely clear, but I strongly suspect the message was never returned over /sync.

@richvdh
Copy link
Member Author

richvdh commented Mar 28, 2023

Some odd things I note:

  • On all the synchrotrons, the replication data seems to jump straight from stream id 26567797 to 26567959, missing out 821825860:
    2023-03-23 13:52:19,647 - synapse.9533_debug - 492 - DEBUG - process-replication-data-26567797 - to-device messages stream id 821825789, awaking streams for [...]
    ...
    2023-03-23 13:52:20,136 - synapse.9533_debug - 492 - DEBUG - process-replication-data-26567959 - to-device messages stream id 821825870, awaking streams for [...]
    
    Normally there is a line for each stream id.
  • On the encryption worker, I see the following:
    2023-03-23 13:52:19,452 - synapse.replication.tcp.resource - 193 - INFO - replication_notifier-1178641 - Streaming: to_device -> 821825789
    2023-03-23 13:52:19,455 - synapse.replication.tcp.resource - 245 - INFO - replication_notifier-1178641 - Sending position: to_device -> 821825860
    
    A batch of replication rows is sent at 13:52:19, but it is limited due to the number of rows to be sent. Then we immediately send a POSITION up to stream id 821825860. This is coming from the code added in Fix bug where 5s delays would occasionally happen. #15150.

According to https://github.com/matrix-org/synapse/blob/develop/synapse/replication/tcp/commands.py#L173, the recipient is meant to read any missing rows from the database, but I'm not convinced that is happening. I also suspect that the code added in #15150 should not be sending a POSITION in this case (is it maybe missing an if not limited), but I don't really understand how our replication streams are meant to work.

@richvdh
Copy link
Member Author

richvdh commented Mar 28, 2023

As @reivilibre points out, https://github.com/matrix-org/synapse/blob/develop/synapse/replication/tcp/commands.py#L164 also says of a POSITION

The tokens are chosen so that no rows were written by the instance between the prev_token and new_token.

... which is certainly not the case here.

@richvdh
Copy link
Member Author

richvdh commented Mar 28, 2023

After closer inspection:

According to https://github.com/matrix-org/synapse/blob/develop/synapse/replication/tcp/commands.py#L173, the recipient is meant to read any missing rows from the database

This is correct, but it is specifically referring to the prev_token, which — unhelpfully — we do not log. (It does actually say it's about the prev_token in the next sentence, but (a) I can't read, and (b) I was led astray by the docs here which say that POSITION only has one stream_id parameter.)

Nevertheless we have the fact that we are sending a POSITION with a new_token of 821825860, despite the fact that there are to-device messages with smaller stream ids which have not sent.

@richvdh
Copy link
Member Author

richvdh commented Mar 28, 2023

New theory is that the to-device stream updates have been limited by the min at https://github.com/matrix-org/synapse/blob/v1.80.0/synapse/storage/databases/main/deviceinbox.py#L620. If that happens (and it does not actually pull out 100 or more rows from the database), it will return a new token of current_id despite the fact we did not even look at all those rows.

(This code was introduced in #7024 and could be an explanation for many of the mysterious to-device-message-disappearances over the years; see #9533.)

@MatMaul MatMaul self-assigned this Mar 28, 2023
@MadLittleMods MadLittleMods added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. A-E2EE End-to-end encryption for Matrix clients labels Mar 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-E2EE End-to-end encryption for Matrix clients T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants