We don't retry E2EE to-device messages at all. #12851

ara4n · 2020-03-24T19:03:01Z

Looking at this series of utter failures to send E2E messages, it doesn't look healthy:

specifically, the 500s never get retried. @uhoreg thinks this may in turn desync and wedge the olm sessions in question.

t3chguy · 2020-10-05T12:03:37Z

Does the order of the to_device messages need maintaining?

uhoreg · 2020-10-05T15:10:15Z

Sort of. Weird things may happen if you try to send an encrypted to_device message to some device, receive a to_device message from that same device, send another to_device message to that device, and then retry sending the first message. (If you don't have the "receive a to_device message from that same device" step, then everything should be fine.)

But it would probably be fine if we just retry sending immediately one or two times.

t3chguy · 2020-10-06T08:42:09Z

But it would probably be fine if we just retry sending immediately one or two times.

The issue I'm thinking is that if one request fails there is a decent probability an immediate retry will too, which might mean we need to queue and issue in-order retrying with a back-off or something and persisting so that we don't forget when the app gets restarted

t3chguy · 2020-10-16T14:30:36Z

send an encrypted to_device message to some device, receive a to_device message from that same device, send another to_device message to that device, and then retry sending the first message.

What should happen in this case
Should the retry be dropped?
Should the another message be queued to land after the retry?

uhoreg · 2020-10-16T17:50:17Z

Hmm. The problem with that case is that if you receive a message from a device, and then send a new message, then the ratchet would be advanced in such a way that any old messages that you had intended to send will be undecryptable. But if you retry the old message (and it succeeds) before sending the new message, then all should be fine. So perhaps one way to solve this would be to put all the to-be-sent to-device messages in a queue, and don't send the next one until the current one has been sent (or we've given up)?

dkasak · 2022-04-19T08:57:53Z

The problem with that case is that if you receive a message from a device, and then send a new message, then the ratchet would be advanced in such a way that any old messages that you had intended to send will be undecryptable.

At least vodozemac would happily handle this case since it retains a number of old chains so that it can decrypt messages encrypted with the old chain but arriving late. So things should still work out even if this scenario happens.

ara4n · 2022-06-22T08:48:32Z

send an encrypted to_device message to some device, receive a to_device message from that same device, send another to_device message to that device, and then retry sending the first message.

What should happen in this case?
Should the retry be dropped?
Should the another message be queued to land after the retry?

I think all this was trying to say is:

We need to queue and retry outbound to-device messages
Retry schedule can be exponential back-off
If we receive a message from a given device, we should probably retry our queue of messages destined to that device.

There is no need to drop todevice messages on the floor.

Separately, in future we'll need the ability to cancel to-device messages which are no longer relevant - both from being stored on the server, and being queued up to be sent to the server in the first place. For instance, MSC3401 needs this for killing off irrelevant stale VoIP events. So, if the application knows a message should be unqueued/deleted, it can do so. But the stack itself should just do reliable delivery by default.

This causes a large number of my most obnoxious UISIs fwiw - c.f. the backlinks above. It can also undermine MSC3401 pretty badly.

dbkr · 2022-07-22T13:32:38Z

Having looked at this for a bit, an initial plan would be to just make the to-device API match the event sending API and make it retry automatically for a period of time, then reject with a mechanism to initiate a retry manually. However, if we really make it match then this probably wouldn't help a great deal since regular messages fail after about 30 seconds, after which the user manually retries sending them. The to-device messages would therefore either need to retry for (much) longer automatically, or we'd need to automatically retry the to-device message queue when retrying a regular message.

At the moment I'm a bit torn on whether to try & retry them maintaining API compatibility or whether to make a new API altogether (the current one is very basic and not very pretty). The main problem is that the current API is a bit fire-and-forget, so we need to avoid to-device messages building up in a queue if we started retrying them without changing the API.

fkwp · 2022-07-27T10:07:53Z

Can we double-check that we also have reliable to-device messages on mobile.

dbkr · 2022-07-29T10:59:24Z

Have created equivalent bugs on iOS & Android.

* Add txn_id support to sliding sync ([\matrix-org#2567](matrix-org#2567)). * Emit an event when the client receives TURN servers ([\matrix-org#2529](matrix-org#2529)). * Add support for stable prefixes for MSC2285 ([\matrix-org#2524](matrix-org#2524)). * Remove stream-replacement ([\matrix-org#2551](matrix-org#2551)). * Add support for sending user-defined encrypted to-device messages ([\matrix-org#2528](matrix-org#2528)). * Retry to-device messages ([\matrix-org#2549](matrix-org#2549)). Fixes element-hq/element-web#12851. * Sliding sync: add missing filters from latest MSC ([\matrix-org#2555](matrix-org#2555)). * Use stable prefixes for MSC3827 ([\matrix-org#2537](matrix-org#2537)). * Fix: Handle parsing of a beacon info event without asset ([\matrix-org#2591](matrix-org#2591)). Fixes element-hq/element-web#23078. * Fix finding event read up to if stable private read receipts is missing ([\matrix-org#2585](matrix-org#2585)). Fixes element-hq/element-web#23027. * Fixed a sliding sync issue where history could be interpreted as live events. ([\matrix-org#2583](matrix-org#2583)). * Don't load the sync accumulator if there's already a sync persist in flight ([\matrix-org#2569](matrix-org#2569)).

Co-authored-by: github-merge-queue <118344674+github-merge-queue@users.noreply.github.com> Co-authored-by: Michael Telatynski <7t3chguy@gmail.com>

ara4n added the T-Defect label Mar 24, 2020

bwindels added A-E2EE P1 labels Mar 25, 2020

t3chguy added the Z-Chronic label Oct 2, 2020

t3chguy added the X-Blocked label Oct 16, 2020

ShadowJonathan mentioned this issue Jan 22, 2022

Have a "fix everything" button for E2EE / Dealing with E2EE breakage #20685

Closed

dkasak mentioned this issue Apr 19, 2022

We don't appear to be retrying /sendToDevice requests matrix-org/matrix-js-sdk#1866

Open

BillCarsonFr mentioned this issue Apr 21, 2022

Revise and improve Key sharing and forwarding element-hq/element-meta#199

Closed

BillCarsonFr assigned duxovni Apr 21, 2022

BillCarsonFr mentioned this issue Apr 28, 2022

Unable To Decrypt meta issue element-hq/element-meta#245

Open

82 tasks

t3chguy added S-Major Severely degrades major functionality or product features, with no satisfactory workaround O-Uncommon Most users are unlikely to come across this or unexpected workflow labels May 17, 2022

ara4n removed the X-Blocked label Jun 22, 2022

kittykat removed the P1 label Jul 20, 2022

This was referenced Jul 28, 2022

Retry to-device messages matrix-org/matrix-js-sdk#2549

Merged

Retry to-device messages matrix-org/matrix-ios-sdk#1538

Closed

Retry to-device messages element-hq/element-android#6689

Closed

dbkr closed this as completed in matrix-org/matrix-js-sdk#2549 Aug 3, 2022

kittykat mentioned this issue Oct 3, 2022

Queue and retry to-device msgs, so E2EE works in bad networks or with unreliable HS element-hq/element-meta#673

Closed

BillCarsonFr mentioned this issue Mar 17, 2023

Sending offline message breaks encryption element-hq/element-android#8245

Open

t3chguy added a commit that referenced this issue Oct 17, 2024

[create-pull-request] automated change (#12851)

9b77279

Co-authored-by: github-merge-queue <118344674+github-merge-queue@users.noreply.github.com> Co-authored-by: Michael Telatynski <7t3chguy@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We don't retry E2EE to-device messages at all. #12851

We don't retry E2EE to-device messages at all. #12851

ara4n commented Mar 24, 2020

t3chguy commented Oct 5, 2020

uhoreg commented Oct 5, 2020

t3chguy commented Oct 6, 2020

t3chguy commented Oct 16, 2020

uhoreg commented Oct 16, 2020

dkasak commented Apr 19, 2022

ara4n commented Jun 22, 2022

dbkr commented Jul 22, 2022

fkwp commented Jul 27, 2022

dbkr commented Jul 29, 2022

We don't retry E2EE to-device messages at all. #12851

We don't retry E2EE to-device messages at all. #12851

Comments

ara4n commented Mar 24, 2020

t3chguy commented Oct 5, 2020

uhoreg commented Oct 5, 2020

t3chguy commented Oct 6, 2020

t3chguy commented Oct 16, 2020

uhoreg commented Oct 16, 2020

dkasak commented Apr 19, 2022

ara4n commented Jun 22, 2022

dbkr commented Jul 22, 2022

fkwp commented Jul 27, 2022

dbkr commented Jul 29, 2022