Ensure message send succeeds even when out of sync #917

nplasterer · 2024-07-22T23:36:21Z

We have configured the max_past_epochs value to 3, which means that we keep around message encryption keys for 3 epochs before deleting them. This means that if we are 3 commits behind when we send a message, nobody else will be able to decrypt it, because they process everything sequentially, and they'll have already deleted their encryption keys by the time they see it.

The fix is as follows:

When pulling down messages from a group, if we see a message we previously published, we check that the message is no more than 3 epochs behind. If the check passes, the message send intent is updated to COMMITTED, otherwise it's reset to TO_PUBLISH so that the message can be sent again.
After sending a message, we should go ahead and pull down the messages afterwards, to make sure the message send succeeded (and retry via intents otherwise).

This has the following implications:

It's not required to sync the group before sending a message
Confirming that a message sent successfully (i.e. waiting for send_message() to complete) is slower - there is an extra round trip to pull down the messages afterwards (+more if the message needs to be retried)

My justification for the slower message send is that we've already set up optimistic message sends, with separate prepare and publish steps. In the event that multiple optimistic message sends happen back-to-back, you can call a single publish at the end. Perhaps we can recommend using optimistic message sends, with debounced publishes, in the docs somewhere.

- Rich

bindings_ffi/src/mls.rs

nplasterer · 2024-07-24T23:09:29Z

I can't review since I opened the PR but the logic makes sense to me. 👍

neekolas · 2024-07-24T23:11:05Z

xmtp_mls/src/groups/mod.rs

@@ -459,7 +461,8 @@ impl MlsGroup {
        let update_interval = Some(5_000_000);
        self.maybe_update_installations(conn.clone(), update_interval, client)
            .await?;
-        self.publish_intents(conn, client).await?;
+        self.publish_intents(conn.clone(), client).await?;
+        self.sync_until_all_intents_resolved(conn, client).await?;


This is a lot of overhead to handle a pretty fringe-y edge case. Will have impacts on latency (now we have to wait for at least 2 sequential API requests to complete before returning on every message send) and on rate-limiting.

The last paragraph of the PR description has a justification for this - WDYT?

Will have impacts on latency

Not to perceived latency if optimistic message sends are used. If non-optimistic sends are used, i.e. confirmation of message send is desired, I'm not sure if it's honest to return success until we have synced.

In general, best practice IMO is for the client app to separate the prepare() and publish() steps like Converse is doing, and to debounce the publish() step

on rate-limiting

I think it'll be negligible, as we have significantly smaller publish volume compared to reads

Given that most clients are going to receive the message from a stream anyways I'm not sure it's necessary.

Receiving it from a stream won't cause the client to retry

Not to perceived latency if optimistic message sends are used. If non-optimistic sends are used, i.e. confirmation of message send is desired, I'm not sure if it's honest to return success until we have synced.

For an inbox app, separating prepare and send makes sense. But this is also going to effect bulk senders/bots who are almost certainly not using optimistic sends.

Receiving it from a stream won't cause the client to retry

That's fair. Updated my comment.

Maybe the better question is: if we are going to pay the price for adding a read to every write, couldn't we just sync first and prevent this problem in almost all cases? It would take a hell of a race condition to get 3 commits in between the sync and the send.

Also keep in mind that latency on the receiver side is not affected - the publish is going to take the same amount of time regardless

Sorry didn't see your last comment until just now! Funny given we're talking about stale local state.

this is also going to effect bulk senders/bots who are almost certainly not using optimistic sends

Are bulk senders sending multiple messages back-to-back within the same conversation, or single messages to many conversations? Is it the case that they need to be up-to-date with their groups regardless?

if we are going to pay the price for adding a read to every write, couldn't we just sync first and prevent this problem in almost all cases? It would take a hell of a race condition to get 3 commits in between the sync and the send

We could. I'm not sure if it'll be that unlikely in large groups, and it seems we've already run into it in our own testing. Also, integrating apps are not guaranteed to have a subscription running, and we're not guaranteed that a running subscription isn't broken or has a network delay etc.

The problem is even if we accept that it's rare, we don't have a good recovery mechanism otherwise. Those messages are pretty much lost, at least until two manual syncs (not subscriptions) happen, which could be a long time later. We could talk about a different way to recover, but it might involve a large refactor. I'm inclined to wait until our work on the new decentralized backend ships, when we will have server-side validation that can trigger the retry without another sync being required.

Also, keep in mind that when talking about overall volume of syncs - it's not clear that getting client apps to sync every time they open a thread, or on some interval, will create a lower overall volume than syncing every time a send happens.

it's not clear that getting client apps to sync every time they open a thread, or on some interval, will create a lower overall volume than syncing every time a send happens.

It might be the same volume, but I think that has to happen anyways to catch missing messages. So this would be purely additional

xmtp_mls/src/groups/mod.rs

xmtp_mls/src/groups/sync.rs

neekolas · 2024-07-24T23:30:57Z

xmtp_mls/src/groups/sync.rs

+                    message_epoch,
+                    3, // max_past_epochs, TODO: expose from OpenMLS MlsGroup
+                ) {
+                    conn.set_group_intent_to_publish(intent.id)?;


I just realized there is another case we should think about:

What if the recipient is streaming messages (or receiving push notifs) and is also out of date with the current epoch. In that case the original message will decrypt successfully AND they will eventually receive a duplicate when they next sync. I think the message ID will be the same between these retries, but it's probably worth writing a test for that.

richardhuaaa · 2024-07-25T22:02:33Z

Summary of offline discussion: there's a tradeoff between reliability and performance here, for now we choose reliability. It's a reversible decision - removing the extra round-trip is a one line change. I think the next iteration of the server will also help:

Verifying epochs within the same round trip as the publish
Reliable subscriptions so that further syncs are unnecessary

I'd also like to add a test case for the scenario Nick mentioned, but am running into some problems - going to land this first to unblock release, and put that test case up as a separate PR.

richardhuaaa · 2024-07-25T22:10:45Z

The build is broken, but it looks like there is a fix (updating the time crate) in #923 - will land regardless

nplasterer added 7 commits July 17, 2024 11:39

lock stuff

ddcfefd

Merge branch 'main' of https://github.com/xmtp/libxmtp

1a5a94b

Merge branch 'main' of https://github.com/xmtp/libxmtp

ea44e28

Merge branch 'main' of https://github.com/xmtp/libxmtp

ae41dcc

Merge branch 'main' of https://github.com/xmtp/libxmtp

ab4ee9b

Merge branch 'main' of https://github.com/xmtp/libxmtp

d1cd793

reproduced it in a test

ccf4e65

nplasterer self-assigned this Jul 22, 2024

nplasterer commented Jul 22, 2024

View reviewed changes

bindings_ffi/src/mls.rs Show resolved Hide resolved

insipx mentioned this pull request Jul 23, 2024

sentry logger for ffi #919

Closed

richardhuaaa added 8 commits July 23, 2024 17:15

Simplify and explain test

fffe60b

Make the test case even harder

dd98d77

Fix last condition of test

0732ac0

Merge remote-tracking branch 'origin/main' into np/repro-forked-groups

8be37bc

Fix message sends when behind

7e2c673

Add test to make sure commits work when out of sync

b7543d9

Don't retry message if within 3 epochs

e36b81a

Reorder test for clarity, fix lint

9d7df2a

richardhuaaa changed the title ~~Forked Groups~~ Ensure message send succeeds even when out of sync Jul 24, 2024

richardhuaaa requested review from neekolas, cameronvoell and insipx July 24, 2024 23:03

richardhuaaa marked this pull request as ready for review July 24, 2024 23:04

richardhuaaa requested a review from a team as a code owner July 24, 2024 23:04

neekolas reviewed Jul 24, 2024

View reviewed changes

xmtp_mls/src/groups/mod.rs Show resolved Hide resolved

neekolas reviewed Jul 24, 2024

View reviewed changes

xmtp_mls/src/groups/sync.rs Outdated Show resolved Hide resolved

neekolas reviewed Jul 24, 2024

View reviewed changes

richardhuaaa requested a review from neekolas July 24, 2024 23:44

neekolas approved these changes Jul 25, 2024

View reviewed changes

sync_until_all_intents_resolved -> sync_until_last_intent_resolved

3e479eb

richardhuaaa merged commit ffa0564 into main Jul 25, 2024
2 of 5 checks passed

richardhuaaa deleted the np/repro-forked-groups branch July 25, 2024 22:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure message send succeeds even when out of sync #917

Ensure message send succeeds even when out of sync #917

nplasterer commented Jul 22, 2024 •

edited by richardhuaaa

Loading

nplasterer commented Jul 24, 2024

neekolas Jul 24, 2024 •

edited

Loading

richardhuaaa Jul 24, 2024 •

edited

Loading

neekolas Jul 24, 2024

richardhuaaa Jul 24, 2024 •

edited

Loading

richardhuaaa Jul 24, 2024 •

edited

Loading

neekolas Jul 25, 2024

neekolas Jul 24, 2024

richardhuaaa commented Jul 25, 2024

richardhuaaa commented Jul 25, 2024

Ensure message send succeeds even when out of sync #917

Ensure message send succeeds even when out of sync #917

Conversation

nplasterer commented Jul 22, 2024 • edited by richardhuaaa Loading

nplasterer commented Jul 24, 2024

neekolas Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

richardhuaaa Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

neekolas Jul 24, 2024

Choose a reason for hiding this comment

richardhuaaa Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

richardhuaaa Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

neekolas Jul 25, 2024

Choose a reason for hiding this comment

neekolas Jul 24, 2024

Choose a reason for hiding this comment

richardhuaaa commented Jul 25, 2024

richardhuaaa commented Jul 25, 2024

nplasterer commented Jul 22, 2024 •

edited by richardhuaaa

Loading

neekolas Jul 24, 2024 •

edited

Loading

richardhuaaa Jul 24, 2024 •

edited

Loading

richardhuaaa Jul 24, 2024 •

edited

Loading

richardhuaaa Jul 24, 2024 •

edited

Loading