-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure message send succeeds even when out of sync #917
Conversation
I can't review since I opened the PR but the logic makes sense to me. 👍 |
xmtp_mls/src/groups/mod.rs
Outdated
@@ -459,7 +461,8 @@ impl MlsGroup { | |||
let update_interval = Some(5_000_000); | |||
self.maybe_update_installations(conn.clone(), update_interval, client) | |||
.await?; | |||
self.publish_intents(conn, client).await?; | |||
self.publish_intents(conn.clone(), client).await?; | |||
self.sync_until_all_intents_resolved(conn, client).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of overhead to handle a pretty fringe-y edge case. Will have impacts on latency (now we have to wait for at least 2 sequential API requests to complete before returning on every message send) and on rate-limiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last paragraph of the PR description has a justification for this - WDYT?
Will have impacts on latency
Not to perceived latency if optimistic message sends are used. If non-optimistic sends are used, i.e. confirmation of message send is desired, I'm not sure if it's honest to return success until we have synced.
In general, best practice IMO is for the client app to separate the prepare() and publish() steps like Converse is doing, and to debounce the publish() step
on rate-limiting
I think it'll be negligible, as we have significantly smaller publish volume compared to reads
Given that most clients are going to receive the message from a stream anyways I'm not sure it's necessary.
Receiving it from a stream won't cause the client to retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to perceived latency if optimistic message sends are used. If non-optimistic sends are used, i.e. confirmation of message send is desired, I'm not sure if it's honest to return success until we have synced.
For an inbox app, separating prepare and send makes sense. But this is also going to effect bulk senders/bots who are almost certainly not using optimistic sends.
Receiving it from a stream won't cause the client to retry
That's fair. Updated my comment.
Maybe the better question is: if we are going to pay the price for adding a read to every write, couldn't we just sync first and prevent this problem in almost all cases? It would take a hell of a race condition to get 3 commits in between the sync and the send.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also keep in mind that latency on the receiver side is not affected - the publish is going to take the same amount of time regardless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry didn't see your last comment until just now! Funny given we're talking about stale local state.
this is also going to effect bulk senders/bots who are almost certainly not using optimistic sends
Are bulk senders sending multiple messages back-to-back within the same conversation, or single messages to many conversations? Is it the case that they need to be up-to-date with their groups regardless?
if we are going to pay the price for adding a read to every write, couldn't we just sync first and prevent this problem in almost all cases? It would take a hell of a race condition to get 3 commits in between the sync and the send
We could. I'm not sure if it'll be that unlikely in large groups, and it seems we've already run into it in our own testing. Also, integrating apps are not guaranteed to have a subscription running, and we're not guaranteed that a running subscription isn't broken or has a network delay etc.
The problem is even if we accept that it's rare, we don't have a good recovery mechanism otherwise. Those messages are pretty much lost, at least until two manual syncs (not subscriptions) happen, which could be a long time later. We could talk about a different way to recover, but it might involve a large refactor. I'm inclined to wait until our work on the new decentralized backend ships, when we will have server-side validation that can trigger the retry without another sync being required.
Also, keep in mind that when talking about overall volume of syncs - it's not clear that getting client apps to sync every time they open a thread, or on some interval, will create a lower overall volume than syncing every time a send happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not clear that getting client apps to sync every time they open a thread, or on some interval, will create a lower overall volume than syncing every time a send happens.
It might be the same volume, but I think that has to happen anyways to catch missing messages. So this would be purely additional
message_epoch, | ||
3, // max_past_epochs, TODO: expose from OpenMLS MlsGroup | ||
) { | ||
conn.set_group_intent_to_publish(intent.id)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized there is another case we should think about:
What if the recipient is streaming messages (or receiving push notifs) and is also out of date with the current epoch. In that case the original message will decrypt successfully AND they will eventually receive a duplicate when they next sync. I think the message ID will be the same between these retries, but it's probably worth writing a test for that.
Summary of offline discussion: there's a tradeoff between reliability and performance here, for now we choose reliability. It's a reversible decision - removing the extra round-trip is a one line change. I think the next iteration of the server will also help:
I'd also like to add a test case for the scenario Nick mentioned, but am running into some problems - going to land this first to unblock release, and put that test case up as a separate PR. |
The build is broken, but it looks like there is a fix (updating the time crate) in #923 - will land regardless |
We have configured the
max_past_epochs
value to 3, which means that we keep around message encryption keys for 3 epochs before deleting them. This means that if we are 3 commits behind when we send a message, nobody else will be able to decrypt it, because they process everything sequentially, and they'll have already deleted their encryption keys by the time they see it.The fix is as follows:
This has the following implications:
send_message()
to complete) is slower - there is an extra round trip to pull down the messages afterwards (+more if the message needs to be retried)My justification for the slower message send is that we've already set up optimistic message sends, with separate prepare and publish steps. In the event that multiple optimistic message sends happen back-to-back, you can call a single publish at the end. Perhaps we can recommend using optimistic message sends, with debounced publishes, in the docs somewhere.
- Rich