You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug @Hywan completely nerd sniped me. Blame him. He pointed me at this UTD rageshake on EIX. It turns out that the cause was with the Olm session between the two devices, not the delivery of to-device events or room keys. For some reason, the Olm session had "wedged". These are the reasons why sessions may wedge, so I set about ruling out the sender being at fault.
I collected all undelivered to-device events on all sliding sync shards and analysed the message content. The ciphertext has a particular format and there should be only 1 event with the same (sender key, chain index, ratchet key). If there were multiple, this indicates using the same key for encrypting multiple events. The receiver won't know this has happened, and so the Olm session will wedge. The results:
Out of 234,837 to-device events with type: 1 (already established Olm sessions),
there were 118 duplicate events (with differing ciphertexts) spanning 8 users,
where possible, mapping the sender_key to a device ID and hence display name shows the affected clients are:
Element iOS (4 devices)
Cinny Web (2 devices)
Element Desktop: Linux (1 device)
When this happens, the receiver will see a UTD. Wedged session recovery should fix this for subsequent messages.
To Reproduce
Unsure. EI and ED don't really share any code, so this feels like a systemic problem with how the SDKs are being used.
Expected behavior
Olm messages should always be decryptable.
My working theory on this:
Both platforms have independent processes which need to talk to each other:
Element iOS: NSE process and App process
Element Web: multiple tabs, web workers
Faults in how these processes talk to each other could cause this by:
Lack of safe read-modify-write on the Session pickle. If this occurs, both processes will retrieve the Session at chain index N and encrypt messages based on N. The fix here would be to ensure there is proper locking such that this cannot happen.
Alternatively, lack of flushing coupled with hard crashes could cause a chain index to be reused.
Alternatively, both processes receiving messages at the same time (?) this would be more a receiver chain fault though not sender chain.
I've looked at the JS SDK and cannot cause the read-modify-write to trip up. I cannot get indexeddb locking to fail cross tab / web worker on Firefox or Chrome either (with a test rig that increments an integer in a loop). Worryingly though, both browsers do not provide durable writes:
The complete event may thus be delivered quicker than before, however, there exists a small chance that the entire transaction will be lost if the OS crashes or there is a loss of system power before the data is flushed to disk. Since such catastrophic events are rare, most consumers should not need to concern themselves further.
It's important to note that strict does not ensure that changes are actually written immediately to disk. After a site calls put(), there's still some finite amount of time during which a power failure could cause the change to not make it to disk and therefore be missing the next time the app runs.
I'm unsure how much this matters in practice, as if there is a power failure, it's reasonable to assume that the newly encrypted event did not have enough time to be sent.
AFAICT The concern over writes not being durable is not relevant for cross-tab locking. (If a claim to exclusive ownership of the indexeddb is lost due to power-failure / process crash, then we can also be sure that the now-dead process is not mutating the indexeddb, so it doesnt matter that the claim is lost.)
However, it is relevant for updates to the Olm Session in indexeddb. It seems like we should be setting durability:strict on indexeddb transactions that bump the Olm sender ratchet.
Describe the bug
@Hywan completely nerd sniped me. Blame him. He pointed me at this UTD rageshake on EIX. It turns out that the cause was with the Olm session between the two devices, not the delivery of to-device events or room keys. For some reason, the Olm session had "wedged". These are the reasons why sessions may wedge, so I set about ruling out the sender being at fault.
I collected all undelivered to-device events on all sliding sync shards and analysed the message content. The ciphertext has a particular format and there should be only 1 event with the same (sender key, chain index, ratchet key). If there were multiple, this indicates using the same key for encrypting multiple events. The receiver won't know this has happened, and so the Olm session will wedge. The results:
type: 1
(already established Olm sessions),sender_key
to a device ID and hence display name shows the affected clients are:When this happens, the receiver will see a UTD. Wedged session recovery should fix this for subsequent messages.
To Reproduce
Unsure. EI and ED don't really share any code, so this feels like a systemic problem with how the SDKs are being used.
Expected behavior
Olm messages should always be decryptable.
My working theory on this:
I've looked at the JS SDK and cannot cause the read-modify-write to trip up. I cannot get indexeddb locking to fail cross tab / web worker on Firefox or Chrome either (with a test rig that increments an integer in a loop). Worryingly though, both browsers do not provide durable writes:
https://developer.mozilla.org/en-US/docs/Web/API/IDBTransaction#firefox_durability_guarantees
https://developer.chrome.com/blog/indexeddb-durability-mode-now-defaults-to-relaxed
I'm unsure how much this matters in practice, as if there is a power failure, it's reasonable to assume that the newly encrypted event did not have enough time to be sent.
The text was updated successfully, but these errors were encountered: