Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement time lease based locks for the CryptoStore #2140

Merged
merged 14 commits into from
Jun 29, 2023

Conversation

bnjbvr
Copy link
Member

@bnjbvr bnjbvr commented Jun 23, 2023

This implements a new time lease based lock for the CryptoStore, that doesn't require explicit unlocking, so that's more robust in the context of #1928, where any process may die because the device is running out of battery, or unexpected flows cause a lock to not be released properly in one or the other process.

//! This is a per-process lock that may be used only for very specific use
//! cases, where multiple processes might concurrently write to the same
//! database at the same time; this would invalidate crypto store caches, so
//! that should be done mindfully. Such a lock can be acquired multiple times by
//! the same process, and it remains active as long as there's at least one user
//! in a given process.
//!
//! The lock is implemented using time-based leases to values inserted in a
//! crypto store. The store maintains the lock identifier (key), who's the
//! current holder (value), and an expiration timestamp on the side; see also
//! `CryptoStore::try_take_leased_lock` for more details.
//!
//! The lock is initially acquired for a certain period of time (namely, the
//! duration of a lease, aka `LEASE_DURATION_MS`), and then a "heartbeat" task
//! renews the lease to extend its duration, every so often (namely, every
//! `EXTEND_LEASE_EVERY_MS`). Since the tokio scheduler might be busy, the
//! extension request should happen way more frequently than the duration of a
//! lease, in case a deadline is missed. The current values have been chosen to
//! reflect that, with a ratio of 1:10 as of 2023-06-23.
//!
//! Releasing the lock happens naturally, by not renewing a lease. It happens
//! automatically after the duration of the last lease, at most.

Notes

  • This is not implemented on indexeddb, because time in wasm is slightly more complicated (requires a host API). Maybe we already have that, but I haven't bothered since I'm not sure that wasm may need the lock in the first place (maybe for a Web app using the SDK, running from multiple tabs?).
  • I've kept the former implementation for insert_custom_value_if_missing/remove_custom_value at the moment, but we could remove them if this lock proves to be more robust.

@bnjbvr bnjbvr requested a review from a team as a code owner June 23, 2023 10:50
@bnjbvr bnjbvr requested review from jplatte and removed request for a team June 23, 2023 10:50
@codecov
Copy link

codecov bot commented Jun 23, 2023

Codecov Report

Patch coverage: 76.50% and project coverage change: +0.22 🎉

Comparison is base (06600ac) 76.35% compared to head (21d2f0b) 76.57%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2140      +/-   ##
==========================================
+ Coverage   76.35%   76.57%   +0.22%     
==========================================
  Files         163      164       +1     
  Lines       17516    17565      +49     
==========================================
+ Hits        13374    13451      +77     
+ Misses       4142     4114      -28     
Impacted Files Coverage Δ
crates/matrix-sdk-crypto/src/store/mod.rs 74.89% <0.00%> (+0.50%) ⬆️
crates/matrix-sdk/src/room/joined/mod.rs 63.41% <0.00%> (+0.91%) ⬆️
crates/matrix-sdk/src/sliding_sync/list/mod.rs 90.52% <ø> (-0.05%) ⬇️
...rix-sdk/src/sliding_sync/list/request_generator.rs 98.24% <ø> (ø)
crates/matrix-sdk-ui/src/timeline/to_device.rs 16.00% <16.00%> (ø)
crates/matrix-sdk-common/src/ring_buffer.rs 92.00% <50.00%> (-8.00%) ⬇️
crates/matrix-sdk/src/encryption/mod.rs 41.56% <50.00%> (-2.44%) ⬇️
crates/matrix-sdk-ui/src/encryption_sync/mod.rs 64.81% <56.25%> (-6.02%) ⬇️
crates/matrix-sdk-crypto/src/store/locks.rs 98.30% <97.56%> (+51.53%) ⬆️
...s/matrix-sdk-crypto/src/store/integration_tests.rs 100.00% <100.00%> (ø)
... and 8 more

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@bnjbvr bnjbvr force-pushed the lease-locks branch 4 times, most recently from 9d6b67f to 5acbac6 Compare June 23, 2023 17:03
@ara4n
Copy link
Member

ara4n commented Jun 24, 2023

why do we need time leased locks? when i tested this with file based posix locks, you could acquire the lock no matter how dirtily the process holding the lock got knifed.

@poljar
Copy link
Contributor

poljar commented Jun 26, 2023

why do we need time leased locks? when i tested this with file based posix locks, you could acquire the lock no matter how dirtily the process holding the lock got knifed.

There isn't a great and obvious choice for a flock based crate and this was easier to implement in a platform-independent way in a timely manner. We certainly need to revise some parts of this as we go along.

Copy link
Collaborator

@jplatte jplatte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to note that I've started review. I haven't gotten very far yet, but will continue tomorrow.

crates/matrix-sdk-crypto/src/store/locks.rs Outdated Show resolved Hide resolved
crates/matrix-sdk-crypto/src/store/locks.rs Outdated Show resolved Hide resolved
Copy link
Member

@Hywan Hywan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good to me :-).

@@ -68,6 +106,16 @@ pub struct CryptoStoreLock {
}

impl CryptoStoreLock {
/// Amount of time a lease of the lock should last, in milliseconds.
pub const LEASE_DURATION_MS: u32 = 2000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be 1000, or even 800, so that it cannot be really perceived by the app end-user?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw: we chatted about it, and we'll try renew-every 50ms / lease-for 500ms: still a 1:10 ratio, and the lease should be closer to non-perceivable by the user (ideally we'd go under 300ms, but that seems more dangerous with respect to the scheduler being super busy).

crates/matrix-sdk-crypto/src/store/locks.rs Outdated Show resolved Hide resolved
// Clone data to be owned by the task.
let this = self.clone();

matrix_sdk_common::executor::spawn(async move {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we don't want to keep a handle to this spawned task, so that we can abort it in case of emergency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's also useful to make sure we don't spawn it multiple times 👍 Added some code to handle that.

Copy link
Contributor

@poljar poljar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. It would be nice if people on the EX side would test this but I guess that's what the nightly is there for.

From my point of view, we can tweak the timeouts if they end up being too long later on.

crates/matrix-sdk-crypto/src/store/locks.rs Outdated Show resolved Hide resolved
match lock.try_lock_once().await? {
Some(guard) => Ok(Some(guard)),
None => {
// We didn't get the lock on the first attempt, so that means that another
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we certain that we don't need to reload when we did acquire the lock on the first attempt? Isn't there a race where we might get the lock just as it was released by the other process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed you're right! This is going to be more generally fixed by #2155 :-)

bnjbvr added 10 commits June 29, 2023 12:10
Signed-off-by: Benjamin Bouvier <public@benj.me>
See top comment in matrix-sdk-crypto/src/store/locks.rs
Signed-off-by: Benjamin Bouvier <public@benj.me>
… there

Signed-off-by: Benjamin Bouvier <public@benj.me>
… failed

Signed-off-by: Benjamin Bouvier <public@benj.me>
Signed-off-by: Benjamin Bouvier <public@benj.me>
Signed-off-by: Benjamin Bouvier <public@benj.me>
Signed-off-by: Benjamin Bouvier <public@benj.me>
Signed-off-by: Benjamin Bouvier <public@benj.me>
// operation
// running in a transaction.

drop(renew_task.take());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful, it doesn't abort the task!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, TIL. I've added an abort() that's guarded by #[cfg(not(target_arch = "wasm32"))]; this code wouldn't work on wasm32 anyways because it's using tokio::sleep.

@Velin92
Copy link
Member

Velin92 commented Jun 29, 2023

Tested and this thing is bulletproof, tested on scenarios where both the NSE and Main App did not stop the sync and did not release the lock, and everything worked as expected.
@bnjbvr

@bnjbvr bnjbvr enabled auto-merge (squash) June 29, 2023 12:51
@bnjbvr bnjbvr merged commit 73ad518 into matrix-org:main Jun 29, 2023
@bnjbvr bnjbvr deleted the lease-locks branch June 29, 2023 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants