Switch accounts storage lock to DashMap #12126

carllin · 2020-09-09T07:26:19Z

Problem

Accounts scans from RPC hold:

the account storage lock duration the entire duration of the scan, which blocks replay on create_and_insert_store() during account commit.
the account index lock duration the entire duration of the scan, which blocks replay in AccountsDb store()->update_index() during account commit.

Summary of Changes

Experimenting with switching global accounts storage (part 1 above) lock to DashMap: https://github.com/xacrimon/dashmap, a concurrent hashmap implemented by sharding the table. This removes the need to hold the global AccountStorage read lock in scan_accounts which is blocking create_and_insert_store()

TODO: This can also potentially be expanded to replace the accounts index lock in part 2 above

TODO: Reason about correctness of places where I've replaced the large account_storage read locks, specifically around cleaning and shrinking. @ryoqun would really appreciate a review in those areas!

Pertinent gotchas with v3 of DashMap: xacrimon/dashmap#74

Other candidates that were considered but not chosen:

chashmap: No iterator support
evmap: Poorer write performance under heavy read contention
flurry (Java concurrent hashmap port): Not as performant as DashMap

Fixes #

codecov · 2020-09-09T08:43:20Z

Codecov Report

Merging #12126 into master will decrease coverage by 0.0%.
The diff coverage is 96.9%.

@@            Coverage Diff            @@
##           master   #12126     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         360      360             
  Lines       84873    84899     +26     
=========================================
+ Hits        69549    69556      +7     
- Misses      15324    15343     +19

t-nelson · 2020-09-09T18:22:52Z

Pertinent gotchas with v3 of DashMap: xacrimon/dashmap#74

The 4.0.0 release candidates are alleged to resolve this issue

carllin · 2020-09-10T02:01:02Z

runtime/src/accounts_db.rs

@@ -702,10 +707,9 @@ impl AccountsDB {
        // Calculate store counts as if everything was purged
        // Then purge if we can
        let mut store_counts: HashMap<AppendVecId, (usize, HashSet<Pubkey>)> = HashMap::new();
-        let storage = self.storage.read().unwrap();


From my understanding, the extra risk removing this large lock here adds is race conditions where now other slots in self.storage can be modified. The three places this can happen:

https://github.com/solana-labs/solana/blob/master/runtime/src/accounts_db.rs#L970-L974. Shouldn't be possible b/c we hold the shrink_candidate_slots lock, and shrink_all_slots() -> do_shrink_slot_forced() is not called after startup, so the lock should be respected

https://github.com/solana-labs/solana/blob/master/runtime/src/accounts_db.rs#L1256-L1259. Can store_with_hashes() -> handle_reclaims_maybe_cleanup() remove a slot that exists in purges.account_infos such that the call below to self.storage.0.get(&slot).unwrap(); panics?

https://github.com/solana-labs/solana/blob/master/runtime/src/accounts_db.rs#L1236-L1238. Adding a new storage entry should be ok as that should be on some future non-rooted slot which shouldn't exist in account_infos

@carllin

Yeah, I think your understanding is correct.

Yeah, can panic!. you are addressing this at Fix rooted accounts cleanup, simplify locking #12194

Yeah, this is correct as well.

Awesome, thanks!

ryoqun · 2020-09-14T07:04:07Z

TODO: This can also potentially be expanded to replace the accounts index lock in part 2 above

Btw, this needs concurrent btree map because currently AccountsIndex uses BTreeMap for predictable rent collection scanning.
Maybe like https://docs.rs/concread/0.2.3/concread/bptree/struct.BptreeMap.html

ryoqun · 2020-09-14T07:09:38Z

runtime/src/accounts_db.rs

@@ -381,7 +386,7 @@ pub struct AccountsDB {
    /// Keeps tracks of index into AppendVec on a per slot basis
    pub accounts_index: RwLock<AccountsIndex<AccountInfo>>,

-    pub storage: RwLock<AccountStorage>,


ryoqun · 2020-09-14T08:53:19Z

runtime/src/accounts_db.rs

-        let mut stores = self.storage.write().unwrap();
-        let slot_storage = stores.0.entry(slot).or_insert_with(HashMap::new);
+        let mut slot_storage = self.storage.0.entry(slot).or_insert_with(HashMap::new);
        slot_storage.insert(store.id, store_for_index);


ryoqun · 2020-09-14T08:55:58Z

runtime/src/accounts_db.rs

-        let stores = self.storage.read().unwrap();
-
-        if let Some(slot_stores) = stores.0.get(&slot) {
+        let slot_stores_guard = self.storage.0.get(&slot);


nits: I think this isn't a lock guard anymore. So, remove the _guard prefix?

heh well technically, I think it's still a guard on the specific DashMap shard: https://docs.rs/dashmap/3.11.10/src/dashmap/lib.rs.html#432-438, i.e. you shouldn't hold it for too long.

ryoqun · 2020-09-14T10:02:40Z

I think this is better than #12132 because of its straight-forwardness to the problem.

Anyway, I wonder how to best to fix the locks for AccountsDB...

carllin · 2020-09-14T20:16:19Z

@ryoqun I think this PR and #12132 may actually work together 😃

DashMap essentially partitions the HashMap into N shards: https://github.com/xacrimon/dashmap/blob/b2951f801bff82461759c4aa28fd59ef51919956/src/lib.rs#L53 based on the value hash so there's not one giant lock everyone competes for, but internally, values with the same hash may still contend with each other.

#12132 should guarantee that during the accounts scan, a single shard isn't locked up for the entire duration of the scan within that shard

carllin · 2020-09-14T20:20:40Z

TODO: This can also potentially be expanded to replace the accounts index lock in part 2 above

Btw, this needs concurrent btree map because currently AccountsIndex uses BTreeMap for predictable rent collection scanning.
Maybe like https://docs.rs/concread/0.2.3/concread/bptree/struct.BptreeMap.html

Unfortunately that one doesn't seem to implement the "range()" function we want from BTreeMap 😢 , thought it may be an easy addition.

I think a sufficient temporary bandaid is to switch AccountsIndex.account_maps to a pub account_maps: AccountMap<Pubkey, Arc<AccountMapEntry<T>>>, and clone out batches of the the Arc during the scan. It'll slow down the iteration (we can use a bigger batching factor), but at least it won't block the lock

ryoqun · 2020-09-15T08:21:33Z

Unfortunately that one doesn't seem to implement the "range()" function we want from BTreeMap cry , thought it may be an easy addition.

Sad...

I think a sufficient temporary bandaid is to switch AccountsIndex.account_maps to a pub account_maps: AccountMap<Pubkey, Arc<AccountMapEntry<T>>>, and clone out batches of the the Arc during the scan. It'll slow down the iteration (we can use a bigger batching factor), but at least it won't block the lock

Yeah, that will work. Seems a good bandaid. :)

stale · 2020-09-23T01:40:41Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

ryoqun · 2020-10-02T11:06:37Z

@carllin how does this perform in various benchmarks? There is several benchmark standard problems. Also, could you try to simulate concurrent heavy write by ReplayStage and heavy read by RPC?

carllin · 2020-10-07T07:06:44Z

@ryoqun I added a benchmark simulating heavy single reads from RPC along with writes from Replay.

As expected, this case doesn't see a lot of benefit from this change because:

Most of the time store is blocked on the AccountsIndex lock. This is especially bad on Linux as the RwLock is read-favored so the read locks starve the write lock.
store only needs the account storage write lock a couple of times to create the AccountStorageEntry. Afterwards, it just uses a read lock.

I didn't yet add a benchmark here for simulating the RPC scan case which I expect to see the most improvement (that benchmark should be an easy extension of this one). I'll add that in with the AccountsIndex change.

runtime/src/accounts_db.rs

programs/bpf/tests/programs.rs

runtime/src/accounts_db.rs

ryoqun · 2020-10-13T08:33:20Z

runtime/benches/accounts.rs

+
+#[bench]
+#[ignore]
+fn bench_concurrent_read_write(bencher: &mut Bencher) {


btw, do you have any idea of how single-threaded throughput of writing is changed from std::collections::HashMap to DashMap? I think I'm worrying too much but I'd rather want to confirm how far does DashMap do well while supporting the concurrency via sharding. Maybe it's trading off maximum throughput by negligible margin?

To add more context, our basic tenet is batch them if we can, make it concurrent otherwise. so, I guess the upper layer is slumming the AccountsDB optimized for batching and the single threaded perf is kind of moderately related to the batching perf. Thus, we're somewhat sensitive to it.

@ryoqun that's a good point, I've run the benchmark here: https://github.com/xacrimon/conc-map-bench which has three different work profiles, exchange(read-write), cache (read-heavy), and rapid-grow (insert-heavy), more details can be found in that link above. The results for a single thread on my dev machine here: https://console.cloud.google.com/compute/instancesDetail/zones/us-west1-b/instances/carl-dev?project=principal-lane-200702&authuser=1

== cache -- MutexStd 25165824 operations across 1 thread(s) in 14.941146454s; time/op = 593ns -- DashMap 25165824 operations across 1 thread(s) in 15.589596263s; time/op = 619ns == == exchange -- MutexStd 25165824 operations across 1 thread(s) in 20.954264682s; time/op = 831ns -- DashMap 25165824 operations across 1 thread(s) in 20.875345754s; time/op = 828ns == == rapid grow -- MutexStd 25165824 operations across 1 thread(s) in 20.22593938s; time/op = 802ns == -- DashMap 25165824 operations across 1 thread(s) in 17.456807471s; time/op = 693ns ==

@carllin that report is interesting. Thanks for running it and sharing this report. So, DashMap operations seem to be on par with not-batched Mutex<HashMap> operations as far as I read the code https://github.com/xacrimon/conc-map-bench/blob/master/src/adapters.rs#L40 ? (I'm assuming our workload is rather like cache or exchange, not like rapid grow).

Ideally, I'd like to see more realistic results which reflect our base batched implementation. Of course, maybe this is small part compared to the overall solana-validator's runtime... Pardon me to be nit-pick here.

Also, how does bench_concurrent_read_write perform before and after dashmap with single/multi thread for writer with no reader? I think this bench is easy enough to cherrypick onto the merge base commit.

Also, there is also accounts-bench/src/main.rs if you have extra stamina, whose AccountDB preparation step tortures AccountsDB quite much :)

What I'm a bit worried is that we rather want to ensure not to introduce silent perf. degradation for validators who aren't affected by RPC calls. (mainnet-beta validators). Also, I'm assuming DashMap is internally locking shards while updating. That means we're locking/unlocking them for each read/write operation? In other words, we're moving to not-batched operation with frequent locks/unlocks (but so less contention!), from batched operation with infrequent locks/unlocks.

Quite interesting benchmarking showdown. :)

@ryoqun, good suggestions, here's the results I saw on my Macbook Pro:

bench_concurrent_read_write on 1 writer, no readers:`

DashMap: test bench_concurrent_read_write ... bench: 3,713,260 ns/iter (+/- 679,081) Master: test bench_concurrent_read_write ... bench: 3,773,731 ns/iter (+/- 654,643)

accounts-bench/src/main.rs:

DashMap: clean: false Creating 10000 accounts created 10000 accounts in 4 slots create accounts took 148ms Master: clean: false Creating 10000 accounts created 10000 accounts in 4 slots create accounts took 145ms

@carllin Perfect reporting :)

ryoqun · 2020-10-13T09:30:11Z

almost lgtm (I'd like to see a system-test-level perf report combined with #12126, also give me few hours to ponder on this as a final check.)

ryoqun · 2020-10-13T09:34:25Z

almost lgtm

special thanks for addressing my bunch of nits quickly as always. this pr got mature pretty quickly because of it. :)

ryoqun

LGTM in code-wise with all nits resolved correctly!

Please check some perf. concerns I wrote about.

Pull request has been modified.

Co-authored-by: Carl Lin <carl@solana.com>

Co-authored-by: Carl Lin <carl@solana.com> (cherry picked from commit f8d338c)

Co-authored-by: Carl Lin <carl@solana.com> (cherry picked from commit f8d338c) Co-authored-by: carllin <wumu727@gmail.com>

carllin requested review from ryoqun and sakridge September 9, 2020 07:26

carllin force-pushed the FixAccounts branch from 99a69f4 to 7626983 Compare September 9, 2020 07:32

carllin mentioned this pull request Sep 9, 2020

Don't hold storage accounts lock during scan #12132

Closed

carllin commented Sep 10, 2020

View reviewed changes

ryoqun reviewed Sep 14, 2020

View reviewed changes

carllin force-pushed the FixAccounts branch from 7626983 to de8586a Compare September 14, 2020 20:39

stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Sep 23, 2020

carllin force-pushed the FixAccounts branch from ccc0525 to b499b43 Compare September 23, 2020 06:09

stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Sep 23, 2020

carllin force-pushed the FixAccounts branch 4 times, most recently from 6371371 to e72ce20 Compare September 29, 2020 20:04

carllin marked this pull request as ready for review September 30, 2020 01:09

carllin force-pushed the FixAccounts branch from e72ce20 to 826e3a7 Compare October 2, 2020 07:20

carllin force-pushed the FixAccounts branch 2 times, most recently from ed51a7b to 8479f62 Compare October 7, 2020 06:58