Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch accounts storage lock to DashMap #12126

Merged
merged 9 commits into from
Oct 14, 2020
Merged

Conversation

carllin
Copy link
Contributor

@carllin carllin commented Sep 9, 2020

Problem

Accounts scans from RPC hold:

  1. the account storage lock duration the entire duration of the scan, which blocks replay on create_and_insert_store() during account commit.

  2. the account index lock duration the entire duration of the scan, which blocks replay in AccountsDb store()->update_index() during account commit.

Summary of Changes

Experimenting with switching global accounts storage (part 1 above) lock to DashMap: https://github.com/xacrimon/dashmap, a concurrent hashmap implemented by sharding the table. This removes the need to hold the global AccountStorage read lock in scan_accounts which is blocking create_and_insert_store()

TODO: This can also potentially be expanded to replace the accounts index lock in part 2 above

TODO: Reason about correctness of places where I've replaced the large account_storage read locks, specifically around cleaning and shrinking. @ryoqun would really appreciate a review in those areas!

Pertinent gotchas with v3 of DashMap: xacrimon/dashmap#74

Other candidates that were considered but not chosen:

  1. chashmap: No iterator support
  2. evmap: Poorer write performance under heavy read contention
  3. flurry (Java concurrent hashmap port): Not as performant as DashMap

Fixes #

@codecov
Copy link

codecov bot commented Sep 9, 2020

Codecov Report

Merging #12126 into master will decrease coverage by 0.0%.
The diff coverage is 96.9%.

@@            Coverage Diff            @@
##           master   #12126     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         360      360             
  Lines       84873    84899     +26     
=========================================
+ Hits        69549    69556      +7     
- Misses      15324    15343     +19     

@t-nelson
Copy link
Contributor

t-nelson commented Sep 9, 2020

Pertinent gotchas with v3 of DashMap: xacrimon/dashmap#74

The 4.0.0 release candidates are alleged to resolve this issue

@@ -702,10 +707,9 @@ impl AccountsDB {
// Calculate store counts as if everything was purged
// Then purge if we can
let mut store_counts: HashMap<AppendVecId, (usize, HashSet<Pubkey>)> = HashMap::new();
let storage = self.storage.read().unwrap();
Copy link
Contributor Author

@carllin carllin Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the extra risk removing this large lock here adds is race conditions where now other slots in self.storage can be modified. The three places this can happen:

  1. https://github.com/solana-labs/solana/blob/master/runtime/src/accounts_db.rs#L970-L974. Shouldn't be possible b/c we hold the shrink_candidate_slots lock, and shrink_all_slots() -> do_shrink_slot_forced() is not called after startup, so the lock should be respected

  2. https://github.com/solana-labs/solana/blob/master/runtime/src/accounts_db.rs#L1256-L1259. Can store_with_hashes() -> handle_reclaims_maybe_cleanup() remove a slot that exists in purges.account_infos such that the call below to self.storage.0.get(&slot).unwrap(); panics?

  3. https://github.com/solana-labs/solana/blob/master/runtime/src/accounts_db.rs#L1236-L1238. Adding a new storage entry should be ok as that should be on some future non-rooted slot which shouldn't exist in account_infos

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllin

  1. Yeah, I think your understanding is correct.
  2. Yeah, can panic!. you are addressing this at Fix rooted accounts cleanup, simplify locking #12194
  3. Yeah, this is correct as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

@ryoqun
Copy link
Member

ryoqun commented Sep 14, 2020

TODO: This can also potentially be expanded to replace the accounts index lock in part 2 above

Btw, this needs concurrent btree map because currently AccountsIndex uses BTreeMap for predictable rent collection scanning.
Maybe like https://docs.rs/concread/0.2.3/concread/bptree/struct.BptreeMap.html

@@ -381,7 +386,7 @@ pub struct AccountsDB {
/// Keeps tracks of index into AppendVec on a per slot basis
pub accounts_index: RwLock<AccountsIndex<AccountInfo>>,

pub storage: RwLock<AccountStorage>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool :)

Comment on lines 1236 to 1226
let mut stores = self.storage.write().unwrap();
let slot_storage = stores.0.entry(slot).or_insert_with(HashMap::new);
let mut slot_storage = self.storage.0.entry(slot).or_insert_with(HashMap::new);
slot_storage.insert(store.id, store_for_index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay! :)

let stores = self.storage.read().unwrap();

if let Some(slot_stores) = stores.0.get(&slot) {
let slot_stores_guard = self.storage.0.get(&slot);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: I think this isn't a lock guard anymore. So, remove the _guard prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh well technically, I think it's still a guard on the specific DashMap shard: https://docs.rs/dashmap/3.11.10/src/dashmap/lib.rs.html#432-438, i.e. you shouldn't hold it for too long.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see!

@ryoqun
Copy link
Member

ryoqun commented Sep 14, 2020

I think this is better than #12132 because of its straight-forwardness to the problem.

Anyway, I wonder how to best to fix the locks for AccountsDB...

@carllin
Copy link
Contributor Author

carllin commented Sep 14, 2020

@ryoqun I think this PR and #12132 may actually work together 😃

DashMap essentially partitions the HashMap into N shards: https://github.com/xacrimon/dashmap/blob/b2951f801bff82461759c4aa28fd59ef51919956/src/lib.rs#L53 based on the value hash so there's not one giant lock everyone competes for, but internally, values with the same hash may still contend with each other.

#12132 should guarantee that during the accounts scan, a single shard isn't locked up for the entire duration of the scan within that shard

@carllin
Copy link
Contributor Author

carllin commented Sep 14, 2020

TODO: This can also potentially be expanded to replace the accounts index lock in part 2 above

Btw, this needs concurrent btree map because currently AccountsIndex uses BTreeMap for predictable rent collection scanning.
Maybe like https://docs.rs/concread/0.2.3/concread/bptree/struct.BptreeMap.html

Unfortunately that one doesn't seem to implement the "range()" function we want from BTreeMap 😢 , thought it may be an easy addition.

I think a sufficient temporary bandaid is to switch AccountsIndex.account_maps to a pub account_maps: AccountMap<Pubkey, Arc<AccountMapEntry<T>>>, and clone out batches of the the Arc during the scan. It'll slow down the iteration (we can use a bigger batching factor), but at least it won't block the lock

@ryoqun
Copy link
Member

ryoqun commented Sep 15, 2020

Unfortunately that one doesn't seem to implement the "range()" function we want from BTreeMap cry , thought it may be an easy addition.

Sad...

I think a sufficient temporary bandaid is to switch AccountsIndex.account_maps to a pub account_maps: AccountMap<Pubkey, Arc<AccountMapEntry<T>>>, and clone out batches of the the Arc during the scan. It'll slow down the iteration (we can use a bigger batching factor), but at least it won't block the lock

Yeah, that will work. Seems a good bandaid. :)

@stale
Copy link

stale bot commented Sep 23, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Sep 23, 2020
@stale stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Sep 23, 2020
@carllin carllin force-pushed the FixAccounts branch 4 times, most recently from 6371371 to e72ce20 Compare September 29, 2020 20:04
@carllin carllin marked this pull request as ready for review September 30, 2020 01:09
@ryoqun
Copy link
Member

ryoqun commented Oct 2, 2020

@carllin how does this perform in various benchmarks? There is several benchmark standard problems. Also, could you try to simulate concurrent heavy write by ReplayStage and heavy read by RPC?

@carllin carllin force-pushed the FixAccounts branch 2 times, most recently from ed51a7b to 8479f62 Compare October 7, 2020 06:58
@carllin
Copy link
Contributor Author

carllin commented Oct 7, 2020

@ryoqun I added a benchmark simulating heavy single reads from RPC along with writes from Replay.

As expected, this case doesn't see a lot of benefit from this change because:

  1. Most of the time store is blocked on the AccountsIndex lock. This is especially bad on Linux as the RwLock is read-favored so the read locks starve the write lock.

  2. store only needs the account storage write lock a couple of times to create the AccountStorageEntry. Afterwards, it just uses a read lock.

I didn't yet add a benchmark here for simulating the RPC scan case which I expect to see the most improvement (that benchmark should be an easy extension of this one). I'll add that in with the AccountsIndex change.

runtime/src/accounts_db.rs Outdated Show resolved Hide resolved
runtime/src/accounts_db.rs Outdated Show resolved Hide resolved

#[bench]
#[ignore]
fn bench_concurrent_read_write(bencher: &mut Bencher) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, do you have any idea of how single-threaded throughput of writing is changed from std::collections::HashMap to DashMap? I think I'm worrying too much but I'd rather want to confirm how far does DashMap do well while supporting the concurrency via sharding. Maybe it's trading off maximum throughput by negligible margin?

Copy link
Member

@ryoqun ryoqun Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add more context, our basic tenet is batch them if we can, make it concurrent otherwise. so, I guess the upper layer is slumming the AccountsDB optimized for batching and the single threaded perf is kind of moderately related to the batching perf. Thus, we're somewhat sensitive to it.

Copy link
Contributor Author

@carllin carllin Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun that's a good point, I've run the benchmark here: https://github.com/xacrimon/conc-map-bench which has three different work profiles, exchange(read-write), cache (read-heavy), and rapid-grow (insert-heavy), more details can be found in that link above. The results for a single thread on my dev machine here: https://console.cloud.google.com/compute/instancesDetail/zones/us-west1-b/instances/carl-dev?project=principal-lane-200702&authuser=1

== cache
-- MutexStd
25165824 operations across 1 thread(s) in 14.941146454s; time/op = 593ns

-- DashMap
25165824 operations across 1 thread(s) in 15.589596263s; time/op = 619ns
==

== exchange
-- MutexStd
25165824 operations across 1 thread(s) in 20.954264682s; time/op = 831ns

-- DashMap
25165824 operations across 1 thread(s) in 20.875345754s; time/op = 828ns
==

== rapid grow
-- MutexStd
25165824 operations across 1 thread(s) in 20.22593938s; time/op = 802ns
==

-- DashMap
25165824 operations across 1 thread(s) in 17.456807471s; time/op = 693ns
==

Copy link
Member

@ryoqun ryoqun Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllin that report is interesting. Thanks for running it and sharing this report. So, DashMap operations seem to be on par with not-batched Mutex<HashMap> operations as far as I read the code https://github.com/xacrimon/conc-map-bench/blob/master/src/adapters.rs#L40 ? (I'm assuming our workload is rather like cache or exchange, not like rapid grow).

Ideally, I'd like to see more realistic results which reflect our base batched implementation. Of course, maybe this is small part compared to the overall solana-validator's runtime... Pardon me to be nit-pick here.

Also, how does bench_concurrent_read_write perform before and after dashmap with single/multi thread for writer with no reader? I think this bench is easy enough to cherrypick onto the merge base commit.

Also, there is also accounts-bench/src/main.rs if you have extra stamina, whose AccountDB preparation step tortures AccountsDB quite much :)

What I'm a bit worried is that we rather want to ensure not to introduce silent perf. degradation for validators who aren't affected by RPC calls. (mainnet-beta validators). Also, I'm assuming DashMap is internally locking shards while updating. That means we're locking/unlocking them for each read/write operation? In other words, we're moving to not-batched operation with frequent locks/unlocks (but so less contention!), from batched operation with infrequent locks/unlocks.

Quite interesting benchmarking showdown. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun, good suggestions, here's the results I saw on my Macbook Pro:

bench_concurrent_read_write on 1 writer, no readers:`

DashMap:
test bench_concurrent_read_write  ... bench:   3,713,260 ns/iter (+/- 679,081)

Master: 
test bench_concurrent_read_write  ... bench:   3,773,731 ns/iter (+/- 654,643)

accounts-bench/src/main.rs:

DashMap:
clean: false
Creating 10000 accounts
created 10000 accounts in 4 slots create accounts took 148ms

Master: 
clean: false
Creating 10000 accounts
created 10000 accounts in 4 slots create accounts took 145ms

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllin Perfect reporting :)

@ryoqun
Copy link
Member

ryoqun commented Oct 13, 2020

almost lgtm (I'd like to see a system-test-level perf report combined with #12126, also give me few hours to ponder on this as a final check.)

@ryoqun
Copy link
Member

ryoqun commented Oct 13, 2020

almost lgtm

special thanks for addressing my bunch of nits quickly as always. this pr got mature pretty quickly because of it. :)

ryoqun
ryoqun previously approved these changes Oct 13, 2020
Copy link
Member

@ryoqun ryoqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in code-wise with all nits resolved correctly!

Please check some perf. concerns I wrote about.

@mergify mergify bot dismissed ryoqun’s stale review October 13, 2020 21:17

Pull request has been modified.

@carllin carllin merged commit f8d338c into solana-labs:master Oct 14, 2020
carllin added a commit to carllin/solana that referenced this pull request Oct 17, 2020
Co-authored-by: Carl Lin <carl@solana.com>
@carllin carllin added the v1.4 label Oct 28, 2020
mergify bot pushed a commit that referenced this pull request Oct 28, 2020
Co-authored-by: Carl Lin <carl@solana.com>
(cherry picked from commit f8d338c)
carllin added a commit that referenced this pull request Oct 28, 2020
Co-authored-by: Carl Lin <carl@solana.com>
(cherry picked from commit f8d338c)
mergify bot added a commit that referenced this pull request Oct 28, 2020
Co-authored-by: Carl Lin <carl@solana.com>
(cherry picked from commit f8d338c)

Co-authored-by: carllin <wumu727@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants