-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce background stale AppendVec shrink mechanism #9219
Conversation
Cool. I thought we might need something like this. |
Howa! Super exciting result... :) 30min => 30sec (60x speedup!) on same machine! Before:
After (passed through
|
devnet snapshots are now too big to even load:
|
Codecov Report
@@ Coverage Diff @@
## master #9219 +/- ##
========================================
+ Coverage 80.9% 81.1% +0.1%
========================================
Files 276 276
Lines 60970 61188 +218
========================================
+ Hits 49346 49632 +286
+ Misses 11624 11556 -68 |
@sakridge I think this pr is ready for serious code review! :) |
CI is unstable... |
@@ -1959,6 +1959,7 @@ impl Bank { | |||
/// calculation and could shield other real accounts. | |||
pub fn verify_snapshot_bank(&self) -> bool { | |||
self.clean_accounts(); | |||
self.shrink_all_stale_slots(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the performance of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I haven't measured it still, although I know this shrinking operation isn't trivial operation. I prioritized to land this for the poor devnet bullied by the break...
My general analysis (heh, not validated by measurements) are like this:
In both following cases, I think these performance cost is justifiable to avoid the DOS this PR is trying to address.
the snapshot codepath (as you commented here): Given innocent snapshots, the shrinking process is basically no-op because almost all slots should be shrunken by the snapshot-producing validator already. The no-op means exactly that we're bailing out rather early, after a new full-scan of all AppendVec
s collectively. There should be no lock contention pressure because we're only user of AccountsDB at the time of snapshot load (=verification).
Given malicious snapshots, this adds linear time increase to reject them, ultimately bounded by MAX_SNAPSHOT_ARCHIVE_UNPACKED_SIZE
. (By the way, I noticed generate_index
is quadratic in that regard while working on this. but this is a different issue...)
the non-snapshot codepath (as an additional remark): As validator continuously do this at the rate of 1 slot/100ms cycling all alive slots, it should be rather rare to do actual shrinking. So this is like the snapshot code path for indivisual slots. Even if it did, that slot is rooted (no other writer) and the shrunken slot is old by definition there should be no lock contention.
Ultimately, this PR introduces 1 read of account data, amortized no read lock by batching under normal/healthy operation.
As a knob for performance tuning, we have this magic number to reduce the count of written accounts.
Also, we could make the before-bail-out cost even cheaper by maintaining the all account count in append vec.
Anyway, I'm landing this after my last-minute local testing as this is prioritized to land as i said and I think you're aware. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sakridge ^^ Enjoy you reading :) (oops, I've just forgot to mention.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to me, just some minor things and question of performance that we can address later if you feel comfortable merging now @ryoqun
I've tested against TdS:
It got smaller even more! |
if next.is_some() { | ||
next | ||
} else { | ||
let mut new_all_slots = self.all_root_slots_in_index(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After local testing, I've found this logic is too naive for real hostile environment. But first I'll merge this...
For devnet:
|
Sad, it seems that I introduced a regression in this PR; For some reason, it stops to produce snapshots.... I'm investigating... |
Really, odd; this bad reproduces without this PR.... |
Maybe affected by this? #9271 Rebasing and testing again. |
I've run over a day without issues. Now, I'm really merging! |
* Introduce background AppendVec shrink mechanism * Support ledger tool * Clean up * save * save * Fix CI * More clean up * Add tests * Clean up yet more * Use account.hash... * Fix typo.... * Add comment * Rename accounts_cleanup_service (cherry picked from commit b28ec43)
* Introduce background AppendVec shrink mechanism * Support ledger tool * Clean up * save * save * Fix CI * More clean up * Add tests * Clean up yet more * Use account.hash... * Fix typo.... * Add comment * Rename accounts_cleanup_service (cherry picked from commit b28ec43) # Conflicts: # core/src/accounts_background_service.rs # core/src/lib.rs # core/src/tvu.rs # runtime/src/accounts_db.rs # runtime/src/bank/mod.rs
Problem
The cluster is still vulnerable account storage exhaustion attack.
This time, the attack scenario is that create-and-forget-style accounts.
Even if we keep only single (stale) account data (136 Byte) for a slot, we consume 4,194,304 Byte; that's totally waste.
This significantly increases snapshot processing time. Especially, much is wasted for index reconstruction at loading.
That's because other than that single stale account, there are usually bunch of outdated account data in the slot/AppendVec and we must conclude that those are actually not used by computation of whole data set.
As an observation of this, the current state of devnet is like this. There are so many appendvec (60000+), which contain a few accounts (~10).
Summary of Changes
Based on index aliveness information, periodically shrink slots by reading up alive accounts and flush them into new appropriately-fit appendVec.
So, from the AccountDB's view, there is no difference of access pattern between Banks and this background shrinker.
Remarks
Surprisingly, AccountsDB design is so flexible and well-coded. To implement that feature, we only needed to add new code. All of its foundation is already just existing! :)
Fixes #9196