-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unit test for ReplayStage::maybe_start_leader() #32679
Conversation
Hi @brooksprumo - I wanted to get your opinion on a possible solution for a bug this PR demonstrates. Namely, I'm curious about your thoughts on propsed solution 3. above. I had the following concerns that I hope you might be able to speak to:
|
@carllin - If the AccountsDb folks say we shouldn't do 3, thoughts on 1 v 2 in proposed solutions. 1 is simpler but 2 would likely be more performant; given that this is a critical path, thinking we'd want to do with 2, but curious to hear your thoughts / maybe any other solutions that I haven't thought of |
Adding some context for the order of events that causes this in PohRecorder. @steviez might be good to add this to the problem description:
|
I think 1 is the right fix, we can cache the result of the check in |
We can (i.e. should) purge all slot information from AccountsDb for any bank that is purged from BankForks—root or not. So if we've purged a bank from BankForks but there is still information (bank_hash_stats) leftover in AccountsDb I would think that would be a bug. Is that what's happening? |
It's interesting that a shred from our upcoming leader slot would be present in blockstore but missing from bank forks. This would not be possible through the duplicate block pathway because we exit if we ever dump our leader bank from bank forks: solana/core/src/replay_stage.rs Lines 1364 to 1371 in f4816dc
Similarly turbine/repair should run through sigverify before inserting any shreds into blockstore, which drops shreds for our leader: solana/turbine/src/sigverify_shreds.rs Lines 184 to 187 in f4816dc
It's unlikely that it was cleaned up in bank forks due to a new root being set, as we are resetting onto a fork before that slot in consensus after. |
perhaps related, we saw the opposite problem: 1 (non leader) block in blockstore generating 2 banks in bank forks here triggered the same error message in accounts db error!("set_hash: already exists; multiple forks with shared slot {slot} as child (parent: {parent_slot})!?"); |
I think my/this comment is related-ish? I tried to turn this |
This is actually what happened where the new root is NOT an ancestor of the duplicate. |
I thought you were running master in the pop cluster. No one should be submitting duplicate blocks right? Are there logs somewhere I can look into? |
@AshwinSekar it's a bug in PohRecorder. The block is made on a minor fork, gets pruned when a root is made, then gets recreated later b/c of the PohRecorder bug. See post above: #32679 (comment) |
Problem
This PR introduces a unit test that demonstrates a series of events that could lead to the panic described in #32466.
Summary of Changes
Right now, this PR only has a unit test to demonstrate the problem. Currently, there is a check that will bail if we see the slot in
BankForks
already:solana/core/src/replay_stage.rs
Lines 1873 to 1876 in 69336ab
Proposed Solutions
BankForks
check above, explicitly check the blockstore to see if there is a shred for the new leader slotReplayStage::reset_poh_recorder()
resetting to an older slot. We could add logic to detect if we reset to an older slot, ie a boolean likereset_to_older
. We could then passreset_to_older
intoReplayStage::maybe_start_leader()
, and only check blockstore ifreset_to_older
wastrue
.AccountsDb
:solana/runtime/src/accounts_db.rs
Lines 4954 to 4964 in 69336ab
In the debug described in linked issue, the error statement in the if case was observed. This is because the
Bank
was purged fromBankForks
, but slot was still newer than the latest root so it hadn't been fully cleaned yet. Thus, this entry in the data structure somewhat serves as a tombstone that we had aBank
for this slot, and could be used to detect whether we should bail on this slot. That is, call the below function and skip the leader slot if the result isSome(...)
:solana/runtime/src/accounts_db.rs
Lines 7895 to 7898 in 69336ab
Some notes about the options:
AccountsDb
or if we're abusing something / liable to a race condition / etc. I will follow up with someone more familiar withAccountsDb
to figure this out.Fixes #32466