Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix duplicate confirmed rollup detection for descendants #34014

Merged
merged 2 commits into from
Jan 10, 2024

Conversation

AshwinSekar
Copy link
Contributor

Problem

State machine is only notified of duplicate confirmed rollup when optimistically confirmed

Summary of Changes

Make the check granular, notify when >52%
dedup requests against duplicate confirmed slots already sent to the fsm

Fixes #

bank.hash(),
));
} else if bank.is_frozen()
&& tower.is_slot_duplicate_confirmed(*slot, voted_stakes, total_stake)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously we were only calling mark_slots_confirmed if we got tower.is_slot_confirmed . this change allows us to notify the fsm as soon as we have tower.is_slot_duplicate_confirmed

Copy link

codecov bot commented Nov 10, 2023

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (1db76cf) 81.8% compared to head (be77751) 81.8%.
Report is 66 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #34014   +/-   ##
=======================================
  Coverage    81.8%    81.8%           
=======================================
  Files         822      822           
  Lines      221573   221623   +50     
=======================================
+ Hits       181341   181416   +75     
+ Misses      40232    40207   -25     

@AshwinSekar AshwinSekar added v1.17 PRs that should be backported to v1.17 v1.16 PRs that should be backported to v1.16 labels Nov 10, 2023
@@ -3877,7 +3925,25 @@ impl ReplayStage {
if bank.is_frozen() && tower.is_slot_confirmed(*slot, voted_stakes, total_stake) {
info!("validator fork confirmed {} {}ms", *slot, duration);
datapoint_info!("validator-confirmation", ("duration_ms", duration, i64));
confirmed_forks.push((*slot, bank.hash()));
confirmed_forks.push(ConfirmedSlot::new_optimistic_confirmed_slot(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't know if we should call this optimistically confirmed because optimistic confirmation doesn't roll up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open to ideas about naming, but also I think we should just fix the plumbing and make it roll up while we're looking at this.

The plumbing rn seems to be a mess:

  • cluster_info_vote_listener tracks gossip votes but also receives replay votes through crossbeam from replay_stage
  • cluster_info_vote_listener tracks gossip votes for propagated check in vote_tracker
  • replay_stage also tracks replay votes in vote_tracker and rolls up in fork_stats in compute_bank_stats
  • cluster_info_vote_listener also separately sends verified gossip votes to replay_stage through crossbeam
  • replay_stage tracks DuplicateConfirmed through fork_stats but also gets notified through crossbeam from cluster_info_vote_listener

There's a lot of back and forth propagation of votes and stake info in multiple places which seems inefficient. My vote would be to just share 1 vote tracker/fork stats between cluster_info_vote_listener and replay_stage to avoid the extra work. This would also let us rollup replay votes and include gossip votes that would get us over the key threshold.

If we don't want to do a big refactor I think there's a simpler solution for rollup:

bank_utils::find_and_send_votes(
batch.sanitized_transactions(),
&tx_results,
replay_vote_sender,
);

When sending replayed votes from replay_stage to cluster_info_vote_listener, send the slot hashes as well.

Then

// We cannot count any other slots in this vote toward optimistic confirmation because:
// 1) There may have been a switch between the earlier vote and the last vote
// 2) We do not know the hash of the earlier slot
if slot == last_vote_slot {

If we have !is_gossip we don't have to filter only on the last slot here, we can track the whole tower since we have the accompanying hash in slot hashes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify this^ is out of scope for this pr and should not be a blocker for merging. we can discuss in #34279.

I agree that optimistically confirmed is an overloaded name and am happy to rename if you have a suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimistically confirmed actually can't be safely rolled up, it has to be all votes on the same slot

Copy link
Contributor

@carllin carllin Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just called it supermajority_voted?

Comment on lines 3874 to 3876
if *slot <= root_slot {
continue;
} else if let Some(prev_hash) = duplicate_confirmed_slots.insert(*slot, *frozen_hash) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good here to add a check for the ConfirmationType::DuplicateConfirmed in case we add any other cases that accidentally fall into this branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want ConfirmationType::OptimisticallyConfirmed to fall through here, but if you prefer I can make it explicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think explicit check would be good

Comment on lines +3874 to +3902
if *slot <= root_slot {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a deviation from current behavior i.e. some supermajority confirmed slots less than the root may no longer be added as duplicate confirmed and passed to the state machine. Seems ok since whatever the latest root slot was should have notified the state machine that all the ancestors are confirmed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah exactly, since we have the threshold check we must have already processed this signal. But also probably not a huge deal if we remove this check.

Copy link
Contributor

mergify bot commented Nov 17, 2023

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

Copy link
Contributor

mergify bot commented Nov 17, 2023

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

carllin
carllin previously approved these changes Nov 30, 2023
Copy link
Contributor

@carllin carllin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some nits

@AshwinSekar AshwinSekar removed v1.16 PRs that should be backported to v1.16 v1.17 PRs that should be backported to v1.17 labels Nov 30, 2023
@github-actions github-actions bot added stale [bot only] Added to stale content; results in auto-close after a week. and removed stale [bot only] Added to stale content; results in auto-close after a week. labels Dec 14, 2023
@AshwinSekar AshwinSekar requested a review from carllin December 21, 2023 01:13
@github-actions github-actions bot added stale [bot only] Added to stale content; results in auto-close after a week. and removed stale [bot only] Added to stale content; results in auto-close after a week. labels Jan 4, 2024
@AshwinSekar AshwinSekar merged commit fb97e93 into solana-labs:master Jan 10, 2024
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants