Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[consensus] fallback heuristics for optimistic quorum store #14346

Merged
merged 6 commits into from
Oct 11, 2024

Conversation

ibalajiarun
Copy link
Contributor

@ibalajiarun ibalajiarun commented Aug 20, 2024

Description

This PR implements the fallback heuristics for Optimistic Quorum Store along with some refactoring.

  • Introduce additional RoundTimeoutReason enum variants to classify proposal failures, payload unavailability, and QC failures.
  • Implement heuristics based on the aggregated RoundTimeoutReason. The aggregated timeout reason is the reason with the most voting power received from at least $f+1$ peers by voting power. If no such reason with voting power exist, then the reason is RoundTimeoutReason::Unknown.
  • The heuristics is based on a window of RoundTimeoutReason for the previous rounds. Since each node aggregate timeouts independently, it is highly likely that their aggregates will differ, but only the next proposer's aggregate local reason is used for proposal generation.
  • The heuristic is managed within the ProposalStatusTracker which outputs OptQSPayloadPullParams { opt_batch_txns_pct, exclude_author, minimum_batch_age_usecs } based on the proposal outcome window.
    • opt_batch_txns_pct is either 0 or 50. 0 means no optimistic batches, and 50 means up to 50% of optimistic batches.
    • minimum_batch_age_usecs sets the minimum age of the batch to be pulled into OptQS payload.
    • exclude_author sets the authors whose batches will be excluded into OptQS payload.

Proposal Status Tracker

ProposalStatusTracker will maintain a list of last proposal statuses (same as NewRoundReason). A exponential window based algorithm to decide whether to go optimistic or not.

Initialize the window at 2.

  • For each proposal failure, double the window up to a MAX size. Set opt_batch_txns_pct to 0.
  • If there are no failures within the window, then propose optimistic batch by setting opt_batch_txns_pct to 50.
    • The exclude_author is computed as the union of all the RoundTimeoutReason::PayloadUnavailable { missing_authors } within the window.
  • If there are no failures up to MAX proposals, reset the window to 2.

Copy link

trunk-io bot commented Aug 20, 2024

⏱️ 1h 15m total CI duration on this PR
Job Cumulative Duration Recent Runs
general-lints 9m 🟩🟩🟩🟩🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩
rust-cargo-deny 9m 🟩🟩🟩🟩🟩
rust-move-tests 8m 🟩
rust-move-tests 8m 🟩
rust-move-tests 8m 🟩
check-dynamic-deps 8m 🟩🟩🟩🟩🟩
semgrep/ci 3m 🟩🟩🟩🟩🟩 (+1 more)
file_change_determinator 1m 🟩🟩🟩🟩🟩
file_change_determinator 1m 🟩🟩🟩🟩🟩
permission-check 19s 🟩🟩🟩🟩🟩
permission-check 16s 🟩🟩🟩🟩🟩
permission-check 15s 🟩🟩🟩🟩🟩
permission-check 14s 🟩🟩🟩🟩🟩

settingsfeedbackdocs ⋅ learn more about trunk.io

@ibalajiarun ibalajiarun changed the base branch from main to balaji/optqs-payload August 20, 2024 15:12
@ibalajiarun ibalajiarun changed the title Balaji/optqs heuristics [optqs] basic fallback heuristic for optimistic quorum store Aug 21, 2024
Base automatically changed from balaji/optqs-payload to main August 22, 2024 03:37
@ibalajiarun ibalajiarun force-pushed the balaji/optqs-heuristics branch 2 times, most recently from d5211e9 to 6112540 Compare September 6, 2024 15:38
@ibalajiarun ibalajiarun changed the title [optqs] basic fallback heuristic for optimistic quorum store [consensus] fallback heuristics for optimistic quorum store Sep 6, 2024
@ibalajiarun ibalajiarun changed the base branch from main to balaji/vote-v2 October 1, 2024 20:50
Base automatically changed from balaji/vote-v2 to main October 2, 2024 14:16
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is mostly extracted from payload_client/mod.rs with the addition of OptQSPayloadPullParams

@ibalajiarun ibalajiarun force-pushed the balaji/optqs-heuristics branch 4 times, most recently from 40e5032 to 4a2f31a Compare October 7, 2024 22:11
@ibalajiarun ibalajiarun marked this pull request as ready for review October 7, 2024 22:13

This comment has been minimized.

This comment has been minimized.

)
.await??;
pub async fn wait_for_payload(&self, block: &Block, deadline: Duration) -> anyhow::Result<()> {
let deadline = deadline.saturating_sub(self.time_service.get_current_timestamp());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably should be called duration instead of deadline

verifier.get_voting_power(&author).unwrap_or_default() as u128;
}
// The aggregated timeout reason is the reason with the most voting power received from
// at least f+1 peers by voting power. If such voting power does not exist, then the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just concerned that slow nodes are not marked malicious under the f+1 requirement. But we can leave it for now and see how it performs.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 04a1dc8b07488163611f3e91016a37a3f4b7a2b4

two traffics test: inner traffic : committed: 13727.09 txn/s, latency: 2896.89 ms, (p50: 2700 ms, p70: 3000, p90: 3100 ms, p99: 3600 ms), latency samples: 5219380
two traffics test : committed: 100.05 txn/s, latency: 2798.13 ms, (p50: 2500 ms, p70: 2600, p90: 2900 ms, p99: 11200 ms), latency samples: 1740
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.242, avg: 0.219", "QsPosToProposal: max: 0.344, avg: 0.304", "ConsensusProposalToOrdered: max: 0.322, avg: 0.300", "ConsensusOrderedToCommit: max: 0.469, avg: 0.450", "ConsensusProposalToCommit: max: 0.770, avg: 0.750"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 1.07s no progress at version 2804188 (avg 0.21s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 8.25s no progress at version 2804186 (avg 8.25s) [limit 15].
Test Ok

Copy link
Contributor

✅ Forge suite compat success on beff51858b445401e49d5be352feadcf05652cc0 ==> 04a1dc8b07488163611f3e91016a37a3f4b7a2b4

Compatibility test results for beff51858b445401e49d5be352feadcf05652cc0 ==> 04a1dc8b07488163611f3e91016a37a3f4b7a2b4 (PR)
1. Check liveness of validators at old version: beff51858b445401e49d5be352feadcf05652cc0
compatibility::simple-validator-upgrade::liveness-check : committed: 12285.81 txn/s, latency: 2615.76 ms, (p50: 1800 ms, p70: 1900, p90: 2500 ms, p99: 27800 ms), latency samples: 497100
2. Upgrading first Validator to new version: 04a1dc8b07488163611f3e91016a37a3f4b7a2b4
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7151.57 txn/s, latency: 3970.61 ms, (p50: 4500 ms, p70: 4800, p90: 4900 ms, p99: 5100 ms), latency samples: 132160
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6215.31 txn/s, latency: 5198.46 ms, (p50: 5400 ms, p70: 5600, p90: 7100 ms, p99: 7600 ms), latency samples: 222380
3. Upgrading rest of first batch to new version: 04a1dc8b07488163611f3e91016a37a3f4b7a2b4
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7291.47 txn/s, latency: 3897.17 ms, (p50: 4400 ms, p70: 4600, p90: 4800 ms, p99: 4900 ms), latency samples: 134780
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6885.83 txn/s, latency: 4631.64 ms, (p50: 4600 ms, p70: 4700, p90: 7000 ms, p99: 7200 ms), latency samples: 232260
4. upgrading second batch to new version: 04a1dc8b07488163611f3e91016a37a3f4b7a2b4
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 9100.62 txn/s, latency: 2906.73 ms, (p50: 2700 ms, p70: 2900, p90: 6100 ms, p99: 8100 ms), latency samples: 158680
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10409.27 txn/s, latency: 2933.01 ms, (p50: 2500 ms, p70: 3400, p90: 4300 ms, p99: 6400 ms), latency samples: 347120
5. check swarm health
Compatibility test for beff51858b445401e49d5be352feadcf05652cc0 ==> 04a1dc8b07488163611f3e91016a37a3f4b7a2b4 passed
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on beff51858b445401e49d5be352feadcf05652cc0 ==> 04a1dc8b07488163611f3e91016a37a3f4b7a2b4

Compatibility test results for beff51858b445401e49d5be352feadcf05652cc0 ==> 04a1dc8b07488163611f3e91016a37a3f4b7a2b4 (PR)
Upgrade the nodes to version: 04a1dc8b07488163611f3e91016a37a3f4b7a2b4
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1182.28 txn/s, submitted: 1184.51 txn/s, failed submission: 2.23 txn/s, expired: 2.23 txn/s, latency: 2684.58 ms, (p50: 2400 ms, p70: 2700, p90: 4200 ms, p99: 6700 ms), latency samples: 105920
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1438.25 txn/s, submitted: 1439.59 txn/s, failed submission: 1.33 txn/s, expired: 1.33 txn/s, latency: 2539.61 ms, (p50: 2400 ms, p70: 2700, p90: 4000 ms, p99: 5100 ms), latency samples: 107760
5. check swarm health
Compatibility test for beff51858b445401e49d5be352feadcf05652cc0 ==> 04a1dc8b07488163611f3e91016a37a3f4b7a2b4 passed
Upgrade the remaining nodes to version: 04a1dc8b07488163611f3e91016a37a3f4b7a2b4
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1170.22 txn/s, submitted: 1173.29 txn/s, failed submission: 3.07 txn/s, expired: 3.07 txn/s, latency: 2492.79 ms, (p50: 2400 ms, p70: 2700, p90: 3900 ms, p99: 5400 ms), latency samples: 106620
Test Ok

@ibalajiarun ibalajiarun merged commit b2781bf into main Oct 11, 2024
49 checks passed
@ibalajiarun ibalajiarun deleted the balaji/optqs-heuristics branch October 11, 2024 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-forge-e2e-perf Run the e2e perf forge only
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants