[consensus] fallback heuristics for optimistic quorum store #14346

ibalajiarun · 2024-08-20T15:12:47Z

Description

This PR implements the fallback heuristics for Optimistic Quorum Store along with some refactoring.

Introduce additional RoundTimeoutReason enum variants to classify proposal failures, payload unavailability, and QC failures.
Implement heuristics based on the aggregated RoundTimeoutReason. The aggregated timeout reason is the reason with the most voting power received from at least $f+1$ peers by voting power. If no such reason with voting power exist, then the reason is RoundTimeoutReason::Unknown.
The heuristics is based on a window of RoundTimeoutReason for the previous rounds. Since each node aggregate timeouts independently, it is highly likely that their aggregates will differ, but only the next proposer's aggregate local reason is used for proposal generation.
The heuristic is managed within the ProposalStatusTracker which outputs OptQSPayloadPullParams { opt_batch_txns_pct, exclude_author, minimum_batch_age_usecs } based on the proposal outcome window.
- opt_batch_txns_pct is either 0 or 50. 0 means no optimistic batches, and 50 means up to 50% of optimistic batches.
- minimum_batch_age_usecs sets the minimum age of the batch to be pulled into OptQS payload.
- exclude_author sets the authors whose batches will be excluded into OptQS payload.

Proposal Status Tracker

ProposalStatusTracker will maintain a list of last proposal statuses (same as NewRoundReason). A exponential window based algorithm to decide whether to go optimistic or not.

Initialize the window at 2.

For each proposal failure, double the window up to a MAX size. Set opt_batch_txns_pct to 0.
If there are no failures within the window, then propose optimistic batch by setting opt_batch_txns_pct to 50.
- The exclude_author is computed as the union of all the RoundTimeoutReason::PayloadUnavailable { missing_authors } within the window.
If there are no failures up to MAX proposals, reset the window to 2.

trunk-io · 2024-08-20T15:12:50Z

⏱️ 1h 15m total CI duration on this PR

Job	Cumulative Duration	Recent Runs
general-lints	9m	🟩 🟩 🟩 🟩 🟩
rust-move-tests	9m	🟩
rust-move-tests	9m	🟩
rust-cargo-deny	9m	🟩 🟩 🟩 🟩 🟩
rust-move-tests	8m	🟩
rust-move-tests	8m	🟩
rust-move-tests	8m	🟩
check-dynamic-deps	8m	🟩 🟩 🟩 🟩 🟩
semgrep/ci	3m	🟩 🟩 🟩 🟩 🟩 (+1 more)
file_change_determinator	1m	🟩 🟩 🟩 🟩 🟩
file_change_determinator	1m	🟩 🟩 🟩 🟩 🟩
permission-check	19s	🟩 🟩 🟩 🟩 🟩
permission-check	16s	🟩 🟩 🟩 🟩 🟩
permission-check	15s	🟩 🟩 🟩 🟩 🟩
permission-check	14s	🟩 🟩 🟩 🟩 🟩

_{settings ⋅ feedback ⋅ docs ⋅ learn more about trunk.io}

ibalajiarun · 2024-10-07T15:34:36Z

consensus/consensus-types/src/payload_pull_params.rs

This file is mostly extracted from payload_client/mod.rs with the addition of OptQSPayloadPullParams

zekun000 · 2024-10-09T21:25:35Z

consensus/src/block_storage/block_store.rs

-        )
-        .await??;
+    pub async fn wait_for_payload(&self, block: &Block, deadline: Duration) -> anyhow::Result<()> {
+        let deadline = deadline.saturating_sub(self.time_service.get_current_timestamp());


this probably should be called duration instead of deadline

danielxiangzl · 2024-10-10T18:06:55Z

consensus/src/pending_votes.rs

+                verifier.get_voting_power(&author).unwrap_or_default() as u128;
+        }
+        // The aggregated timeout reason is the reason with the most voting power received from
+        // at least f+1 peers by voting power. If such voting power does not exist, then the


I am just concerned that slow nodes are not marked malicious under the f+1 requirement. But we can leave it for now and see how it performs.

github-actions · 2024-10-11T04:32:45Z

✅ Forge suite `realistic_env_max_load` success on `04a1dc8b07488163611f3e91016a37a3f4b7a2b4`

two traffics test: inner traffic : committed: 13727.09 txn/s, latency: 2896.89 ms, (p50: 2700 ms, p70: 3000, p90: 3100 ms, p99: 3600 ms), latency samples: 5219380
two traffics test : committed: 100.05 txn/s, latency: 2798.13 ms, (p50: 2500 ms, p70: 2600, p90: 2900 ms, p99: 11200 ms), latency samples: 1740
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.242, avg: 0.219", "QsPosToProposal: max: 0.344, avg: 0.304", "ConsensusProposalToOrdered: max: 0.322, avg: 0.300", "ConsensusOrderedToCommit: max: 0.469, avg: 0.450", "ConsensusProposalToCommit: max: 0.770, avg: 0.750"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 1.07s no progress at version 2804188 (avg 0.21s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 8.25s no progress at version 2804186 (avg 8.25s) [limit 15].
Test Ok

github-actions · 2024-10-11T04:33:47Z

✅ Forge suite `compat` success on `beff51858b445401e49d5be352feadcf05652cc0` ==> `04a1dc8b07488163611f3e91016a37a3f4b7a2b4`

Compatibility test results for beff51858b445401e49d5be352feadcf05652cc0 ==> 04a1dc8b07488163611f3e91016a37a3f4b7a2b4 (PR)
1. Check liveness of validators at old version: beff51858b445401e49d5be352feadcf05652cc0
compatibility::simple-validator-upgrade::liveness-check : committed: 12285.81 txn/s, latency: 2615.76 ms, (p50: 1800 ms, p70: 1900, p90: 2500 ms, p99: 27800 ms), latency samples: 497100
2. Upgrading first Validator to new version: 04a1dc8b07488163611f3e91016a37a3f4b7a2b4
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7151.57 txn/s, latency: 3970.61 ms, (p50: 4500 ms, p70: 4800, p90: 4900 ms, p99: 5100 ms), latency samples: 132160
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6215.31 txn/s, latency: 5198.46 ms, (p50: 5400 ms, p70: 5600, p90: 7100 ms, p99: 7600 ms), latency samples: 222380
3. Upgrading rest of first batch to new version: 04a1dc8b07488163611f3e91016a37a3f4b7a2b4
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7291.47 txn/s, latency: 3897.17 ms, (p50: 4400 ms, p70: 4600, p90: 4800 ms, p99: 4900 ms), latency samples: 134780
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6885.83 txn/s, latency: 4631.64 ms, (p50: 4600 ms, p70: 4700, p90: 7000 ms, p99: 7200 ms), latency samples: 232260
4. upgrading second batch to new version: 04a1dc8b07488163611f3e91016a37a3f4b7a2b4
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 9100.62 txn/s, latency: 2906.73 ms, (p50: 2700 ms, p70: 2900, p90: 6100 ms, p99: 8100 ms), latency samples: 158680
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10409.27 txn/s, latency: 2933.01 ms, (p50: 2500 ms, p70: 3400, p90: 4300 ms, p99: 6400 ms), latency samples: 347120
5. check swarm health
Compatibility test for beff51858b445401e49d5be352feadcf05652cc0 ==> 04a1dc8b07488163611f3e91016a37a3f4b7a2b4 passed
Test Ok

github-actions · 2024-10-11T04:34:32Z

✅ Forge suite `framework_upgrade` success on `beff51858b445401e49d5be352feadcf05652cc0` ==> `04a1dc8b07488163611f3e91016a37a3f4b7a2b4`

Compatibility test results for beff51858b445401e49d5be352feadcf05652cc0 ==> 04a1dc8b07488163611f3e91016a37a3f4b7a2b4 (PR)
Upgrade the nodes to version: 04a1dc8b07488163611f3e91016a37a3f4b7a2b4
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1182.28 txn/s, submitted: 1184.51 txn/s, failed submission: 2.23 txn/s, expired: 2.23 txn/s, latency: 2684.58 ms, (p50: 2400 ms, p70: 2700, p90: 4200 ms, p99: 6700 ms), latency samples: 105920
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1438.25 txn/s, submitted: 1439.59 txn/s, failed submission: 1.33 txn/s, expired: 1.33 txn/s, latency: 2539.61 ms, (p50: 2400 ms, p70: 2700, p90: 4000 ms, p99: 5100 ms), latency samples: 107760
5. check swarm health
Compatibility test for beff51858b445401e49d5be352feadcf05652cc0 ==> 04a1dc8b07488163611f3e91016a37a3f4b7a2b4 passed
Upgrade the remaining nodes to version: 04a1dc8b07488163611f3e91016a37a3f4b7a2b4
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1170.22 txn/s, submitted: 1173.29 txn/s, failed submission: 3.07 txn/s, expired: 3.07 txn/s, latency: 2492.79 ms, (p50: 2400 ms, p70: 2700, p90: 3900 ms, p99: 5400 ms), latency samples: 106620
Test Ok

ibalajiarun changed the base branch from main to balaji/optqs-payload August 20, 2024 15:12

ibalajiarun force-pushed the balaji/optqs-payload branch from 6d3d7b5 to fbdc550 Compare August 20, 2024 15:13

ibalajiarun force-pushed the balaji/optqs-heuristics branch from b61ab8d to 4a68f84 Compare August 21, 2024 00:12

ibalajiarun force-pushed the balaji/optqs-payload branch from 92103ca to 0c4aa6b Compare August 21, 2024 00:16

ibalajiarun force-pushed the balaji/optqs-heuristics branch from 4a68f84 to 2a9b7de Compare August 21, 2024 00:36

ibalajiarun changed the title ~~Balaji/optqs heuristics~~ [optqs] basic fallback heuristic for optimistic quorum store Aug 21, 2024

ibalajiarun force-pushed the balaji/optqs-heuristics branch from 2a9b7de to a40e083 Compare August 21, 2024 00:48

Base automatically changed from balaji/optqs-payload to main August 22, 2024 03:37

ibalajiarun force-pushed the balaji/optqs-heuristics branch from a40e083 to 8f8a0ee Compare August 22, 2024 03:55

ibalajiarun force-pushed the balaji/optqs-heuristics branch 2 times, most recently from d5211e9 to 6112540 Compare September 6, 2024 15:38

ibalajiarun changed the title ~~[optqs] basic fallback heuristic for optimistic quorum store~~ [consensus] fallback heuristics for optimistic quorum store Sep 6, 2024

ibalajiarun force-pushed the balaji/optqs-heuristics branch from a0c6e4c to cb52952 Compare September 8, 2024 17:18

ibalajiarun force-pushed the balaji/optqs-heuristics branch from cb52952 to 87c18e4 Compare October 1, 2024 20:49

ibalajiarun changed the base branch from main to balaji/vote-v2 October 1, 2024 20:50

Base automatically changed from balaji/vote-v2 to main October 2, 2024 14:16

ibalajiarun force-pushed the balaji/optqs-heuristics branch from 87c18e4 to 2b6294b Compare October 2, 2024 20:17

ibalajiarun commented Oct 7, 2024

View reviewed changes

ibalajiarun force-pushed the balaji/optqs-heuristics branch 4 times, most recently from 40e5032 to 4a2f31a Compare October 7, 2024 22:11

ibalajiarun marked this pull request as ready for review October 7, 2024 22:13

ibalajiarun requested review from bchocho, sasha8, gelash, zekun000, JoshLind and gregnazario as code owners October 7, 2024 22:13

ibalajiarun requested a review from zekun000 October 9, 2024 20:37

This comment has been minimized.

Sign in to view

zekun000 approved these changes Oct 9, 2024

View reviewed changes

danielxiangzl approved these changes Oct 10, 2024

View reviewed changes

ibalajiarun force-pushed the balaji/optqs-heuristics branch from ff18407 to cbe05b0 Compare October 11, 2024 03:08

ibalajiarun added 4 commits October 10, 2024 20:11

[consensus] Fallback heuristics for optimistic quorum store

cb1fc8b

[optqs] set minimum batch age for optimistic batch proposals

e66f53c

new unit tests and existing test fixes

0b024cf

allow handling OptQS payload by default

e0279db

ibalajiarun force-pushed the balaji/optqs-heuristics branch from cbe05b0 to 9835591 Compare October 11, 2024 03:12

ibalajiarun enabled auto-merge (squash) October 11, 2024 03:12

lint

f460d05

ibalajiarun force-pushed the balaji/optqs-heuristics branch from 9835591 to ba1dc89 Compare October 11, 2024 03:14

This comment has been minimized.

Sign in to view

address feedback

04a1dc8

ibalajiarun force-pushed the balaji/optqs-heuristics branch from ba1dc89 to 04a1dc8 Compare October 11, 2024 04:06

This comment has been minimized.

Sign in to view

ibalajiarun merged commit b2781bf into main Oct 11, 2024
49 checks passed

ibalajiarun deleted the balaji/optqs-heuristics branch October 11, 2024 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[consensus] fallback heuristics for optimistic quorum store #14346

[consensus] fallback heuristics for optimistic quorum store #14346

ibalajiarun commented Aug 20, 2024 •

edited

Loading

trunk-io bot commented Aug 20, 2024 •

edited

Loading

ibalajiarun Oct 7, 2024

This comment has been minimized.

This comment has been minimized.

zekun000 Oct 9, 2024

danielxiangzl Oct 10, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Oct 11, 2024

github-actions bot commented Oct 11, 2024

github-actions bot commented Oct 11, 2024

[consensus] fallback heuristics for optimistic quorum store #14346

[consensus] fallback heuristics for optimistic quorum store #14346

Conversation

ibalajiarun commented Aug 20, 2024 • edited Loading

Description

Proposal Status Tracker

trunk-io bot commented Aug 20, 2024 • edited Loading

ibalajiarun Oct 7, 2024

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

zekun000 Oct 9, 2024

Choose a reason for hiding this comment

danielxiangzl Oct 10, 2024

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Oct 11, 2024

✅ Forge suite realistic_env_max_load success on 04a1dc8b07488163611f3e91016a37a3f4b7a2b4

github-actions bot commented Oct 11, 2024

✅ Forge suite compat success on beff51858b445401e49d5be352feadcf05652cc0 ==> 04a1dc8b07488163611f3e91016a37a3f4b7a2b4

github-actions bot commented Oct 11, 2024

✅ Forge suite framework_upgrade success on beff51858b445401e49d5be352feadcf05652cc0 ==> 04a1dc8b07488163611f3e91016a37a3f4b7a2b4

ibalajiarun commented Aug 20, 2024 •

edited

Loading

trunk-io bot commented Aug 20, 2024 •

edited

Loading

✅ Forge suite `realistic_env_max_load` success on `04a1dc8b07488163611f3e91016a37a3f4b7a2b4`

✅ Forge suite `compat` success on `beff51858b445401e49d5be352feadcf05652cc0` ==> `04a1dc8b07488163611f3e91016a37a3f4b7a2b4`

✅ Forge suite `framework_upgrade` success on `beff51858b445401e49d5be352feadcf05652cc0` ==> `04a1dc8b07488163611f3e91016a37a3f4b7a2b4`