[quorum store] new onchain config for turning on quorum store and turning off mempool broadcast #5400

bchocho · 2022-11-01T20:49:33Z

Description

Adds quorum_store_enabled to the consensus onchain config. This is read by

Mempool, to turn off mempool broadcast.
Epoch manager, to determine whether to use quorum store in the epoch.

Note, there is a race between mempool (long-lived) and consensus + quorum store components (recreated by epoch manager) during the config transition. We believe txns can be duplicated but do not expect txns to be “lost” or “stuck”. We need to further test this when the quorum store implementation is merged into main.

Test Plan

Existing tests. In particular, test_txn_broadcast.

This change is

bchocho

The PR is larger than the actual logic change. Added some pointers in the comments above.

bchocho · 2022-11-10T00:12:04Z

mempool/src/shared_mempool/coordinator.rs

@@ -67,6 +68,17 @@ pub(crate) async fn coordinator<V>(
    let workers_available = smp.config.shared_mempool_max_concurrent_inbound_syncs;
    let bounded_executor = BoundedExecutor::new(workers_available, executor.clone());

+    let initial_reconfig = mempool_reconfig_events


Previously mempool was not reading the initial config before starting, which seems like an oversight.

This change required a lot of refactoring of the tests. MockDbReaderWriter could not handle the on chain configs.

bchocho · 2022-11-10T00:13:50Z

mempool/src/shared_mempool/types.rs

        }
    }

    pub fn broadcast_within_validator_network(&self) -> bool {
-        self.config.shared_mempool_validator_broadcast
+        *self.broadcast_within_validator_network.read()


This is essentially the logic change. Instead of taking a config value, take the onchain config value.

This should not be too expensive, because it's only read on the validators which receive broadcasted transactions in batches.

is changing broadcast_within_validator_network underneath safe from mempools perspective?
i.e. if it was false, and we change it to true, do we need to re-broadcast some transactions in mempool that we have skipped before?

it seems odd to have onchain config modify things underneath the mempool without any coordination. it might be safe - but if it is - it requires an inline comment explaining why that is the case.

It's well-behaved going from true to false, but the other way around we can have transactions that are in mempool and are never broadcasted or in a batch.

We really don't want to go back from quorum store, so I think it's ok to have this disruption. WDYT?

I can add this as a comment if it makes sense.

ibalajiarun · 2022-11-10T15:23:10Z

types/src/on_chain_config/consensus_config.rs

@@ -12,6 +12,7 @@ use serde::{Deserialize, Serialize};
 #[derive(Clone, Debug, Deserialize, PartialEq, Eq, Serialize)]
 pub enum OnChainConsensusConfig {
    V1(ConsensusConfigV1),
+    V2(ConsensusConfigV2),


noob question: why do we have to create a new config version in this case? Can't we add the new flag to V1 and annotate it with #serde(default = ...) to maintain backwards compatibility? in other words, is there a need for V2 config?

+1. cc @igor-aptos

Awesome. So if I make the change and see compat test succeed we should be good, right?

I’m 100% sure (need to look at the code more later) which is why I want to check with @igor-aptos. The consensus config is stored on chain as bytes so if the structure changes, deserialization can change. But yes, tests would fail if deserialization fails. I’d also recommend adding a specific test for the new config + corresponding usage

Discussed with @zekun000 offline. Since this is using BCS, it will fail because BCS is not extendable (to be canonical).

The on chain config probably doesn't need to use BCS, but it's using BCS today.

I think it makes sense to move away from bcs encoding to json or something more flexible. the data stored on-chain just as vector, and the interpretation is off-chain so we can be more flexible here. but need to do a migration first

If we don't plan to turn off quorum store, we could have V2 represent the quorum store, i.e. have:
V1(ConsensusConfigV1),
V2(ConsensusConfigV1),
so you don't need to create another ConsensusConfigV1 class. that's what I did for the LeaderReputationType::ProposerAndVoterV2.

not sure if that is better here, or not, though.

Oh that's interesting. So the enabling will be like below?

pub fn enable_quorum_store(&self) -> bool { match &self { OnChainConsensusConfig::V1(_) => false, OnChainConsensusConfig::V2(_) => true, } }

I like that it avoids a lot of repeating. If in emergency we need to revert, we just push a new V1 config, right?

JoshLind · 2022-11-13T14:27:49Z

state-sync/inter-component/test-helpers/src/lib.rs

+use std::sync::Arc;
+use storage_interface::DbReaderWriter;
+
+pub fn create_database() -> Arc<RwLock<DbReaderWriter>> {


FYI: I need to look at the details in this PR, but from a quick skim, it seems like overkill to me to create a new crate for a single test helper. I'd just copy/paste this 😄

Cool, makes things easier :) CLion was giving me shame about the copied code :P

zekun000 · 2022-11-15T00:43:56Z

mempool/src/shared_mempool/tasks.rs

        counters::VM_RECONFIG_UPDATE_FAIL_COUNT.inc();
        error!(LogSchema::event_log(LogEntry::ReconfigUpdate, LogEvent::VMUpdateFail).error(&e));
    }
+
+    let consensus_config: anyhow::Result<OnChainConsensusConfig> = config_update.get();
+    if let Err(error) = &consensus_config {


this seems a good place to do

match consensus_config { Ok() => .. Err() => .. }

The reason the match was not used: If we fail to get the onchain consensus config, we instead use the default value (so we want to still proceed with .unwrap_or_default() ) in the Err case.

This is tricky though. Does it make sense to use the default or just ignore? (In this case, if a previous config was good, it would continue to use that value.) Epoch manager uses the default when starting round manager: https://github.com/aptos-labs/aptos-core/blob/main/consensus/src/epoch_manager.rs#L720

if new config is not valid, I think it is safer to keep the old config, than to revert to default? what would be the reason to revert to default?

also, given the error! below, this is something unexpected, we want to be alerted on, correct?

Yes, this is something unexpected that we should alert on. I guess using the previous value also makes sense -- but it's really all unexpected.

mempool/src/shared_mempool/tasks.rs

igor-aptos · 2022-12-07T17:09:50Z

mempool/src/shared_mempool/tasks.rs

        counters::VM_RECONFIG_UPDATE_FAIL_COUNT.inc();
        error!(LogSchema::event_log(LogEntry::ReconfigUpdate, LogEvent::VMUpdateFail).error(&e));
    }
+
+    let consensus_config: anyhow::Result<OnChainConsensusConfig> = config_update.get();
+    if let Err(error) = &consensus_config {


if new config is not valid, I think it is safer to keep the old config, than to revert to default? what would be the reason to revert to default?

also, given the error! below, this is something unexpected, we want to be alerted on, correct?

igor-aptos · 2022-12-07T17:39:07Z

mempool/src/shared_mempool/types.rs

        }
    }

    pub fn broadcast_within_validator_network(&self) -> bool {
-        self.config.shared_mempool_validator_broadcast
+        *self.broadcast_within_validator_network.read()


is changing broadcast_within_validator_network underneath safe from mempools perspective?
i.e. if it was false, and we change it to true, do we need to re-broadcast some transactions in mempool that we have skipped before?

it seems odd to have onchain config modify things underneath the mempool without any coordination. it might be safe - but if it is - it requires an inline comment explaining why that is the case.

igor-aptos · 2022-12-07T17:42:50Z

types/src/on_chain_config/consensus_config.rs

@@ -12,6 +12,7 @@ use serde::{Deserialize, Serialize};
 #[derive(Clone, Debug, Deserialize, PartialEq, Eq, Serialize)]
 pub enum OnChainConsensusConfig {
    V1(ConsensusConfigV1),
+    V2(ConsensusConfigV2),


If we don't plan to turn off quorum store, we could have V2 represent the quorum store, i.e. have:
V1(ConsensusConfigV1),
V2(ConsensusConfigV1),
so you don't need to create another ConsensusConfigV1 class. that's what I did for the LeaderReputationType::ProposerAndVoterV2.

not sure if that is better here, or not, though.

igor-aptos · 2022-12-07T17:48:27Z

types/src/on_chain_config/consensus_config.rs

+            back_pressure_limit: 10,
+            exclude_round: 20,
+            max_failed_authors_to_store: 10,
+            proposer_election_type: ProposerElectionType::LeaderReputation(


let's have a default for ProposerElectionType, so this is not repeated?

bchocho · 2022-12-07T19:34:20Z

@igor-aptos I added inline responses. Github is so confusing, I can only see these responses in "Files changed" :(

…mempool broadcasts

…broadcast_within_validator_network transition behavior

testsuite/smoke-test/src/aptos_cli/validator.rs

gelash · 2022-12-10T22:31:51Z

types/src/on_chain_config/consensus_config.rs

@@ -20,13 +21,15 @@ impl OnChainConsensusConfig {
    pub fn leader_reputation_exclude_round(&self) -> u64 {
        match &self {
            OnChainConsensusConfig::V1(config) => config.exclude_round,


ditto, pattern?

gelash · 2022-12-10T22:31:58Z

types/src/on_chain_config/consensus_config.rs

        }
    }

    /// Decouple execution from consensus or not.
    pub fn decoupled_execution(&self) -> bool {
        match &self {
            OnChainConsensusConfig::V1(config) => config.decoupled_execution,
+            OnChainConsensusConfig::V2(config) => config.decoupled_execution,


gelash · 2022-12-10T22:32:13Z

types/src/on_chain_config/consensus_config.rs

@@ -47,13 +51,22 @@ impl OnChainConsensusConfig {
    pub fn max_failed_authors_to_store(&self) -> usize {
        match &self {
            OnChainConsensusConfig::V1(config) => config.max_failed_authors_to_store,


same, I hope I don't forget the pattern syntax

gelash · 2022-12-10T22:32:19Z

types/src/on_chain_config/consensus_config.rs

        }
    }

    // Type and configuration used for proposer election.
    pub fn proposer_election_type(&self) -> &ProposerElectionType {
        match &self {
            OnChainConsensusConfig::V1(config) => &config.proposer_election_type,
+            OnChainConsensusConfig::V2(config) => &config.proposer_election_type,


here as well

sasha8 · 2022-12-11T04:44:38Z

consensus/src/epoch_manager.rs

@@ -641,6 +641,7 @@ impl EpochManager {
        ));

        // Start QuorumStore
+        self.quorum_store_enabled = onchain_config.quorum_store_enabled();


nit: why in strat_round_manager() and not before at start_new_epoch()?

sasha8 · 2022-12-11T04:59:29Z

mempool/src/shared_mempool/tasks.rs

+    let consensus_config: anyhow::Result<OnChainConsensusConfig> = config_update.get();
+    match consensus_config {
+        Ok(consensus_config) => {
+            *broadcast_within_validator_network.write() = !consensus_config.quorum_store_enabled();


It is possible that mempool will broadcast for some time after quorum_store is enabled, right?

There's a small window where this can happen. But once broadcast_within_validator_network is set, the validator won't broadcast anymore

sasha8 · 2022-12-11T05:22:11Z

types/src/on_chain_config/consensus_config.rs

+    pub fn quorum_store_enabled(&self) -> bool {
+        match &self {
+            OnChainConsensusConfig::V1(_config) => false,
+            OnChainConsensusConfig::V2(_config) => true,


So we cannot use ConsensusConfigV2 and add it there?

We could, but as Igor pointed out it adds a lot of boilerplate repeated configs.

github-actions · 2022-12-16T19:31:55Z

✅ Forge suite `land_blocking` success on `55a58ed9754f44e3323bbc690e67f553d4d5ddd1`

performance benchmark with full nodes : 5736 TPS, 6914 ms latency, 14100 ms p99 latency,(!) expired 940 out of 2450380 txns
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

github-actions · 2022-12-16T19:33:35Z

✅ Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `55a58ed9754f44e3323bbc690e67f553d4d5ddd1`

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 55a58ed9754f44e3323bbc690e67f553d4d5ddd1 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7260 TPS, 5411 ms latency, 7900 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 55a58ed9754f44e3323bbc690e67f553d4d5ddd1
compatibility::simple-validator-upgrade::single-validator-upgrade : 4490 TPS, 8962 ms latency, 12600 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 55a58ed9754f44e3323bbc690e67f553d4d5ddd1
compatibility::simple-validator-upgrade::half-validator-upgrade : 4608 TPS, 9233 ms latency, 12400 ms p99 latency,no expired txns
4. upgrading second batch to new version: 55a58ed9754f44e3323bbc690e67f553d4d5ddd1
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6540 TPS, 6032 ms latency, 9600 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 55a58ed9754f44e3323bbc690e67f553d4d5ddd1 passed
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

bchocho force-pushed the brian/qs-onchain-config branch from 602b919 to d1754ed Compare November 8, 2022 00:52

bchocho changed the title ~~[DRAFT] onchain config for turning on quorum store and turning off mempool broadcast~~ [quorum store] new onchain config for turning on quorum store and turning off mempool broadcast Nov 8, 2022

bchocho commented Nov 10, 2022

View reviewed changes

bchocho added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Nov 10, 2022

bchocho marked this pull request as ready for review November 10, 2022 00:50

bchocho requested a review from JoshLind as a code owner November 10, 2022 00:50

bchocho requested review from zekun000, gelash, igor-aptos, sasha8 and danielxiangzl November 10, 2022 00:51

This comment has been minimized.

Sign in to view

ibalajiarun reviewed Nov 10, 2022

View reviewed changes

JoshLind reviewed Nov 13, 2022

View reviewed changes

zekun000 reviewed Nov 15, 2022

View reviewed changes

bchocho mentioned this pull request Nov 15, 2022

[mempool] wait on init reconfig at startup #5578

Merged

bchocho force-pushed the brian/qs-onchain-config branch from b039a07 to 2130b23 Compare November 17, 2022 00:31

This comment has been minimized.

Sign in to view

bchocho mentioned this pull request Nov 17, 2022

[quorum store] payload_manager #5464

Merged

igor-aptos reviewed Dec 7, 2022

View reviewed changes

bchocho added 2 commits December 8, 2022 17:01

[quorum store] create onchain config for quorum store, connect it to …

09b0a73

…mempool broadcasts

Use a single inner config

507db70

bchocho force-pushed the brian/qs-onchain-config branch from 2130b23 to 507db70 Compare December 9, 2022 01:21

This comment has been minimized.

Sign in to view

bchocho added 2 commits December 9, 2022 10:58

Resolve comments: keep old config value on error, and add comment on …

7bc4ff4

…broadcast_within_validator_network transition behavior

fix

c91d2b8

This comment has been minimized.

Sign in to view

gelash reviewed Dec 10, 2022

View reviewed changes

testsuite/smoke-test/src/aptos_cli/validator.rs Show resolved Hide resolved

gelash reviewed Dec 10, 2022

View reviewed changes

sasha8 reviewed Dec 11, 2022

View reviewed changes

bchocho added 3 commits December 15, 2022 12:30

simplify match

e7eb1f8

Set self.quorum_store_enabled before starting round manager

19b71a9

Merge branch 'main' into brian/qs-onchain-config

5360cfd

sasha8 approved these changes Dec 15, 2022

View reviewed changes

igor-aptos approved these changes Dec 16, 2022

View reviewed changes

bchocho enabled auto-merge (squash) December 16, 2022 18:28

Merge branch 'main' into brian/qs-onchain-config

55a58ed

This comment has been minimized.

Sign in to view

bchocho merged commit 05510f2 into main Dec 16, 2022

bchocho deleted the brian/qs-onchain-config branch December 16, 2022 19:34

Markuze mentioned this pull request Dec 26, 2022

markuze/aptos perf #5998

Closed

Markuze mentioned this pull request Jan 3, 2023

[Network] Adding AptosPerf Client #6054

Closed

thepomeranian mentioned this pull request Apr 20, 2023

[AIP-26][Discussion]Quorum Store aptos-foundation/AIPs#108

Closed

[quorum store] new onchain config for turning on quorum store and turning off mempool broadcast #5400

[quorum store] new onchain config for turning on quorum store and turning off mempool broadcast #5400

Conversation

bchocho commented Nov 1, 2022 • edited Loading

Description

Test Plan

bchocho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

movekevin Nov 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshLind Nov 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bchocho commented Dec 7, 2022

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Dec 16, 2022

✅ Forge suite land_blocking success on 55a58ed9754f44e3323bbc690e67f553d4d5ddd1

github-actions bot commented Dec 16, 2022

✅ Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 55a58ed9754f44e3323bbc690e67f553d4d5ddd1

bchocho commented Nov 1, 2022 •

edited

Loading

movekevin Nov 11, 2022 •

edited

Loading

JoshLind Nov 13, 2022 •

edited

Loading

✅ Forge suite `land_blocking` success on `55a58ed9754f44e3323bbc690e67f553d4d5ddd1`

✅ Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `55a58ed9754f44e3323bbc690e67f553d4d5ddd1`