-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[quorum store] new onchain config for turning on quorum store and turning off mempool broadcast #5400
Conversation
602b919
to
d1754ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR is larger than the actual logic change. Added some pointers in the comments above.
@@ -67,6 +68,17 @@ pub(crate) async fn coordinator<V>( | |||
let workers_available = smp.config.shared_mempool_max_concurrent_inbound_syncs; | |||
let bounded_executor = BoundedExecutor::new(workers_available, executor.clone()); | |||
|
|||
let initial_reconfig = mempool_reconfig_events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously mempool was not reading the initial config before starting, which seems like an oversight.
This change required a lot of refactoring of the tests. MockDbReaderWriter
could not handle the on chain configs.
} | ||
} | ||
|
||
pub fn broadcast_within_validator_network(&self) -> bool { | ||
self.config.shared_mempool_validator_broadcast | ||
*self.broadcast_within_validator_network.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is essentially the logic change. Instead of taking a config value, take the onchain config value.
This should not be too expensive, because it's only read on the validators which receive broadcasted transactions in batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is changing broadcast_within_validator_network underneath safe from mempools perspective?
i.e. if it was false, and we change it to true, do we need to re-broadcast some transactions in mempool that we have skipped before?
it seems odd to have onchain config modify things underneath the mempool without any coordination. it might be safe - but if it is - it requires an inline comment explaining why that is the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's well-behaved going from true
to false
, but the other way around we can have transactions that are in mempool and are never broadcasted or in a batch.
We really don't want to go back from quorum store, so I think it's ok to have this disruption. WDYT?
I can add this as a comment if it makes sense.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@@ -12,6 +12,7 @@ use serde::{Deserialize, Serialize}; | |||
#[derive(Clone, Debug, Deserialize, PartialEq, Eq, Serialize)] | |||
pub enum OnChainConsensusConfig { | |||
V1(ConsensusConfigV1), | |||
V2(ConsensusConfigV2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noob question: why do we have to create a new config version in this case? Can't we add the new flag to V1 and annotate it with #serde(default = ...)
to maintain backwards compatibility? in other words, is there a need for V2 config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. cc @igor-aptos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. So if I make the change and see compat test succeed we should be good, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m 100% sure (need to look at the code more later) which is why I want to check with @igor-aptos. The consensus config is stored on chain as bytes so if the structure changes, deserialization can change. But yes, tests would fail if deserialization fails. I’d also recommend adding a specific test for the new config + corresponding usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @zekun000 offline. Since this is using BCS, it will fail because BCS is not extendable (to be canonical).
The on chain config probably doesn't need to use BCS, but it's using BCS today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to move away from bcs encoding to json or something more flexible. the data stored on-chain just as vector, and the interpretation is off-chain so we can be more flexible here. but need to do a migration first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't plan to turn off quorum store, we could have V2 represent the quorum store, i.e. have:
V1(ConsensusConfigV1),
V2(ConsensusConfigV1),
so you don't need to create another ConsensusConfigV1 class. that's what I did for the LeaderReputationType::ProposerAndVoterV2.
not sure if that is better here, or not, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's interesting. So the enabling will be like below?
pub fn enable_quorum_store(&self) -> bool {
match &self {
OnChainConsensusConfig::V1(_) => false,
OnChainConsensusConfig::V2(_) => true,
}
}
I like that it avoids a lot of repeating. If in emergency we need to revert, we just push a new V1 config, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
use std::sync::Arc; | ||
use storage_interface::DbReaderWriter; | ||
|
||
pub fn create_database() -> Arc<RwLock<DbReaderWriter>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: I need to look at the details in this PR, but from a quick skim, it seems like overkill to me to create a new crate for a single test helper. I'd just copy/paste this 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, makes things easier :) CLion was giving me shame about the copied code :P
mempool/src/shared_mempool/tasks.rs
Outdated
counters::VM_RECONFIG_UPDATE_FAIL_COUNT.inc(); | ||
error!(LogSchema::event_log(LogEntry::ReconfigUpdate, LogEvent::VMUpdateFail).error(&e)); | ||
} | ||
|
||
let consensus_config: anyhow::Result<OnChainConsensusConfig> = config_update.get(); | ||
if let Err(error) = &consensus_config { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems a good place to do
match consensus_config {
Ok() => ..
Err() => ..
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason the match was not used: If we fail to get the onchain consensus config, we instead use the default value (so we want to still proceed with .unwrap_or_default()
) in the Err case.
This is tricky though. Does it make sense to use the default or just ignore? (In this case, if a previous config was good, it would continue to use that value.) Epoch manager uses the default when starting round manager: https://github.com/aptos-labs/aptos-core/blob/main/consensus/src/epoch_manager.rs#L720
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if new config is not valid, I think it is safer to keep the old config, than to revert to default? what would be the reason to revert to default?
also, given the error! below, this is something unexpected, we want to be alerted on, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is something unexpected that we should alert on. I guess using the previous value also makes sense -- but it's really all unexpected.
b039a07
to
2130b23
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
mempool/src/shared_mempool/tasks.rs
Outdated
counters::VM_RECONFIG_UPDATE_FAIL_COUNT.inc(); | ||
error!(LogSchema::event_log(LogEntry::ReconfigUpdate, LogEvent::VMUpdateFail).error(&e)); | ||
} | ||
|
||
let consensus_config: anyhow::Result<OnChainConsensusConfig> = config_update.get(); | ||
if let Err(error) = &consensus_config { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if new config is not valid, I think it is safer to keep the old config, than to revert to default? what would be the reason to revert to default?
also, given the error! below, this is something unexpected, we want to be alerted on, correct?
} | ||
} | ||
|
||
pub fn broadcast_within_validator_network(&self) -> bool { | ||
self.config.shared_mempool_validator_broadcast | ||
*self.broadcast_within_validator_network.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is changing broadcast_within_validator_network underneath safe from mempools perspective?
i.e. if it was false, and we change it to true, do we need to re-broadcast some transactions in mempool that we have skipped before?
it seems odd to have onchain config modify things underneath the mempool without any coordination. it might be safe - but if it is - it requires an inline comment explaining why that is the case.
@@ -12,6 +12,7 @@ use serde::{Deserialize, Serialize}; | |||
#[derive(Clone, Debug, Deserialize, PartialEq, Eq, Serialize)] | |||
pub enum OnChainConsensusConfig { | |||
V1(ConsensusConfigV1), | |||
V2(ConsensusConfigV2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't plan to turn off quorum store, we could have V2 represent the quorum store, i.e. have:
V1(ConsensusConfigV1),
V2(ConsensusConfigV1),
so you don't need to create another ConsensusConfigV1 class. that's what I did for the LeaderReputationType::ProposerAndVoterV2.
not sure if that is better here, or not, though.
back_pressure_limit: 10, | ||
exclude_round: 20, | ||
max_failed_authors_to_store: 10, | ||
proposer_election_type: ProposerElectionType::LeaderReputation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's have a default for ProposerElectionType, so this is not repeated?
@igor-aptos I added inline responses. Github is so confusing, I can only see these responses in "Files changed" :( |
2130b23
to
507db70
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…broadcast_within_validator_network transition behavior
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@@ -20,13 +21,15 @@ impl OnChainConsensusConfig { | |||
pub fn leader_reputation_exclude_round(&self) -> u64 { | |||
match &self { | |||
OnChainConsensusConfig::V1(config) => config.exclude_round, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, pattern?
} | ||
} | ||
|
||
/// Decouple execution from consensus or not. | ||
pub fn decoupled_execution(&self) -> bool { | ||
match &self { | ||
OnChainConsensusConfig::V1(config) => config.decoupled_execution, | ||
OnChainConsensusConfig::V2(config) => config.decoupled_execution, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@@ -47,13 +51,22 @@ impl OnChainConsensusConfig { | |||
pub fn max_failed_authors_to_store(&self) -> usize { | |||
match &self { | |||
OnChainConsensusConfig::V1(config) => config.max_failed_authors_to_store, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, I hope I don't forget the pattern syntax
} | ||
} | ||
|
||
// Type and configuration used for proposer election. | ||
pub fn proposer_election_type(&self) -> &ProposerElectionType { | ||
match &self { | ||
OnChainConsensusConfig::V1(config) => &config.proposer_election_type, | ||
OnChainConsensusConfig::V2(config) => &config.proposer_election_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here as well
consensus/src/epoch_manager.rs
Outdated
@@ -641,6 +641,7 @@ impl EpochManager { | |||
)); | |||
|
|||
// Start QuorumStore | |||
self.quorum_store_enabled = onchain_config.quorum_store_enabled(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why in strat_round_manager() and not before at start_new_epoch()?
let consensus_config: anyhow::Result<OnChainConsensusConfig> = config_update.get(); | ||
match consensus_config { | ||
Ok(consensus_config) => { | ||
*broadcast_within_validator_network.write() = !consensus_config.quorum_store_enabled(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible that mempool will broadcast for some time after quorum_store is enabled, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a small window where this can happen. But once broadcast_within_validator_network is set, the validator won't broadcast anymore
pub fn quorum_store_enabled(&self) -> bool { | ||
match &self { | ||
OnChainConsensusConfig::V1(_config) => false, | ||
OnChainConsensusConfig::V2(_config) => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we cannot use ConsensusConfigV2 and add it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but as Igor pointed out it adds a lot of boilerplate repeated configs.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
✅ Forge suite
|
✅ Forge suite
|
Description
Adds quorum_store_enabled to the consensus onchain config. This is read by
Note, there is a race between mempool (long-lived) and consensus + quorum store components (recreated by epoch manager) during the config transition. We believe txns can be duplicated but do not expect txns to be “lost” or “stuck”. We need to further test this when the quorum store implementation is merged into main.
Test Plan
Existing tests. In particular, test_txn_broadcast.
This change is