Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always record storage when stateless validation is enabled #10859

Merged
merged 6 commits into from
Mar 26, 2024

Conversation

jancionear
Copy link
Contributor

For stateless validation we need to record all trie reads performed when applying a chunk in order to prepare a PartialState that will be included in ChunkStateWitness

Initially it was only necessary to record trie reads when producing a chunk. Validators could read all the values from the provided PartialState without recording anything.

A recent change (#10703) introduced a soft limit on the size of PartialState. When applying a chunk we watch how much state was recorded, and once the amount of state exceeds the soft limit we stop applying the receipts in this chunk.
This needs to be done on both the chunk producer and chunk validator - if the chunk validator doesn't record reads and enforce the limit, it will produce a different result of chunk application, which would lead to validation failure.

This means that state should be recorded in all cases when a statelessly validated chunk is applied. Let's remove the configuration option that controls whether trie reads should be recorded (record_storage) and just record the reads on every chunk application (when statelessnet_protocol feature is enabled).

Refs: zulip conversation

For stateless validation we need to record all trie reads
performed when applying a chunk in order to prepare a `PartialState`
that will be included in `ChunkStateWitness`

Initially it was only necessary to record trie reads when producing
a chunk. Validators could read all the values from the provided `PartialState`
without recording anything.

A recent change (near#10703) introduced a soft
limit on the size of `PartialState`. When applying a chunk we watch how much
state was recorded, and once the amount of state exceeds the soft limit we stop
applying the receipts in this chunk.
This needs to be done on both the chunk producer and chunk validator - if the
chunk validator doesn't record reads and enforce the limit, it will produce
a different result of chunk application, which would lead to validation failure.

This means that state should be recorded in all cases when a statelessly validated
chunk is applied. Let's remove the configuration option that controls whether
trie reads should be recorded (`record_storage`) and just record the reads on
every chunk application (when `statelessnet_protocol` feature is enabled).

Refs: [zulip conversation](https://near.zulipchat.com/#narrow/stream/407237-core.2Fstateless-validation/topic/State.20witness.20size.20limit/near/428313518)
@jancionear jancionear added the A-stateless-validation Area: stateless validation label Mar 22, 2024
@jancionear jancionear requested a review from a team as a code owner March 22, 2024 14:42
…idation is not enabled

prepare_transactions() doesn't produce a storage proof when stateless validation isn't enabled.
Let's disable tests that require this storage proof when stateless validation is disabled.
@Longarithm
Copy link
Member

This is (going) to change the protocol, right? Then it needs to be gated by ProtocolFeature / protocol version, so we could smoothly roll it out to statelessnet and mainnet.

@jancionear
Copy link
Contributor Author

This is (going) to change the protocol, right? Then it needs to be gated by ProtocolFeature / protocol version, so we could smoothly roll it out to statelessnet and mainnet.

I'd say that it's a fix for the protocol version introduced by #10703.
Adding the soft limit caused a protocol change, I'm just fixing the validators to also take this limit into account when applying chunks.

using the `checked_feature!` macro covers both `statelessnet_protocol`
and `nightly`, so it's better to use it instead of checking for just
`statelessnet_protocol`.
Copy link

codecov bot commented Mar 22, 2024

Codecov Report

Attention: Patch coverage is 90.90909% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 71.63%. Comparing base (92e5938) to head (6c5dfac).

Files Patch % Lines
chain/chain/src/chain.rs 77.77% 0 Missing and 4 partials ⚠️
chain/chain/src/chain_update.rs 90.47% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10859      +/-   ##
==========================================
+ Coverage   71.61%   71.63%   +0.01%     
==========================================
  Files         758      758              
  Lines      151765   151789      +24     
  Branches   151765   151789      +24     
==========================================
+ Hits       108692   108730      +38     
+ Misses      38537    38518      -19     
- Partials     4536     4541       +5     
Flag Coverage Δ
backward-compatibility 0.24% <0.00%> (-0.01%) ⬇️
db-migration 0.24% <0.00%> (-0.01%) ⬇️
genesis-check 1.42% <0.00%> (-0.01%) ⬇️
integration-tests 37.39% <75.75%> (-0.01%) ⬇️
linux 70.09% <54.54%> (-0.15%) ⬇️
linux-nightly 71.12% <75.75%> (+0.02%) ⬆️
macos 54.53% <54.54%> (-0.18%) ⬇️
pytests 1.65% <0.00%> (-0.01%) ⬇️
sanity-checks 1.44% <0.00%> (-0.01%) ⬇️
unittests 67.28% <90.90%> (+0.01%) ⬆️
upgradability 0.29% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +1606 to +1607
if !checked_feature!("stable", StatelessValidationV0, PROTOCOL_VERSION) {
println!("Test not applicable without StatelessValidation enabled");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that this test won't kick in before hitting testnet (with stable build)? isn't it too late?

Copy link
Contributor Author

@jancionear jancionear Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the test will only execute in builds where the default protocol version has stateless validation enabled.
In practice this means that the test will be enabled for builds with --features statelessnet_protocol and --features nightly. Both of those options are present in the CI, so the test will be run on every change, and it will be enabled on statelessnet builds.

Tbh I don't know what "stable" in checked_feature is supposed to mean, the macro just checks the protocol version:

#[macro_export]
macro_rules! checked_feature {
    ("stable", $feature:ident, $current_protocol_version:expr) => {{
        $crate::version::ProtocolFeature::$feature.protocol_version() <= $current_protocol_version
    }};

The code is here if anyone wants to take a look:

pub const PROTOCOL_VERSION: ProtocolVersion = if cfg!(feature = "statelessnet_protocol") {
// Current StatelessNet protocol version.
83
} else if cfg!(feature = "nightly_protocol") {
// On nightly, pick big enough version to support all features.
140
} else {
// Enable all stable features.
STABLE_PROTOCOL_VERSION
};
/// Both, outgoing and incoming tcp connections to peers, will be rejected if `peer's`
/// protocol version is lower than this.
pub const PEER_MIN_ALLOWED_PROTOCOL_VERSION: ProtocolVersion = STABLE_PROTOCOL_VERSION - 2;
#[macro_export]
macro_rules! checked_feature {
("stable", $feature:ident, $current_protocol_version:expr) => {{
$crate::version::ProtocolFeature::$feature.protocol_version() <= $current_protocol_version
}};
($feature_name:tt, $feature:ident, $current_protocol_version:expr) => {{
#[cfg(feature = $feature_name)]
let is_feature_enabled = $crate::version::ProtocolFeature::$feature.protocol_version()
<= $current_protocol_version;
#[cfg(not(feature = $feature_name))]
let is_feature_enabled = {
// Workaround unused variable warning
let _ = $current_protocol_version;
false
};
is_feature_enabled
}};

Copy link
Contributor Author

@jancionear jancionear Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have this kind of check in many other tests that don't work without stateless validation, e.g:

if !checked_feature!("stable", StatelessValidationV0, PROTOCOL_VERSION) {

Copy link
Contributor

@shreyan-gupta shreyan-gupta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Could @Longarithm confirm whether it's fine to delete the should_produce_state_witness_for_this_or_next_epoch function? In the future, with single shard tracking, we shouldn't even require the condition self.epoch_manager.is_chunk_producer_for_epoch(epoch_id, account_id)? || self.epoch_manager.is_chunk_producer_for_epoch(&next_epoch_id, account_id)??

storage_data_source: StorageDataSource::Db,
state_patch,
record_storage: self
.should_produce_state_witness_for_this_or_next_epoch(me, block.header())?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should_produce_state_witness_for_this_or_next_epoch is used in one other place in prepare_transactions. We should remove it from there as well and delete the function.

@Longarithm
Copy link
Member

Good point, actually. We should still call should_produce_state_witness_for_this_or_next_epoch, but on the place where we save recorded storage to disk. The method still makes sense because this data is useless if node is not a chunk producer. For single shard tracking shard change still may happen.

Comment on lines 869 to 871
if checked_feature!("stable", StatelessValidationV0, PROTOCOL_VERSION) {
trie = trie.recording_reads();
}
Copy link
Contributor Author

@jancionear jancionear Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recording should probably also be enabled with the shadow_chunk_validation feature.

@jancionear
Copy link
Contributor Author

v2:

  • Don't save the recorded state transition data to the database if we aren't a chunk producer for this or next epoch. Validators don't need to persist this data, we can save some disk space (refs Reducing hard drive validator footprint #10357).
  • Also record state transition data when the shadow_chunk_validation feature is enabled. I hope this change won't break anything with shadow validation (cc @pugachAG)

@jancionear
Copy link
Contributor Author

Merged master in hope that it fixes the cargo audit check.

@jancionear
Copy link
Contributor Author

jancionear commented Mar 25, 2024

Merged master in hope that it fixes the cargo audit check.

Didn't help, opened #10876 about the problem.
It looks like the check is failing because the yaml-rust crate is unmaintained. I think we can safely ignore this warning. It's not a vulnerability, just an inactive maintainer.

@jancionear
Copy link
Contributor Author

@Longarithm are you okay with this change?
It would be nice to get an approval because this change is needed for the next statelessnet release. And AFAIK Shreyan is OOO right now :c

Copy link
Member

@Longarithm Longarithm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I think we need to make protocol version check more strict.

Comment on lines +871 to +873
if checked_feature!("stable", StatelessValidationV0, PROTOCOL_VERSION)
|| cfg!(feature = "shadow_chunk_validation")
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PROTOCOL_VERSION is not enough, we need a check with protocol version of next epoch here. If we stabilize this version and some validator picks it up before others, its performance will be worsened because they will have to read all the nodes unlike others. And if protocol version switch is delayed for some reason, validator may miss blocks/chunks because of that.
We need next epoch, not current epoch, because state transitions must be recorded 1 epoch before protocol upgrade. EpochManager has a method for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I imagined that before switching to stateless validation we would enable this feature (or make it the default), and then all nodes would start recording the data needed for stateless validation. At some point the protocol version would change and the nodes would start using this recorded data, but it's okay to start recording much earlier, without stateless validation the data would just be discarded.
Basically we don't need to start recording at the exact moment when it's needed, we can do it as soon as the updated node starts.

Is it this that much of a performance hit? We have to do it on the epoch before switching to stateless validation, so it has to be performant enough for normal traffic. Anton's shadow_chunk_validation records data and it's able to keep up with the mainnet traffic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anton's shadow_chunk_validation records data and it's able to keep up with the mainnet traffic.

Oh, that's nice.
I also realised that memtrie should be enabled not later than stateless validation, which will give desired performance. Then looks good!

Comment on lines +708 to +710
if checked_feature!("stable", StatelessValidationV0, PROTOCOL_VERSION)
|| cfg!(feature = "shadow_chunk_validation")
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below. If we want to be super precise, we can use current epoch protocol version, because that recorded storage is not saved anythere, it is used immediately on chunk production.

@@ -413,6 +419,7 @@ impl<'a> ChainUpdate<'a> {
block: &Block,
block_preprocess_info: BlockPreprocessInfo,
apply_chunks_results: Vec<(ShardId, Result<ShardUpdateResult, Error>)>,
should_save_state_transition_data: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be computed here, because postprocess_block has me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but Chain also calls ChainUpdate::process_apply_chunk_result, which doesn't have a me argument. This means that should_produce_state_witness_for_this_or_next_epoch would have to be available from both Chain and ChainUpdate. I can't just do everything inside ChainUpdate.

I guess it would be possible to move should_produce_state_witness_for_this_or_next_epoch to EpochManager and use it from both Chain and ChainUpdate, but I'm not convinced if there's any real benefit gained in exchange for this additional work. At this point I'm tempted to just merge it and unblock the statelessnet release.

Copy link
Member

@Longarithm Longarithm Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhh right. Okay, makes sense, feel free to merge

@Longarithm Longarithm self-requested a review March 26, 2024 20:18
@jancionear jancionear added this pull request to the merge queue Mar 26, 2024
Merged via the queue into near:master with commit c2f9695 Mar 26, 2024
30 of 31 checks passed
@jancionear jancionear deleted the panopticon branch March 26, 2024 20:54
jancionear added a commit to jancionear/nearcore that referenced this pull request Mar 28, 2024
github-merge-queue bot pushed a commit that referenced this pull request Mar 28, 2024
…10900)

Revert #10859 because it's causing trouble on statelessnet

This reverts commit c2f9695.
jancionear added a commit to jancionear/nearcore that referenced this pull request Mar 29, 2024
For stateless validation we need to record all trie reads performed when
applying a chunk in order to prepare a `PartialState` that will be
included in `ChunkStateWitness`

Initially it was only necessary to record trie reads when producing a
chunk. Validators could read all the values from the provided
`PartialState` without recording anything.

A recent change (near#10703) introduced
a soft limit on the size of `PartialState`. When applying a chunk we
watch how much state was recorded, and once the amount of state exceeds
the soft limit we stop applying the receipts in this chunk.
This needs to be done on both the chunk producer and chunk validator -
if the chunk validator doesn't record reads and enforce the limit, it
will produce a different result of chunk application, which would lead
to validation failure.

This means that state should be recorded in all cases when a statelessly
validated chunk is applied. Let's remove the configuration option that
controls whether trie reads should be recorded (`record_storage`) and
just record the reads on every chunk application (when
`statelessnet_protocol` feature is enabled).

Refs: [zulip
conversation](https://near.zulipchat.com/#narrow/stream/407237-core.2Fstateless-validation/topic/State.20witness.20size.20limit/near/428313518)
github-merge-queue bot pushed a commit that referenced this pull request Mar 29, 2024
…ng back always recording storage (#10904)

Fixes the issue that caused #10859
to break validation.
The flag `charge_gas_for_trie_node_access` should be forwarded to the
new Trie created by `recording_reads`, otherwise the node counting won't
work correctly and there'll be invalid state.

Fix the issue and bring back #10859 which was reverted in
#10900.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-stateless-validation Area: stateless validation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants