Always record storage when stateless validation is enabled #10859

jancionear · 2024-03-22T14:42:05Z

For stateless validation we need to record all trie reads performed when applying a chunk in order to prepare a PartialState that will be included in ChunkStateWitness

Initially it was only necessary to record trie reads when producing a chunk. Validators could read all the values from the provided PartialState without recording anything.

A recent change (#10703) introduced a soft limit on the size of PartialState. When applying a chunk we watch how much state was recorded, and once the amount of state exceeds the soft limit we stop applying the receipts in this chunk.
This needs to be done on both the chunk producer and chunk validator - if the chunk validator doesn't record reads and enforce the limit, it will produce a different result of chunk application, which would lead to validation failure.

This means that state should be recorded in all cases when a statelessly validated chunk is applied. Let's remove the configuration option that controls whether trie reads should be recorded (record_storage) and just record the reads on every chunk application (when statelessnet_protocol feature is enabled).

Refs: zulip conversation

For stateless validation we need to record all trie reads performed when applying a chunk in order to prepare a `PartialState` that will be included in `ChunkStateWitness` Initially it was only necessary to record trie reads when producing a chunk. Validators could read all the values from the provided `PartialState` without recording anything. A recent change (near#10703) introduced a soft limit on the size of `PartialState`. When applying a chunk we watch how much state was recorded, and once the amount of state exceeds the soft limit we stop applying the receipts in this chunk. This needs to be done on both the chunk producer and chunk validator - if the chunk validator doesn't record reads and enforce the limit, it will produce a different result of chunk application, which would lead to validation failure. This means that state should be recorded in all cases when a statelessly validated chunk is applied. Let's remove the configuration option that controls whether trie reads should be recorded (`record_storage`) and just record the reads on every chunk application (when `statelessnet_protocol` feature is enabled). Refs: [zulip conversation](https://near.zulipchat.com/#narrow/stream/407237-core.2Fstateless-validation/topic/State.20witness.20size.20limit/near/428313518)

…idation is not enabled prepare_transactions() doesn't produce a storage proof when stateless validation isn't enabled. Let's disable tests that require this storage proof when stateless validation is disabled.

Longarithm · 2024-03-22T15:51:51Z

This is (going) to change the protocol, right? Then it needs to be gated by ProtocolFeature / protocol version, so we could smoothly roll it out to statelessnet and mainnet.

jancionear · 2024-03-22T15:57:24Z

This is (going) to change the protocol, right? Then it needs to be gated by ProtocolFeature / protocol version, so we could smoothly roll it out to statelessnet and mainnet.

I'd say that it's a fix for the protocol version introduced by #10703.
Adding the soft limit caused a protocol change, I'm just fixing the validators to also take this limit into account when applying chunks.

using the `checked_feature!` macro covers both `statelessnet_protocol` and `nightly`, so it's better to use it instead of checking for just `statelessnet_protocol`.

codecov · 2024-03-22T17:06:14Z

Codecov Report

Attention: Patch coverage is 90.90909% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 71.63%. Comparing base (92e5938) to head (6c5dfac).

Files	Patch %	Lines
chain/chain/src/chain.rs	77.77%	0 Missing and 4 partials ⚠️
chain/chain/src/chain_update.rs	90.47%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10859      +/-   ##
==========================================
+ Coverage   71.61%   71.63%   +0.01%     
==========================================
  Files         758      758              
  Lines      151765   151789      +24     
  Branches   151765   151789      +24     
==========================================
+ Hits       108692   108730      +38     
+ Misses      38537    38518      -19     
- Partials     4536     4541       +5

Flag	Coverage Δ
backward-compatibility	`0.24% <0.00%> (-0.01%)`	⬇️
db-migration	`0.24% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.42% <0.00%> (-0.01%)`	⬇️
integration-tests	`37.39% <75.75%> (-0.01%)`	⬇️
linux	`70.09% <54.54%> (-0.15%)`	⬇️
linux-nightly	`71.12% <75.75%> (+0.02%)`	⬆️
macos	`54.53% <54.54%> (-0.18%)`	⬇️
pytests	`1.65% <0.00%> (-0.01%)`	⬇️
sanity-checks	`1.44% <0.00%> (-0.01%)`	⬇️
unittests	`67.28% <90.90%> (+0.01%)`	⬆️
upgradability	`0.29% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

walnut-the-cat · 2024-03-22T18:01:32Z

chain/chain/src/runtime/tests.rs

+    if !checked_feature!("stable", StatelessValidationV0, PROTOCOL_VERSION) {
+        println!("Test not applicable without StatelessValidation enabled");


Does this mean that this test won't kick in before hitting testnet (with stable build)? isn't it too late?

This means that the test will only execute in builds where the default protocol version has stateless validation enabled.
In practice this means that the test will be enabled for builds with --features statelessnet_protocol and --features nightly. Both of those options are present in the CI, so the test will be run on every change, and it will be enabled on statelessnet builds.

Tbh I don't know what "stable" in checked_feature is supposed to mean, the macro just checks the protocol version:

#[macro_export] macro_rules! checked_feature { ("stable", $feature:ident, $current_protocol_version:expr) => {{ $crate::version::ProtocolFeature::$feature.protocol_version() <= $current_protocol_version }};

The code is here if anyone wants to take a look:

nearcore/core/primitives-core/src/version.rs

Lines 224 to 256 in f2e87c0

pub const PROTOCOL_VERSION: ProtocolVersion = if cfg!(feature = "statelessnet_protocol") {

// Current StatelessNet protocol version.

83

} else if cfg!(feature = "nightly_protocol") {

// On nightly, pick big enough version to support all features.

140

} else {

// Enable all stable features.

STABLE_PROTOCOL_VERSION

};

/// Both, outgoing and incoming tcp connections to peers, will be rejected if `peer's`

/// protocol version is lower than this.

pub const PEER_MIN_ALLOWED_PROTOCOL_VERSION: ProtocolVersion = STABLE_PROTOCOL_VERSION - 2;

#[macro_export]

macro_rules! checked_feature {

("stable", $feature:ident, $current_protocol_version:expr) => {{

$crate::version::ProtocolFeature::$feature.protocol_version() <= $current_protocol_version

}};

($feature_name:tt, $feature:ident, $current_protocol_version:expr) => {{

#[cfg(feature = $feature_name)]

let is_feature_enabled = $crate::version::ProtocolFeature::$feature.protocol_version()

<= $current_protocol_version;

#[cfg(not(feature = $feature_name))]

let is_feature_enabled = {

// Workaround unused variable warning

let _ = $current_protocol_version;

false

};

is_feature_enabled

}};

We already have this kind of check in many other tests that don't work without stateless validation, e.g:

nearcore/integration-tests/src/tests/client/features/stateless_validation.rs

Line 33 in f2e87c0

if !checked_feature!("stable", StatelessValidationV0, PROTOCOL_VERSION) {

shreyan-gupta

Looks good! Could @Longarithm confirm whether it's fine to delete the should_produce_state_witness_for_this_or_next_epoch function? In the future, with single shard tracking, we shouldn't even require the condition self.epoch_manager.is_chunk_producer_for_epoch(epoch_id, account_id)? || self.epoch_manager.is_chunk_producer_for_epoch(&next_epoch_id, account_id)??

shreyan-gupta · 2024-03-22T20:30:51Z

chain/chain/src/chain.rs

-                storage_data_source: StorageDataSource::Db,
-                state_patch,
-                record_storage: self
-                    .should_produce_state_witness_for_this_or_next_epoch(me, block.header())?,


should_produce_state_witness_for_this_or_next_epoch is used in one other place in prepare_transactions. We should remove it from there as well and delete the function.

Longarithm · 2024-03-22T20:39:09Z

Good point, actually. We should still call should_produce_state_witness_for_this_or_next_epoch, but on the place where we save recorded storage to disk. The method still makes sense because this data is useless if node is not a chunk producer. For single shard tracking shard change still may happen.

jancionear · 2024-03-25T13:02:20Z

chain/chain/src/runtime/mod.rs

+        if checked_feature!("stable", StatelessValidationV0, PROTOCOL_VERSION) {
            trie = trie.recording_reads();
        }


The recording should probably also be enabled with the shadow_chunk_validation feature.

jancionear · 2024-03-25T18:57:32Z

v2:

Don't save the recorded state transition data to the database if we aren't a chunk producer for this or next epoch. Validators don't need to persist this data, we can save some disk space (refs Reducing hard drive validator footprint #10357).
Also record state transition data when the shadow_chunk_validation feature is enabled. I hope this change won't break anything with shadow validation (cc @pugachAG)

jancionear · 2024-03-25T19:12:28Z

Merged master in hope that it fixes the cargo audit check.

jancionear · 2024-03-25T19:33:06Z

Merged master in hope that it fixes the cargo audit check.

Didn't help, opened #10876 about the problem.
It looks like the check is failing because the yaml-rust crate is unmaintained. I think we can safely ignore this warning. It's not a vulnerability, just an inactive maintainer.

jancionear · 2024-03-26T16:36:55Z

@Longarithm are you okay with this change?
It would be nice to get an approval because this change is needed for the next statelessnet release. And AFAIK Shreyan is OOO right now :c

Longarithm

Sorry for the delay. I think we need to make protocol version check more strict.

Longarithm · 2024-03-26T19:47:36Z

chain/chain/src/runtime/mod.rs

+        if checked_feature!("stable", StatelessValidationV0, PROTOCOL_VERSION)
+            || cfg!(feature = "shadow_chunk_validation")
+        {


PROTOCOL_VERSION is not enough, we need a check with protocol version of next epoch here. If we stabilize this version and some validator picks it up before others, its performance will be worsened because they will have to read all the nodes unlike others. And if protocol version switch is delayed for some reason, validator may miss blocks/chunks because of that.
We need next epoch, not current epoch, because state transitions must be recorded 1 epoch before protocol upgrade. EpochManager has a method for that.

Hmm I imagined that before switching to stateless validation we would enable this feature (or make it the default), and then all nodes would start recording the data needed for stateless validation. At some point the protocol version would change and the nodes would start using this recorded data, but it's okay to start recording much earlier, without stateless validation the data would just be discarded.
Basically we don't need to start recording at the exact moment when it's needed, we can do it as soon as the updated node starts.

Is it this that much of a performance hit? We have to do it on the epoch before switching to stateless validation, so it has to be performant enough for normal traffic. Anton's shadow_chunk_validation records data and it's able to keep up with the mainnet traffic.

Anton's shadow_chunk_validation records data and it's able to keep up with the mainnet traffic.

Oh, that's nice.
I also realised that memtrie should be enabled not later than stateless validation, which will give desired performance. Then looks good!

Longarithm · 2024-03-26T19:47:42Z

chain/chain/src/runtime/mod.rs

+        if checked_feature!("stable", StatelessValidationV0, PROTOCOL_VERSION)
+            || cfg!(feature = "shadow_chunk_validation")
+        {


Same as below. If we want to be super precise, we can use current epoch protocol version, because that recorded storage is not saved anythere, it is used immediately on chunk production.

Longarithm · 2024-03-26T19:49:16Z

chain/chain/src/chain_update.rs

@@ -413,6 +419,7 @@ impl<'a> ChainUpdate<'a> {
        block: &Block,
        block_preprocess_info: BlockPreprocessInfo,
        apply_chunks_results: Vec<(ShardId, Result<ShardUpdateResult, Error>)>,
+        should_save_state_transition_data: bool,


Can be computed here, because postprocess_block has me.

That's true, but Chain also calls ChainUpdate::process_apply_chunk_result, which doesn't have a me argument. This means that should_produce_state_witness_for_this_or_next_epoch would have to be available from both Chain and ChainUpdate. I can't just do everything inside ChainUpdate.

I guess it would be possible to move should_produce_state_witness_for_this_or_next_epoch to EpochManager and use it from both Chain and ChainUpdate, but I'm not convinced if there's any real benefit gained in exchange for this additional work. At this point I'm tempted to just merge it and unblock the statelessnet release.

Uhh right. Okay, makes sense, feel free to merge

…ear#10859)" This reverts commit c2f9695.

…10900) Revert #10859 because it's causing trouble on statelessnet This reverts commit c2f9695.

For stateless validation we need to record all trie reads performed when applying a chunk in order to prepare a `PartialState` that will be included in `ChunkStateWitness` Initially it was only necessary to record trie reads when producing a chunk. Validators could read all the values from the provided `PartialState` without recording anything. A recent change (near#10703) introduced a soft limit on the size of `PartialState`. When applying a chunk we watch how much state was recorded, and once the amount of state exceeds the soft limit we stop applying the receipts in this chunk. This needs to be done on both the chunk producer and chunk validator - if the chunk validator doesn't record reads and enforce the limit, it will produce a different result of chunk application, which would lead to validation failure. This means that state should be recorded in all cases when a statelessly validated chunk is applied. Let's remove the configuration option that controls whether trie reads should be recorded (`record_storage`) and just record the reads on every chunk application (when `statelessnet_protocol` feature is enabled). Refs: [zulip conversation](https://near.zulipchat.com/#narrow/stream/407237-core.2Fstateless-validation/topic/State.20witness.20size.20limit/near/428313518)

…ng back always recording storage (#10904) Fixes the issue that caused #10859 to break validation. The flag `charge_gas_for_trie_node_access` should be forwarded to the new Trie created by `recording_reads`, otherwise the node counting won't work correctly and there'll be invalid state. Fix the issue and bring back #10859 which was reverted in #10900.

jancionear added the A-stateless-validation Area: stateless validation label Mar 22, 2024

jancionear requested review from Longarithm and shreyan-gupta March 22, 2024 14:42

jancionear requested a review from a team as a code owner March 22, 2024 14:42

Disable prepare_transactions() storage proof tests when stateless val…

31fbeeb

…idation is not enabled prepare_transactions() doesn't produce a storage proof when stateless validation isn't enabled. Let's disable tests that require this storage proof when stateless validation is disabled.

Check protocol feature instead of checking statelssnet_protocol feature

6c64c7d

using the `checked_feature!` macro covers both `statelessnet_protocol` and `nightly`, so it's better to use it instead of checking for just `statelessnet_protocol`.

walnut-the-cat reviewed Mar 22, 2024

View reviewed changes

shreyan-gupta approved these changes Mar 22, 2024

View reviewed changes

jancionear commented Mar 25, 2024

View reviewed changes

jancionear added 2 commits March 25, 2024 18:21

Save state transition data only when it's needed

9f14cf6

Record storage when shadow chunk validation is enabled

730b616

Merge branch 'master' into panopticon

6c5dfac

Longarithm requested changes Mar 26, 2024

View reviewed changes

Longarithm self-requested a review March 26, 2024 20:18

Longarithm approved these changes Mar 26, 2024

View reviewed changes

jancionear added this pull request to the merge queue Mar 26, 2024

Merged via the queue into near:master with commit c2f9695 Mar 26, 2024
30 of 31 checks passed

jancionear deleted the panopticon branch March 26, 2024 20:54

jancionear added a commit to jancionear/nearcore that referenced this pull request Mar 28, 2024

Revert "Always record storage when stateless validation is enabled (n…

8cc14f8

…ear#10859)" This reverts commit c2f9695.

jancionear mentioned this pull request Mar 28, 2024

Revert "Always record storage when stateless validation is enabled" #10900

Merged

github-merge-queue bot pushed a commit that referenced this pull request Mar 28, 2024

Revert "Always record storage when stateless validation is enabled" (#…

c25d6ba

…10900) Revert #10859 because it's causing trouble on statelessnet This reverts commit c2f9695.

jancionear mentioned this pull request Mar 29, 2024

Fix gas charging when using recorded_storage + recording_reads(), bring back always recording storage #10904

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always record storage when stateless validation is enabled #10859

Always record storage when stateless validation is enabled #10859

jancionear commented Mar 22, 2024

Longarithm commented Mar 22, 2024

jancionear commented Mar 22, 2024

codecov bot commented Mar 22, 2024 •

edited

Loading

walnut-the-cat Mar 22, 2024

jancionear Mar 22, 2024 •

edited

Loading

jancionear Mar 22, 2024 •

edited

Loading

shreyan-gupta left a comment

shreyan-gupta Mar 22, 2024

Longarithm commented Mar 22, 2024

jancionear Mar 25, 2024 •

edited

Loading

jancionear commented Mar 25, 2024

jancionear commented Mar 25, 2024

jancionear commented Mar 25, 2024 •

edited

Loading

jancionear commented Mar 26, 2024

Longarithm left a comment

Longarithm Mar 26, 2024

jancionear Mar 26, 2024

Longarithm Mar 26, 2024

Longarithm Mar 26, 2024

Longarithm Mar 26, 2024

jancionear Mar 26, 2024

Longarithm Mar 26, 2024 •

edited

Loading

		if !checked_feature!("stable", StatelessValidationV0, PROTOCOL_VERSION) {
		println!("Test not applicable without StatelessValidation enabled");

	pub const PROTOCOL_VERSION: ProtocolVersion = if cfg!(feature = "statelessnet_protocol") {
	// Current StatelessNet protocol version.
	83
	} else if cfg!(feature = "nightly_protocol") {
	// On nightly, pick big enough version to support all features.
	140
	} else {
	// Enable all stable features.
	STABLE_PROTOCOL_VERSION
	};

	/// Both, outgoing and incoming tcp connections to peers, will be rejected if `peer's`
	/// protocol version is lower than this.
	pub const PEER_MIN_ALLOWED_PROTOCOL_VERSION: ProtocolVersion = STABLE_PROTOCOL_VERSION - 2;

	#[macro_export]
	macro_rules! checked_feature {
	("stable", $feature:ident, $current_protocol_version:expr) => {{
	$crate::version::ProtocolFeature::$feature.protocol_version() <= $current_protocol_version
	}};
	($feature_name:tt, $feature:ident, $current_protocol_version:expr) => {{
	#[cfg(feature = $feature_name)]
	let is_feature_enabled = $crate::version::ProtocolFeature::$feature.protocol_version()
	<= $current_protocol_version;
	#[cfg(not(feature = $feature_name))]
	let is_feature_enabled = {
	// Workaround unused variable warning
	let _ = $current_protocol_version;

	false
	};
	is_feature_enabled
	}};

Always record storage when stateless validation is enabled #10859

Always record storage when stateless validation is enabled #10859

Conversation

jancionear commented Mar 22, 2024

Longarithm commented Mar 22, 2024

jancionear commented Mar 22, 2024

codecov bot commented Mar 22, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

jancionear Mar 22, 2024 • edited Loading

Choose a reason for hiding this comment

jancionear Mar 22, 2024 • edited Loading

Choose a reason for hiding this comment

shreyan-gupta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Longarithm commented Mar 22, 2024

jancionear Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

jancionear commented Mar 25, 2024

jancionear commented Mar 25, 2024

jancionear commented Mar 25, 2024 • edited Loading

jancionear commented Mar 26, 2024

Longarithm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Longarithm Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Mar 22, 2024 •

edited

Loading

jancionear Mar 22, 2024 •

edited

Loading

jancionear Mar 22, 2024 •

edited

Loading

jancionear Mar 25, 2024 •

edited

Loading

jancionear commented Mar 25, 2024 •

edited

Loading

Longarithm Mar 26, 2024 •

edited

Loading