Per-receipt hard limit on storage proof size using upper bound estimation #11069

jancionear · 2024-04-15T11:11:48Z

During receipt execution we record all touched nodes from the pre-state trie. Those recorded nodes form the storage proof that is sent to validators, and validators use it to execute the receipts and validate the results.

In #9378 it's stated that in a worst case scenario a single receipt can generate hundreds of megabytes of storage proof. That would cause problems, as it'd cause the ChunkStateWitness to also be hundreds of megabytes in size, and there would be problems with sending this much data over the network.

Because of that we need to limit the size of the storage proof. We plan to have two limits:

per-chunk soft limit - once a chunk has more than X MB of storage proof we stop processing new receipts, and move the remaining ones to the delayed receipt queue. This has been implemented in [stateless_validation] Add a soft size limit for state witness #10703
per-receipt hard limit - once a receipt generates more than X MB of storage proof we fail the receipt, similarly to what happens when a receipt goes over the allowed gas limit. This one is implemented in this PR.

Most of the hard-limit code is straightforward - we need to track the size of recorded storage and fail the receipt if it goes over the limit.
But there is one ugly problem: #10890. Because of the way current TrieUpdate works we don't record all of the storage proof in real time. There are some corner cases (deleting one of two children of a branch) in which some nodes are not recorded until we do finalize() at the end of the chunk. This means that we can't really use Trie::recorded_storage_size() to limit the size, as it isn't fully accurate. If we do that, a malicious actor could prepare receipts which seem to have only 1MB of storage proof during execution, but actually record 10MB during finalize().
There is a long discussion in #10890 along with some possible solution ideas, please read that if you need more context.

This PR implements Idea 1 from #10890.
Instead of using Trie::recorded_storage_size() we'll use Trie::recorded_storage_size_upper_bound(), which estimates the upper bound of recorded storage size by assuming that every trie removal records additional 2000 bytes:

    /// Size of the recorded state proof plus some additional size added to cover removals.
    /// An upper-bound estimation of the true recorded size after finalization.
    /// See https://github.com/near/nearcore/issues/10890 and https://github.com/near/nearcore/pull/11000 for details.
    pub fn recorded_storage_size_upper_bound(&self) -> usize {
        // Charge 2000 bytes for every removal
        let removals_size = self.removal_counter.saturating_mul(2000);
        self.recorded_storage_size().saturating_add(removals_size)
    }

As long as the upper bound is below the limit we can be sure that the real recorded size is also below the limit.
It's a rough estimation, which often exaggerates the actual recorded size (even by 20+ times), but it could be a good-enough/MVP solution for now. Doing it in a better way would require a lot of refactoring in the Trie code. We're now moving fast, so I decided to go with this solution for now.

The upper bound calculation has been added in a previous PR along with the metrics to see if using such a rough estimation is viable: #11000

I set up a mainnet node with shadow validation to gather some data about the size distribution with mainnet traffic: Metrics link

The metrics show that:

For all receipts both the recorded size and the upper bound estimate are below 2MB
Overwhelming majority of receipts generate < 50KB of storage proof
For all chunks the upper bound estimate is below 6MB
For 99% of chunks the upper bound estimate is below 3MB

Based on this I believe that we can:

Set the hard per-receipt limit to 4MB. All receipts were below 2MB, but it's good to have a bit of a safety margin here. This is a hard limit, so it might break existing contracts if they turn out to generate more storage proof than the limit.
Set the soft per-chunk limit to 3MB. 99% of chunks will not be affected by this limit. For the 1% that hit the limit they'll execute fewer receipts, with the rest of the receipts put into the delayed receipt queue. This slightly lowers throughput of a single chunk, but it's not a big slowdown, by ~1%.

Having a 4MB per-receipt hard limit and a 3MB per-chunk soft limit would give us a hard guarantee that for all chunks the total storage proof size is below 7MB.

It's worth noting that gas usage already limits the storage proof size quite effectively. For 98% of chunks the storage proof size is already below 2MB, so the limit isn't really needed for typical mainnet traffic. The limit matters mostly for stopping malicious actors that'd try to DoS the network by generating large storage proofs.

Fixes: #11019

codecov · 2024-04-15T11:38:09Z

Codecov Report

Attention: Patch coverage is 86.79245% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 71.07%. Comparing base (85d6b90) to head (8912e9a).
Report is 25 commits behind head on master.

Files	Patch %	Lines
...ar-vm-runner/src/logic/recorded_storage_counter.rs	81.81%	1 Missing and 3 partials ⚠️
runtime/near-vm-runner/src/logic/logic.rs	66.66%	0 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #11069      +/-   ##
==========================================
- Coverage   71.08%   71.07%   -0.01%     
==========================================
  Files         763      768       +5     
  Lines      153077   153395     +318     
  Branches   153077   153395     +318     
==========================================
+ Hits       108808   109032     +224     
- Misses      39828    39911      +83     
- Partials     4441     4452      +11

Flag	Coverage Δ
backward-compatibility	`0.24% <0.00%> (-0.01%)`	⬇️
db-migration	`0.24% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.43% <0.00%> (+<0.01%)`	⬆️
integration-tests	`36.86% <81.13%> (+0.12%)`	⬆️
linux	`69.47% <62.26%> (-0.04%)`	⬇️
linux-nightly	`70.55% <86.79%> (-0.01%)`	⬇️
macos	`54.22% <45.28%> (-0.05%)`	⬇️
pytests	`1.66% <0.00%> (+<0.01%)`	⬆️
sanity-checks	`1.45% <0.00%> (+<0.01%)`	⬆️
unittests	`66.74% <66.03%> (-0.04%)`	⬇️
upgradability	`0.29% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…time To enforce the hard per-receipt limit we need to monitor how much storage proof has been recorded during execution of the receipt and halt the execution when the size of generated storage proof goes over the limit. To achieve this the runtime needs to be able to see how much proof was recorded, so let's expose this information so that it's available from the runtime. `recorded_storage_size()` doesn't provide the exact size of storage proof, as it doesn't cover some corner cases (see near#10890), so we use the `upper_bound` version to estimate how much storage proof could've been generated by the receipt. As long as upper bound is under the limit we can be sure that the actual value is also under the limit.

Add a `RecordedStorageCounter` which functions similarly to `GasCounter`. After every trie operation (`storage_write`, `storage_read`, ...) the runtime calls `RecordedStorageCounter::observe_size` to keep track of the size of recorded storage proof. If the size exceeds the allowed limit, then the receipt execution failed, just like it would fail if the receipt went over the allowed gas limit. For now the limit is `usize::MAX`, the proper limit will be set in a following commit.

Shadow validation metrics show that all receipts being executed on mainnet generate less than 2MB of recorded storage proof size (and the upper bound estimtion is also less than 2MB). Let's set the per-receipt hard limit of recorded storage proof size to 4MB, it's 2x larger than the largest value observed in the wild, which gives us a bit of a safety margin. Making this limit too low could cause some existing contracts to fail if it turns out that they generate a big storage proof, so it's better to err on the side of caution. Adding this limit is a protocol change, so a new protocol version is introduced - 85.

…limit The `recorded_storage_size()` function can sometimes return a value which is less than the actual recorded size. See near#10890 for details. This means that a malicious actor could create a workload which would bypass the soft size limit and e.g generate 10x more storage proof than allowed. To fix this proble let's use the upper bound estimation of the total recorded size. As long as the upper bound estimation is under the limit we can be sure that the actual value is also under the limit.

Data collected using shadow validation on mainnet traffic indicates that for 99% of chunks the upper bound of recorded storage proof size is less than 3MB. Let's set the soft per-chunk limit to 3MB. 99% of the chunks will be unaffected. For 1% of the chunks we will execute a bit fewer receipts in the chunk, so the chunk throughput can drop by ~1%, but 1% isn't that terrible. A 3MB soft limit along with a 4MB per-receipt limit gives us a solid guarantee that the total recorded storage proof size will be less than 7MB.

jancionear · 2024-04-16T13:59:48Z

I still plan to add some tests, but other than that I'd say the PR is ready for review and we can start discussing the solution. Moving out of draft.

shreyan-gupta · 2024-04-17T05:19:28Z

core/primitives-core/src/version.rs

@@ -243,7 +243,7 @@ const STABLE_PROTOCOL_VERSION: ProtocolVersion = 66;
 /// Largest protocol version supported by the current binary.
 pub const PROTOCOL_VERSION: ProtocolVersion = if cfg!(feature = "statelessnet_protocol") {
    // Current StatelessNet protocol version.
-    84
+    85


Could you add a quick comment below ProtocolFeature::StatelessnetShuffleShardAssignmentsForChunkProducers => 84, saying the next protocol version just implements the hard limit and we should use 86 for the next feature?

Hmm should I add a ProtocolFeature for the hard limit as well?
I don't see any use of it for now, but in the future we might want to be able to check it (??). I don't know what's the convention for protocol changes.

Just to make sure I understand, why is there no need to check for it? Is it because it's checked in the runtime params?

I have to say it is rather confusing to see bumped protocol version without corresponding protocol feature. I would recommend adding it just for clarity or adding a comment somewhere.

Just to make sure I understand, why is there no need to check for it? Is it because it's checked in the runtime params?

Yes, for previous protocol versions the limit is 9999999999 (+inf), so it will never be triggered. The limit becomes 4MB at protocol 85 and it's defined in the runtime params, so there's no need for protocol feature checks.

I have to say it is rather confusing to see bumped protocol version without corresponding protocol feature. I would recommend adding it just for clarity or adding a comment somewhere.

I will add one 👍

shreyan-gupta · 2024-04-17T05:35:43Z

runtime/near-vm-runner/src/logic/recorded_storage_counter.rs

+    size_limit: usize,
+}
+
+impl RecordedStorageCounter {


I'm wondering if we can push this implementation to the ext.rs layer? We could have this check done for every trie operations like storage_set, storage_get, storage_remove, instead of remembering to have to call it in logic/mod.rs

The way I see it per-receipt storage counting is very similar to what we have with gas, so it would be nice to have those in the same place to keep it somewhat consistent. Current implementation does exactly that, so I suggest to keep it.

pugachAG

LGTM!
One thing to note: in the current implementation previously executed receipts affect storage proof size for the currently executed one (if the value was previously recorded then it won't increase recorded storage size). This potentially could be a problem when a contract is executed successfully after other receipts, but then could fail when executed as a first receipt in the chunk. In practice I don't think it matters and I don't suggest addressing that, I don't believe it is worth the effort and added implementation complexity.
@jancionear please make sure to get an approval from someone from the contract runtime team before merging this

Longarithm · 2024-04-17T10:17:29Z

previously executed receipts affect storage proof size for the currently executed one (if the value was previously recorded then it won't increase recorded storage size)

Oh, right. Well, I also don't think we should make impl more complex to account for that. As a reference, we already have similar effect for chunk cache.

wacban · 2024-04-17T12:21:13Z

core/primitives-core/src/version.rs

@@ -243,7 +243,7 @@ const STABLE_PROTOCOL_VERSION: ProtocolVersion = 66;
 /// Largest protocol version supported by the current binary.
 pub const PROTOCOL_VERSION: ProtocolVersion = if cfg!(feature = "statelessnet_protocol") {
    // Current StatelessNet protocol version.
-    84
+    85


Just to make sure I understand, why is there no need to check for it? Is it because it's checked in the runtime params?

I have to say it is rather confusing to see bumped protocol version without corresponding protocol feature. I would recommend adding it just for clarity or adding a comment somewhere.

wacban · 2024-04-17T12:21:49Z

runtime/near-vm-runner/src/logic/errors.rs

@@ -210,6 +210,8 @@ pub enum HostError {
    YieldPayloadLength { length: u64, limit: u64 },
    /// Yield resumption data id is malformed.
    DataIdMalformed,
+    /// Size of the recorded trie storage proof has exceeded the allowed limit.
+    RecordedStorageExceeded { size: usize, limit: usize },


nit: best to use bytesize for type safety and human friendliness

Hmm with bytesize the error message started looking a bit funky:

0: Error: An error occurred during a `FunctionCall` Action, parameter is debug message. ExecutionError("Size of the recorded trie storage proof has exceeded the allowed limit: 4.0 MB > 4.0 MB")

I guess it doesn't really make much sense to print how much storage proof was recorded, as this value will always be just a little bit over the limit. I'll change it so that it prints only the limit.

Yeah with a single value it looks much more reasonable:

Error: 0: Error: An error occurred during a `FunctionCall` Action, parameter is debug message. ExecutionError("Size of the recorded trie storage proof has exceeded the allowed limit (4.0 MB)")

I hope it's okay to add a bytesize dependency to near-vm-runner 👀

jancionear · 2024-04-17T19:18:44Z

I still plan to add some tests

I was able to construct a receipt which produces 24MB of storage proof by reading 24 values, 1MB each. Tested it manually on localnet and it seems that the hard limit works as it should. It rejected the 24MB receipt, but a 3MB receipt went through without any problems. Now I'll try to make it into a proper integration test or something like that (???). Not sure what's the proper place for a test like that.

* Use bytesize to print the limit in a human-readable format * Don't print the recorded value as it's always just over the limit. An error message saying "error: 4MB > 4MB" looks stupid. Printing only the limit is more reasonable.

jancionear · 2024-04-18T20:20:10Z

Now I'll try to make it into a proper integration test or something like that

Alright added an integration test which tests both the per-receipt hard limit and the per-chunk soft limit.
The test is based on increase_storage_compute_cost.rs.

Added two new functions to test-contract-rs to make it easy to produce receipts with large storage proofs
Tested that a receipt with 3MB storage proof is accepted
Tested that a receipt with 20MB storage proof is rejected because of the hard limit
Tested that spawning 3 receipts with 2MB of storage proof will cause the third receipt to be postponed to the next block because of the soft limit

…11507) To limit the amount of storage proof generated during chunk application we calculate the upper bound estimation of how big the storage proof will be, and stop executing receipts when this estimated size gets to big. When estimating we assume that every trie removals generates 2000 bytes of storage proof, because this is the maximum size that a malicious attacker could generate (#11069, #10890). This estimation was meant to limit the size of proof generated while executing receipts, but currently it also applies to other trie removals performed by the runtime, for example when removing receipts from the delayed receipt queue. This is really wasteful - removing 1000 receipts would cause the estimation to jump by 2MB, hitting the soft limit. We don't really need to charge this much for internal operations performed by the runtime, they aren't malicious. Let's change is so that only contracts are charged extra for removals. This will avoid the extra big estimation caused by normal queue manipulation. Refs: https://near.zulipchat.com/#narrow/stream/308695-nearone.2Fprivate/topic/Large.20number.20of.20delayed.20receipts.20in.20statelessnet/near/442878068

jancionear added the A-stateless-validation Area: stateless validation label Apr 15, 2024

jancionear requested a review from a team as a code owner April 15, 2024 11:11

jancionear requested a review from wacban April 15, 2024 11:11

jancionear marked this pull request as draft April 15, 2024 11:12

jancionear added 5 commits April 16, 2024 14:07

jancionear force-pushed the hard-limit branch from 5460cf7 to 1f1ecfa Compare April 16, 2024 12:30

jancionear marked this pull request as ready for review April 16, 2024 14:00

jancionear requested review from nagisa, pugachAG, Longarithm and shreyan-gupta and removed request for wacban April 16, 2024 14:00

Merge branch 'master' into hard-limit

89e9c99

shreyan-gupta reviewed Apr 17, 2024

View reviewed changes

pugachAG approved these changes Apr 17, 2024

View reviewed changes

wacban reviewed Apr 17, 2024

View reviewed changes

nagisa approved these changes Apr 17, 2024

View reviewed changes

jancionear added 4 commits April 17, 2024 21:34

Add ProtocolFeature::PerReceiptHardStorageProofLimit

26735fc

Integration test for hard and soft storage proof limits

9b6742e

fix math

8912e9a

jancionear added this pull request to the merge queue Apr 19, 2024

Merged via the queue into near:master with commit a1a01b4 Apr 19, 2024
28 of 29 checks passed

jancionear deleted the hard-limit branch April 19, 2024 17:13

jancionear mentioned this pull request Apr 22, 2024

[stateless_validation][witness size soft limit] Count nodes needed for key updates in state witness size limit #10890

Open

jancionear mentioned this pull request Jun 6, 2024

Don't charge extra for trie deletions outside of contract execution #11507

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-receipt hard limit on storage proof size using upper bound estimation #11069

Per-receipt hard limit on storage proof size using upper bound estimation #11069

jancionear commented Apr 15, 2024 •

edited

Loading

codecov bot commented Apr 15, 2024 •

edited

Loading

jancionear commented Apr 16, 2024 •

edited

Loading

shreyan-gupta Apr 17, 2024

jancionear Apr 17, 2024

wacban Apr 17, 2024

jancionear Apr 17, 2024

shreyan-gupta Apr 17, 2024

pugachAG Apr 17, 2024

pugachAG left a comment

Longarithm commented Apr 17, 2024

wacban Apr 17, 2024

wacban Apr 17, 2024

jancionear Apr 17, 2024

jancionear Apr 17, 2024

jancionear Apr 17, 2024

jancionear commented Apr 17, 2024 •

edited

Loading

jancionear commented Apr 18, 2024 •

edited

Loading

Per-receipt hard limit on storage proof size using upper bound estimation #11069

Per-receipt hard limit on storage proof size using upper bound estimation #11069

Conversation

jancionear commented Apr 15, 2024 • edited Loading

codecov bot commented Apr 15, 2024 • edited Loading

Codecov Report

jancionear commented Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pugachAG left a comment

Choose a reason for hiding this comment

Longarithm commented Apr 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jancionear commented Apr 17, 2024 • edited Loading

jancionear commented Apr 18, 2024 • edited Loading

jancionear commented Apr 15, 2024 •

edited

Loading

codecov bot commented Apr 15, 2024 •

edited

Loading

jancionear commented Apr 16, 2024 •

edited

Loading

jancionear commented Apr 17, 2024 •

edited

Loading

jancionear commented Apr 18, 2024 •

edited

Loading