-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per-receipt hard limit on storage proof size using upper bound estimation #11069
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #11069 +/- ##
==========================================
- Coverage 71.08% 71.07% -0.01%
==========================================
Files 763 768 +5
Lines 153077 153395 +318
Branches 153077 153395 +318
==========================================
+ Hits 108808 109032 +224
- Misses 39828 39911 +83
- Partials 4441 4452 +11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
…time To enforce the hard per-receipt limit we need to monitor how much storage proof has been recorded during execution of the receipt and halt the execution when the size of generated storage proof goes over the limit. To achieve this the runtime needs to be able to see how much proof was recorded, so let's expose this information so that it's available from the runtime. `recorded_storage_size()` doesn't provide the exact size of storage proof, as it doesn't cover some corner cases (see near#10890), so we use the `upper_bound` version to estimate how much storage proof could've been generated by the receipt. As long as upper bound is under the limit we can be sure that the actual value is also under the limit.
Add a `RecordedStorageCounter` which functions similarly to `GasCounter`. After every trie operation (`storage_write`, `storage_read`, ...) the runtime calls `RecordedStorageCounter::observe_size` to keep track of the size of recorded storage proof. If the size exceeds the allowed limit, then the receipt execution failed, just like it would fail if the receipt went over the allowed gas limit. For now the limit is `usize::MAX`, the proper limit will be set in a following commit.
Shadow validation metrics show that all receipts being executed on mainnet generate less than 2MB of recorded storage proof size (and the upper bound estimtion is also less than 2MB). Let's set the per-receipt hard limit of recorded storage proof size to 4MB, it's 2x larger than the largest value observed in the wild, which gives us a bit of a safety margin. Making this limit too low could cause some existing contracts to fail if it turns out that they generate a big storage proof, so it's better to err on the side of caution. Adding this limit is a protocol change, so a new protocol version is introduced - 85.
…limit The `recorded_storage_size()` function can sometimes return a value which is less than the actual recorded size. See near#10890 for details. This means that a malicious actor could create a workload which would bypass the soft size limit and e.g generate 10x more storage proof than allowed. To fix this proble let's use the upper bound estimation of the total recorded size. As long as the upper bound estimation is under the limit we can be sure that the actual value is also under the limit.
Data collected using shadow validation on mainnet traffic indicates that for 99% of chunks the upper bound of recorded storage proof size is less than 3MB. Let's set the soft per-chunk limit to 3MB. 99% of the chunks will be unaffected. For 1% of the chunks we will execute a bit fewer receipts in the chunk, so the chunk throughput can drop by ~1%, but 1% isn't that terrible. A 3MB soft limit along with a 4MB per-receipt limit gives us a solid guarantee that the total recorded storage proof size will be less than 7MB.
I still plan to add some tests, but other than that I'd say the PR is ready for review and we can start discussing the solution. Moving out of draft. |
@@ -243,7 +243,7 @@ const STABLE_PROTOCOL_VERSION: ProtocolVersion = 66; | |||
/// Largest protocol version supported by the current binary. | |||
pub const PROTOCOL_VERSION: ProtocolVersion = if cfg!(feature = "statelessnet_protocol") { | |||
// Current StatelessNet protocol version. | |||
84 | |||
85 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a quick comment below ProtocolFeature::StatelessnetShuffleShardAssignmentsForChunkProducers => 84,
saying the next protocol version just implements the hard limit and we should use 86 for the next feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm should I add a ProtocolFeature
for the hard limit as well?
I don't see any use of it for now, but in the future we might want to be able to check it (??). I don't know what's the convention for protocol changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand, why is there no need to check for it? Is it because it's checked in the runtime params?
I have to say it is rather confusing to see bumped protocol version without corresponding protocol feature. I would recommend adding it just for clarity or adding a comment somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand, why is there no need to check for it? Is it because it's checked in the runtime params?
Yes, for previous protocol versions the limit is 9999999999 (+inf), so it will never be triggered. The limit becomes 4MB at protocol 85 and it's defined in the runtime params, so there's no need for protocol feature checks.
I have to say it is rather confusing to see bumped protocol version without corresponding protocol feature. I would recommend adding it just for clarity or adding a comment somewhere.
I will add one 👍
size_limit: usize, | ||
} | ||
|
||
impl RecordedStorageCounter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can push this implementation to the ext.rs layer? We could have this check done for every trie operations like storage_set
, storage_get
, storage_remove
, instead of remembering to have to call it in logic/mod.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I see it per-receipt storage counting is very similar to what we have with gas, so it would be nice to have those in the same place to keep it somewhat consistent. Current implementation does exactly that, so I suggest to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
One thing to note: in the current implementation previously executed receipts affect storage proof size for the currently executed one (if the value was previously recorded then it won't increase recorded storage size). This potentially could be a problem when a contract is executed successfully after other receipts, but then could fail when executed as a first receipt in the chunk. In practice I don't think it matters and I don't suggest addressing that, I don't believe it is worth the effort and added implementation complexity.
@jancionear please make sure to get an approval from someone from the contract runtime team before merging this
Oh, right. Well, I also don't think we should make impl more complex to account for that. As a reference, we already have similar effect for chunk cache. |
@@ -243,7 +243,7 @@ const STABLE_PROTOCOL_VERSION: ProtocolVersion = 66; | |||
/// Largest protocol version supported by the current binary. | |||
pub const PROTOCOL_VERSION: ProtocolVersion = if cfg!(feature = "statelessnet_protocol") { | |||
// Current StatelessNet protocol version. | |||
84 | |||
85 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand, why is there no need to check for it? Is it because it's checked in the runtime params?
I have to say it is rather confusing to see bumped protocol version without corresponding protocol feature. I would recommend adding it just for clarity or adding a comment somewhere.
@@ -210,6 +210,8 @@ pub enum HostError { | |||
YieldPayloadLength { length: u64, limit: u64 }, | |||
/// Yield resumption data id is malformed. | |||
DataIdMalformed, | |||
/// Size of the recorded trie storage proof has exceeded the allowed limit. | |||
RecordedStorageExceeded { size: usize, limit: usize }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: best to use bytesize for type safety and human friendliness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm with bytesize
the error message started looking a bit funky:
0: Error: An error occurred during a `FunctionCall` Action, parameter is debug message.
ExecutionError("Size of the recorded trie storage proof has exceeded the allowed limit: 4.0 MB > 4.0 MB")
I guess it doesn't really make much sense to print how much storage proof was recorded, as this value will always be just a little bit over the limit. I'll change it so that it prints only the limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah with a single value it looks much more reasonable:
Error:
0: Error: An error occurred during a `FunctionCall` Action, parameter is debug message.
ExecutionError("Size of the recorded trie storage proof has exceeded the allowed limit (4.0 MB)")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope it's okay to add a bytesize
dependency to near-vm-runner
👀
I was able to construct a receipt which produces 24MB of storage proof by reading 24 values, 1MB each. Tested it manually on localnet and it seems that the hard limit works as it should. It rejected the 24MB receipt, but a 3MB receipt went through without any problems. Now I'll try to make it into a proper integration test or something like that (???). Not sure what's the proper place for a test like that. |
* Use bytesize to print the limit in a human-readable format * Don't print the recorded value as it's always just over the limit. An error message saying "error: 4MB > 4MB" looks stupid. Printing only the limit is more reasonable.
Alright added an integration test which tests both the per-receipt hard limit and the per-chunk soft limit.
|
…11507) To limit the amount of storage proof generated during chunk application we calculate the upper bound estimation of how big the storage proof will be, and stop executing receipts when this estimated size gets to big. When estimating we assume that every trie removals generates 2000 bytes of storage proof, because this is the maximum size that a malicious attacker could generate (#11069, #10890). This estimation was meant to limit the size of proof generated while executing receipts, but currently it also applies to other trie removals performed by the runtime, for example when removing receipts from the delayed receipt queue. This is really wasteful - removing 1000 receipts would cause the estimation to jump by 2MB, hitting the soft limit. We don't really need to charge this much for internal operations performed by the runtime, they aren't malicious. Let's change is so that only contracts are charged extra for removals. This will avoid the extra big estimation caused by normal queue manipulation. Refs: https://near.zulipchat.com/#narrow/stream/308695-nearone.2Fprivate/topic/Large.20number.20of.20delayed.20receipts.20in.20statelessnet/near/442878068
During receipt execution we record all touched nodes from the pre-state trie. Those recorded nodes form the storage proof that is sent to validators, and validators use it to execute the receipts and validate the results.
In #9378 it's stated that in a worst case scenario a single receipt can generate hundreds of megabytes of storage proof. That would cause problems, as it'd cause the
ChunkStateWitness
to also be hundreds of megabytes in size, and there would be problems with sending this much data over the network.Because of that we need to limit the size of the storage proof. We plan to have two limits:
Most of the hard-limit code is straightforward - we need to track the size of recorded storage and fail the receipt if it goes over the limit.
But there is one ugly problem: #10890. Because of the way current
TrieUpdate
works we don't record all of the storage proof in real time. There are some corner cases (deleting one of two children of a branch) in which some nodes are not recorded until we dofinalize()
at the end of the chunk. This means that we can't really useTrie::recorded_storage_size()
to limit the size, as it isn't fully accurate. If we do that, a malicious actor could prepare receipts which seem to have only 1MB of storage proof during execution, but actually record 10MB duringfinalize()
.There is a long discussion in #10890 along with some possible solution ideas, please read that if you need more context.
This PR implements Idea 1 from #10890.
Instead of using
Trie::recorded_storage_size()
we'll useTrie::recorded_storage_size_upper_bound()
, which estimates the upper bound of recorded storage size by assuming that every trie removal records additional 2000 bytes:As long as the upper bound is below the limit we can be sure that the real recorded size is also below the limit.
It's a rough estimation, which often exaggerates the actual recorded size (even by 20+ times), but it could be a good-enough/MVP solution for now. Doing it in a better way would require a lot of refactoring in the Trie code. We're now moving fast, so I decided to go with this solution for now.
The upper bound calculation has been added in a previous PR along with the metrics to see if using such a rough estimation is viable: #11000
I set up a mainnet node with shadow validation to gather some data about the size distribution with mainnet traffic: Metrics link
The metrics show that:
Based on this I believe that we can:
Having a 4MB per-receipt hard limit and a 3MB per-chunk soft limit would give us a hard guarantee that for all chunks the total storage proof size is below 7MB.
It's worth noting that gas usage already limits the storage proof size quite effectively. For 98% of chunks the storage proof size is already below 2MB, so the limit isn't really needed for typical mainnet traffic. The limit matters mostly for stopping malicious actors that'd try to DoS the network by generating large storage proofs.
Fixes: #11019