-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not re-prepare PVFs if not needed #4211
Do not re-prepare PVFs if not needed #4211
Conversation
|
||
let mut enc = b"prep".to_vec(); | ||
for param in &self.0 { | ||
if matches!(param, StackLogicalMax(_) | PvfPrepTimeout(..) | WasmExtBulkMemory) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks brittle wrt adding new parameters. I would rewrite and use !matches
such that any new parameter added is considered preparation related unless we explicitly name it here as non-preparation related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an obvious improvement I've overseen, thank you for the suggestion!
Another question I'm trying to answer: is PvfPrepTimeout
a parameter that affects preparation indeed?
That may be a bit counterintuitive, but I'm more inclined to say "no" than "yes". We have a strict pre-checking timeout, and we never re-pre-check, even if executor environment params change. The preparation timeout is lenient, and anyway, those timeouts are non-deterministic. So I don't really see much sense in re-preparing artifacts if this timeout changes (especially given that we're not going to ever decrease them). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just try to see what would happen if we say no. Let's first pick the likely more troublesome situation, we for some reason reduce the timeout:
Artifacts are not being recompiled, but a node that restarts will and may now hit a timeout it did not hit before. If I am not mistaken, the node will not dispute, but simply don't vote - right? Now, if that happens on just a few nodes, not too much harm done.
Now, if this is a general problem, everytime a node restarts, we have another machine not voting. So we might run into finality issues.
Problems:
- Potential finality stall.
- The issue will only show up gradually over time and might be hard to debug. Real issues might only appear weeks later.
Now, let's flip it and let's say "yes". Now all validators will recompile immediately, including backers. I am assuming that anything that got backed with the old artifact, will also be approved and disputed with the old artifact (this is true, right?). In that case the situation would be way better, because now we only have a single parachain that will completely cease to make progress, but all other paras and the relay chain won't be affected at all. Only the backers for that para will try to prepare and fail, as long as they are assigned.
If we assume we will only ever increase that timeout, then of course this should be fine, but is this a sound assumption? Why would we change the timeout at all?
- Either, we are being attacked and someone abuses those long timeouts - we would need to reduce.
- We have legitimate PVFs that run into the limit ... we would need to increase or fix otherwise.
Couldn't say, which scenario is more likely. That we are never ever re-prechecking is more a bug than anything. What we should actually be doing is re-prechecking and only enacting the new paras, once this was successful. That would also avoid the finality issues on changing the parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the node will not dispute, but simply don't vote - right?
Yes, I believe so. Although we treat the preparation timeout as a deterministic error (as it hasn't failed during the pre-check), we don't dispute.
I am assuming that anything that got backed with the old artifact, will also be approved and disputed with the old artifact (this is true, right?).
We should always use execution parameters from the session where the candidate was produced, so yes, they should always be executed with the same artifact.
we only have a single parachain that will completely cease to make progress
Why do you think it's only a single parachain? The timeout is per-network per-session, so after passing the session boundary where this parameter change is enacted, all the artifacts should be re-prepared, as we saw during the Kusama and Polkadot incidents.
If we assume we will only ever increase that timeout, then of course this should be fine, but is this a sound assumption?
I remember we talked about decreasing the timeouts as not being safe, as it may make PVFs that are already on-chain fail. At the very least, we should have some tooling that could be used to check all the already-existing PVFs not to fail with the new set of executor parameters on the reference hardware (I think @ordian was working on that one?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least, we should have some tooling that could be used to check all the already-existing PVFs not to fail with the new set of executor parameters on the reference hardware (I think @ordian was working on that one?)
The tooling we do have, but currently it's not integrated into the release process. But even if it could be used to check the current PVFs at the time of the release, one could imagine an attacker uploading a new PVF (with on-demand core) right after the release that will be on the edge of the old time limit. So it doesn't solve the problem by itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it doesn't solve the problem by itself.
Rerunning pre-checking before enactment would.
Why do you think it's only a single parachain
Not necessarily a single one, but most certainly not all of them. Otherwise we would have been really stupid with the parameters.
I remember we talked about decreasing the timeouts as not being safe, as it may make PVFs that are already on-chain fail.
True, but I think this should be fixed. Without pre-checking we get a finality stall, that alone is reason enough to do it properly and not enact parameters that have not been checked. Like, assuming we want to change these parameters at all: Why would we only ever want to increase? Most likely we need to increase to mitigate some issue, but would actually want to decrease again later, once it is fixed. Machines also get faster, compilers get better, so the chances that we actually want to decrease at some point are likely higher that we would want to increase.
In other words, by saying "no" we dig ourselves deeper into a solution that is actually not sound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should always use execution parameters from the session where the candidate was produced, so yes, they should always be executed with the same artifact.
But that's not true, is it ?
On approval voting we fetch executor params based on the block the candidate is included: https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/core/approval-voting/src/lib.rs#L2967, .
On backing we fetch them based on the relay_parent:
https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/core/backing/src/lib.rs#L666
I don't think there is anything preventing the two relay chain blocks being in different sessions at the boundary, especially with async backing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's not true, is it ?
Hmmm, that may be an important find... If that does not hold, it should be fixed. The very purport of the executor parameters is to always use the same set of parameters with the same candidate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Currently, PVFs are re-prepared if any execution environment parameter changes. As we've recently seen on Kusama and Polkadot, that may lead to a severe finality lag because every validator has to re-prepare every PVF. That cannot be avoided altogether; however, we could cease re-preparing PVFs when a change in the execution environment can't lead to a change in the artifact itself. For example, it's clear that changing the execution timeout cannot affect the artifact. In this PR, I'm introducing a separate hash for the subset of execution environment parameters that changes only if a preparation-related parameter changes. It introduces some minor code duplication, but without that, the scope of changes would be much bigger. TODO: - [x] Add a test to ensure the artifact is not re-prepared if non-preparation-related parameter is changed - [x] Add a test to ensure the artifact is re-prepared if a preparation-related parameter is changed - [x] Add comments, warnings, and, possibly, a test to ensure a new parameter ever added to the executor environment parameters will be evaluated by the author of changes with respect to its artifact preparation impact and added to the new hash preimage if needed. Closes paritytech#4132
Currently, PVFs are re-prepared if any execution environment parameter changes. As we've recently seen on Kusama and Polkadot, that may lead to a severe finality lag because every validator has to re-prepare every PVF. That cannot be avoided altogether; however, we could cease re-preparing PVFs when a change in the execution environment can't lead to a change in the artifact itself. For example, it's clear that changing the execution timeout cannot affect the artifact.
In this PR, I'm introducing a separate hash for the subset of execution environment parameters that changes only if a preparation-related parameter changes. It introduces some minor code duplication, but without that, the scope of changes would be much bigger.
TODO:
Closes #4132