-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVF: Add a new class of "possibly invalid" errors #660
Comments
Hey @mrcnski, I'm looking for another issue to work on. Do you still recommend this one or is there another that you think would be better? |
Yeah, it would be good to have! Just note that for your other PR, we will need to make a fix so that the "1. Timeout due to exceeded CPU time" case works correctly. I'll add a comment there. BTW, I'm at a work retreat right now, so may be slow to respond. |
enum ValidationError {
// preparation issue that are deterministic
Deterministic,
// may-be-transient preparation issue caused by internal conditions
Internal,
// vote-against execution issue
Invalid,
// fail-once-more-vote-against execution issue
PossiblyInvalid,
} What do you think? @mrcnski |
That looks great to me @eagr! I would just make a few changes: enum ValidationError {
// preparation issue that is deterministic
Preparation,
// may-be-transient issue with preparation or execution, caused by internal conditions
Internal,
// vote-against execution issue
Invalid,
// fail-once-more-vote-against execution issue
PossiblyInvalid,
} And FYI, per discussions with @Overkillus we may not want to retry in backing. Only in approval. Can be a separate issue from this one though. |
* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone
* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone
* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone
* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone
* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone
* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone
* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone
* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone
* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone
* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone
* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone
* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone
* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone
ISSUE
Overview
PVF execution plays a critical role in our dispute process. If we execute a candidate block and find it is invalid, we vote against it. If some validators' votes do not agree, a dispute is initiated.
Now, in the real world we may try to execute a candidate and it fails due to a hardware fault, operator error, some rare bug, etc. In this case, voting against would initiate a dispute, which is not what we want -- it may get the validator slashed!
So, right now we have two main failure states:
InvalidCandidate
, where we always vote against.InternalError
, where we never vote against, but do retry once (since PVF: Don't dispute on missing artifact polkadot#7011) to let transient error conditions clear.And we do actually have a third state:
InvalidCandidate::AmbiguousWorkerDeath
, which happens when the worker process dies for an unknown reason. In this case we retry once, but if it still fails, then we do vote against. This would happen if the PVF execution takes up a large amount of memory, causing OOM, in a way that is reproducible.Proposal
So, I am proposing that we extend this third category, giving it a catch-all name like
PossiblyInvalid
. We would retry these, and only on continued failure deem the candidate invalid.Other errors that we could treat this way:
InvalidCandidate
. But, under conditions of high local load we may do very little work, while still counting CPU time, eventually timing out for a valid candidate. This can be retried after a delay, in the hopes that the load died down. We have to be careful though. It may lead to yet more increase to load, on the other hand we may prevent disputes by retrying!a. (We may also want to detect conditions of high load and deal with it somehow.)
b. NOTE: There may not be enough time for a retry in candidate validation from backing.
RuntimeConstruction
this way. Right now it always results in Invalid, though it should be a local issue. PVF: Consider treatingRuntimeConstruction
as an internal execution error #661 would be the full fix, but in the meantime it would make sense to retry in this case before voting against.Having this separate variant would also just make the code easier to reason about, and easier to add more "retry-then-vote-against" cases in the future.
Also, it would be nice to make these three distinct categories clear in the documentation!
The text was updated successfully, but these errors were encountered: