Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

PVF: Vote invalid on panics in execution thread (after a retry) #7155

Merged
merged 29 commits into from
May 16, 2023

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented May 1, 2023

PULL REQUEST

Overview

See title. Also make sure we kill the worker process on panic errors and internal errors to potentially clear any error states independent of the candidate.

Related

Closes #7045
Addresses some TODOs from #7153, set as target branch.

1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed.

2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1]

[^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet.

3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop.

4. The order of thread selection was flipped to make (3) sound (see note in code).

I left some TODO's related to panics which I'm going to address soon as part of #7045.
Also make sure we kill the worker process on panic errors and internal errors to
potentially clear any error states independent of the candidate.
Addresses a couple of follow-up TODOs from
#7153.
Comment on lines 715 to 716
// TODO: Should we stop retrying after some time has passed?
// We would need to know if we came from backing or approval.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s0me0ne-unkn0wn IIRC you've looked into this, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ? But still, if we retry and wait too much time, malicious validators might have enough time to vote valid and the including block will be finalized - we currently have 2 relay chain blocks delay on finality which should help, but we'd have to be careful here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in the case of backing you want the strict timeout to be applied to both retries cumulatively, not to a single one? Looks like a valid idea. There's no dedicated field to tell from which subsystem the request is coming yet, but you can check for execution timeout kind in the incoming CandidateValidationMessage::ValidateFromExhaustive or CandidateValidationMessage::ValidateFromChainState, it will be either PvfExecTimeoutKind::Backing or PvfExecTimeoutKind::Approval so you can tell backing from the other stuff. Sounds a bit hacky probably, but will work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ?

Maybe, or also just a hardware glitch, spurious OS glitch, etc. We can't really tell why a panic happened, or if it was an issue with the candidate or not.1 Retrying lets us be more sure about our vote.

Footnotes

  1. Well we do get a stringified error but we can't match on it or anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @s0me0ne-unkn0wn! Pushed some changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ? But still, if we retry and wait too much time, malicious validators might have enough time to vote valid and the including block will be finalized - we currently have 2 relay chain blocks delay on finality which should help, but we'd have to be careful here.

No, this should not be possible. If it takes too long, we noshow and more validators will cover for us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But still, if those new guys also retry, they will also no-show and malicious votes come in ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will then be replaced by even more checkers. One noshow will be replaced by >1 more checkers, which will again be 2/3 honest. Anyhow, this perfect DoS of all honest nodes is certainly the absolute worst case, so yes we have to be careful when to not raise a dispute. In general, if in doubt raise one.

For hardware faults being the cause, I think we should look into having error checking. E.g. we already suggest/require ECC memory - we should have something similar for disk/db. The node should be able to detect a corrupted db and not just read garbage data and then dispute stuff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU hardware faults are extremely rare (I'm closely familiar with a single case, though a really interesting one), and memory faults are not so rare (I came across two cases in my life) but still not something that should be seriously taken into account (that's something that could be handled by governance reverting the slashes on a case-by-case basis), so do I understand correctly we are left with a single really concerning case of OOM? Can we tell the OOM from the other cases? Afair Linux's oom_kill_task() sends SIGKILL to the process instead of SIGSEGV, could we use it to detect the condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we tell the OOM from the other cases?

Someone (I believe @burdges?) had the idea of setting a flag before each allocation and unsetting it after -- would be easy to do if we had the allocator wrapper that's been discussed. And I agree that we should focus on OOM, since we've seen it happen in the wild.

Comment on lines 715 to 716
// TODO: Should we stop retrying after some time has passed?
// We would need to know if we came from backing or approval.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ? But still, if we retry and wait too much time, malicious validators might have enough time to vote valid and the including block will be finalized - we currently have 2 relay chain blocks delay on finality which should help, but we'd have to be careful here.

mrcnski and others added 8 commits May 2, 2023 03:36
…on.md

Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
- Measure the CPU time in the prepare thread, so the observed time is not
  affected by any delays in joining on the thread.

- Measure the full CPU time in the execute thread.
Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std
docs.

Considered also using a condvar to signal the CPU thread to end, in place of an
mpsc channel. This was not done because `Condvar::wait_timeout_while` is
documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not
documented as such. Also, we would need a separate condvar, to avoid this case:
the worker thread finishes its job, notifies the condvar, the CPU thread returns
first, and we join on it and not the worker thread. So it was simpler to leave
this part as is.
…ution' into mrcnski/pvf-catch-panics-in-execution
@mrcnski mrcnski requested a review from sandreim May 2, 2023 15:32
@mrcnski mrcnski added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. labels May 2, 2023
Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other internal errors are left and can we be more specific about them?

@mrcnski
Copy link
Contributor Author

mrcnski commented May 3, 2023

What other internal errors are left and can we be more specific about them?

@eskimor That is a great question. I enumerated the internal errors here (should jump to InternalValidationError):

660dadd (#7155)

node/core/pvf/src/error.rs Outdated Show resolved Hide resolved
@mrcnski mrcnski self-assigned this May 15, 2023
Base automatically changed from mrcnski/pvf-remove-rayon-tokio to master May 16, 2023 18:35
@mrcnski
Copy link
Contributor Author

mrcnski commented May 16, 2023

bot merge

@paritytech-processbot paritytech-processbot bot merged commit 9420d7d into master May 16, 2023
@paritytech-processbot paritytech-processbot bot deleted the mrcnski/pvf-catch-panics-in-execution branch May 16, 2023 21:01
ordian added a commit that referenced this pull request May 23, 2023
* master: (60 commits)
  Ensure all `StorageVersion`s on Rococo/Westend are correct and migration hooks pass (#7251)
  Try-runtime proper return types (#7146)
  Have OCW mined election once a week on Westend (#7248)
  Bump enumn from 0.1.5 to 0.1.8 (#7226)
  Companion to #14183: FRAME: Allow message ID to be mutated in `ProcessMessage` (#7262)
  Remove TODO comment (#7260)
  Fix build (#7261)
  Update syn (#7258)
  Use Message Queue pallet for UMP dispatch (#6271)
  Freeze chain if there are byzantine threshold + 1 invalid votes against a local candidate (#7225)
  Revert chain if at least f+1 validators voted against a candidate (#7151)
  Ensure all `StorageVersion`s on Polkadot/Kusama are correct (#7199)
  Forgotten pub reexport for `GlobalConsensusParachainConvertsFor` (#7238)
  PVF: Vote invalid on panics in execution thread (after a retry) (#7155)
  PVF: Remove `rayon` and some uses of `tokio` (#7153)
  [xcm] Foreign global consensus parachain LocationToAccountId converter (#7016)
  Update docs (#7230)
  Bump parity-db to 0.4.8 (#7231)
  Merge branch 'master' of https://github.com/paritytech/polkadot (#7224)
  Relax the watermark rule in the runtime (#7188)
  ...
ordian added a commit that referenced this pull request May 23, 2023
…slashing-client

* ao-past-session-slashing-runtime: (61 commits)
  Ensure all `StorageVersion`s on Rococo/Westend are correct and migration hooks pass (#7251)
  Try-runtime proper return types (#7146)
  Have OCW mined election once a week on Westend (#7248)
  Bump enumn from 0.1.5 to 0.1.8 (#7226)
  Companion to #14183: FRAME: Allow message ID to be mutated in `ProcessMessage` (#7262)
  Remove TODO comment (#7260)
  Fix build (#7261)
  Update syn (#7258)
  Use Message Queue pallet for UMP dispatch (#6271)
  Freeze chain if there are byzantine threshold + 1 invalid votes against a local candidate (#7225)
  Revert chain if at least f+1 validators voted against a candidate (#7151)
  Ensure all `StorageVersion`s on Polkadot/Kusama are correct (#7199)
  Forgotten pub reexport for `GlobalConsensusParachainConvertsFor` (#7238)
  PVF: Vote invalid on panics in execution thread (after a retry) (#7155)
  PVF: Remove `rayon` and some uses of `tokio` (#7153)
  [xcm] Foreign global consensus parachain LocationToAccountId converter (#7016)
  Update docs (#7230)
  Bump parity-db to 0.4.8 (#7231)
  Merge branch 'master' of https://github.com/paritytech/polkadot (#7224)
  Relax the watermark rule in the runtime (#7188)
  ...
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

PVF: Vote invalid on panics in execution thread
4 participants