Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

CLI: Restrict os/arch for secure validators, add flag for insecure mode #7073

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented Apr 13, 2023

PULL REQUEST

Overview

Summary: Running --validator on a platform other than Linux x86-64 requires the --insecure-validator-i-know-what-i-do flag.

Due to paritytech/polkadot-sdk#882 becoming high-priority for parathreads, we are now forced to provide a secure validator mode only for Linux x86-64 (to start).

We will still support other platforms with an --insecure-validator-i-know-what-i-do flag. (Naming follows interpreted-i-know-what-i-do.)

We may want to wait for paritytech/polkadot-sdk#882 before merging, but it seems at this point pretty certain that platform-specific code is a requirement for securely running the PVF workers.

TODO

  • Add flag to zombienet (so existing commands don't break) (@pepoviola can we minimize breakage?)
  • Anywhere else?
  • Documentation (where? validators' guide?)
  • Require landlock to be enabled for secure-mode.

Related Issues

See paritytech/polkadot-sdk#882

Closes paritytech/polkadot-sdk#881

Due to #4718 becoming high-priority for parathreads, we are now forced to provide a secure validator mode only for Linux x86-64 (to start).

We will still support MacOS with an `--insecure-validator-i-know-what-i-do` flag. (Naming follows [`interpreted-i-know-what-i-do`](https://github.com/paritytech/substrate/blob//client/cli/src/arg_enums.rs#L58).)

See https://github.com/paritytech/polkadot/issues/4718#issuecomment-1484137059

Closes https://github.com/paritytech/polkadot/issues/4720
@mrcnski mrcnski added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. labels Apr 13, 2023
Copy link
Contributor

@koute koute left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.... I'm not entirely sure we want to immediately do this? Are we going to insta-enable the sandboxing? I was thinking that maybe we could do something like this:

  1. Add sandboxing support, but keep it disabled by default.
  2. Get the people from the thousand validators program to test it for us for one release.
  3. Then if it all goes well switch it on by default and add this flag.

But maybe this is too cautiousness? In case there would be any problems people could always disable sandboxing if they want, but on the other hand it's probably not a good look to tell people to run with this parameter because we didn't test it out enough to ensure it's completely compatible with everything.

@@ -118,6 +118,10 @@ pub struct RunCmd {
#[arg(long)]
pub beefy: bool,

/// Allows the validator to run insecurely if they know what they're doing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is visible to the user in --help, right? In which case a slightly longer explanation would be nice. (Or as I've said in the other comment, maybe a link to somewhere where it's explained in more detail.)

Comment on lines +744 to +748
#[cfg(not(target_os = "linux"))]
let result = Err("Must be on Linux to run a validator securely.".into());

#[cfg(all(target_os = "linux", not(target_arch = "x86_64")))]
let result = Err("Must be on x86_64 to run a validator securely.".into());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a little bit less jargon-y? Normal users might not necessarily know what x86_64 is.

Suggested change
#[cfg(not(target_os = "linux"))]
let result = Err("Must be on Linux to run a validator securely.".into());
#[cfg(all(target_os = "linux", not(target_arch = "x86_64")))]
let result = Err("Must be on x86_64 to run a validator securely.".into());
#[cfg(not(target_os = "linux"))]
let result = Err("Running a validator is only supported on Linux.".into());
#[cfg(all(target_os = "linux", not(target_arch = "x86_64")))]
let result = Err("Running a validator is only supported on CPUs from the x86_64 family (usually Intel or AMD).".into());

@@ -57,4 +57,7 @@ pub enum Error {

#[error("This subcommand is only available when compiled with `{feature}`")]
FeatureNotEnabled { feature: &'static str },

#[error("Insecure validator: {0} Run with --insecure-validator-i-know-what-i-do if you understand and accept the risks of running insecurely.")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Insecure validator" can be kinda vague, maybe something like this?

Suggested change
#[error("Insecure validator: {0} Run with --insecure-validator-i-know-what-i-do if you understand and accept the risks of running insecurely.")]
#[error("Your system cannot securely run a validator: {0}\nYou can ignore this requirement by using the `--insecure-validator-i-know-what-i-do` command line argument if you understand and accept the risks of running insecurely.")]

Also, I think in this case it'd be nice to maybe e.g. put a link to the docs with a longer explanation of why this is and what will be the consequences when running in unsecure mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the idea, just don't know where the doc should go. Maybe on the validators guide?

@mrcnski mrcnski self-assigned this Apr 13, 2023
@mrcnski
Copy link
Contributor Author

mrcnski commented Apr 13, 2023

Good point @koute. I just thought, since the upcoming os/arch requirement is certain, we could start enforcing it early -- or at least start informing validators and getting the docs ready. And people seem confident that essentially all validators are already on Linux amd64 (though I still haven't seen any data myself).

  1. Add sandboxing support, but keep it disabled by default.
  2. Get the people from the thousand validators program to test it for us for one release.
  3. Then if it all goes well switch it on by default and add this flag.

That is a really good idea. I am worried about turning on sandboxing and having legitimate syscalls blocked on some real PVFs. One concern though is that for validators, it might seem too risky to beta test a new feature that could get them slashed?

This PR can certainly wait until sandboxing, no strong preference here as to when it's merged. Can mark as DNM if we want.

@bkchr
Copy link
Member

bkchr commented Apr 13, 2023

I'm with @koute. I would also propose to close the pr, it will just get stale and recreating it will be very easy.

That is a really good idea. I am worried about turning on sandboxing and having legitimate syscalls blocked on some real PVFs. One concern though is that for validators, it might seem too risky to beta test a new feature that could get them slashed?

That is a good point. I think at the beginning we should run with some kind of training wheels, aka: https://github.com/pmem/syscall_intercept We could intercept all the syscalls and let each unknown syscall print some error message.

@mrcnski
Copy link
Contributor Author

mrcnski commented Apr 14, 2023

Thanks! To be clear, this PR doesn't enable sandboxing, it just restricts the os/arch for validators without the insecure flag.

That is a good point. I think at the beginning we should run with some kind of training wheels, aka: pmem/syscall_intercept We could intercept all the syscalls and let each unknown syscall print some error message.

That looks like a C library and not very easy to use? seccomp already provides a log-only mode so we can use that right now for any syscalls that pass through the filter. Validators could then grep e.g. kern.log for the syscalls and share with us. Or we can grep it ourselves in the worker process after execution, and raise an error trace if there's something there.

We could provide e.g. a --sandbox-mode flag with only one possible value for now, like log-syscalls. This would totally safe. Later we can add another option like restrict-syscalls and make that the default. We can add more options as we expand the sandboxing, and make whatever the strongest one is the default.

@koute
Copy link
Contributor

koute commented Apr 14, 2023

We could intercept all the syscalls and let each unknown syscall print some error message.

We could, but the problem is - how do we get the users to report that to us if the only thing it does is print out a message? If a tree falls down in a forest and no one is there, does it make a sound? :P

That looks like a C library and not very easy to use?

I think Basti meant to just adopt a similar concept, not to actually use that library. Especially since from what I can see that library works by monkeypatching the process' memory, which is not what we want.

@mrcnski
Copy link
Contributor Author

mrcnski commented Apr 14, 2023

We could, but the problem is - how do we get the users to report that to us if the only thing it does is print out a message? If a tree falls down in a forest and no one is there, does it make a sound? :P

The message could say: "in the future this error may result in a dispute, please report to the Polkadot development team!"

@bkchr
Copy link
Member

bkchr commented Apr 14, 2023

Thanks! To be clear, this PR doesn't enable sandboxing, it just restricts the os/arch for validators without the insecure flag.

I know :P

I think Basti meant to just adopt a similar concept, not to actually use that library. Especially since from what I can see that library works by monkeypatching the process' memory, which is not what we want.

No, I meant to use this monkeypatching :P My 5 min google search told me that you can not replace syscall that easily as it is already loaded before anything else, which makes sense. But if there exist some kind of logging for seccomp, it sounds nice! If we want to abuse the 1kv program any way, we can send the data about accessed syscalls to the telemetry. Maybe we can let seccomp log somewhere else or we may need to parse the kernel log 🙈

@mrcnski
Copy link
Contributor Author

mrcnski commented Apr 14, 2023

Maybe we can let seccomp log somewhere else or we may need to parse the kernel log 🙈

I don't think we can change where it logs. ☹️ I also don't know if the log location is always the same on each machine. On my GCP instance it's kern.log but the man page says it goes in the audit log, which I thought was audit.log.

@koute Is it possible to use the trace action to log from a tracer?

@mrcnski mrcnski added C7-high PR touches the given topic and has a high impact on builders. B1-note_worthy Changes should be noted in the release notes and removed B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Apr 17, 2023
@koute
Copy link
Contributor

koute commented May 11, 2023

@koute Is it possible to use the trace action to log from a tracer?

Well, with SECCOMP_RET_TRACE the ptrace host will get notified about the call so it can do whatever it wants with it, including logging it wherever it wants.

@Polkadot-Forum
Copy link

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/ux-of-distributing-multiple-binaries-take-2/2854/22

@mrcnski mrcnski marked this pull request as draft June 1, 2023 14:50
@mrcnski mrcnski mentioned this pull request Jun 1, 2023
3 tasks
@mrcnski
Copy link
Contributor Author

mrcnski commented Jun 1, 2023

I realized some things related to this work:

  1. For determinism, we should fully require whatever security features (like landlock) we have, in order to run secure-mode.
  2. On the other hand, too many validators on insecure-mode can also be a source of indeterminism.
  3. So we should be careful not to have some crazy unrealistic standards for running in secure-mode.
  4. We should monitor with telemetry how many validators are secure vs. insecure. This shouldn't use the public telemetry server.

See #7303 (comment) for more informations.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B1-note_worthy Changes should be noted in the release notes C7-high PR touches the given topic and has a high impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
No open projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

PVF host: Specialize on Linux, support macOS
4 participants