Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF worker: consider fork instead of thread for better isolation #574

Closed
mrcnski opened this issue Aug 23, 2023 · 25 comments · Fixed by #1685
Closed

PVF worker: consider fork instead of thread for better isolation #574

mrcnski opened this issue Aug 23, 2023 · 25 comments · Fixed by #1685
Assignees

Comments

@mrcnski
Copy link
Contributor

mrcnski commented Aug 23, 2023

Right now each PVF job runs in its own thread, but doing a fork may be better:

  1. Read needed information from socket into a buffer.
  2. Fork, child immediately closes socket - but has access to that buffer.
  3. Child runs and validates, information it needs to report back can be sent over a pipe.
  4. Parent waits for child to terminate, communicates back the result and clears out the buffer.

In the end the thread spawning would just be replaced by a fork() call. Reason this might be good: It is just better isolation: The process can now only mess with its own address space and can not possible affect execution of upcoming candidates for example.

Credit: @eskimor

@eskimor
Copy link
Member

eskimor commented Aug 23, 2023

If the memory usage in the parent is low, I believe fork should end up being equally cheap then threads. The only difference should be COW semantics on memory pages, but if there was very little (mutable) memory used by the parent then this should not really matter.

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@mrcnski
Copy link
Contributor Author

mrcnski commented Aug 28, 2023

@jpserrat If you are still keen on contributing, this would be a great way to make a big impact. It shouldn't be an overly demanding task, while still being interesting.

@jpserrat
Copy link
Contributor

@mrcnski I'm going to start working on this now.
Thanks a lot for reaching out.

@mrcnski
Copy link
Contributor Author

mrcnski commented Aug 29, 2023

Here there is a section called "Why not fork?". But I didn't really understand the concerns and am ending for the day.

@mrcnski
Copy link
Contributor Author

mrcnski commented Aug 31, 2023

My concern about the thread stuff pep mentioned was, what if we go back to multi-threaded execution at some point. A while back we found that multi-threaded had significantly better performance (at the cost of determinism):
paritytech/polkadot#7206 (comment)
But I don't know if what pep said about threads makes sense.

We also have the memory metrics and cpu time metrics running in separate threads. Those would now have to be on the forked process. But, I think with a separate process we can just have the parent call rusage on the child at the end. We should research it, would be a nice simplification to remove the metrics threads.

@jpserrat
Copy link
Contributor

jpserrat commented Sep 1, 2023

Great points. Regarding your concern, it has a high chance of happening, so I will make this implementation with this in mind, ensuring it's easy to switch if necessary. What do you think?
About the stuff mentioned on the Why not fork?, I'm not sure about it either but I'm going to do some tests.

Just to make sure that I got it all right. This change is going to be on the worker executor entrypoint right? When we spawn a new thread execute thread

@mrcnski
Copy link
Contributor Author

mrcnski commented Sep 4, 2023

Here's the problem I see with the threads:

  1. We have a thread that runs simultaneously with the prepare/execute threads, that measures CPU time used.
  2. We have another thread that runs along with the prepare thread, that measures memory usage.
  3. Ideally we can stop using threads because they can be hijacked by a malicious job, and made to report wrong values.
    1. e.g. a process can take a huge amount of memory but made to report a value that is under the limit, to make it pass e.g. pre-checking.
  4. We can get both of these values with getrusage(2) instead, but there are two issues with it:
    1. We can't rely on the forked process to report its own values with RUSAGE_SELF (point (3) above).
    2. We can't use RUSAGE_CHILDREN from the parent of the forked process, because it gets values for all children that have terminated, not just the one we're interested in.
      1. It would work for CPU time because that's cumulative and we only would have one process at once.
  5. Conclusion: Due to point (3), ideally we could find a way to get these values some other way, from the parent of the forked process.

Just to make sure that I got it all right. This change is going to be on the worker executor entrypoint right? When we spawn a new thread execute thread

Yep, and also in the prepare worker entrypoint. :) But we can focus on execute right now to keep things simple.

@mrcnski
Copy link
Contributor Author

mrcnski commented Sep 5, 2023

Sorry for things being up in the air while we figure this out. 😅 I am currently wondering if the threads can be removed:

  1. preparation memory - this is not super accurate and we have plans to get the stat some other way.
  2. CPU time - this can be falsified by an attacker. But, I believe we can just replace this thread with getrusage in the parent of the forked process.

(I appreciate your help with threads in the past. I just hadn't realized that they are not isolated and therefore not secure.)

Unfortunately @s0me0ne-unkn0wn is on vacation and I'm waiting for his input. :)

@jpserrat
Copy link
Contributor

jpserrat commented Sep 7, 2023

No problem, I'm looking at the code to get a better understanding during this time. :)
I just wanted to clarify the CPU time part. From what I understood, currently, we are measuring CPU time per job. With the new approach, we would measure the cumulative CPU time of all children's processes. Is my understanding right? Why are we making this change?

@mrcnski
Copy link
Contributor Author

mrcnski commented Sep 7, 2023

With the new approach, we would measure the cumulative CPU time of all children's processes.

Good question! Yeah, getrusage gives a cumulative CPU time. But since we run one child process at a time, we can get the delta between the cumulative CPU time after the current job, and the cumulative CPU time after the last job. That should give us the CPU time just for the current job. Sorry for being unclear.

@jpserrat
Copy link
Contributor

jpserrat commented Sep 9, 2023

Now I get it, thanks!
I have another one. The CPU time that the job took is currently measured at the and of prepare_artifact/validate_using_artifact, for this one getrusage can be used to get the duration that the job took to end. However, for the cpu_time_monitor_loop, we would still need the thread, as getrusage can't be used. It is not clear to me how we would be able to remove the separated thread for cpu_time_monitor_loop for the timeout check.

@mrcnski
Copy link
Contributor Author

mrcnski commented Sep 10, 2023

Ah, you mean how do we stop the child process when it times out? Good point, we still need a thread on the parent side which works in much the same way as cpu_time_monitor_loop. Only difference is that it would call getrusage with the RUSAGE_CHILDREN argument. Does that answer your question?

@s0me0ne-unkn0wn
Copy link
Contributor

Sorry for being late to the party. I've read through the discussion but haven't got a strong opinion yet. Here are some thoughts.

First of all, threads are not evil by themselves. We were concerned about inter-job isolation, and that's why job processes are better than job threads, but threads running inside a job (and cleaned up with a job) are not something bad by definition. We could still use them if need be.

Further, you consider the situation when the malicious code breaks out and falsifies the CPU/memory measurement done by some thread running inside a job. That's a valid point for execution (I still believe it's not so valid for preparation, and anyway, it's solved by the limiting global allocator that doesn't require threads). Yes, that may happen. But is that different from a situation when the code breaks out and falsifies validation results themselves, resulting in the inclusion of an invalid parachain block? IIUC, the latter carries a much higher risk than just reporting invalid CPU/RAM measurements, and it has nothing to do with threads at all.

I mean, we shouldn't get rid of threads just to get rid of threads. Observing the resource consumption from the outside is a good idea (we still need to figure out if those measurements would be deterministic enough). We just don't have to push with eliminating threads without a clear need.

@mrcnski
Copy link
Contributor Author

mrcnski commented Sep 10, 2023

Thanks @s0me0ne-unkn0wn, good points!

About the measurements from the child, keeping the attack surface as low as possible is a good idea. If attackers falsify the validation result, that might be detectable if the attack fails on other machines and a dispute is raised - but falsified measurements would not be detectable. So I still say the child should do only the job it's spawned to do, just to avoid providing any additional attack surface however small it may be.

About threads, you may be right, I was just concerned about pep's points linked above in "Why not fork". Thinking about it more, it may be preferable that no threads are present while fork is being called. pep's point might have been that managing processes in a multi-threaded context can get hairy as there are lots of subtle race conditions involved. (The pidfd API was developed partially to address this, and I would suggest using it if possible.) But I think if we have the CPU time monitor thread, it should start after fork anyway, so it doesn't measure the child start-up time.

Observing the resource consumption from the outside is a good idea (we still need to figure out if those measurements would be deterministic enough).

For CPU time it shouldn't affect determinism, since we are already relying on the kernel to get this stat - doing it from the parent instead should change nothing. For memory, what we have now is a polling-based method which is already not very deterministic at all. Using strace or or your global allocator approach would both be better in this regard.

@s0me0ne-unkn0wn
Copy link
Contributor

it may be preferable that no threads are present while fork is being called

I don't think it's a big concern:

The child process is created with a single thread -- the one that called fork(). The entire virtual address space of the parent is replicated in the child, including the states of mutexes, condition variables, and other pthreads objects; the use of pthread_atfork(3) may be helpful for dealing with problems that this can cause.

@jpserrat
Copy link
Contributor

It does answer my question, thanks. Just one more point about this.

Only difference is that it would call getrusage with the RUSAGE_CHILDREN argument.

getrusage with RUSAGE_CHILDREN will return the stats of all terminated children, as we need to get the CPU time from the running child, this is not going to work for this purpose. As RUSAGE_SELF can not be used as well, should we go back to the ProcessTime approach?

About the concern with thread and fork, as mentioned by @mrcnski the CPU and memory threads will start after the fork, do we need to worry about this? If there are other points of concern with this we can do as @s0me0ne-unkn0wn suggested with pthread_atfork.

@mrcnski
Copy link
Contributor Author

mrcnski commented Sep 12, 2023

getrusage with RUSAGE_CHILDREN will return the stats of all terminated children, as we need to get the CPU time from the running child, this is not going to work for this purpose.

Oh, right... I hate Linux. 😂 We could check /proc/[pid]/stat, but I'm not sure we can assume /proc/ is mounted. At least on my GCP instance it wasn't.

As RUSAGE_SELF can not be used as well, should we go back to the ProcessTime approach?

I don't think ProcessTime API can be used to get the cpu time of a child. But maybe clock_gettime can be used?

But wait... now that we are forking, we can have the forked process set its own CPU time limit right before it runs any untrusted code. I believe it can be done with setrlimit(2).

Great point @jpserrat ⭐️

About the concern with thread and fork, as mentioned by @mrcnski the CPU and memory threads will start after the fork

With the above solution we won't need the CPU time thread. So execute jobs won't need any threads.

For prepare jobs, the memory thread can be put on the side of the forked child. We decided that an attacker manipulating this value is not a concern in our threat model. Eventually it might be replaced with a global allocator tracker, but that remains an ongoing area of research. So for now, let's have a memory thread on the forked child and a thread that performs the prepare job, but no CPU time thread.

@s0me0ne-unkn0wn
Copy link
Contributor

We could check /proc/[pid]/stat, but I'm not sure we can assume /proc/ is mounted.

In general case, we cannot assume that, but maybe we could just mimic its behavior? The parent process must have all the needed permissions iiuc. It's somewhere over there: https://elixir.bootlin.com/linux/v4.14/source/fs/proc/array.c#L390

now that we are forking, we can have the forked process set its own CPU time limit right before it runs any untrusted code. I believe it can be done with setrlimit(2).

That's true, but following your worst-case logic, the untrusted code can break out and reset that limit with setrlimit() as well 🙂

@mrcnski
Copy link
Contributor Author

mrcnski commented Sep 12, 2023

In general case, we cannot assume that, but maybe we could just mimic its behavior? The parent process must have all the needed permissions iiuc. It's somewhere over there: elixir.bootlin.com/linux/v4.14/source/fs/proc/array.c#L390

That seems really hacky. 😂

That's true, but following your worst-case logic, the untrusted code can break out and reset that limit with setrlimit() as well 🙂

Not so! Assuming the untrusted code can't escalate its privilege, which is much harder than just running arbitrary code:

The hard limit acts as a ceiling for the soft limit: an unprivileged process may only set its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit.

@s0me0ne-unkn0wn
Copy link
Contributor

That seems really hacky. 😂

Okay, I have an idea of how to do that in not so hacky way, but it introduces some overhead. We could build some matrioshka doll.

  1. The main worker process forks a new process (let's call it trusted)
  2. The trusted process forks a new process (let's call it executor)
  3. The executor executes a PVF
  4. The trusted process observes the executor's CPU time through getrusage(RUSAGE_CHILDREN)
  5. The trusted process reports the CPU time to the main process

I believe the only overhead here is the second fork() as the executor process may communicate with the main process directly. I believe that overhead could be significantly reduced if we could use clone() instead of fork() in one of those bifurcation points.

Just an idea and should be evaluated.

@mrcnski
Copy link
Contributor Author

mrcnski commented Sep 12, 2023

Interesting idea! I think it has the issue noted above, where we can't get the CPU time with getrusage(RUSAGE_CHILDREN) until the executor process finishes. This is not ideal because there could be some bug or attack where we have some over-running jobs. (We do have another timeout on the host for these situations so the over-running process doesn't take forever. But it's wall-clock timeout (and not CPU time) and super lenient. I'd rather rely on on a real, precise CPU timeout.)

I approve of thinking out-of-the-box and Russian-doll approaches are always a win in my book. But what's wrong with setrlimit? Seems like it does exactly what we want in a single call.

@s0me0ne-unkn0wn
Copy link
Contributor

But what's wrong with setrlimit? Seems like it does exactly what we want in a single call.

Probably nothing's wrong with it. Somehow, I was sure you need root privilege to set the hard limit, but manpage says you can lower the limit without privileges, so it should be okay.

I just tend to trust our resource-limiting code more than OS's code 🙂 Because our code is right before our eyes in the first place. But I understand that we shouldn't invent too many wheels as well.

@mrcnski
Copy link
Contributor Author

mrcnski commented Sep 13, 2023

Phew! @jpserrat I hope everything is clear now. BTW, I'd be more than happy to provide mentorship. Let me know if you would like to get on a call to discuss more, or do some pair programming. My email's on my GitHub profile.

@jpserrat
Copy link
Contributor

That would be great @mrcnski, thanks!

@mrcnski
Copy link
Contributor Author

mrcnski commented Nov 3, 2023

Here there is a section called "Why not fork?". But I didn't really understand the concerns and am ending for the day.

Ah, I think the concerns are not valid because we are calling fork from a single-threaded context. See also https://docs.rs/nix/latest/nix/unistd/fn.fork.html#safety for more info.

This means, the forked child job process can have threads if it wants. And I think it will be necessary to monitor CPU time again, because setrusage has a little caveat I didn't know about.

We can still use getrusage(CHILDREN) to avoid trusting the time metric reported by the child job process. However, we can no longer rely on the process getting killed when it surpasses the time limit. This means malicious code can take up validator resources up until the (very long) lenient timeout is reached. It's unfortunate, but I don't think it's a security issue. We could leave a followup to investigate e.g. cgroups as an alternative, but it's unlikely that we'd get to it soon.

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Completed
Development

Successfully merging a pull request may close this issue.

4 participants