Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tasks] Handle TaskPool panicking threads #2307

Closed

Conversation

NathanSWard
Copy link
Contributor

@NathanSWard NathanSWard commented Jun 6, 2021

Objective

Solution

  • Implement a TaskPoolThreadPanicPolicy which implements two different behaviors
    • Restart (default) -> this will join the panicked thread and then spawn a new thread in it's place.
    • Propagate -> this will propagate the panic from the inner thread to the spawning thread (usually the main thread)

TODO

  • Testing
  • Some more docs
  • An example

@NathanSWard NathanSWard added core C-Feature A new feature, making something new possible labels Jun 6, 2021
@NathanSWard NathanSWard changed the title Handle TaskPool panicking threads [tasks] Handle TaskPool panicking threads Jun 6, 2021
crates/bevy_core/src/lib.rs Outdated Show resolved Hide resolved
@github-actions github-actions bot added the S-Needs-Triage This issue needs to be labelled label Jun 7, 2021
@NathanSWard NathanSWard removed the S-Needs-Triage This issue needs to be labelled label Jun 7, 2021
@github-actions github-actions bot added the S-Needs-Triage This issue needs to be labelled label Jun 7, 2021
@NathanSWard NathanSWard removed the S-Needs-Triage This issue needs to be labelled label Jun 7, 2021
@github-actions github-actions bot added the S-Needs-Triage This issue needs to be labelled label Jun 7, 2021
@NathanSWard NathanSWard removed the S-Needs-Triage This issue needs to be labelled label Jun 8, 2021
@NathanSWard NathanSWard requested a review from bjorn3 June 8, 2021 23:03
Copy link
Member

@mockersf mockersf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, but threading is not my forte

Copy link
Contributor

@tiagolam tiagolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @NathanSWard! The approach here does seem sensible to me.

But there's one thing that does stand out to me, though. Why is state being kept in ThreadStates (having to then use a RwLock to sync the Vec in TaskPoolInner)? Given we already have the async_channel dependency for the shutdown channel, which already offers a multi producer / multi consumer channel (both sides Sync + Send), I would have thought we could simply use a channel to flag to the TaskPool which threads needed attention. This means we wouldn't have to loop through all ThreadStates even if there's nothing to action on (and drop the RwLock and AtomicBool).

@NathanSWard
Copy link
Contributor Author

Thanks, @NathanSWard! The approach here does seem sensible to me.

Um, to be frank, I didn't even consider the async_channel, and now thinking about it, I think I prefer that over the manual AtomicBool/ThreadState thing I implemented 😄

Thanks for the comment, I'll see what I can come up with!

@NathanSWard
Copy link
Contributor Author

(and drop the RwLock and AtomicBool).

Well, after looking at this again. We have to keep the RwLock regardless. This is because we somehow need mutable access to the Vec<JoinHandle<()>> since on PanicPolicy::Restart we need to replace the thread with a new one.

@tiagolam
Copy link
Contributor

(and drop the RwLock and AtomicBool).

Well, after looking at this again. We have to keep the RwLock regardless. This is because we somehow need mutable access to the Vec<JoinHandle<()>> since on PanicPolicy::Restart we need to replace the thread with a new one.

Hmm, yeah, that's true. My initial thinking was that we could drop the need for the Vec<JoinHandle<()>> as well, by sending the JoinHandle<()> over the channel to be received in handle_panicking_threads, but on a closer it looks like a Thread can't get access to its own handle. This makes it less convenient, so I think the current approach holds well.

@cart cart added the S-Pre-Relicense This PR was made before Bevy added the Apache license. Cannot be merged or used for other work label Jul 23, 2021
@mockersf mockersf removed the S-Pre-Relicense This PR was made before Bevy added the Apache license. Cannot be merged or used for other work label Jul 24, 2021
@alice-i-cecile alice-i-cecile added P-High This is particularly urgent, and deserves immediate attention S-Adopt-Me The original PR author has no intent to complete this work. Pick me up! labels Jun 12, 2022
SarthakSingh31 added a commit to SarthakSingh31/bevy that referenced this pull request Jun 15, 2022
SarthakSingh31 added a commit to SarthakSingh31/bevy that referenced this pull request Jun 15, 2022
bors bot pushed a commit that referenced this pull request Nov 2, 2022
# Objective
Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by #2250 as these threads are global and cannot be replaced after initialization.

Removes the need for temporary fixes like #4998. Fixes #4996. Fixes #6081. Fixes #5285. Fixes #5054. Supersedes #2307.

## Solution

The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~
 
 - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below).
 - ~~no error is logged of any kind~~ (See below)
 - ~~it's unclear if it drops other tasks in the executor~~ (it does not)
 - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread.

### Alternatives
A final solution likely will incorporate elements of any or all of the following.

#### ~~Log and Ignore~~
~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~

Panics already do this by default, even when caught by `catch_unwind`.

#### ~~`catch_unwind` in `ParallelExecutor`~~
~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~
~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### Abort on Panic
The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la #4740). Roughly takes the shape of:

```rust
struct AbortOnPanic;

impl Drop for AbortOnPanic {
   fn drop(&mut self) {
     abort!();
   }
}

let guard = AbortOnPanic;
// Run task
std::mem::forget(AbortOnPanic);
```

---

## Changelog

Changed: `bevy_tasks::TaskPool`'s threads  will no longer terminate permanently when a task scheduled onto them panics.
Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.
@james7132
Copy link
Member

Superceded by #6443.

@james7132 james7132 closed this Nov 3, 2022
ItsDoot pushed a commit to ItsDoot/bevy that referenced this pull request Feb 1, 2023
# Objective
Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by bevyengine#2250 as these threads are global and cannot be replaced after initialization.

Removes the need for temporary fixes like bevyengine#4998. Fixes bevyengine#4996. Fixes bevyengine#6081. Fixes bevyengine#5285. Fixes bevyengine#5054. Supersedes bevyengine#2307.

## Solution

The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~
 
 - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below).
 - ~~no error is logged of any kind~~ (See below)
 - ~~it's unclear if it drops other tasks in the executor~~ (it does not)
 - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread.

### Alternatives
A final solution likely will incorporate elements of any or all of the following.

#### ~~Log and Ignore~~
~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~

Panics already do this by default, even when caught by `catch_unwind`.

#### ~~`catch_unwind` in `ParallelExecutor`~~
~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~
~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### Abort on Panic
The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la bevyengine#4740). Roughly takes the shape of:

```rust
struct AbortOnPanic;

impl Drop for AbortOnPanic {
   fn drop(&mut self) {
     abort!();
   }
}

let guard = AbortOnPanic;
// Run task
std::mem::forget(AbortOnPanic);
```

---

## Changelog

Changed: `bevy_tasks::TaskPool`'s threads  will no longer terminate permanently when a task scheduled onto them panics.
Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-Feature A new feature, making something new possible P-High This is particularly urgent, and deserves immediate attention S-Adopt-Me The original PR author has no intent to complete this work. Pick me up!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panics while loading assets kill IoThreads preventing assets to be further loaded
7 participants