-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleaned up panic messages #8219
Conversation
I'm going to try to fix #8152 at the same time. Many of the panics that cause noise are necessary for making sure the application actually panics, and not merely quits (exit code 0 vs 101). For example, the panic I'm having difficulty with right now is one in bevy_tasks. I would like to make it so that whatever panic was captured is propagated from here, but that means finding some way of storing/retrieving the payload. I was able to fix a similar issue in bevy_ecs by using a In the case of the bevy_tasks dilemma, These issues may also be fixed by refactoring the design. In the meantime, I'm going to try creating these alternative types. |
Working on a |
I think instead of closing the channel to propagate the error, we should be trying a more graceful shutdown. Specifically we should be waiting for any currently running systems to finish before stopping the multithreaded executor. This could be done by changing the finish channel to use a result instead of just sending the index and then handling the error in the executor. |
@hymm Yeah, we can do that. I suppose the benefit would be so IO tasks don't corrupt files/send corrupted messages? Are there any other benefits to be aware of? There will be a cost of channel messages doubling in size (right now, the added cost is checking a mutex which is guaranteed to not be locked until a panic), so I would like to get an overview of the pros and cons of such a design. |
Never mind, I was able to avoid the cost in the channels by keeping it the same. The only difference now is that systems always signal that they are complete, regardless of whether they were successful or not. Instead of closing the channel, they just set the panic payload. Edit: However, I'm not sure how this change will affect dependent systems. Essentially, the change I made makes it so systems signal that they have finished regardless of whether or not they are successful. This would mean dependent systems are signaled that they can start, which might cause further panics. Edit 2: Yeah, tested this with a simple app where there are two systems, where one depends on a resource inserted by the first system. If that first system panics, the second one panics as well. It seems that cost in channels is unavoidable, since there needs to be some way of signaling to the executor that the system finished/failed without signaling to dependent systems to start. |
Okay, it now allows unfinished systems to complete. That includes systems not even started yet (in the running schedules), as long as they don't depend on the panicked system. The reason for this is that advanced behavior can be split into multiple systems. If we were to panic immediately or only allow systems that were started to finish, systems that may be responsible for cleaning up those systems would never run, which could lead to issues in IO, like file corruption. There may be other reasons for doing this that I'm not aware about. I still have to fix a few other bugs + add some optimizations. |
Okay, I think this is ready. Errors now properly propagate up (though it looks like some of the backtrace is still missing, but I think that's intended due to how bevy_tasks restarts tasks that panic; this is my first look at the internals of bevy_tasks, so I would need someone to confirm this) and all systems (in the schedule) run before exiting (if possible). |
Oops, found a bug: I merely copied the code from skipping systems for skipping dependents, but I should have ensured dependents of dependents are skipped as well. Edit: I'll add it as a test for future reference. |
Could you update the description with how a panic would look like? What happens if more than one system panics at the same time in different threads? |
Will do.
Here is a small program: use bevy::prelude::*;
fn main() {
App::new()
.add_plugins(DefaultPlugins)
.add_systems(Update, (panicking_system, panicking_system_2))
.run();
}
fn panicking_system() {
panic!("oooh scary");
}
fn panicking_system_2() {
panic!("oooh scary 2!");
} And the output:
That last message is due to the fact that there are two schedules running. The main, single-threaded schedule and the the multi-threaded on top of it. For that reason, the error is caught twice and prints two messages. I rephrased the error messages so it doesn't sound like the main thread panicked when it was a task somewhere in the main thread that panicked. This wasn't really intended, but is more of a happy little accident, since it gives a mini backtrace of sorts. |
That doesn't look good 😄 What's the stack trace displayed with |
Yeah, but IDK if there is any thing we can do there. I think it's more a problem with
|
Yup, definitely not something to solve here. At least the stacktraces looks good |
Oh yeah, I guess I should be using |
Actually, I changed my mind. Edit: What it would look like with
|
And |
No, the Edit: Oh, maybe that's what you meant? |
Right, that was the edge case I was half-remembering :) |
let task = self.executor.spawn(f).fallible(); | ||
let task = self | ||
.executor | ||
.spawn(AssertUnwindSafe(f).catch_unwind()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can guarantee that the passed in future is UnwindSafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to have to look more into it, since we do execute other systems after one fails, but it seems other projects ignore unwind safety (I may have misinterpreted it though, I've got to go back to work).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Send + 'static pretty much implies unwind safe.
In our case the futures aren't 'static
when spawning on the scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so I guess the thing to do would be make it so the spawn functions can only take unwind safe futures and then assure that safety from bevy_ecs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I took a more in depth look at this and I think unwind safety doesn't matter here. Sure, we are catching the unwinds, but we aren't restarting tasks or returning the result, instead the panic is being propagated instead of creating a new panic. The real problem is in bevy_ecs, where catching an unwind previously led to a panic, now still runs the systems. In order to guarantee that broken invariants aren't observed, the mutable accesses of the panicked system need to be marked, and any system that depends on any of the marked accesses need to be skipped.
Hmmm, so this has already become pretty complex, and it seems there is even more work to do. The major problem is unwind safety. In order to run systems after the panic, we have to account for any corruption that may have occurred. This can be done by skipping systems that depend on affected accesses. However, this severely limits the benefits of running systems afterwards. For example, if the panicking system is exclusive, then no system could run. And a future system is exclusive, it can't run either. Anyways, my point is that this is a bad way to handle a graceful shutdown. A better alternative might be providing a panic handler in the So I'm going to change this to only wait for running systems to stop. I'll leave figuring out graceful shutdowns to someone who has more experience with the ECS internals. |
fab7927
to
90ab514
Compare
There, I think that fixes everything. Currently running systems are allowed to finish, but the executor won't spawn any more systems. Future work will have to be done to allow for advanced clean up. Maybe that might even look like what I was attempting, but with some method of insuring unwind safety ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a solid improvement over the current state of things, and this is critical work. Let's merge this IMO.
Looks good. Let's merge and catch edge casses on main if any |
#[cfg(feature = "trace")] | ||
system_span.exit(); | ||
if let Err(payload) = res { | ||
eprintln!("Encountered a panic in system `{}`!", &*system.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit too late now, but in a future PR something like this could make the error stand out a bit more
eprintln!("Encountered a panic in system `{}`!", &*system.name()); | |
eprintln!("\x1b[91mERROR\x1b[0m: Encountered a panic in system `{}`!", &*system.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this supported in all terminals on all platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it works on windows at least and I would be surprised if it didn't work on linux. but I don't have a way to confirm it right now. There are crates that can handle that more gracefully though, just not sure if adding a crate just for that is required, although, tracing probably already uses one of them so we could just reuse that one since it's already in the dep tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And when logs are redirected to a file? I don't know of a single way to have colors...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use error!
macro to have colors in terminal and normal output if colors aren't supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use error!
error!
wasn't used because of verbosity concerns. I don't personally agree, but I suggested that approach as an alternative. I think we should instead change the default format of logs if the verbosity is a concern instead of avoiding it, but that's out of scope for here.
And when logs are redirected to a file
True, I didn't really consider that. At least it would just be a couple of weird characters at the end of the file, but it's indeed not ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should instead change the default format of logs if the verbosity is a concern instead of avoiding it,
Makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know of a single way to have colors...
Well, there's error!()
that can deal with it correctly 😉. So it's a way, but like mentioned it was purposefully not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there's error!() that can deal with it correctly
I meant by adding control characters. As far as I know the tracing library is using environment detection.
error!
wasn't used because of verbosity concerns
Oh I thought it was because bevy_ecs
can be used without logs and tracing enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I was basing it on this comment #8219 (comment) but alice mentioned just after that that it won't work with tracing disabled.
Not sure what the best solution is, maybe we could just use a feature flag for tracing and if it's disabled just use the current log with no color?
It jus seems unfortunate to not use tracing where possible. For example, if someone sets up a tracing layer to output to a file they won't get those logs currently, which really isn't ideal.
…11455) # Objective #8219 changed the target type of a `transmute` without changing the one transmuting from ([see the relevant diff](55e9ab7#diff-11413fb2eeba97978379d325353d32aa76eefd0af0c8e9b50b7f394ddfda7a26R351-R355)), making them incompatible. This PR fixes this by changing the initial type to match the target one (modulo lifetimes). ## Solution Change the type to be transmuted from to match the one transmuting into (modulo lifetimes)
Objective
Fixes #8215 and #8152. When systems panic, it causes the main thread to panic as well, which clutters the output.
Solution
Resolves the panic in the multi-threaded scheduler. Also adds an extra message that tells the user the system that panicked.
Using the example from the issue, here is what the messages now look like:
Before
After