Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark the device as lost when hal produces a device lost error #5132

Closed
nical opened this issue Jan 24, 2024 · 6 comments · Fixed by #6229
Closed

Mark the device as lost when hal produces a device lost error #5132

nical opened this issue Jan 24, 2024 · 6 comments · Fixed by #6229
Assignees
Labels
type: enhancement New feature or request

Comments

@nical
Copy link
Contributor

nical commented Jan 24, 2024

We currently do not detect when hal produces a device lost error. In this situation we should call Global::device_mark_lost to fire the device lost callback and prevent the device from issuing hal commands again (other than potentially destroying resources).

See also #4907 which has a broader scope.

@nical
Copy link
Contributor Author

nical commented Jan 24, 2024

An possible approach is to handle it outside of wgpu-core, near the code that deals with error scopes since it's a central place where errors are processed. That means all users of wgpu-core (wgpu, wgpu-native and gecko) would have to do it separately.
The alternative is to wrap each wgpu-core entry points into functions that inspect the errors.

The next question is whether the device lost callback should be invoked right away or next time the device is polled.

@nical nical self-assigned this Jan 24, 2024
@nical
Copy link
Contributor Author

nical commented Jan 24, 2024

My plan is now to detect device lost errors when converting the hal device error type to the wgpu-core one. It will require a bit of plumbing but has the merit of structurally making sure we can't forget to check.

It will mark the device as lost and the device will fire the device lost callback next time it is polled.

@nical
Copy link
Contributor Author

nical commented Jan 25, 2024

My plan is now to detect device lost errors when converting the hal device error type to the wgpu-core one.

Scratch that, it's way too invasive.

@vorporeal
Copy link
Contributor

vorporeal commented Feb 14, 2024

I'm looking into how to deal with Parent device is lost panics produced in our application by wgpu when using a dedicated laptop GPU and performing a suspend/resume cycle on the machine.

It appears that it's currently not possible to detect these (given the lack of wiring on the wgpu side) - is this accurate? If so, is there anything I could do to help get this functionality working? If not, what's the best way to avoid this impossible-to-recover-from panic?

@nical
Copy link
Contributor Author

nical commented Feb 15, 2024

This is accurate. We need to find a good way to inspect every error, detect when it's a device loss and react accordingly (mark the device as lost so that it skips further calls and invoke the device loss callback. This is not implemented yet unfortunately and a good way to help is to try to implement it.

I'm not sure yet what the most practical way to do this is. I lean on the side of having some kind of wgpu_core::Device::error<E>(error: E) -> E where E: ErrorTrait function that checks whether the error is a device loss and sets a flag on the device if so. ErrorTrait which would need a better name would provide let us extract some basic info about the error like whether it is a device loss, an out-of-memory, etc.

But someone needs to try and see if it works well. It's somewhat high on my priority list, unfortunately there's a bunch of more pressing issues that will keep me busy in the short/medium term.

@nical
Copy link
Contributor Author

nical commented Feb 15, 2024

Another solution that was discussed is to do something similar to what I described in my previous comment, but in the backends instead of in wgpu-core. The backends have simpler error types since they don't contain validation, so it might be easier to check there, even if it would involve duplicating the logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New feature or request
Projects
Status: Done
3 participants