Mark the device as lost when hal produces a device lost error #5132

nical · 2024-01-24T10:34:25Z

We currently do not detect when hal produces a device lost error. In this situation we should call Global::device_mark_lost to fire the device lost callback and prevent the device from issuing hal commands again (other than potentially destroying resources).

See also #4907 which has a broader scope.

The text was updated successfully, but these errors were encountered:

nical · 2024-01-24T10:49:00Z

An possible approach is to handle it outside of wgpu-core, near the code that deals with error scopes since it's a central place where errors are processed. That means all users of wgpu-core (wgpu, wgpu-native and gecko) would have to do it separately.
The alternative is to wrap each wgpu-core entry points into functions that inspect the errors.

The next question is whether the device lost callback should be invoked right away or next time the device is polled.

nical · 2024-01-24T17:10:03Z

My plan is now to detect device lost errors when converting the hal device error type to the wgpu-core one. It will require a bit of plumbing but has the merit of structurally making sure we can't forget to check.

It will mark the device as lost and the device will fire the device lost callback next time it is polled.

nical · 2024-01-25T10:35:49Z

My plan is now to detect device lost errors when converting the hal device error type to the wgpu-core one.

Scratch that, it's way too invasive.

vorporeal · 2024-02-14T20:50:23Z

I'm looking into how to deal with Parent device is lost panics produced in our application by wgpu when using a dedicated laptop GPU and performing a suspend/resume cycle on the machine.

It appears that it's currently not possible to detect these (given the lack of wiring on the wgpu side) - is this accurate? If so, is there anything I could do to help get this functionality working? If not, what's the best way to avoid this impossible-to-recover-from panic?

nical · 2024-02-15T09:30:17Z

This is accurate. We need to find a good way to inspect every error, detect when it's a device loss and react accordingly (mark the device as lost so that it skips further calls and invoke the device loss callback. This is not implemented yet unfortunately and a good way to help is to try to implement it.

I'm not sure yet what the most practical way to do this is. I lean on the side of having some kind of wgpu_core::Device::error<E>(error: E) -> E where E: ErrorTrait function that checks whether the error is a device loss and sets a flag on the device if so. ErrorTrait which would need a better name would provide let us extract some basic info about the error like whether it is a device loss, an out-of-memory, etc.

But someone needs to try and see if it works well. It's somewhat high on my priority list, unfortunately there's a bunch of more pressing issues that will keep me busy in the short/medium term.

nical · 2024-02-15T09:32:46Z

Another solution that was discussed is to do something similar to what I described in my previous comment, but in the backends instead of in wgpu-core. The backends have simpler error types since they don't contain validation, so it might be easier to check there, even if it would involve duplicating the logic.

nical self-assigned this Jan 24, 2024

teoxoy added the type: enhancement New feature or request label Jul 3, 2024

teoxoy assigned teoxoy and unassigned nical Jul 15, 2024

teoxoy mentioned this issue Sep 6, 2024

Invalidate the device when we encounter driver-induced device loss or on unexpected errors #6229

Merged

teoxoy closed this as completed in #6229 Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark the device as lost when hal produces a device lost error #5132

Mark the device as lost when hal produces a device lost error #5132

nical commented Jan 24, 2024

nical commented Jan 24, 2024

nical commented Jan 24, 2024

nical commented Jan 25, 2024

vorporeal commented Feb 14, 2024 •

edited

Loading

nical commented Feb 15, 2024

nical commented Feb 15, 2024

Mark the device as lost when hal produces a device lost error #5132

Mark the device as lost when hal produces a device lost error #5132

Comments

nical commented Jan 24, 2024

nical commented Jan 24, 2024

nical commented Jan 24, 2024

nical commented Jan 25, 2024

vorporeal commented Feb 14, 2024 • edited Loading

nical commented Feb 15, 2024

nical commented Feb 15, 2024

vorporeal commented Feb 14, 2024 •

edited

Loading