Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved wgpu error handling, no more crashes through wgpu validation errors #4509

Merged
merged 14 commits into from
Dec 13, 2023

Conversation

Wumpf
Copy link
Member

@Wumpf Wumpf commented Dec 13, 2023

What

There's a bunch of things that probably should have been separate PRs but by the time I ventured there it got too hard to entangle, so please bear with me and read this detailed list of changes carefully (if you understand this list you've done half the review ;-)):

  • re_renderer now uses cfg_aliases and cfg_if in various places to make cfg decisions more readable
  • ErrorTracker is now used in all build configurations. Previously, it was only active on debug native builds
  • The first thing RendererContext does now is to install global error handlers that feed into the "top level" ErrorTracker. Previously this was done delayed and only for debug+native
  • ErrorTracker never panics now
  • ErrorTracker no longer uses a frame counting heuristic (!!) to discard all errors, but instead relies on precise device timeline frame counters to discard individual errors
    • this fixes issues where previously an error would disappear and reappear without being logged
    • as commented several times in the code: Keep in mind the three staggered timelines of WebGPU
      • Content (your code)
      • Device (everything happening on wgpu::Device)
      • Queue (everything happening on the GPU, represented by wgpu::Queue)
    • Content and Device are mostly in sync on wgpu-core, but NOT on WebGPU in general!
  • split out the wgpu-core specific parts of ErrorTracker into its own module and fall back to simple string hashing for WebGPU
    • on WebGPU we're dealing with a JS interface that doesn't provide these types. Additionally, in a pure WebGPU build
      (the only webgpu enabled build we have today), we don't even depend on wgpu-core!
    • Manual test of re_renderer samples shows that we now no longer crash on WebGPU either!
  • the wgpu-core error type handling is unchanged except for a change in label hashing
    • previously, in the absence of debug labels, error de-duplication would fail. In practice leading to error spam on release builds
  • Introduce WgpuErrorScope utility for safe & fully covering wgpu error scopes
    • The resulting futures are handled with now_or_never on wgpu-core backend since we know this is safe. On WebGPU we hand off execution to the browser.
  • Each frame now has a top level WgpuErrorScope in order to catch all errors
    • this is not a strict requirement on wgpu-core backends since wgpu-core will always feed the fallback error handler. But Chrome/Dawn WebGPU does not. The spec allows for this behavior
image Screenshot of a viewer in release mode with an intentionally broken mesh shader. Also tried other breakages like render pipeline format mismatches.

Things related, but NOT covered by this PR:

  • crashes on startup / adapter selection
    • this requires further improvements to egui: We need to improve the hooks that go into creating the egui render state. Need to consider allowing to create your own renderstate. This would also allow us to move the actual device/queue/adapter ownership to re_renderer
  • Create re_renderer error scope per space view in order to show error messages for "crashed views" #4507
    • punting on this one. It's a nice-to-have extension of the work presented here
  • close look into Error when quickly resizing viewer with 3D view #4455
    • the crash is almost certainly fixed now and replaced by a recoverable error 🥳, but I want to understand what sets the wrong viewport - either the one in egui we already fixed or the ViewBuilder (which surely could use some safety check)

Checklist

  • I have read and agree to Contributor Guide and the Code of Conduct
  • I've included a screenshot or gif (if applicable)
  • I have tested the web demo (if applicable):
    • Full build: app.rerun.io
    • Partial build: app.rerun.io - Useful for quick testing when changes do not affect examples in any way
  • The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG

@Wumpf Wumpf added enhancement New feature or request 🔺 re_renderer affects re_renderer itself include in changelog labels Dec 13, 2023
@emilk emilk self-requested a review December 13, 2023 12:26
Copy link
Member

@emilk emilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phew, that was a lot!

The frame_index_for_uncaptured_errors is a bit confusing to me - maybe you can make a diagram or something? 😬

crates/re_renderer/build.rs Outdated Show resolved Hide resolved
crates/re_renderer/build.rs Outdated Show resolved Hide resolved
crates/re_renderer/src/context.rs Outdated Show resolved Hide resolved
crates/re_renderer/src/context.rs Outdated Show resolved Hide resolved
crates/re_renderer/src/context.rs Outdated Show resolved Hide resolved
crates/re_renderer/src/context.rs Outdated Show resolved Hide resolved
crates/re_renderer/src/error_handling/mod.rs Outdated Show resolved Hide resolved
@Wumpf
Copy link
Member Author

Wumpf commented Dec 13, 2023

originally called frame_index_for_uncaptured_errors instead frame_index_device_timeline. But opted out of that since that's a bit too much of a claim & confusing as well and it's really only for that uncaptured_error callback 🤔

@Wumpf Wumpf merged commit ac75d50 into main Dec 13, 2023
40 of 41 checks passed
@Wumpf Wumpf deleted the andreas/re_renderer/improved-wgpu-error-handling branch December 13, 2023 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request include in changelog 🔺 re_renderer affects re_renderer itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rendering validation errors should result in black screen, not a crash
2 participants