Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Ignore Timeout errors on Linux AMD & Intel #5957

Closed
wants to merge 6 commits into from

Conversation

nicopap
Copy link
Contributor

@nicopap nicopap commented Sep 12, 2022

Objective

Solution

When running on a Linux machine with some AMD or Intel device, when calling
surface.get_current_texture(), ignore wgpu::SurfaceError::Timeout errors.

Alternative

An alternative solution found in the wgpu examples is:

let frame = surface
    .get_current_texture()
    .or_else(|_| {
        render_device.configure_surface(surface, &swap_chain_descriptor);
        surface.get_current_texture()
    })
    .expect("Error reconfiguring surface");
window.swap_chain_texture = Some(TextureView::from(frame));

See: https://github.com/gfx-rs/wgpu/blob/94ce76391b560a66e36df1300bd684321e57511a/wgpu/examples/framework.rs#L362-L370

Veloren handles the Timeout error the way this PR proposes to handle it.

The reason I went with this PR's solution is that configure_surface seems to be quite an expensive operation, and it would run every frame with the wgpu framework solution, despite the fact it works perfectly fine without configure_surface.

I know this looks super hacky with the linux-specific line and the AMD check, but my understanding is that the Timeout occurrence is specific to a quirk of some AMD drivers on linux, and if otherwise met should be considered a bug.

@nicopap nicopap added C-Bug An unexpected or incorrect behavior A-Rendering Drawing game state to the screen P-Crash A sudden unexpected crash O-Linux Specific to the Linux desktop operating system labels Sep 12, 2022
@alice-i-cecile alice-i-cecile added this to the Bevy 0.9 milestone Sep 12, 2022
@nicopap
Copy link
Contributor Author

nicopap commented Sep 12, 2022

@mdickopp @tirithen @alexpyattaev @bobhenkel @chiboreache @paullouisageneau @popojan @etam You seem to have hit this bug, could a benevolent soul try out this patch?

@heavyrain266
Copy link

Bevy seems to no longer crash on Wayland for me, finally can disable XWayland in orichalcum.

@Ixentus
Copy link
Contributor

Ixentus commented Sep 13, 2022

Tried the method from 4579, but didn't get any errors. Is there a reliable minimal way to reproduce? Running X11 on AMD XT6700 with 0.9.0-dev.

@popojan
Copy link

popojan commented Sep 13, 2022

I am sorry, in the meantime I switched to Wayland/sway and I cannot easily reproduce the problem anymore.

When I tried to use nicopap's revision as dependency and recompile I got an error stating cannot provide explicit generic arguments when impl Trait is used in argument position related to ::<SystemStage> usage in mod.rs, but it may well be my fault, I am new to Rust.

@mdickopp
Copy link
Contributor

This fixes both #3380 and #4579 for me. Thanks!

(Since I do not use Wayland, I cannot test #3606).

@mdickopp
Copy link
Contributor

Please note that I can reproduce both #3380 and #4579 on an Intel device, so I do not think they are specific to AMD devices.

2022-09-13T17:14:48.220831Z  INFO winit::platform_impl::platform::x11::window: Guessed window scale factor: 1.75    
2022-09-13T17:14:48.260219Z  INFO bevy_render::renderer: AdapterInfo { name: "Intel(R) HD Graphics 5500 (BDW GT2)", vendor: 32902, device: 5654, device_type: IntegratedGpu, backend: Vulkan }

@nicopap nicopap marked this pull request as draft September 24, 2022 12:57
@nicopap
Copy link
Contributor Author

nicopap commented Sep 24, 2022

Draft until workaround expanded to intel devices. I also happen to have a intel GPU handy, so I might be able to test as well.

@nicopap
Copy link
Contributor Author

nicopap commented Sep 26, 2022

@mdickopp

Please note that I can reproduce both #3380 and #4579 on an Intel device

Hmm, can't reproduce on my Whiskey Lake intel iGPU. Looks like you are using a Broadwell, which is very common. I've a Broadwell CPU somewhere, but, at the moment I can't test on it, as the motherboard is pretty much in a cardboard box without peripherals.

@nicopap nicopap changed the title Ignore Timeout errors on Linux AMD Ignore Timeout errors on Linux AMD & Intel Sep 26, 2022
@nicopap nicopap force-pushed the fix-swapchain-timeout-crash branch from 5896bc2 to a14a344 Compare September 26, 2022 12:00
@nicopap nicopap marked this pull request as ready for review September 26, 2022 12:11
@ksf
Copy link

ksf commented Sep 27, 2022

I wonder if it wouldn't be better to ignore timeouts for all cards and drivers, but only for a specific time. That is, ignore intermittent ones:

That way we can still catch degraded application/driver state because when the driver is returning timeouts for a whole, say, second or two something very much looks amiss, at the same time delaying panic on non-problematic configurations seems benign. We can even make the timeout configurable in case some valiant gamer tries to run things on a potato or something.

Bonus: A message like "Graphics driver returned timeouts for X seconds" points end-users squarely at the issue.

@alice-i-cecile
Copy link
Member

I wonder if it wouldn't be better to ignore timeouts for all cards and drivers, but only for a specific time. That is, ignore intermittent ones:

I'm interested in this solution; the less hardcoded special-casing the better.

@nicopap
Copy link
Contributor Author

nicopap commented Sep 28, 2022

@ksf For me personally, the timeout happens every frame, despite the frame clearly drawing in less than the actual timeout, so your proposed solution wouldn't work (frame draws in well below 16ms, timeout is a full second)

It might be possible to not special-case it, which is, from what I understand, how Veloren does it. I guess I was worried that I would break other assumptions. As far as I know, it shouldn't break anything, but "if it works on Linux it works on Windows" is not a sentence I've heard many times… So I kept conservative to exactly what I changed. I'd be happy to remove the #[cfg(target_os = "linux")] if someone else can confirm it works. But I'm worried another contributor will chime in soon and say "if this bug is limited to Linux, why not add #[cfg(target_os = "linux")]?" And then I'll have to write another lengthy reply justifying my decisions.

Anyway, maybe the fact we log on debug! is misleading and is what led you to believe it was intermittent? Should we entirely skip logging or log on trace! level?

Remember please that this workaround fixes a bug that prevents people from using bevy at all, so getting it in at all should be a priority, getting fancy with it can wait IMO (maybe open an issue once this is merged?)

Copy link
Member

@alice-i-cecile alice-i-cecile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember please that this workaround fixes a bug that prevents people from using bevy at all, so getting it in at all should be a priority, getting fancy with it can wait IMO (maybe open an issue once this is merged?)

Yep, you're right there. I'm on board with this fix as a solution, even if it is temporary.

@ksf
Copy link

ksf commented Sep 29, 2022

@nicopap In my case the timeout happens during configuring events, things like resizing, that's why I assumed it was an intermittent issue. But on hindsight, if the user is drag-resizing the window for a second and every request times out and thus the "last successful" time can't get reset my solution would still panic. I'm not sure whether all requests time out in my case, would have to investigate (currently, at the state of development of my code (early) I said "meh" and hard-coded the present mode to AutoNoVSync).

I definitely agree that your solution is better than just crashing, and even when things get more well-behaved upstream we'll have to support older drivers (though at some point I'd say it becomes sensible to tell people to upgrade or disable VSync/eat the performance hit)

@mdickopp
Copy link
Contributor

@nicopap

Hmm, can't reproduce on my Whiskey Lake intel iGPU. Looks like you are using a Broadwell, which is very common. I've a Broadwell CPU somewhere, but, at the moment I can't test on it, as the motherboard is pretty much in a cardboard box without peripherals.

I re-tested your latest commit (a14a344) on my Intel system, and can confirm that it fixes the bugs for me.

@superdump superdump added the X-Controversial There is active debate or serious implications around merging this PR label Sep 29, 2022
@superdump
Copy link
Contributor

This looks controversial. I hear the reasoning. I guess ideally it would be fixed in mesa but even then it will take time for new drivers to roll out. Still, I’d like to understand it before ignoring the error.

@nicopap
Copy link
Contributor Author

nicopap commented Sep 30, 2022

100% Agree. I really dislike the fact I basically don't understand what I'm doing here.

@mdickopp
Copy link
Contributor

After some digging I found out that the Vulkan backend of wgpu calls vkAcquireNextImageKHR with a timeout of 1 second.

A comment in (an older version of) the source code of Chromium hints at a bug in X11 and how they worked around it: https://chromium.googlesource.com/chromium/src/+/8ec9935d64c1fcc72d09c2d44ac1dfc0a29514f3/gpu/vulkan/x/vulkan_surface_x11.cc#62
For one thing, they use a 2 seconds timeout. I'll do some more testing.

@meisme-dev
Copy link

I can reproduce this on an NVIDIA GPU (Linux)

@nicopap
Copy link
Contributor Author

nicopap commented Oct 24, 2022

@meisme-dev Can you give more precise system specs, notably kernel version, distro, card model etc? The fact that it doesn't work with both Fifo and immediate mode tells me it might be an unrelated issue.

See the issue template for how to get the specs https://github.com/bevyengine/bevy/blob/main/.github/ISSUE_TEMPLATE/bug_report.md

@meisme-dev
Copy link

@meisme-dev Can you give more precise system specs, notably kernel version, distro, card model etc? The fact that it doesn't work with both Fifo and immediate mode tells me it might be an unrelated issue.

See the issue template for how to get the specs https://github.com/bevyengine/bevy/blob/main/.github/ISSUE_TEMPLATE/bug_report.md

Kernel: 5.15.74
Distro: NixOS Unstable (Raccoon)
Card model: RTX 2070 Super
Driver version: 520.56.06
wgpu-info: { name: "NVIDIA GeForce RTX 2070 SUPER", vendor: 4318, device: 7812, device_type: DiscreteGpu, driver: "NVIDIA", driver_info: "520.56.06", backend: Vulkan }

@alice-i-cecile alice-i-cecile removed this from the Bevy 0.9 milestone Oct 30, 2022
@cart
Copy link
Member

cart commented Oct 30, 2022

On the one hand: I think its worth hacking around this quirk if we can, in the interest of getting Bevy running on more computers. Panicking at startup (or intermittently) is a high priority bug fix. We have multiple people saying that this works, and Veloren successfully using it is a reasonable indicator that it works. On the other hand, it feels important to understand what is happening here. Theres a chance that doing the wrong thing here will introduce ghosts in the system, hard-to-debug issues, unnecessary screen "flashing" as timeouts occur, etc.

@cart
Copy link
Member

cart commented Oct 30, 2022

I'm going to re-add this to the 0.9 milestone, just so we can make a final call on this if the conversation progresses.

@cart cart added this to the Bevy 0.9 milestone Oct 30, 2022
@mdickopp
Copy link
Contributor

To gain a better understanding of the situation, I could investigate further how other projects deal with the timeout in vkAcquireNextImageKHR. However, I'll be pretty busy in November, so I'll not be able to do this before December.

If you close this issue for 0.9 and open a followup issue, kindly mention me in the new one, so that I don't forget it.

Alternative:

An alternative solution found in the `wgpu` examples is:

```rust
let frame = surface
    .get_current_texture()
    .or_else(|_| {
        render_device.configure_surface(surface, &swap_chain_descriptor);
        surface.get_current_texture()
    })
    .expect("Error reconfiguring surface");
window.swap_chain_texture = Some(TextureView::from(frame));
```

See: <https://github.com/gfx-rs/wgpu/blob/94ce76391b560a66e36df1300bd684321e57511a/wgpu/examples/framework.rs#L362-L370>

The reason I went with this PR's solution is that `configure_surface`
seems to be quite an expensive operation, and it would run every frame
with the wgpu framework solution, despite the fact it works perfectly
fine without `configure_surface`.

I know this looks super hacky with the linux-specific line and the AMD
check, but my understanding is that the `Timeout` occurence is specific
to a quirk of some AMD drivers on linux, and if otherwise met should be
considered a bug.
@nicopap nicopap force-pushed the fix-swapchain-timeout-crash branch from a14a344 to b1698fe Compare November 1, 2022 14:04
@nicopap
Copy link
Contributor Author

nicopap commented Nov 1, 2022

I've resolved the conflicts with main now.

I limit the change strictly to linux and hackishly restrict it to AMD/intel GPUs so that the risk of causing unexpected issues is strictly limited to a smaller subset of users.

Maybe we could limit this to X11 users, since it seems to be limited to X11. This would require adding a x11 feature to bevy_render though which is why I left it out. At this point we are fairly close to fully capture the subset of people who just couldn't run bevy before, so even random graphical bugs would be a net win for them. Linux user are also notorious for pretty good bug reporting and being helpful in diagnosing.

I'm also keeping a close look at the wgpu issue, (gfx-rs/wgpu#1218 and gfx-rs/wgpu#2941) and make sure to revert when it is fixed.

@nicopap nicopap force-pushed the fix-swapchain-timeout-crash branch from b1698fe to 587b33f Compare November 2, 2022 08:38
@nicopap
Copy link
Contributor Author

nicopap commented Nov 5, 2022

ETA on this? Will it get merged before 0.9? A shame to have something ready that makes bevy usable for a few more people and just overlook it.

@coreh
Copy link
Contributor

coreh commented Nov 5, 2022

If we're not entirely sure on the correctness of this, or if other engines take the same approach, could we make it a cargo feature, e.g. ignore-amd-intel-swapchain-timeout, or maybe an environment variable, e.g. BEVY_IGNORE_SWAPCHAIN_TIMEOUT? That way either developers or users can make the call, and we don't postpone the fix until 0.10?

@alice-i-cecile
Copy link
Member

It's in the milestone: just needs a final call from Cart :)

@ghost
Copy link

ghost commented Nov 10, 2022

I've hit this on v0.8.1 Intel Raptor Lake IGP X11+Vulkan while looking into #6417
cargo r --release & sleep 2 && xset dpms force standby makes it easily reproducible for me.
https://gitlab.freedesktop.org/mesa/mesa/-/issues/2849 seems to be the same root issue.

@cart
Copy link
Member

cart commented Nov 12, 2022

bors r+

bors bot pushed a commit that referenced this pull request Nov 12, 2022
# Objective

- Fix #3606
- Fix #4579
- Fix #3380

## Solution

When running on a Linux machine with some AMD or Intel device, when calling
`surface.get_current_texture()`, ignore `wgpu::SurfaceError::Timeout` errors.


## Alternative

An alternative solution found in the `wgpu` examples is:

```rust
let frame = surface
    .get_current_texture()
    .or_else(|_| {
        render_device.configure_surface(surface, &swap_chain_descriptor);
        surface.get_current_texture()
    })
    .expect("Error reconfiguring surface");
window.swap_chain_texture = Some(TextureView::from(frame));
```

See: <https://github.com/gfx-rs/wgpu/blob/94ce76391b560a66e36df1300bd684321e57511a/wgpu/examples/framework.rs#L362-L370>

Veloren [handles the Timeout error the way this PR proposes to handle it](gfx-rs/wgpu#1218 (comment)).

The reason I went with this PR's solution is that `configure_surface` seems to be quite an expensive operation, and it would run every frame with the wgpu framework solution, despite the fact it works perfectly fine without `configure_surface`.

I know this looks super hacky with the linux-specific line and the AMD check, but my understanding is that the `Timeout` occurrence is specific to a quirk of some AMD drivers on linux, and if otherwise met should be considered a bug.


Co-authored-by: Carter Anderson <mcanders1@gmail.com>
@bors bors bot changed the title Ignore Timeout errors on Linux AMD & Intel [Merged by Bors] - Ignore Timeout errors on Linux AMD & Intel Nov 12, 2022
@bors bors bot closed this Nov 12, 2022
ItsDoot pushed a commit to ItsDoot/bevy that referenced this pull request Feb 1, 2023
# Objective

- Fix bevyengine#3606
- Fix bevyengine#4579
- Fix bevyengine#3380

## Solution

When running on a Linux machine with some AMD or Intel device, when calling
`surface.get_current_texture()`, ignore `wgpu::SurfaceError::Timeout` errors.


## Alternative

An alternative solution found in the `wgpu` examples is:

```rust
let frame = surface
    .get_current_texture()
    .or_else(|_| {
        render_device.configure_surface(surface, &swap_chain_descriptor);
        surface.get_current_texture()
    })
    .expect("Error reconfiguring surface");
window.swap_chain_texture = Some(TextureView::from(frame));
```

See: <https://github.com/gfx-rs/wgpu/blob/94ce76391b560a66e36df1300bd684321e57511a/wgpu/examples/framework.rs#L362-L370>

Veloren [handles the Timeout error the way this PR proposes to handle it](gfx-rs/wgpu#1218 (comment)).

The reason I went with this PR's solution is that `configure_surface` seems to be quite an expensive operation, and it would run every frame with the wgpu framework solution, despite the fact it works perfectly fine without `configure_surface`.

I know this looks super hacky with the linux-specific line and the AMD check, but my understanding is that the `Timeout` occurrence is specific to a quirk of some AMD drivers on linux, and if otherwise met should be considered a bug.


Co-authored-by: Carter Anderson <mcanders1@gmail.com>
@nicopap nicopap deleted the fix-swapchain-timeout-crash branch August 30, 2023 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Bug An unexpected or incorrect behavior O-Linux Specific to the Linux desktop operating system P-Crash A sudden unexpected crash X-Controversial There is active debate or serious implications around merging this PR
Projects
None yet