-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Merged by Bors] - Ignore Timeout
errors on Linux AMD & Intel
#5957
Conversation
@mdickopp @tirithen @alexpyattaev @bobhenkel @chiboreache @paullouisageneau @popojan @etam You seem to have hit this bug, could a benevolent soul try out this patch? |
Bevy seems to no longer crash on Wayland for me, finally can disable XWayland in orichalcum. |
Tried the method from 4579, but didn't get any errors. Is there a reliable minimal way to reproduce? Running X11 on AMD XT6700 with 0.9.0-dev. |
I am sorry, in the meantime I switched to Wayland/sway and I cannot easily reproduce the problem anymore. When I tried to use nicopap's revision as dependency and recompile I got an error stating |
Please note that I can reproduce both #3380 and #4579 on an Intel device, so I do not think they are specific to AMD devices.
|
Draft until workaround expanded to intel devices. I also happen to have a intel GPU handy, so I might be able to test as well. |
Hmm, can't reproduce on my Whiskey Lake intel iGPU. Looks like you are using a Broadwell, which is very common. I've a Broadwell CPU somewhere, but, at the moment I can't test on it, as the motherboard is pretty much in a cardboard box without peripherals. |
Timeout
errors on Linux AMDTimeout
errors on Linux AMD & Intel
5896bc2
to
a14a344
Compare
I wonder if it wouldn't be better to ignore timeouts for all cards and drivers, but only for a specific time. That is, ignore intermittent ones: That way we can still catch degraded application/driver state because when the driver is returning timeouts for a whole, say, second or two something very much looks amiss, at the same time delaying panic on non-problematic configurations seems benign. We can even make the timeout configurable in case some valiant gamer tries to run things on a potato or something. Bonus: A message like "Graphics driver returned timeouts for X seconds" points end-users squarely at the issue. |
I'm interested in this solution; the less hardcoded special-casing the better. |
@ksf For me personally, the timeout happens every frame, despite the frame clearly drawing in less than the actual timeout, so your proposed solution wouldn't work (frame draws in well below 16ms, timeout is a full second)
It might be possible to not special-case it, which is, from what I understand, how Veloren does it. I guess I was worried that I would break other assumptions. As far as I know, it shouldn't break anything, but "if it works on Linux it works on Windows" is not a sentence I've heard many times… So I kept conservative to exactly what I changed. I'd be happy to remove the Anyway, maybe the fact we log on Remember please that this workaround fixes a bug that prevents people from using bevy at all, so getting it in at all should be a priority, getting fancy with it can wait IMO (maybe open an issue once this is merged?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember please that this workaround fixes a bug that prevents people from using bevy at all, so getting it in at all should be a priority, getting fancy with it can wait IMO (maybe open an issue once this is merged?)
Yep, you're right there. I'm on board with this fix as a solution, even if it is temporary.
@nicopap In my case the timeout happens during configuring events, things like resizing, that's why I assumed it was an intermittent issue. But on hindsight, if the user is drag-resizing the window for a second and every request times out and thus the "last successful" time can't get reset my solution would still panic. I'm not sure whether all requests time out in my case, would have to investigate (currently, at the state of development of my code (early) I said "meh" and hard-coded the present mode to I definitely agree that your solution is better than just crashing, and even when things get more well-behaved upstream we'll have to support older drivers (though at some point I'd say it becomes sensible to tell people to upgrade or disable VSync/eat the performance hit) |
I re-tested your latest commit (a14a344) on my Intel system, and can confirm that it fixes the bugs for me. |
This looks controversial. I hear the reasoning. I guess ideally it would be fixed in mesa but even then it will take time for new drivers to roll out. Still, I’d like to understand it before ignoring the error. |
100% Agree. I really dislike the fact I basically don't understand what I'm doing here. |
After some digging I found out that the Vulkan backend of wgpu calls vkAcquireNextImageKHR with a timeout of 1 second. A comment in (an older version of) the source code of Chromium hints at a bug in X11 and how they worked around it: https://chromium.googlesource.com/chromium/src/+/8ec9935d64c1fcc72d09c2d44ac1dfc0a29514f3/gpu/vulkan/x/vulkan_surface_x11.cc#62 |
I can reproduce this on an NVIDIA GPU (Linux) |
@meisme-dev Can you give more precise system specs, notably kernel version, distro, card model etc? The fact that it doesn't work with both Fifo and immediate mode tells me it might be an unrelated issue. See the issue template for how to get the specs https://github.com/bevyengine/bevy/blob/main/.github/ISSUE_TEMPLATE/bug_report.md |
Kernel: 5.15.74 |
On the one hand: I think its worth hacking around this quirk if we can, in the interest of getting Bevy running on more computers. Panicking at startup (or intermittently) is a high priority bug fix. We have multiple people saying that this works, and Veloren successfully using it is a reasonable indicator that it works. On the other hand, it feels important to understand what is happening here. Theres a chance that doing the wrong thing here will introduce ghosts in the system, hard-to-debug issues, unnecessary screen "flashing" as timeouts occur, etc. |
I'm going to re-add this to the 0.9 milestone, just so we can make a final call on this if the conversation progresses. |
To gain a better understanding of the situation, I could investigate further how other projects deal with the timeout in If you close this issue for 0.9 and open a followup issue, kindly mention me in the new one, so that I don't forget it. |
Alternative: An alternative solution found in the `wgpu` examples is: ```rust let frame = surface .get_current_texture() .or_else(|_| { render_device.configure_surface(surface, &swap_chain_descriptor); surface.get_current_texture() }) .expect("Error reconfiguring surface"); window.swap_chain_texture = Some(TextureView::from(frame)); ``` See: <https://github.com/gfx-rs/wgpu/blob/94ce76391b560a66e36df1300bd684321e57511a/wgpu/examples/framework.rs#L362-L370> The reason I went with this PR's solution is that `configure_surface` seems to be quite an expensive operation, and it would run every frame with the wgpu framework solution, despite the fact it works perfectly fine without `configure_surface`. I know this looks super hacky with the linux-specific line and the AMD check, but my understanding is that the `Timeout` occurence is specific to a quirk of some AMD drivers on linux, and if otherwise met should be considered a bug.
a14a344
to
b1698fe
Compare
I've resolved the conflicts with main now. I limit the change strictly to linux and hackishly restrict it to AMD/intel GPUs so that the risk of causing unexpected issues is strictly limited to a smaller subset of users. Maybe we could limit this to X11 users, since it seems to be limited to X11. This would require adding a I'm also keeping a close look at the wgpu issue, (gfx-rs/wgpu#1218 and gfx-rs/wgpu#2941) and make sure to revert when it is fixed. |
b1698fe
to
587b33f
Compare
ETA on this? Will it get merged before 0.9? A shame to have something ready that makes bevy usable for a few more people and just overlook it. |
If we're not entirely sure on the correctness of this, or if other engines take the same approach, could we make it a cargo feature, e.g. |
It's in the milestone: just needs a final call from Cart :) |
I've hit this on v0.8.1 Intel Raptor Lake IGP X11+Vulkan while looking into #6417 |
bors r+ |
# Objective - Fix #3606 - Fix #4579 - Fix #3380 ## Solution When running on a Linux machine with some AMD or Intel device, when calling `surface.get_current_texture()`, ignore `wgpu::SurfaceError::Timeout` errors. ## Alternative An alternative solution found in the `wgpu` examples is: ```rust let frame = surface .get_current_texture() .or_else(|_| { render_device.configure_surface(surface, &swap_chain_descriptor); surface.get_current_texture() }) .expect("Error reconfiguring surface"); window.swap_chain_texture = Some(TextureView::from(frame)); ``` See: <https://github.com/gfx-rs/wgpu/blob/94ce76391b560a66e36df1300bd684321e57511a/wgpu/examples/framework.rs#L362-L370> Veloren [handles the Timeout error the way this PR proposes to handle it](gfx-rs/wgpu#1218 (comment)). The reason I went with this PR's solution is that `configure_surface` seems to be quite an expensive operation, and it would run every frame with the wgpu framework solution, despite the fact it works perfectly fine without `configure_surface`. I know this looks super hacky with the linux-specific line and the AMD check, but my understanding is that the `Timeout` occurrence is specific to a quirk of some AMD drivers on linux, and if otherwise met should be considered a bug. Co-authored-by: Carter Anderson <mcanders1@gmail.com>
Timeout
errors on Linux AMD & IntelTimeout
errors on Linux AMD & Intel
# Objective - Fix bevyengine#3606 - Fix bevyengine#4579 - Fix bevyengine#3380 ## Solution When running on a Linux machine with some AMD or Intel device, when calling `surface.get_current_texture()`, ignore `wgpu::SurfaceError::Timeout` errors. ## Alternative An alternative solution found in the `wgpu` examples is: ```rust let frame = surface .get_current_texture() .or_else(|_| { render_device.configure_surface(surface, &swap_chain_descriptor); surface.get_current_texture() }) .expect("Error reconfiguring surface"); window.swap_chain_texture = Some(TextureView::from(frame)); ``` See: <https://github.com/gfx-rs/wgpu/blob/94ce76391b560a66e36df1300bd684321e57511a/wgpu/examples/framework.rs#L362-L370> Veloren [handles the Timeout error the way this PR proposes to handle it](gfx-rs/wgpu#1218 (comment)). The reason I went with this PR's solution is that `configure_surface` seems to be quite an expensive operation, and it would run every frame with the wgpu framework solution, despite the fact it works perfectly fine without `configure_surface`. I know this looks super hacky with the linux-specific line and the AMD check, but my understanding is that the `Timeout` occurrence is specific to a quirk of some AMD drivers on linux, and if otherwise met should be considered a bug. Co-authored-by: Carter Anderson <mcanders1@gmail.com>
Objective
import
utility causes examples to panic on Linux #3380Solution
When running on a Linux machine with some AMD or Intel device, when calling
surface.get_current_texture()
, ignorewgpu::SurfaceError::Timeout
errors.Alternative
An alternative solution found in the
wgpu
examples is:See: https://github.com/gfx-rs/wgpu/blob/94ce76391b560a66e36df1300bd684321e57511a/wgpu/examples/framework.rs#L362-L370
Veloren handles the Timeout error the way this PR proposes to handle it.
The reason I went with this PR's solution is that
configure_surface
seems to be quite an expensive operation, and it would run every frame with the wgpu framework solution, despite the fact it works perfectly fine withoutconfigure_surface
.I know this looks super hacky with the linux-specific line and the AMD check, but my understanding is that the
Timeout
occurrence is specific to a quirk of some AMD drivers on linux, and if otherwise met should be considered a bug.