-
Notifications
You must be signed in to change notification settings - Fork 899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wgpu may lock up in ioctl on Linux/Vulkan/Intel #1672
Comments
Not sure whether this is the same issue but I've recently tracked down something similar in the same environment. (Linux (Wayland)/ Vulkan/Intel) My code is based on the Vulkan tutorial so you can just take e.g. this repo (compile in In my case (and also with the linked repo) the issue is always quickly reproducible by having a Youtube running in the background (in another Sway workspace) and switching window focus/workspaces a couple of times. The crux is though that the hang only happens if I set Backtrace on my machine (hang is inside
|
On the current master branch (11d31d5), if I duplicate the section of code in the let mut encoder =
device.create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });
{
let mut cpass = encoder.begin_compute_pass(&wgpu::ComputePassDescriptor { label: None });
cpass.set_pipeline(&compute_pipeline);
cpass.set_bind_group(0, &bind_group, &[]);
cpass.insert_debug_marker("compute collatz iterations");
cpass.dispatch(numbers.len() as u32, 1, 1); // Number of cells to run, the (x,y,z) size of item being processed
}
// Sets adds copy operation to command encoder.
// Will copy data from storage buffer on GPU to staging buffer on CPU.
encoder.copy_buffer_to_buffer(&storage_buffer, 0, &staging_buffer, 0, size);
// Submits command encoder for processing
queue.submit(Some(encoder.finish())); such that it appears twice in sequence in the file, then the example hangs. In my own project, I found that all tests where I tried to run two (as opposed to one) compute passes on the device would hang at the end of the test, when wgpu resources are being dropped. It seemed like there was a deadlock inside I don't know whether this is the same issue or a different issue. My environment is Linux (Ubuntu 21.04, with Wayland and kernel 5.11.0-25-generic), with the following device adapter info:
|
I did some further investigation into my comment above. First of all, the problem does not occur on the Attached are trace-level logs generated on the master branch (commit 7798534), either for the unmodified version of the The interesting parts of the logs seem to be near the end. In the example that hangs:
The command buffers that cannot be freed are those created during the second compute pass, as far as I can tell from using a debugger. There are a few details I do not understand:
Unfortunately I am not sufficiently familiar with the codebase to know the source of the problem. |
I realise my problem is probably this issue: #1689 |
Thank you for investigation. Your issue is certainly easier to reproduce. |
Ok, I don't know how to approach this. Tried on 3 different machines: Linux/NV, Linux/AMD, and Windows/Intel, all on Vulkan, with no luck reproducing any bad behavior of the example as given. |
Just as a datapoint: I can reproduce @bllanos's issue on Linux/Intel ( |
I am not sure how to update the Vulkan validation layers beyond the version provided by the default Ubuntu repositories ( I updated package The issue still occurs on the latest master branch (f0520f8) with the updated driver. |
@bllanos you can update from getting the SDK here https://vulkan.lunarg.com/. |
I am seeing a similar hang in ioctl, although I cannot reproduce bllanos version with hello-compute, so I'm not sure if I'm seeing the same bug. I'm also on linux with the mesa driver. Here's a minimized version: use winit::{
event::{Event, WindowEvent},
event_loop::ControlFlow,
};
struct Framework {
device: wgpu::Device,
queue: wgpu::Queue,
sc_desc: wgpu::SurfaceConfiguration,
surface: wgpu::Surface,
}
impl Framework {
async fn new(window: &winit::window::Window) -> Framework {
let backend = wgpu::util::backend_bits_from_env().unwrap_or(wgpu::Backends::PRIMARY);
let instance = wgpu::Instance::new(backend);
let size = window.inner_size();
let surface = unsafe { instance.create_surface(window) };
let adapter = wgpu::util::initialize_adapter_from_env_or_default(&instance, backend)
.await
.expect("No suitable GPU adapters found on the system!");
let features = wgpu::Features::default();
let trace_dir = std::env::var("WGPU_TRACE");
let limits = adapter.limits();
let (device, queue) = adapter
.request_device(
&wgpu::DeviceDescriptor {
features,
limits,
label: None,
},
trace_dir.ok().as_ref().map(std::path::Path::new),
)
.await
.expect("Unable to find a suitable GPU adapter!");
let format = surface.get_preferred_format(&adapter).unwrap();
let sc_desc = wgpu::SurfaceConfiguration {
usage: wgpu::TextureUsages::RENDER_ATTACHMENT,
format,
width: size.width,
height: size.height,
present_mode: wgpu::PresentMode::Mailbox,
};
// Removing this command encoder fixes the hang
let init_encoder =
device.create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });
queue.submit(Some(init_encoder.finish()));
Framework {
device,
queue,
sc_desc,
surface,
}
}
fn handle_event<T>(
&mut self,
window: &winit::window::Window,
event: Event<'_, T>,
control_flow: &mut ControlFlow,
) {
match event {
Event::MainEventsCleared => {
window.request_redraw();
}
Event::WindowEvent {
event: WindowEvent::Resized(size),
..
} => {
self.sc_desc.width = if size.width == 0 { 1 } else { size.width };
self.sc_desc.height = if size.height == 0 { 1 } else { size.height };
self.surface.configure(&self.device, &self.sc_desc)
}
Event::WindowEvent {
event: WindowEvent::CloseRequested,
..
} => *control_flow = ControlFlow::Exit,
Event::RedrawRequested(_) => {
let surface = &self.surface;
let frame = match surface.get_current_frame() {
Ok(frame) => frame,
Err(_) => {
self.surface.configure(&self.device, &self.sc_desc);
surface
.get_current_frame()
.expect("Failed to acquire next surface texture!")
}
};
let encoder = self
.device
.create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });
self.queue.submit(std::iter::once(encoder.finish()));
dbg!("Hangs after here");
}
_ => {}
}
dbg!("Hangs before here");
}
}
fn main() {
let event_loop = winit::event_loop::EventLoop::new();
let mut builder = winit::window::WindowBuilder::new();
builder = builder.with_title("test");
#[cfg(windows_OFF)]
{
use winit::platform::windows::WindowBuilderExtWindows;
builder = builder.with_no_redirection_bitmap(true);
}
let window = builder.build(&event_loop).unwrap();
let mut framework = pollster::block_on(Framework::new(&window));
framework
.surface
.configure(&framework.device, &framework.sc_desc);
event_loop.run(move |event, _, control_flow| {
*control_flow = if cfg!(feature = "metal-auto-capture") {
ControlFlow::Exit
} else {
ControlFlow::Poll
};
framework.handle_event(&window, event, control_flow);
});
} |
I can reproduce @tfgast's issue (and its workaround, when I comment out the encoder as mentioned in the code sample). @tfgast "my" issue seems more like #1689, but I started the conversation on this thread before I came to that conclusion. @cwfitzgerald I set up the LunarG SDK as described at https://vulkan.lunarg.com/doc/sdk/1.2.182.0/linux/getting_started.html (skipping the step on copying files to system directories, which I think is not necessary based on the instructions given by ash's author here). I have the following environment variables in my shell startup file:
I re-tested my issue on commit 8f02b73, running
to generate the log file in the attachment. I ran The attachment includes the example's logging output, trace files, the Rust code I was running (mentioned above), and the output of |
Looks similar to #1878
|
For anyone who can reproduce this, do you have a dual-GPU configuration with NVidia by any chance? |
I wonder if it's related to #1898 |
I do have a dual-GPU with NVidia. |
I only have Intel integrated graphics. |
This is to work around a problem in wgpu on intel gpus which causes hanging while dropping resources. More info: - gfx-rs/wgpu#1877 - gfx-rs/wgpu#1672
I only have integrated gpu, and it's possible to reproduce it by changing the |
I think I might have run into this while updating Here's the backtrace from gdb at the moment of hanging:
Here's the WIP PR of the update: PistonDevelopers/conrod#1436. No major changes other than replacing the old GLSL and pre-compiled SPIR-V shaders with WGSL (translated from the old GLSL shaders using the current I tried running with validation layers enabled like so:
though received no extra output. I'm not 100% I have these installed though! Anyone know offhand if there's an easy way to check on NixOS? I also tried each of the different present modes in case something other than FIFO worked, but no luck there. I'm on NixOS + Gnome + Wayland + Intel Xe Graphics (only integrated). |
This definitely looks related to #1673. @mitchmindtree could you run wgpu-rs examples from master on your system? |
Yes the examples on master appear to work well, I also tried with the commit that published wgpu 0.10.1 (what I'm updating to in |
Could you be making multiple submissions per frame? Try making a single submission, just for experiment. |
Yep that seems to be it! There's an extra submission before the event loop runs to load the single image that's used in the example - if I remove that submission and let the command be submitted along with the rest of the first frame's command buffer (so that there's only one submission), the example seems to run perfectly.
…-------- Original Message --------
On Sep 12, 2021, 01:19, Dzmitry Malyshau wrote:
Could you be making multiple submissions per frame? Try making a single submission, just for experiment.
—
You are receiving this because you were mentioned.
Reply to this email directly, [view it on GitHub](#1672 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ABC763J5YLVADMTWD3CKF3TUBN3AXANCNFSM5AOZRCZA).
|
Adds a collection of features to `nannou_wgpu` that match those exposed by `wgpu` itself. The `spirv` feature in particular will be useful for users still using `.spv` shaders (as opposed to the new default `.wgsl` format). Rewrites the `Texture` short-hand constructors to use `Queue::write_texture` rather than using copy commands and submitting command buffers directly. This is to avoid an Intel driver bug where submitting more than one command buffer per frame appears to be causing issues: gfx-rs/wgpu#1672 (comment) While this likely means consuming more RAM, it also likely results in slightly better performance due to reducing the number of command buffers submitted.
Adds a collection of features to `nannou_wgpu` that match those exposed by `wgpu` itself. The `spirv` feature in particular will be useful for users still using `.spv` shaders (as opposed to the new default `.wgsl` format). Rewrites the `Texture` short-hand constructors to use `Queue::write_texture` rather than using copy commands and submitting command buffers directly. This is to avoid an Intel driver bug where submitting more than one command buffer per frame appears to be causing issues: gfx-rs/wgpu#1672 (comment) While this likely means consuming more RAM, it also likely results in slightly better performance due to reducing the number of command buffers submitted.
I have Intel Xe Graphics, and I see this issue myself. |
Considering this fixed by #2212 Edit: to clarify, #2212 is a workaround for systems that haven't updated to https://gitlab.freedesktop.org/mesa/mesa/-/issues/5508 |
Adds a collection of features to `nannou_wgpu` that match those exposed by `wgpu` itself. The `spirv` feature in particular will be useful for users still using `.spv` shaders (as opposed to the new default `.wgsl` format). Rewrites the `Texture` short-hand constructors to use `Queue::write_texture` rather than using copy commands and submitting command buffers directly. This is to avoid an Intel driver bug where submitting more than one command buffer per frame appears to be causing issues: gfx-rs/wgpu#1672 (comment) While this likely means consuming more RAM, it also likely results in slightly better performance due to reducing the number of command buffers submitted.
Description
Under certain scenarios, we may see a hang in
ioctl
.Repro steps
Unknown.
Expected vs observed behavior
No hangs.
Extra materials
Looks like this is described in https://www.reddit.com/r/vulkan/comments/b37762/command_queue_grows_indefinitely_on_intel_gpus/
Edit: actually, no, we aren't expecting
vkAcquireNextImageKHR
to block. We are explicitly blocking on the fence, which was passed to it, instead.Platform
wgpu master
The text was updated successfully, but these errors were encountered: