Bindless Tracking Issue #3637

Bindless mode means you have a big table of all the texture objects in the GPU. Each of those texture objects is a GPU buffer. Buffers either belong to the GPU, and the CPU can't touch them, or they belong to the CPU, and the GPU can't touch them.

So the big table of texture object descriptors and the GPU buffer states have to be kept in sync. At the Vulkan level, this problem belongs to the program calling Vulkan. Vulkano, which is supposed to be a safe Rust interface to Vulkan, has machinery for maintaining that table. But WGPU doesn't have that machinery, so it can't do bindless yet, at least not safely.

GPU/CPU buffer ownership and the bindless descriptor table need to be managed together.

All the targets WGPU currently supports seem to support bindless mode. Even OpenGL, since 2013, has offered it as an extension. You do need OpenGL 4 with extensions, though. Everything in current release seems to have full support now, via Vulkan, Metal, or OpenGL.

The future is bindless. Unreal Engine is now bindless-only, I think.

SK83RJOSH · 2024-12-08T21:53:45Z

Unreal Engine is now bindless-only, I think.

It is not, in fact. Bindless resources are largely experimental and only enabled by default under Vulkan. They are optionally enabled under DX12 when using SM6 + ray tracing. There's still ongoing work to add support for bindless resources to shader graph as well.

Aside from that, it seems very much like there are targets that they're interested in that do not support bindless. So many materials will likely still need to implement both paths.

cwfitzgerald · 2024-12-12T04:33:12Z

I have filled out the above issue with the current plan for bindless, and what work has been done previously

John-Nagle · 2024-12-12T05:38:06Z

Right. I see more of the problems now. Looking at Bindless Investigation and Proposal, it's clear that driver-controlled residency and big arrays of bindless descriptors do not play well together. This is a non-problem for Vulkan, where all assets must be resident in GPU memory. For Metal, there is a residency control API.. Not sure what the plan is for WebGPU, since that's still being defined.

I look at this from the viewpoint of needing game-type performance on large scenes, for target machines comparable to what the average Steam user has. In my own applications, I'm managing residency at the application level, where I switch textures to lower resolutions when memory is tight. Rejection of a buffer allocation request is a normal event which results in LOD reductions. For the Vulkan case, this substitutes for driver initiated eviction.

I've mostly looked at the Vulkan case, where VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT is available and one big array of descriptors is possible. When everything is resident, that leads to simple implementations which don't require barriers during drawing, while allowing concurrent updating. I've written some design materials on that. The basic concept is that the wrapper level (WGPU) takes ownership of a buffer when it is in the descriptor table, which protects it from being deleted while rendering is in progress.

The cram job required to run on smaller targets makes things far more complicated. Driver-initiated eviction really complicates things. Can that be handled without crippling performance on the more powerful targets?

cwfitzgerald · 2024-12-12T06:14:34Z

To clarify, none of these apis deal with residency at all . The assumption is that all resources that are bound to bind group are resident always. Yes we need to prevent metal from making resources non-resident, but with residency sets this should be easy.

When everything is resident, that leads to simple implementations which don't require barriers during drawing, while allowing concurrent updating

Barrier generation has basically nothing to do with residency. The purpose of the buffer.set_usages API, is to allow wgpu to not need to check if every single resource needs a barrier. Those resources with only read only usages can be ignored and just bound indirectly as part of the binding array

John-Nagle · 2024-12-12T20:46:58Z

OK, if everything is resident, things are simpler.

So, as you see bindless, what has to be done on a per-draw basis? That's where the overhead comes from. Is there checking that has to be done at each draw, or can the checking be hoisted to once per frame, or once per texture change, or to compile time?

John-Nagle · 2024-12-13T04:55:15Z

If an allocator is needed for descriptor slots, feel free to use this one I wrote. It's lock-free.

cwfitzgerald · 2024-12-13T05:42:26Z

If an allocator is needed for descriptor slots, feel free to use this one I wrote. It's lock-free.

This would be deligated to the user, but good to have in case a user needs it

John-Nagle · 2024-12-13T05:53:52Z

Ah. You may not want to delegate that to the user. WGPU already has a buffer allocator, and descriptor slots and buffers need to be closely coordinated. I've been looking at designs which work something like this:

Application/renderer layer requests a descriptor slot. They get back an opaque handle.
Application/renderer layer requests a texture buffer with data. Buffer is created and filled.
Application/renderer layer attaches the buffer to the descriptor slot handle. Descriptor manager takes ownership of the buffer, preventing it from being changed or deleted by the application. This queues insertion into the descriptor array.
Dropping the descriptor slot handle queues deletion from the descriptor array.
All descriptor array updates are processed at a point in the render cycle where the GPU is not rendering.

The idea is to use Rust ownership to manage most of the interlocking. If you let the application mess with the descriptor array, you need more checking in the lower layers. More machinery to implement.

Bindless mode exists to improve performance by spending less time doing binding. It's only useful if it provides a big reduction in binding overhead.

Comments?

magcius · 2024-12-15T01:03:02Z

So, as you see bindless, what has to be done on a per-draw basis? That's where the overhead comes from. Is there checking that has to be done at each draw, or can the checking be hoisted to once per frame, or once per texture change, or to compile time?

Nothing needs to be done on a per-draw basis (unless you count the usage of our indirection buffers). We designed it in a way that if the user uses the API properly (that is, marking resources as read-only when they are going to basically be read-only by the application), that there's very little cost to the validation. Assuming we don't find a flaw in our plan, that is :)

For read-write bindless resources, there's some validation required that needs to be done at set_bind_group time, and some barriers might need to get emitted at submit time, but we expect that the number of read-write resources to be relatively small.

Ah. You may not want to delegate that to the user.

However, we discussed several ideas in the WebGPU WG, and currently believe that leaving bind group indices to the user hits a nice balance between user flexibility and performance.

The idea is to use Rust ownership to manage most of the interlocking. If you let the application mess with the descriptor array, you need more checking in the lower layers. More machinery to implement.

Your library sounds like a great utility library that we might want to recommend to users of wgpu in Rust!

Note that the bindless proposal we have provided above is perfectly safe and requires no synchronization with the GPU; shadow copies are made under the hood when updating bind groups.

That said, if we think the cost of mutating bind groups through the shadow copy remains too high, we might go the other way and handle slot allocation for the user. Whatever happens, we want to make sure that wgpu remains aligned with the WebGPU WG and specification.

John-Nagle · 2024-12-15T02:39:58Z

if the user uses the API properly

Where is the new bindless API documented? Even if it's not working, I'd like to see how the API is supposed to be used.

magcius · 2024-12-15T03:36:23Z

The two proposals (https://hackmd.io/PCwnjLyVSqmLfTRSqH0viA and https://hackmd.io/@cwfitzgerald/wgpu-bindless) should paint a somewhat complete picture of the new API, though you might have to read between the lines a bit. It's meant for implementers.

John-Nagle · 2024-12-15T04:03:45Z

Bind groups are updated on the CPU timeline and updates "immediately". ... What** this means that all previous uses of the bind group continue to use the old contents and any new usages use the updated contents.

So the descriptor table is double-buffered? Reasonable. I was thinking in terms of an update queue applied at end of frame, but that's functionally equivalent.

"The CPU timeline" concept needs to be clarified for multi-thread programs. There are potential locking bottlenecks. As noted, this is weird for multi-thread programs.

Note that this means that every update_bindings call will require us to make a shadow copy of all descriptors in the bind group, and associated tracking data. While we don't expect update_bindings to require a lot of memory compared to buffers and textures, it is still not a cheap operation.

Right. That's why I was thinking in terms of an update queue. Number of changes per frame is probably < 100. Number of descriptors is on the order of 100,000 for a complex scene. But either way will work.

By allowing a resource to be shifted to a read-only state, we let the tracking systems only worry about the resource being alive, not their state. This allows bindless arrays to be bound with very low costs.

Right. Most content is read-only. Read-write content is mostly rendering intermediates.

What prevents freeing a bound buffer? Something has to interlock against that. Will that be an error or an operation that is deferred until it is safe?

Is a shader accessing an unused descriptor slot a problem? It's tempting to initialize all the unused descriptor slots to point to a purple error texture. Something has to check for out of range indices in shaders, but if unused slots are harmless, there's no need to check for a valid descriptor.

This can work. Thanks.

cwfitzgerald · 2024-12-15T06:58:44Z

"The CPU timeline" concept needs to be clarified for multi-thread programs. There are potential locking bottlenecks. As noted, this is weird for multi-thread programs.

There's still a single timeline that all threads experience - that is, you will either see before or after the update. This is the same for all multi-threaded methods in wgpu.

Right. That's why I was thinking in terms of an update queue. Number of changes per frame is probably < 100. Number of descriptors is on the order of 100,000 for a complex scene.

It's hard to avoid shadow copying something - this could be optimized to as little as 4 bytes per descriptor though depending on implementation strategy. This will need to be driven by profiling.

What prevents freeing a bound buffer? Something has to interlock against that. Will that be an error or an operation that is deferred until it is safe?

The same infrastructure that prevents it today. The bind group owns the texture (as in it actually just has an Arc) so as long as we keep the command encoder alive until the gpu is done using it, the texture will stay alive. This is how it happens today.

Is a shader accessing an unused descriptor slot a problem?

It will return unspecified (NOT undefined) results, but is valid.

Something has to check for out of range indices in shaders

Yes, we will check against a metadata buffer, see the implementation notes in the spec

John-Nagle · 2024-12-15T07:14:46Z

It's hard to avoid shadow copying something - this could be optimized to as little as 4 bytes per descriptor though depending on implementation strategy. This will need to be driven by profiling.

Right. All binding updates commit at that copy, which allows for more optimization vs. bind-per-draw. Sounds good.

(Referencing an unused descriptor slot) will return unspecified (NOT undefined) results, but is valid.

I'd suggest mapping unused texture descriptor slots to some built-in error texture, such as the purple often used for areas where nothing was drawn. Then errors become obvious. Mapping to a null handle means nothing is drawn, which is harder to debug.

cwfitzgerald changed the title ~~[Tracking] Bindless~~ Bindless Tracking Issue Apr 12, 2023

cwfitzgerald added type: enhancement New feature or request area: api Issues related to API surface labels Apr 12, 2023

cwfitzgerald mentioned this issue Apr 12, 2023

Want to Contribute? Start Here! List of Tracking Issues #3676

Open

JMS55 mentioned this issue Apr 25, 2023

Bind groups with binding arrays with high max count in their bind group layout take an extremely long time to allocate descriptor sets for #3729

Closed

sholloway mentioned this issue May 13, 2024

Forward Rendering Pipeline sholloway/agents-playground#149

Open

6 tasks

Elabajaba mentioned this issue Nov 9, 2024

Toward bindless #6501

Closed

John-Nagle mentioned this issue Nov 14, 2024

Wgpu22 23% slower than wgpu 0.20 on my render-bench. #6434

Open

cwfitzgerald added the feature: bindless Issues with Bindless Native Feature label Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bindless Tracking Issue #3637

Bindless Tracking Issue #3637

kanerogers commented Apr 2, 2023 •

edited by cwfitzgerald

Loading

John-Nagle commented Nov 14, 2024

SK83RJOSH commented Dec 8, 2024

cwfitzgerald commented Dec 12, 2024

John-Nagle commented Dec 12, 2024

cwfitzgerald commented Dec 12, 2024

John-Nagle commented Dec 12, 2024

John-Nagle commented Dec 13, 2024

cwfitzgerald commented Dec 13, 2024

John-Nagle commented Dec 13, 2024 •

edited

Loading

magcius commented Dec 15, 2024

John-Nagle commented Dec 15, 2024

magcius commented Dec 15, 2024

John-Nagle commented Dec 15, 2024

cwfitzgerald commented Dec 15, 2024

John-Nagle commented Dec 15, 2024

Bindless Tracking Issue #3637

Bindless Tracking Issue #3637

Comments

kanerogers commented Apr 2, 2023 • edited by cwfitzgerald Loading

Overview

Binding Array Support

Partially bound descriptors

Validation

GPU Validation

Sparse Bind Groups

Mutable Bind Groups

Read Only Resources

Temporary Removal

Driver Bugs(?)

John-Nagle commented Nov 14, 2024

SK83RJOSH commented Dec 8, 2024

cwfitzgerald commented Dec 12, 2024

John-Nagle commented Dec 12, 2024

cwfitzgerald commented Dec 12, 2024

John-Nagle commented Dec 12, 2024

John-Nagle commented Dec 13, 2024

cwfitzgerald commented Dec 13, 2024

John-Nagle commented Dec 13, 2024 • edited Loading

magcius commented Dec 15, 2024

John-Nagle commented Dec 15, 2024

magcius commented Dec 15, 2024

John-Nagle commented Dec 15, 2024

cwfitzgerald commented Dec 15, 2024

John-Nagle commented Dec 15, 2024

kanerogers commented Apr 2, 2023 •

edited by cwfitzgerald

Loading

John-Nagle commented Dec 13, 2024 •

edited

Loading