Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bindless Tracking Issue #3637

Open
5 of 31 tasks
kanerogers opened this issue Apr 2, 2023 · 15 comments
Open
5 of 31 tasks

Bindless Tracking Issue #3637

kanerogers opened this issue Apr 2, 2023 · 15 comments
Labels
area: api Issues related to API surface feature: bindless Issues with Bindless Native Feature type: enhancement New feature or request

Comments

@kanerogers
Copy link

kanerogers commented Apr 2, 2023

Overview

This issue tracks enabling various "bindless" functionality for the various native backends.

For a high level guide on what we believe the bindless api should look like, look at https://hackmd.io/@cwfitzgerald/wgpu-bindless

Binding Array Support

Partially bound descriptors

Validation

GPU Validation

  • Validate Binding Array Access is in Bounds on Vulkan
  • Validate Binding Array Access is in Bounds on Metal
  • Validate Binding Array Access is in Bounds on DX12

Sparse Bind Groups

  • Implement Validation Resource Metadata Buffer on Vulkan
  • Implement Validation Resource Metadata Buffer on Metal
  • Implement Validation Resource Metadata Buffer on DX12

Mutable Bind Groups

  • Implement BindGroup::update_bindings Without Holes

Read Only Resources

  • Implement Texture::set_usages
  • Implement Buffer::set_usages

Temporary Removal

  • Implement prototype temporary remove api. Depends on "Sparse Bind Groups"

Driver Bugs(?)

@John-Nagle
Copy link

Bindless mode means you have a big table of all the texture objects in the GPU. Each of those texture objects is a GPU buffer. Buffers either belong to the GPU, and the CPU can't touch them, or they belong to the CPU, and the GPU can't touch them.

So the big table of texture object descriptors and the GPU buffer states have to be kept in sync. At the Vulkan level, this problem belongs to the program calling Vulkan. Vulkano, which is supposed to be a safe Rust interface to Vulkan, has machinery for maintaining that table. But WGPU doesn't have that machinery, so it can't do bindless yet, at least not safely.

GPU/CPU buffer ownership and the bindless descriptor table need to be managed together.

All the targets WGPU currently supports seem to support bindless mode. Even OpenGL, since 2013, has offered it as an extension. You do need OpenGL 4 with extensions, though. Everything in current release seems to have full support now, via Vulkan, Metal, or OpenGL.

The future is bindless. Unreal Engine is now bindless-only, I think.

@SK83RJOSH
Copy link

Unreal Engine is now bindless-only, I think.

It is not, in fact. Bindless resources are largely experimental and only enabled by default under Vulkan. They are optionally enabled under DX12 when using SM6 + ray tracing. There's still ongoing work to add support for bindless resources to shader graph as well.

Aside from that, it seems very much like there are targets that they're interested in that do not support bindless. So many materials will likely still need to implement both paths.

@cwfitzgerald
Copy link
Member

I have filled out the above issue with the current plan for bindless, and what work has been done previously

@John-Nagle
Copy link

Right. I see more of the problems now. Looking at Bindless Investigation and Proposal, it's clear that driver-controlled residency and big arrays of bindless descriptors do not play well together. This is a non-problem for Vulkan, where all assets must be resident in GPU memory. For Metal, there is a residency control API.. Not sure what the plan is for WebGPU, since that's still being defined.

I look at this from the viewpoint of needing game-type performance on large scenes, for target machines comparable to what the average Steam user has. In my own applications, I'm managing residency at the application level, where I switch textures to lower resolutions when memory is tight. Rejection of a buffer allocation request is a normal event which results in LOD reductions. For the Vulkan case, this substitutes for driver initiated eviction.

I've mostly looked at the Vulkan case, where VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT is available and one big array of descriptors is possible. When everything is resident, that leads to simple implementations which don't require barriers during drawing, while allowing concurrent updating. I've written some design materials on that. The basic concept is that the wrapper level (WGPU) takes ownership of a buffer when it is in the descriptor table, which protects it from being deleted while rendering is in progress.

The cram job required to run on smaller targets makes things far more complicated. Driver-initiated eviction really complicates things. Can that be handled without crippling performance on the more powerful targets?

@cwfitzgerald
Copy link
Member

To clarify, none of these apis deal with residency at all . The assumption is that all resources that are bound to bind group are resident always. Yes we need to prevent metal from making resources non-resident, but with residency sets this should be easy.

When everything is resident, that leads to simple implementations which don't require barriers during drawing, while allowing concurrent updating

Barrier generation has basically nothing to do with residency. The purpose of the buffer.set_usages API, is to allow wgpu to not need to check if every single resource needs a barrier. Those resources with only read only usages can be ignored and just bound indirectly as part of the binding array

@John-Nagle
Copy link

OK, if everything is resident, things are simpler.

So, as you see bindless, what has to be done on a per-draw basis? That's where the overhead comes from. Is there checking that has to be done at each draw, or can the checking be hoisted to once per frame, or once per texture change, or to compile time?

@John-Nagle
Copy link

If an allocator is needed for descriptor slots, feel free to use this one I wrote. It's lock-free.

@cwfitzgerald
Copy link
Member

If an allocator is needed for descriptor slots, feel free to use this one I wrote. It's lock-free.

This would be deligated to the user, but good to have in case a user needs it

@John-Nagle
Copy link

John-Nagle commented Dec 13, 2024

Ah. You may not want to delegate that to the user. WGPU already has a buffer allocator, and descriptor slots and buffers need to be closely coordinated. I've been looking at designs which work something like this:

  • Application/renderer layer requests a descriptor slot. They get back an opaque handle.
  • Application/renderer layer requests a texture buffer with data. Buffer is created and filled.
  • Application/renderer layer attaches the buffer to the descriptor slot handle. Descriptor manager takes ownership of the buffer, preventing it from being changed or deleted by the application. This queues insertion into the descriptor array.
  • Dropping the descriptor slot handle queues deletion from the descriptor array.
  • All descriptor array updates are processed at a point in the render cycle where the GPU is not rendering.

The idea is to use Rust ownership to manage most of the interlocking. If you let the application mess with the descriptor array, you need more checking in the lower layers. More machinery to implement.

Bindless mode exists to improve performance by spending less time doing binding. It's only useful if it provides a big reduction in binding overhead.

Comments?

@cwfitzgerald cwfitzgerald added the feature: bindless Issues with Bindless Native Feature label Dec 14, 2024
@magcius
Copy link
Contributor

magcius commented Dec 15, 2024

So, as you see bindless, what has to be done on a per-draw basis? That's where the overhead comes from. Is there checking that has to be done at each draw, or can the checking be hoisted to once per frame, or once per texture change, or to compile time?

Nothing needs to be done on a per-draw basis (unless you count the usage of our indirection buffers). We designed it in a way that if the user uses the API properly (that is, marking resources as read-only when they are going to basically be read-only by the application), that there's very little cost to the validation. Assuming we don't find a flaw in our plan, that is :)

For read-write bindless resources, there's some validation required that needs to be done at set_bind_group time, and some barriers might need to get emitted at submit time, but we expect that the number of read-write resources to be relatively small.

Ah. You may not want to delegate that to the user.

However, we discussed several ideas in the WebGPU WG, and currently believe that leaving bind group indices to the user hits a nice balance between user flexibility and performance.

The idea is to use Rust ownership to manage most of the interlocking. If you let the application mess with the descriptor array, you need more checking in the lower layers. More machinery to implement.

Your library sounds like a great utility library that we might want to recommend to users of wgpu in Rust!

Note that the bindless proposal we have provided above is perfectly safe and requires no synchronization with the GPU; shadow copies are made under the hood when updating bind groups.

That said, if we think the cost of mutating bind groups through the shadow copy remains too high, we might go the other way and handle slot allocation for the user. Whatever happens, we want to make sure that wgpu remains aligned with the WebGPU WG and specification.

@John-Nagle
Copy link

if the user uses the API properly

Where is the new bindless API documented? Even if it's not working, I'd like to see how the API is supposed to be used.

@magcius
Copy link
Contributor

magcius commented Dec 15, 2024

The two proposals (https://hackmd.io/PCwnjLyVSqmLfTRSqH0viA and https://hackmd.io/@cwfitzgerald/wgpu-bindless) should paint a somewhat complete picture of the new API, though you might have to read between the lines a bit. It's meant for implementers.

@John-Nagle
Copy link

Bind groups are updated on the CPU timeline and updates "immediately". ... What** this means that all previous uses of the bind group continue to use the old contents and any new usages use the updated contents.

So the descriptor table is double-buffered? Reasonable. I was thinking in terms of an update queue applied at end of frame, but that's functionally equivalent.

"The CPU timeline" concept needs to be clarified for multi-thread programs. There are potential locking bottlenecks. As noted, this is weird for multi-thread programs.

Note that this means that every update_bindings call will require us to make a shadow copy of all descriptors in the bind group, and associated tracking data. While we don't expect update_bindings to require a lot of memory compared to buffers and textures, it is still not a cheap operation.

Right. That's why I was thinking in terms of an update queue. Number of changes per frame is probably < 100. Number of descriptors is on the order of 100,000 for a complex scene. But either way will work.

By allowing a resource to be shifted to a read-only state, we let the tracking systems only worry about the resource being alive, not their state. This allows bindless arrays to be bound with very low costs.

Right. Most content is read-only. Read-write content is mostly rendering intermediates.

What prevents freeing a bound buffer? Something has to interlock against that. Will that be an error or an operation that is deferred until it is safe?

Is a shader accessing an unused descriptor slot a problem? It's tempting to initialize all the unused descriptor slots to point to a purple error texture. Something has to check for out of range indices in shaders, but if unused slots are harmless, there's no need to check for a valid descriptor.

This can work. Thanks.

@cwfitzgerald
Copy link
Member

"The CPU timeline" concept needs to be clarified for multi-thread programs. There are potential locking bottlenecks. As noted, this is weird for multi-thread programs.

There's still a single timeline that all threads experience - that is, you will either see before or after the update. This is the same for all multi-threaded methods in wgpu.

Right. That's why I was thinking in terms of an update queue. Number of changes per frame is probably < 100. Number of descriptors is on the order of 100,000 for a complex scene.

It's hard to avoid shadow copying something - this could be optimized to as little as 4 bytes per descriptor though depending on implementation strategy. This will need to be driven by profiling.

What prevents freeing a bound buffer? Something has to interlock against that. Will that be an error or an operation that is deferred until it is safe?

The same infrastructure that prevents it today. The bind group owns the texture (as in it actually just has an Arc) so as long as we keep the command encoder alive until the gpu is done using it, the texture will stay alive. This is how it happens today.

Is a shader accessing an unused descriptor slot a problem?

It will return unspecified (NOT undefined) results, but is valid.

Something has to check for out of range indices in shaders

Yes, we will check against a metadata buffer, see the implementation notes in the spec

@John-Nagle
Copy link

It's hard to avoid shadow copying something - this could be optimized to as little as 4 bytes per descriptor though depending on implementation strategy. This will need to be driven by profiling.

Right. All binding updates commit at that copy, which allows for more optimization vs. bind-per-draw. Sounds good.

(Referencing an unused descriptor slot) will return unspecified (NOT undefined) results, but is valid.

I'd suggest mapping unused texture descriptor slots to some built-in error texture, such as the purple often used for areas where nothing was drawn. Then errors become obvious. Mapping to a null handle means nothing is drawn, which is harder to debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: api Issues related to API surface feature: bindless Issues with Bindless Native Feature type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants