-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use GpuArrayBuffer for MeshUniform #9254
Use GpuArrayBuffer for MeshUniform #9254
Conversation
574516f
to
382737d
Compare
Example |
1 similar comment
Example |
pub(super) fn model(render_device: &RenderDevice, binding: u32) -> BindGroupLayoutEntry { | ||
GpuArrayBuffer::<MeshUniform>::binding_layout( | ||
binding, | ||
ShaderStages::VERTEX_FRAGMENT, | ||
render_device, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess skinning
and weights
can get the same treatment in a future PR right? Since they could use a storage buffer in lieu of a uniform buffer when available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me reformulate: we already use an index and an Uniform
for skins and morphs, but is there any benefit to use a storage buffer when available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can pack all the skinning data into 1 storage buffer, instead of multiple uniform buffers. Less buffers = less rebinds = more performance :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it applies to all buffers of struct data, basically. One needs to plumb in the indices, so perhaps a per-object index has an index or offset into a skinning array or so. I don't know too much about those skins work to know how best to structure the data. Another aspect of having a single buffer is if the data itself repeats with varying counts of things (so things like vertices in vertex buffers, or maybe the number of transforms in a skin?) then some kind of allocator is usually used to manage loading in/out data without having to reallocate buffers all the time and rewrite everything, and handle things of varying sizes.
GpuArrayBuffer::<MeshUniform>::binding_layout( | ||
binding, | ||
ShaderStages::VERTEX_FRAGMENT, | ||
render_device, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a future PR, it might be worthwhile to newtype device.limits().max_storage_buffers_per_shader_stage
, and accept that newtype instead of RenderDevice
as argument to GpuArrayBuffer
so that it's clear what data it is using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm. I thought about passing the values explicitly, but then I also thought that maybe other features or limits may impact logic at some point. If it had been more difficult to plumb it in then I would have changed it.
Looks good. (I'll do some testing on my end before dropping an approval) What is the "TODO" item supposed to mean? Do you intend to iterate in future PRs or within this one? |
I'll probably do it in this one. |
Reviewed the PR in it's current state, minus the existing feedback lgtm :) Also good catch on GpuComponentArrayBufferPlugin not using finish(). |
…mic offset When using a uniform buffer with batches of per-object data, this provides a ~18% frame time reduction on the many_cubes -- sphere example in the Opaque3d phase and should benefit alpha mask, prepass, and shadow phases in a similar way.
Example |
Ideally we'd have a good way of benchmarking the impact on prepass and shadows. I may modify the many_cubes example to also support some modes to test the performance of these. |
The RenderApp sub app does not exist when the renderer is disabled so assuming it does is not a good idea.
Example |
@@ -138,7 +138,7 @@ fn fragment( | |||
pbr_input.V = V; | |||
pbr_input.occlusion = occlusion; | |||
|
|||
pbr_input.flags = mesh.flags; | |||
pbr_input.flags = mesh[in.instance_index].flags; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is using in.instance_index
unconditionally, but it's existence depends on VERTEX_OUTPUT_INSTANCE_INDEX
. i think that's probably fine since this frag shader is (currently) specific to standard materials, but it doesn't feel great.
as far as i can tell this is the only use of the instance index in any fragment stage. would it make more sense to push the mesh flags through the MeshVertexOutput
struct directly (and unconditionally)? then we could get rid of the VERTEX_OUTPUT_INSTANCE_INDEX
entirely.
this would not make sense if there are other potential uses for mesh data in the fragment stage but i think there shoudn't be generally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would add cost to the vertex stage / interpolators but I don't know if it's worth the tradeoff.
After this change, people will need to be aware of per-object data and how to access it. Long-term, various indices will be added into the per-object data for materials, animations, and probably more. At least material stuff I expect to be used in fragment stage. The instance index, or some other way of getting the per-object index into the shader, would be used to index into the per-object data array, and then within that type there would be a material index to index into a material data array.
// NOTE: Values increase towards the camera. Front-to-back ordering for opaque means we need a descending sort. | ||
type SortKey = Reverse<FloatOrd>; | ||
type SortKey = (u32, Reverse<FloatOrd>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not for this pr but just wondering, is there any possibility for / value in some kind of spatial sorting before putting the mesh data in the gpu array? i think currently they are entity iterator sorted, but it seems like (at least for static meshes) there'd be some benefit to having a batch be as proximally local as possible and then to sort batches based on the nearest member, or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will happen when the render set reordering and batching is implemented. The reason for reordering the render sets to extract, prepare assets, queue, sort, prepare+batch, render
is to allow the order of draws to be known when preparing the data so that the order can be taken into account.
392408d
to
4c8fba0
Compare
4c8fba0
to
dae52b7
Compare
This seems to be ignoring object transforms when rendering stuff, as everything seems to be rendering in the same place (tested 3d_scene: cube is lower than it should be, 3d_shapes: everything is mashed together instead of being separate, and scene_viewer (with a large custom scene): everything is mashed together, and the object clump jumps around weirdly when moving the camera around.) edit: Works properly on vulkan, broken on dx12. |
I'm getting a 45% speedup on vulkan/linux with your latest set of changes for I'm unable to test the OpenGL backend as I get a panic with |
@nicopap nice! @Elabajaba - oof, ok, I guess this isn't mergeable if it breaks DX12. Could anyone test DX12 on NVIDIA as I think @Elabajaba uses AMD. I will test on a mobile RTX 3080 as soon as I can. |
looks like always index 0 on nvidia/dx12 as well |
The vulkan gains seem big enough that it might be worth forcing dx12 to always draw instead of draw_indexed for now, and try and get it implemented in wgpu for their next release? edit: Not sure how many people are actually using bevy dx12, or what any potential performance losses might be (or if forcing draw instead of draw_indexed would break stuff). |
I'm satisfied with the performance benefit. The blocking issue is DX12 instance_index being broken. |
…d of vertex.instance_index
Dx12 instance_index bugfix
The DX12 issue was fixed by @Elabajaba - thanks for figuring that one out! |
# Objective - Fix shader_material_glsl example ## Solution - Expose the `PER_OBJECT_BUFFER_BATCH_SIZE` shader def through the default `MeshPipeline` specialization. - Make use of it in the `custom_material.vert` shader to access the mesh binding. --- ## Changelog - Added: Exposed the `PER_OBJECT_BUFFER_BATCH_SIZE` shader def through the default `MeshPipeline` specialization to use in custom shaders not using bevy_pbr::mesh_bindings that still want to use the mesh binding in some way.
Objective
Solution
GpuArrayBuffer
forMeshUniform
data to store allMeshUniform
data in arrays within fewer bindingsChangelog
MeshUniform
data is now managed byGpuArrayBuffer
as arrays in buffers that need to be indexed into.Migration Guide
Accessing the
model
member of an individual mesh object's shaderMesh
struct the old way where eachMeshUniform
was stored at its own dynamic offset:The new way where one needs to index into the array of
Mesh
es for the batch:Note that using the instance_index is the default way to pass the per-object index into the shader, but if you wish to do custom rendering approaches you can pass it in however you like.