-
-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for enhanced barriers in D3D12. #91769
Add support for enhanced barriers in D3D12. #91769
Conversation
2a1b55f
to
ec3ef0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittedly, this is a somewhat naive review because it's hard to follow everything going on here properly in detail. As such, besides witnessing the code changes look solid and sensible, I can only raise a few nitpicks:
driver_clear_resources_with_views
and the other bool member above could become bitfields.- Also for
driver_clear_resources_with_views
, I'd suggest a rename to the more compactdriver_uses_views_for_clears
. Or, alternatively, inverting its meaning and naming itdriver_clears_via_copy_engine
, or something similar. (The rename would carry over toAPI_TRAIT_CLEAR_RESOURCES_WITH_VIEWS
.) - I've spotted a few pieces of code style that deviate from the latest best or estabilished practices in the codebase. That was already the case in some of these files, so I think we'd rather do a style-fixing pass later. They are things such as uninitialized local variables and the omission of the (superfluous, I know)
public
keyword in the specification of inheritance across structs. - Finally, there's the
thread_local LocalVector
technique. That's kind of a saferalloca()
, isn't it? I guess we should discuss rules about when to prefer that over theALLOCA_ARRAY()
utility macro that drivers use, which may mean we let the drivers do the same, but in any case come up with something consistent.
Admittedly the purpose of these is just to store one flag and be consistent throughout the execution of the graph without having to constantly call the api_trait_get and go through the cost of using the virtual function. I'm not really very much for or against the approach chosen here (the alternative being just querying the API or using the bitfield), but I went with whatever requires the least amount of knowledge from the reader. The graph is only created once per RenderingDevice so there's not much in the way of memory savings to be found.
This I agree with. It'd be best to change the trait to something that represents more clearly that Vulkan is capable of doing this. It might be more common that we find APIs that don't.
Yes, the usage of thread_local is very deliberate across this and my other PRs as it's an effective way to avoid using stack memory and reusing an already allocated vector across calls while also allowing the method to remain safe when using multi-threading. It's important to note This will be a pretty common pattern among some other optimization PRs that will be coming later. Considering there's no upper bound to some of these calls, relying on stack memory could become a problem at some point that'd leave no other possible alternative as a fix. |
I believe anyone hoping to understand this area of the engine is aware of bitfields. It's not a great saving, but why missing the opportunity of packing a few similarly-purposed booleans together? |
It's not something I gave that much thought to be honest as it wouldn't make much of a difference and from my POV it seemed easier to understand. I have no reason to argue for or against it as there's no measurable performance impact. I'll go with whatever preference the project wants. |
The technique is cool. I'd only like to raise the need of discussing its usage at some point in the future. Lately, I've been trying to remove or minimize runtime allocations in some PRs, and I don't quite feel good about we adding them to other area of the engine. I believe you are more experienced than me regarding the real impact of them in this context, so I'm all ears (or eyes in this context). My point is that something as thin as the rendering driver shouldn't be the culprit of a spike in frame time because it happened to need to invoke the allocator. I have an idea (again not for now, not for this PR): we can use something similar to your technique, that would hit the sweet spot between UPDATE: It doesn't really need to be per-frame, since it's throwaway data, but the idea is mostly the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me to merge for 4.3. It should have a pretty small impact outside of D3D12 and will be a huge benefit for D3D12 users (it brings performance in line with the Vulkan backend).
Approving from my side, but it would be good for @RandomShaper to confirm that he is happy with the details before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style looks good!
I'm not sure my feedback (the directly actionable part) has been addressed yet. Also, this may be better rebased on top of #91416, which already includes the upgrade of the Agility SDK to 613, but in a more comprehensive manner. |
Would you say this covers the copy engine suggestion and the bitfields? I feel the arena allocator would be out of scope for this PR until we get a common class we can reuse to adopt that strategy elsewhere. |
Upon trying the suggestion, one annoying thing I realized is that bitfield declaration does not allow us to give it an initialization value on the class itself and instead requires to do a union to do so. Do we feel it's worth doing this: union {
struct {
bool driver_honors_barriers : 1;
bool driver_clear_resources_with_views : 1;
}
bool driver_all = false;
}; Instead of just this? bool driver_honors_barriers = false;
bool driver_clear_resources_with_views = false; The other alternative is just doing the initialization withing the class' initializer manually, but that breaks a bit of the style of the rest of the class where you can just read it directly to find out what the defaults are. |
cc5d782
to
97ddfd1
Compare
I've added the rename to use copy engine. We should decide about the booleans vs the bitfields for the reasons outlined above. |
The lack of bitfield initializers is unfortunate indeed, but the style ship has already sailed in that regard. See this: godot/core/object/worker_thread_pool.h Lines 76 to 99 in f4b047a
So, until we upgrade to the version of C++ that allows in-place initializers for that, I'd suggest embracing that style as the convention. To clarify, yes, that's indeed the last piece of feedback I was talking about. Everything else belongs indeed to future iterations, in any case. |
Enables support for enhanced barriers if available. Gets rid of the implementation of [CROSS_FAMILY_FALLBACK] in the D3D12 driver. The logic has been reimplemented at a higher level in RenderingDevice itself. This fallback is only used if the RenderingDeviceDriver reports the API traits and the capability of sharing texture formats correctly. Aliases created in this way can only be used for sampling: never for writing. In most cases, the formats that do not support sharing do not support unordered access/storage writes in the first place.
97ddfd1
to
adabd14
Compare
I've addressed the bitfield suggestion. |
@@ -245,13 +255,34 @@ class RenderingDevice : public RenderingDeviceCommons { | |||
r.layer_count = layers; | |||
return r; | |||
} | |||
|
|||
TextureFormat texture_format() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Godot the naming convention in cases like this involves adding a get_
prefix. However, since there are already other cases of this other style in the code, I guess a future PR (by maybe @akien-mga or @AThousandShips) can do a uniformity pass.
As this is relatively big and I'm planning a 4.3-beta1 release this week, I'm scheduling to merge this after beta1 and for beta2, so we can hopefully find potential regressions before making the beta2 build. Or at least have a "stable" beta1 baseline to compare with if users report D3D12 or RenderGraph regressions in beta2. |
Thanks! |
Enhanced barriers are a relatively modern feature that wasn't part of the initial D3D12 specification. This feature brings barriers in D3D12 much closer in functionality to Vulkan (with a few caveats this PR works around).
The current D3D12 driver was designed around legacy barriers, and this PR maintains that path as it is. Enhanced barriers are only used if the driver reports them as supported. To enable this, you might need to use the Agility SDK and use it as indicated here. If you wanna double check if this is reported, you can check by verifying the return of this line.
By implementing enhanced barriers, that means the D3D12 driver can finally use the information provided by the recently introduced Acyclic Rendering Graph to group barriers together as much as possible and allow more commands to execute in parallel.
So far this has given me a real performance increase in an NVIDIA GeForce RTX 3090 Ti. A 3D project with about every post processing effect enabled went from ~2.57ms in
master
to ~2.16ms ind3d12_enhanced_barriers
, an improvement very much in line with the performance benefits we saw when the render graph was introduced.The second part of this PR was reworking the fallback that the D3D12 driver had to do into something more generic. When relaxed format casting is not available, the driver has to opt for making copies of textures, transitioning to the proper layouts and replacing them in descriptor sets without the render graph knowing any of that is going on. This fallback was instead re-implemented entirely at the RenderingDevice level, allowing the ARG to reorganize the operations and also get a performance boost. This also allows enhanced barriers to work with this fallback without issue, as it would've been impossible to solve it properly otherwise.
Some small changes had to be made to RenderingDeviceDriver to give more detail about the operations and states the resources are used to comply with D3D12's restrictions. No regressions are expected but it should be tested thoroughly regardless.
Contributed by W4 Games. 🍀