-
-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically resolve initial and final action for draw lists. #98670
Conversation
f8bcde9
to
549441e
Compare
I have a small complaint with the naming convention (& possibly with its implementation in the graph, which I didn't look in detail): Right now you're using Transient and Discardable. TransientIt is a texture that can have no memory backing. It lives on cache alone. After the GPU is done using this "texture" in a single pass, the contents are lost. This means load actions can only be CLEAR or DONT_CARE. DiscardableIt is a texture that has memory backing. After the GPU is done using this "texture", its contents may be preserved with a STORE action. However a discardable texture may still temporarily need the contents within the same frame, however that concent doesn't need to be preserved for the next frame. Consider the following scenario for MSAA RenderTargetA and RenderTargetB. We mark RenderTargetA as discardable. For the sake of example, we don't care about the details of RenderTargetB.
RenderTargetA is discardable because:
However RenderTargetA cannot be transient, because its contents (drawn on step 2) are read in step 3. Only the contents from step 4 can be discarded. The distinction between discardable and transient becomes important once I submit the PR for transient GPU memory backing, we could run into bugs where the contents are needed within the frame. We also limit ourselves for the future if we want to apply further optimizations because is_transient is a bool instead of an enum. Quite possible instead of enum TextureTrascience
{
NORMAL,
DISCARDABLE,
TRANSIENT
}; |
@darksylinc I'm sort in agreement with your idea as this was something I discussed with Clay that we might need to split the new state to have more detail. Let's see if we can work out a design that fits the bill better. Notice that in the state of the current PR, actions are not fixed to be don't care or store based on the transient flag. If another step makes use of the texture in the frame, the transient flag will not stop it from using store in a previous render pass to ensure the frame doesn't break. The only promise the transient flag is supposed to break is that the contents will be preserved between frames. This is all automatically detected by the graph at the moment. Ideally the system should give control over the following details, as well as support the future plans from The Forge's PR to support actual transient textures as well. Whether we think all of this is necessary or not or it exposes too much complexity can be simplified if necessary.
On top of all I said, an actually transient texture (no memory backing) would automatically cover all of these properties, so this level of detail would be ignored for it. Therefore we can probably think of everything I said to be a mutually exclusive state with the no memory backing flag. An enum is sounding like a good candidate for this sort of thing. One technical question: how is an image specified to have no memory backing? Is that property assigned during creation? If so, that may limit our plans to make the transient property accessible through a setter and getter, although it's still up for debate whether we want those methods or not if it makes things harder. |
549441e
to
ca14491
Compare
Yes, during creation. The VkImage creation procedure follows two changes:
That's it. So if you want to change this feature via setter, you'd have to recreate the VkImage (and all its associated VkImageViews and all descriptors where it is in use). |
That seems like it'd firmly settle us on the camp that we need this attribute to be at creation time only. I think I'd push for the set/get to not exist then. It'd be easier to handle for us in the long term and project settings that affect it can be solved by just re-creating the texture IMO. |
ca14491
to
d3d884e
Compare
Here's my proposal for a solution: enum TextureUsageBits {
TEXTURE_USAGE_SAMPLING_BIT = (1 << 0),
TEXTURE_USAGE_COLOR_ATTACHMENT_BIT = (1 << 1),
TEXTURE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT = (1 << 2),
TEXTURE_USAGE_STORAGE_BIT = (1 << 3),
TEXTURE_USAGE_STORAGE_ATOMIC_BIT = (1 << 4),
TEXTURE_USAGE_CPU_READ_BIT = (1 << 5),
TEXTURE_USAGE_CAN_UPDATE_BIT = (1 << 6),
TEXTURE_USAGE_CAN_COPY_FROM_BIT = (1 << 7),
TEXTURE_USAGE_CAN_COPY_TO_BIT = (1 << 8),
TEXTURE_USAGE_INPUT_ATTACHMENT_BIT = (1 << 9),
TEXTURE_USAGE_VRS_ATTACHMENT_BIT = (1 << 10),
+ TEXTURE_USAGE_TRANSIENT_ATTACHMENT_BIT = (1 << 11),
};
+enum TextureBehavior {
+ TEXTURE_BEHAVIOR_NORMAL,
+ TEXTURE_BEHAVIOR_DISCARD_BETWEEN_FRAMES,
+ TEXTURE_BEHAVIOR_DISCARD_ALWAYS
+};
struct TextureFormat {
DataFormat format = DATA_FORMAT_R8_UNORM;
uint32_t width = 1;
uint32_t height = 1;
uint32_t depth = 1;
uint32_t array_layers = 1;
uint32_t mipmaps = 1;
TextureType texture_type = TEXTURE_TYPE_2D;
TextureSamples samples = TEXTURE_SAMPLES_1;
uint32_t usage_bits = 0;
Vector<DataFormat> shareable_formats;
+ TextureBehavior load_behavior = TEXTURE_BEHAVIOR_NORMAL;
+ TextureBehavior store_behavior = TEXTURE_BEHAVIOR_NORMAL;
bool is_resolve_buffer = false;
}; My reasoning behind this is:
If we're good with this solution I'll go ahead and implement it. |
I've discussed the above with Dario a bit and I agree with the comments made by Matias. The naming That being said VK transient textures are kind of an orthogonal concept to this. As both of you point out above:
Importantly, Vulkan transient textures need a special flag when they are created and cannot be updated at run time while this PR's transient texture is just a hint to the render graph. I suggest the following:
I prefer this approach as it leaves all the control in the hands of the render graph. |
I believe this will keep happening. I suggest we just do what other apps do:
All of them exposed to GDScript. And the newer versions try to patch the older and emit a deprecation warning. |
What if we were to switch to a structure as the argument instead? |
If that's the expectation, we really ought to flag
Yeah that's an option. We've never done that in the scripting API though, so this is the kind of thing that would invite to significant bikeshedding before introducing a new deprecation policy to the codebase. Also for context, with this approach we'd be at We do use this approach in the GDExtension interface, which is pure C and where we really try not to break the ABI. But at that level we also have better options to preserve compatibility (including for the change with this PR) with the
That would be ideal, but here again we're limited by GDScript, which doesn't yet support structs. There's work in progress to implement structs in GDScript though, but I'm not sure it will make it in 4.4. And we'd break API compatibility anyway if we change parameters to a struct (but possibly for the last time). Overall I'd prefer not to open a can of worms regarding starting to use numbered deprecated methods in the scripting API, and just accept another compat breakage for this method that should really be flagged as experimental, so users know more breakage can happen. This will need to be documented in the 4.3 to 4.4 migration guide. |
I thought it was flagged as experimental already! I guess it should be. We will likely need more changes to accommodate WebGPU, and likely there are a few more changes that might happen as we move more stuff to the ARG |
I've been thinking a little bit: We're exposing This may be a mistake, I'll give an example:
enum LoadMode
{
Clear,
IgnoreDontCare,
Keep
}
struct RenderTargetClearSetting
{
LoadMode loadMode;
Color clearColor;
};
draw_list_begin( Vector<RenderTargetClearSetting> color_settings ); Such API is not be the most performant for internal use, but it is a better design for user-facing APIs. We may have to split it into what we expose to GDScript and the internal function (the exposed version just demangles everything and calls the internal version). |
@darksylinc That is actually very close to the first API that Dario and I designed. However there were a few problems:
|
But does it have maps? Like
I know gdscript isn't python, but Python does have C-like structs.
Huh? I think you have the wrong API in mind. Proper API would be: colours[0].mode = CLEAR;
colours[0].clear_color = [1, 0, 1, 1];
depth.mode = CLEAR;
depth.clear_value = 1.0;
stencil.mode = CLEAR;
stencil.clear_value = 0xFF;
draw_list_begin( colours, depth, stencil ); I believe this is pretty hard to screw up. If you mix color w/ depth and colour as one single array then there is a lot of room for mistakes, and some stuff doesn't make sense (like clear colour needs 4 values but depth only 1, and for stencil it is an int). |
servers/rendering/renderer_rd/storage_rd/render_scene_buffers_rd.h
Outdated
Show resolved
Hide resolved
d4ea927
to
5fcfc4b
Compare
I see, so we would have two different classes to avoid that issue. I think I need to clarify, we went with the bitmask not because it was efficient but because it matched other user facing APIs (like ArrayMesh and SurfaceTool). We wanted to be consistent with what users already know. The bitmask approach is really nice because it the default behaviour is typically exactly what users would expect. So with relatively little effort you can get the function working correctly. Exposing classes that need to be instantiated (or maps alternatively) adds a lot of friction. The error cases can easily be validated in either case too, so I don't see it adding much value in forcing users to be explicit (which is a fine design goal, its just not a design goal for us). |
43e63e9
to
42fe294
Compare
42fe294
to
0f65e7d
Compare
Will need to be rebased before this can be merged |
0f65e7d
to
516bab1
Compare
@clayjohn It's not a MacOS-specific problem, it's broken everywhere all the same. I'll take this out of the merge queue in the meantime. |
516bab1
to
6d5ac8f
Compare
Should be fixed now. |
Thanks! |
Background
RenderingDevice
currently requires the user to specify on each draw list if it should clear, ignore, write or discard the contents of the color and depth attachments. However, the current API does not offer us enough granularity to specify the behavior for each particular attachment, which means we're potentially losing out on performance improvements from discarding information we don't need, especially in TBDR architectures like mobile devices that can keep contents of each tile on-chip without having to write the intermediate steps back to memory.With the addition of the Acyclic Render Graph, it's now possible to automatically determine the required behavior for these targets in an optimal way, meaning we can get both performance improvements and simplify the API in the process.
Improvements
RenderingDevice::draw_list_begin()
has been simplified and no longer requires specifying the initial and final action for attachments. Instead, extra granularity has been provided to be able to specify whether each individual color attachment is cleared instead and whether the depth/stencil component also needs to be cleared as well.RenderingDevice
will automatically create render passes based on the configuration determined by the render graph after command reordering is finished. In other words, the creation of render passes has been deferred to a later step instead of being done when the draw list is created.RenderingDevice::TextureFormat
includes a new attribute calledis_transient
.RenderingDevice::texture_set_transient
and check its current status withRenderingDevice::texture_is_transient
.Compatibility breakage
Compatibility wrappers have been provided to use the old versions of
draw_list_begin
. Nothing is expected to break. However, it might be necessary for users to specify that textures are transient if they happen to lose performance due to theFinalAction
arguments being ignored.Performance improvements
We've found large performance improvements in scenarios where the renderer is bottle-necked by the memory bandwidth available. The configuration of the render passes is crucial to unlocking performance when faced with this problem. Most of these improvements can be found with the mobile renderer. While it is possible for Forward+ to show improvements on desktop, you should not expect as much of an improvement as the mobile renderer with TBDR GPUs can achieve.
For example, a completely blank scene that uses 200% resolution scale with MSAA 4X on a Mali-G715 has improved from 52 FPS to 120 FPS (Vsync limit) just from being able to discard any writes to MSAA buffers, which largely go unused as the resolve is performed as a subpass in the main render pass. This is an interesting benchmark for VR in particular, which relies on very big render targets and MSAA to achieve good image quality on a headset.
The improvement can be largely attributed to a major reduction of the bandwidth required per frame:
master
rd-transient-targets
Project: mobile-msaa.zip
Profiles: MSAA-Profiles.zip
Does that mean you'll automatically get large improvements like these on every project?
The answer is no. If the project is limited by some other component, this PR won't necessarily give you an FPS improvement outright. The TPS Demo as found on this PR hasn't shown any FPS improvements whatsoever, but a very similar bandwidth reduction shows up. While it may not result in an improvement right now, it is likely this PR will unlock a bigger performance boost on the demo once the other bottlenecks have been resolved.
master
rd-transient-targets
Profiles: TPSDemo-Profiles.zip
TODO
Contributed by W4 Games. 🍀