Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically resolve initial and final action for draw lists. #98670

Merged
merged 1 commit into from
Nov 27, 2024

Conversation

DarioSamo
Copy link
Contributor

@DarioSamo DarioSamo commented Oct 30, 2024

Background

RenderingDevice currently requires the user to specify on each draw list if it should clear, ignore, write or discard the contents of the color and depth attachments. However, the current API does not offer us enough granularity to specify the behavior for each particular attachment, which means we're potentially losing out on performance improvements from discarding information we don't need, especially in TBDR architectures like mobile devices that can keep contents of each tile on-chip without having to write the intermediate steps back to memory.

With the addition of the Acyclic Render Graph, it's now possible to automatically determine the required behavior for these targets in an optimal way, meaning we can get both performance improvements and simplify the API in the process.

Improvements

  • RenderingDevice::draw_list_begin() has been simplified and no longer requires specifying the initial and final action for attachments. Instead, extra granularity has been provided to be able to specify whether each individual color attachment is cleared instead and whether the depth/stencil component also needs to be cleared as well.
  • RenderingDevice will automatically create render passes based on the configuration determined by the render graph after command reordering is finished. In other words, the creation of render passes has been deferred to a later step instead of being done when the draw list is created.
  • RenderingDevice::TextureFormat includes a new attribute called is_transient.
  • It is also possible to specify and change at runtime whether a texture is transient by using RenderingDevice::texture_set_transient and check its current status with RenderingDevice::texture_is_transient.

Compatibility breakage

Compatibility wrappers have been provided to use the old versions of draw_list_begin. Nothing is expected to break. However, it might be necessary for users to specify that textures are transient if they happen to lose performance due to the FinalAction arguments being ignored.

Performance improvements

We've found large performance improvements in scenarios where the renderer is bottle-necked by the memory bandwidth available. The configuration of the render passes is crucial to unlocking performance when faced with this problem. Most of these improvements can be found with the mobile renderer. While it is possible for Forward+ to show improvements on desktop, you should not expect as much of an improvement as the mobile renderer with TBDR GPUs can achieve.

For example, a completely blank scene that uses 200% resolution scale with MSAA 4X on a Mali-G715 has improved from 52 FPS to 120 FPS (Vsync limit) just from being able to discard any writes to MSAA buffers, which largely go unused as the resolve is performed as a subpass in the main render pass. This is an interesting benchmark for VR in particular, which relies on very big render targets and MSAA to achieve good image quality on a headset.

The improvement can be largely attributed to a major reduction of the bandwidth required per frame:

master

image

rd-transient-targets

image

Project: mobile-msaa.zip
Profiles: MSAA-Profiles.zip

Does that mean you'll automatically get large improvements like these on every project?

The answer is no. If the project is limited by some other component, this PR won't necessarily give you an FPS improvement outright. The TPS Demo as found on this PR hasn't shown any FPS improvements whatsoever, but a very similar bandwidth reduction shows up. While it may not result in an improvement right now, it is likely this PR will unlock a bigger performance boost on the demo once the other bottlenecks have been resolved.

master

image

rd-transient-targets

image

Profiles: TPSDemo-Profiles.zip

TODO

  • Sort out CI.
  • Sort out logic of shadowmap atlas being marked as load when it should be don't care as it draws to the entire region.
  • Identify potential improvements in the existing codebase where we can gain even more performance by marking textures as transient.

Contributed by W4 Games. 🍀

@DarioSamo DarioSamo changed the title Automatically resolve Initial and Final action for draw lists. Automatically resolve initial and final action for draw lists. Oct 30, 2024
@DarioSamo DarioSamo force-pushed the rd-transient-targets branch 10 times, most recently from f8bcde9 to 549441e Compare October 30, 2024 17:56
@DarioSamo DarioSamo marked this pull request as ready for review October 30, 2024 17:57
@DarioSamo DarioSamo requested review from a team as code owners October 30, 2024 17:57
@clayjohn clayjohn added this to the 4.4 milestone Oct 31, 2024
@darksylinc
Copy link
Contributor

darksylinc commented Oct 31, 2024

I have a small complaint with the naming convention (& possibly with its implementation in the graph, which I didn't look in detail): Right now you're using bool is_transient however there are two concepts that you are merging:

Transient and Discardable.

Transient

It is a texture that can have no memory backing. It lives on cache alone. After the GPU is done using this "texture" in a single pass, the contents are lost.

This means load actions can only be CLEAR or DONT_CARE.
Store action must be RESOLVE or DONT_CARE.

Discardable

It is a texture that has memory backing. After the GPU is done using this "texture", its contents may be preserved with a STORE action.

However a discardable texture may still temporarily need the contents within the same frame, however that concent doesn't need to be preserved for the next frame.

Consider the following scenario for MSAA RenderTargetA and RenderTargetB. We mark RenderTargetA as discardable. For the sake of example, we don't care about the details of RenderTargetB.

  1. Transition: RenderTargetA from UNKNOWN to VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
    • We know for certain we can initialLayout = UNKNOWN because we don't care about preserving its contents (that's why it's discardable).
  2. Pass - RenderTargetA
    • LoadAction: DONT_CARE (or CLEAR)
    • StoreAction: STORE
    • Renders scene
  3. Pass - RenderTargetB
    • LoadAction: CLEAR
    • StoreAction: STORE
    • Samples from RenderTargetA
  4. Pass - RenderTargetA
    • LoadAction: DONT_CARE (or CLEAR)
    • StoreAction: MSAA RESOLVE (DONT_CARE for the MSAA surface)

RenderTargetA is discardable because:

  1. We can assume the pass starts as either DONT_CARE or CLEAR (no need for LOAD).
  2. We can assume the transition at the beginning of the frame is always UNKNOWN -> VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL which can simplify tracking.
  3. We can assume the last pass when it's no longer used becomes DONT_CARE.

However RenderTargetA cannot be transient, because its contents (drawn on step 2) are read in step 3. Only the contents from step 4 can be discarded.

The distinction between discardable and transient becomes important once I submit the PR for transient GPU memory backing, we could run into bugs where the contents are needed within the frame.

We also limit ourselves for the future if we want to apply further optimizations because is_transient is a bool instead of an enum.

Quite possible instead of bool is_transient we would need an enum:

enum TextureTrascience
{
   NORMAL,
   DISCARDABLE,
   TRANSIENT
};

@DarioSamo
Copy link
Contributor Author

DarioSamo commented Nov 1, 2024

@darksylinc I'm sort in agreement with your idea as this was something I discussed with Clay that we might need to split the new state to have more detail. Let's see if we can work out a design that fits the bill better.

Notice that in the state of the current PR, actions are not fixed to be don't care or store based on the transient flag. If another step makes use of the texture in the frame, the transient flag will not stop it from using store in a previous render pass to ensure the frame doesn't break. The only promise the transient flag is supposed to break is that the contents will be preserved between frames. This is all automatically detected by the graph at the moment.

Ideally the system should give control over the following details, as well as support the future plans from The Forge's PR to support actual transient textures as well. Whether we think all of this is necessary or not or it exposes too much complexity can be simplified if necessary.

  • Load behavior: One small detail that this redesign doesn't cover is that it's impossible to specify don't care for a non-transient texture, which is an scenario that can be useful if the render pass doesn't clear the texture's contents through the clear color but rather completely overwrites the region using the fragment shader. This is impossible for the graph to detect, so we need some other hints to know we can use don't care.

    • Discardable between frames: This flag would indicate don't care can be used on the first render pass that uses the texture if it doesn't use a clear operation.
    • Always discardable: This flag would indicate don't care should always be used on any render pass if a clear operation is not specified. Even if operation dependency is detected (two draw lists writing to the same attachment in the same frame), don't care would be used as the load operation. A good example of this is the render pass that copies over the omni-light cubemap to the atlas texture: no clear operation is required but the shader always completely overwrites the destination pixels. However, we want to preserve the result across several frames until the omni light is updated again.
  • Store behavior: At the moment this is where the PR actually solves this behavior properly based on detecting operation dependency. We only need to specify here if it's discardable between frames or not, as the render graph can detect on its own whether the result is discardable inside the frame itself.

On top of all I said, an actually transient texture (no memory backing) would automatically cover all of these properties, so this level of detail would be ignored for it. Therefore we can probably think of everything I said to be a mutually exclusive state with the no memory backing flag. An enum is sounding like a good candidate for this sort of thing.

One technical question: how is an image specified to have no memory backing? Is that property assigned during creation? If so, that may limit our plans to make the transient property accessible through a setter and getter, although it's still up for debate whether we want those methods or not if it makes things harder.

@DarioSamo DarioSamo force-pushed the rd-transient-targets branch from 549441e to ca14491 Compare November 1, 2024 12:42
@darksylinc
Copy link
Contributor

One technical question: how is an image specified to have no memory backing? Is that property assigned during creation? If so, that may limit our plans to make the transient property accessible through a setter and getter, although it's still up for debate whether we want those methods or not if it makes things harder.

Yes, during creation.

The VkImage creation procedure follows two changes:

  1. VkImageCreateInfo::usage sets the VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT.
  2. VMA allocation is slightly different because you need to use VMA_MEMORY_USAGE_GPU_LAZILY_ALLOCATED.

That's it.

So if you want to change this feature via setter, you'd have to recreate the VkImage (and all its associated VkImageViews and all descriptors where it is in use).

@DarioSamo
Copy link
Contributor Author

Yes, during creation.

That seems like it'd firmly settle us on the camp that we need this attribute to be at creation time only. I think I'd push for the set/get to not exist then. It'd be easier to handle for us in the long term and project settings that affect it can be solved by just re-creating the texture IMO.

@DarioSamo DarioSamo force-pushed the rd-transient-targets branch from ca14491 to d3d884e Compare November 5, 2024 18:15
@DarioSamo
Copy link
Contributor Author

DarioSamo commented Nov 5, 2024

Here's my proposal for a solution:

enum TextureUsageBits {
	TEXTURE_USAGE_SAMPLING_BIT = (1 << 0),
	TEXTURE_USAGE_COLOR_ATTACHMENT_BIT = (1 << 1),
	TEXTURE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT = (1 << 2),
	TEXTURE_USAGE_STORAGE_BIT = (1 << 3),
	TEXTURE_USAGE_STORAGE_ATOMIC_BIT = (1 << 4),
	TEXTURE_USAGE_CPU_READ_BIT = (1 << 5),
	TEXTURE_USAGE_CAN_UPDATE_BIT = (1 << 6),
	TEXTURE_USAGE_CAN_COPY_FROM_BIT = (1 << 7),
	TEXTURE_USAGE_CAN_COPY_TO_BIT = (1 << 8),
	TEXTURE_USAGE_INPUT_ATTACHMENT_BIT = (1 << 9),
	TEXTURE_USAGE_VRS_ATTACHMENT_BIT = (1 << 10),
+	TEXTURE_USAGE_TRANSIENT_ATTACHMENT_BIT = (1 << 11),
};

+enum TextureBehavior {
+	TEXTURE_BEHAVIOR_NORMAL,
+	TEXTURE_BEHAVIOR_DISCARD_BETWEEN_FRAMES,
+	TEXTURE_BEHAVIOR_DISCARD_ALWAYS
+};

struct TextureFormat {
	DataFormat format = DATA_FORMAT_R8_UNORM;
	uint32_t width = 1;
	uint32_t height = 1;
	uint32_t depth = 1;
	uint32_t array_layers = 1;
	uint32_t mipmaps = 1;
	TextureType texture_type = TEXTURE_TYPE_2D;
	TextureSamples samples = TEXTURE_SAMPLES_1;
	uint32_t usage_bits = 0;
	Vector<DataFormat> shareable_formats;
+	TextureBehavior load_behavior = TEXTURE_BEHAVIOR_NORMAL;
+	TextureBehavior store_behavior = TEXTURE_BEHAVIOR_NORMAL;
	bool is_resolve_buffer = false;
};

My reasoning behind this is:

  • We already have similar bits to VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT mapped in the texture usage bits. That lets us specify it pretty transparently during creation that we want to create a texture with no memory backing in a very similar way to Vulkan already. This has less of a learning curve than other methods.
    • Note: I will not implement transient attachments outright, but the bit will be added regardless to leave the door open for when the Forge PR comes in and implements it.
  • Automatically, we can force load and store behavior internally to DISCARD_ALWAYS if VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT is found in the usage bits.
  • Specifying the behavior separately lets us control some potential optimizations to a few particular cases where we don't have enough information about whether the draw lists completely overwrites the contents of the target area or if a write dependency between targets is not detected correctly. By default, most users won't need to care about this, but it opens the door for us to do granular optimizations if necessary without having to go down the route of specifying initial and final action on every draw list.

If we're good with this solution I'll go ahead and implement it.

@clayjohn
Copy link
Member

clayjohn commented Nov 5, 2024

I've discussed the above with Dario a bit and I agree with the comments made by Matias. The naming is_transient is unfortunate as it conflicts with the concept of transient textures which are textures that never have a memory backing in main memory. We intend to support proper transient textures very soon, so its good to get things right in this PR.

That being said VK transient textures are kind of an orthogonal concept to this. As both of you point out above:

  1. Transient in Vulkan terms means there is no backing memory, so you don't load or store the texture. This needs to be texture flag that is set at texture creation time and cannot be changed.
  2. Transient in this PR simply means you won't need the texture to be saved across frames. This is simply a hint to the render graph

Importantly, Vulkan transient textures need a special flag when they are created and cannot be updated at run time while this PR's transient texture is just a hint to the render graph.

I suggest the following:

  1. Rename "transient" in this PR to "discardable" (using the naming suggested by darksylinc)
  2. When we add support for proper transient textures, they will use the TEXTURE_USAGE_TRANSIENT_ATTACHMENT_BIT suggested by Dario
  3. When using a proper transient texture, the render graph can ignore the discardable hint as it knows the texture will never be loaded or stored.

I prefer this approach as it leaves all the control in the hands of the render graph.

@darksylinc
Copy link
Contributor

I believe this will keep happening. draw_list_begin() is a hot function in terms of changes and compatibility.

I suggest we just do what other apps do:

  • draw_list_begin2()
  • draw_list_begin3()
  • etc...

All of them exposed to GDScript. And the newer versions try to patch the older and emit a deprecation warning.

@DarioSamo
Copy link
Contributor Author

Yes... we can't keep changing draw_list_begin twice per release :( We're now going from 9 args to 6 so it's clearly breaking compatibility, not just for extensions but also for GDScript and C#.

What if we were to switch to a structure as the argument instead?

@akien-mga
Copy link
Member

akien-mga commented Nov 19, 2024

I believe this will keep happening. draw_list_begin() is a hot function in terms of changes and compatibility.

If that's the expectation, we really ought to flag draw_list_begin() as experimental - or possibly the whole RenderingDevice class, and backport that metadata change to 4.3.1 so users know that we're going to change the API further. WDYT @clayjohn?

I suggest we just do what other apps do:

  • draw_list_begin2()
  • draw_list_begin3()
  • etc...

All of them exposed to GDScript. And the newer versions try to patch the older and emit a deprecation warning.

Yeah that's an option. We've never done that in the scripting API though, so this is the kind of thing that would invite to significant bikeshedding before introducing a new deprecation policy to the codebase.

Also for context, with this approach we'd be at draw_list_begin4 with this PR, and apparently with further expectation of changes. At some point if we reach double digit numbers and our API reference is full of deprecated APIs, it starts becoming a bit messy for a scripting API that's meant to be high level and end user-friendly.

We do use this approach in the GDExtension interface, which is pure C and where we really try not to break the ABI. But at that level we also have better options to preserve compatibility (including for the change with this PR) with the bind_compatibility_method option, which sadly doesn't apply to GDScript (yet? GDScript would need to support polymorphism for method parameters).

What if we were to switch to a structure as the argument instead?

That would be ideal, but here again we're limited by GDScript, which doesn't yet support structs. There's work in progress to implement structs in GDScript though, but I'm not sure it will make it in 4.4. And we'd break API compatibility anyway if we change parameters to a struct (but possibly for the last time).


Overall I'd prefer not to open a can of worms regarding starting to use numbered deprecated methods in the scripting API, and just accept another compat breakage for this method that should really be flagged as experimental, so users know more breakage can happen.

This will need to be documented in the 4.3 to 4.4 migration guide.

@clayjohn
Copy link
Member

If that's the expectation, we really ought to flag draw_list_begin() as experimental - or possibly the whole RenderingDevice class, and backport that metadata change to 4.3.1 so users know that we're going to change the API further. WDYT @clayjohn?

I thought it was flagged as experimental already! I guess it should be. We will likely need more changes to accommodate WebGPU, and likely there are a few more changes that might happen as we move more stuff to the ARG

@darksylinc
Copy link
Contributor

darksylinc commented Nov 19, 2024

I've been thinking a little bit: We're exposing RenderingDevice::draw_list_begin() raw. i.e. the same function we use internally is exposed "as is" to GDScript.

This may be a mistake, I'll give an example:

  1. draw_list_begin uses a BitMask to set which render textures should be cleared or ignored (or none). We do this because it's efficient.
  2. But improper use means that one can, for example, set the CLEAR bit but forget to specify the clear colour (which is invalid).
  3. A GDScript version would make the API impossible to make such mistake, because it would be safer to do something like the following (pseudocode):
enum LoadMode
{
  Clear,
  IgnoreDontCare,
  Keep
}

struct RenderTargetClearSetting
{
    LoadMode loadMode;
    Color clearColor;
};

draw_list_begin( Vector<RenderTargetClearSetting> color_settings );

Such API is not be the most performant for internal use, but it is a better design for user-facing APIs.

We may have to split it into what we expose to GDScript and the internal function (the exposed version just demangles everything and calls the internal version).

@clayjohn
Copy link
Member

@darksylinc That is actually very close to the first API that Dario and I designed. However there were a few problems:

  1. GDScript doesn't have structs
  2. We can't just have loadMode and ClearColor, we also need ClearValue (for depth and stencil). If we expose both to GDScript we create a situation where users can easily screw up and pass the wrong values (i.e. index 1 contains depth, but I only set the color, so default clear value is used and the engine can't validate anything)

@darksylinc
Copy link
Contributor

GDScript doesn't have structs

But does it have maps?

Like

data["clear_value"] = [1.0, 1.0, 1.0, 1.0];

I know gdscript isn't python, but Python does have C-like structs.

We can't just have loadMode and ClearColor, we also need ClearValue (for depth and stencil). If we expose both to GDScript we create a situation where users can easily screw up and pass the wrong values (i.e. index 1 contains depth, but I only set the color, so default clear value is used and the engine can't validate anything)

Huh? I think you have the wrong API in mind.

Proper API would be:

colours[0].mode = CLEAR;
colours[0].clear_color = [1, 0, 1, 1];

depth.mode = CLEAR;
depth.clear_value = 1.0;

stencil.mode = CLEAR;
stencil.clear_value = 0xFF;

draw_list_begin( colours, depth, stencil );

I believe this is pretty hard to screw up.

If you mix color w/ depth and colour as one single array then there is a lot of room for mistakes, and some stuff doesn't make sense (like clear colour needs 4 values but depth only 1, and for stencil it is an int).

servers/rendering/rendering_device_graph.cpp Outdated Show resolved Hide resolved
doc/classes/RDTextureFormat.xml Outdated Show resolved Hide resolved
doc/classes/RenderingDevice.xml Outdated Show resolved Hide resolved
doc/classes/RenderingDevice.xml Outdated Show resolved Hide resolved
servers/rendering/rendering_device_graph.h Outdated Show resolved Hide resolved
misc/extension_api_validation/4.3-stable.expected Outdated Show resolved Hide resolved
@DarioSamo DarioSamo force-pushed the rd-transient-targets branch 2 times, most recently from d4ea927 to 5fcfc4b Compare November 20, 2024 13:29
@clayjohn
Copy link
Member

GDScript doesn't have structs

But does it have maps?

Like

data["clear_value"] = [1.0, 1.0, 1.0, 1.0];

I know gdscript isn't python, but Python does have C-like structs.

We can't just have loadMode and ClearColor, we also need ClearValue (for depth and stencil). If we expose both to GDScript we create a situation where users can easily screw up and pass the wrong values (i.e. index 1 contains depth, but I only set the color, so default clear value is used and the engine can't validate anything)

Huh? I think you have the wrong API in mind.

Proper API would be:

colours[0].mode = CLEAR;
colours[0].clear_color = [1, 0, 1, 1];

depth.mode = CLEAR;
depth.clear_value = 1.0;

stencil.mode = CLEAR;
stencil.clear_value = 0xFF;

draw_list_begin( colours, depth, stencil );

I believe this is pretty hard to screw up.

If you mix color w/ depth and colour as one single array then there is a lot of room for mistakes, and some stuff doesn't make sense (like clear colour needs 4 values but depth only 1, and for stencil it is an int).

I see, so we would have two different classes to avoid that issue.

I think I need to clarify, we went with the bitmask not because it was efficient but because it matched other user facing APIs (like ArrayMesh and SurfaceTool). We wanted to be consistent with what users already know. The bitmask approach is really nice because it the default behaviour is typically exactly what users would expect. So with relatively little effort you can get the function working correctly.

Exposing classes that need to be instantiated (or maps alternatively) adds a lot of friction. The error cases can easily be validated in either case too, so I don't see it adding much value in forcing users to be explicit (which is a fine design goal, its just not a design goal for us).

@DarioSamo DarioSamo force-pushed the rd-transient-targets branch 2 times, most recently from 43e63e9 to 42fe294 Compare November 21, 2024 15:37
@DarioSamo DarioSamo force-pushed the rd-transient-targets branch from 42fe294 to 0f65e7d Compare November 22, 2024 18:36
@Repiteo
Copy link
Contributor

Repiteo commented Nov 22, 2024

Will need to be rebased before this can be merged

@clayjohn
Copy link
Member

Looks like this causes problems on MacOS. I tested it out briefly with the TPS demo and it seems that with this PR, only transparent objects are included in the final frame.

Screenshot 2024-11-23 at 9 56 11 PM

@darksylinc darksylinc mentioned this pull request Nov 24, 2024
3 tasks
@DarioSamo
Copy link
Contributor Author

Looks like this causes problems on MacOS. I tested it out briefly with the TPS demo and it seems that with this PR, only transparent objects are included in the final frame.
Screenshot 2024-11-23 at 9 56 11 PM

That's odd as I've not made changes to the backends to add this. I'll see if the code might have an implementation error on Metal that is exposed by the change of behavior in this PR.

@DarioSamo DarioSamo force-pushed the rd-transient-targets branch from 0f65e7d to 516bab1 Compare November 25, 2024 13:39
@DarioSamo
Copy link
Contributor Author

@clayjohn It's not a MacOS-specific problem, it's broken everywhere all the same. I'll take this out of the merge queue in the meantime.

@DarioSamo DarioSamo force-pushed the rd-transient-targets branch from 516bab1 to 6d5ac8f Compare November 25, 2024 14:27
@DarioSamo
Copy link
Contributor Author

Should be fixed now.

@DarioSamo DarioSamo requested a review from clayjohn November 25, 2024 14:29
@Repiteo Repiteo merged commit ed01f5f into godotengine:master Nov 27, 2024
20 checks passed
@Repiteo
Copy link
Contributor

Repiteo commented Nov 27, 2024

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants