-
-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Acyclic Command Graph for Rendering Device #84976
Conversation
3e9b18a
to
7a3b4e6
Compare
3bd08dd
to
7973cd0
Compare
Opening this for review. We won't take any steps towards merging these until the elements marked out in the TODO are done and the PR this is based on is merged, but I expect it to take a while to effectively review all these changes. |
7973cd0
to
4b504db
Compare
a1e01bc
to
2c1954d
Compare
Some initial testing, after fixing an issue in @RandomShaper PR, this is working on both desktop VR (tested with an Valve Index headset) and on Quest (tested on Quest 3). I have more testing to do. This was done with the mobile renderer, there weren't any obvious performance improvements but I wasn't expecting any as the mobile renderer already has a minimum of passes, there isn't much for the graph to optimise. I did notice when testing on the Quest 3 that MSAA broke, as far as I can tell it looks like it's resolving MSAA before it has finished rendering, so there is an issue in barriers. I did not test this with just @RandomShaper PR, so not 100% sure if this is introduced by the acyclic graph or if we're missing something in the new barrier code. |
The conclusions after talking with @akien-mga seem to be so far:
We should probably discuss how to approach this, as it could be a hint of something being wrong in the driver/system combination itself. Reviewing the RenderDoc capture has not revealed anything apparent nor does the validation or synchronization layer show any errors about it. It might be possible to build a standalone sample using Vulkan that replicates the issue if we want to dedicate time to that. Alternatively, we can enable buffer barriers by default at the cost of some performance and trusting that IHVs implement them correctly (or at least basically translate them to global barriers internally). |
e1a29bd
to
6065cc1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dario and I discussed this on chat. The issue that Remi is facing appears to go away when using buffer barriers, when using full barriers, or when not reordering commands. We suspect that it is caused by a driver bug as it only appears to reproduce on a specific combination of hardware.
For most drivers, buffer barriers are ignored/promoted to memory barriers. In theory their should be some overhead from adding them, but testing has shown that the overhead is minimal.
Accordingly, our plan is as follows:
- Enable buffer barriers by default
- If @akien confirms that this new update works fine on his hardware, merge this before Dev2
- Create an MRP for AMD/MESA and submit a bug report.
- Remove the buffer barriers once the problematic driver is fixed, or once we find a better workaround
Well this is a bad discovery to make at the last minute, but it turns out that at some point, my Vulkan Validation misconfigured itself and actually turned off my synchronization checking, and now I get some synchronization errors that @akien-mga was reporting (not the error that was reported on the scene however, the visuals themselves are still fine). I realized this when I went to test another project and was wondering why I was not getting synchronization errors in a more obvious scenario. I'd suggest avoiding to merge this until these synchronization errors are addressed, as there's quite a lot more than I thought there were due to the Validation layer turning itself off at some point during development. EDIT: Upon further testing I can confirm when forcing the full barrier access bits most of the errors are gone at least, so the rendering graph logic itself seems fine, it just needs some further tweaking for correctness and analyzing what's missing from these cases. |
6065cc1
to
d7ea8b7
Compare
I was able to solve most of the synchronization errors, although one of the solutions will probably remain a bit temporary until a more proper solution is figured out, but it's not exactly a pressing case as it involves an edge case with slices transitions (mostly due to how reflection probes behave). There's another synchronization error in the TPS project, but it seems actually unrelated to the graph and it has more to do with the texture upload in particular of that project. It's worth checking if that error shows up as well in |
Oh thank god. I am building a PR on top of the RenderGraph and couldn't figure out why the layers were screaming at me when I hadn't even launched my new compute shader yet. |
Were you using only validation or synchronization? I never saw errors with regular validation so far, but don't hesitate to report anything that might've been missed. |
I retested the latest version of this PR (d7ea8b7). I confirm that:
|
d7ea8b7
to
53313d0
Compare
@akien-mga tested a standalone Vulkan sample that I created but we were unable to reproduce the glitch he's getting when using only memory barriers instead of buffer barriers. It seems it'll be much harder to trace what exactly is failing here and what part of the operations are corrupting it. |
Adds a new system to automatically reorder commands, perform layout transitions and insert synchronization barriers based on the commands issued to RenderingDevice.
53313d0
to
cc4d39b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most recent version looks good. I tested locally with the synchronization layer enabled and can confirm that the errors present in the last version are now gone.
At this point I am comfortable saying that this is ready for merging before Dev2.
Thanks and congrats, this is an amazing change! 🎉 🥇 |
Background
@reduz proposed the idea a while ago of refactoring RenderingDevice to automatically build a graph out of the commands submitted to the class (outlined here). The first stepping stone towards implementing this was @RandomShaper's PR (#83452) that splits the commands into easily serializable parameters. Therefore, merging this PR will require merging that PR first (and if you wish to review this, only look at my individual commit after Pedro's changes).
Improvements
This PR makes the following improvements towards implementing @reduz's idea.
vkCmdPipelineBarrier
calls has been reduced immensely. On average I've noticed a reduction of about 60-80% of the total amount of barrier calls in a frame when compared tomaster
.Implementation
Since commands need to be deferred until later for reordering, everything submitted to RD is serialized into a large uint8 vector that grows as much as necessary. The capacity of all these vectors is reused across recordings to avoid reallocation as much as possible. The recorded commands are the bare minimum the RenderingDeviceDriver needs to execute the command after reordering is performed.
As expected, this PR will add CPU overhead that was not there before due to the additional time spent recording commands and reordering. While there's a very obvious optimization that is indicated in the TODO list that is pending, the rest of the extra work won't be as easy to overcome. This extra cost can also be offset tremendously by using secondary command buffers, which is pending to be enabled in the PR as soon as the issue described in the TODO is figured out.
Compatibility breakage
Performance improvements
GPU performance improvements are expected across the board as long as the CPU overhead isn't slowing the game down (which should go down with the future immutable change). The performance improvements will also vary depending on how much the particular IHV was suffering from inefficient barrier usage. One area that will particularly benefit is projects using GPU particles, as their processing will be much more parallelized than it was before.
At least on an NVIDIA 3090ti I've noticed around an overall ~10% frame time improvement in several projects, with potential bigger wins in platforms like AMD that can parallelize effectively on single queue or mobile hardware that does not handle barriers as gracefully as NVIDIA does.
Future improvements
These will be researched after this PR is merged as a second iteration.
TODO
Production edit: closes godotengine/godot-roadmap#29