-
-
Notifications
You must be signed in to change notification settings - Fork 21.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GLES2 2d Batch rendering (across items) #37349
Conversation
Just starting taking a look now. Run it through my previous test project and I am still getting ~5x speedup (heavy text with interleaved sprites and lights). I will review the code this afternoon and we can discuss what to do next tomorrow. edit: I wasnt able to spend as much time reviewing as I had hoped. But so far I am liking this implementation. It is feeling closer to the original ItemCommands batching PR than your other iterations. i.e. it feels a little more elegant. Testing it out on my device I also see a substantial speed increase, even compared to the previous PR for batching ItemCommands. |
The lights is interesting and something I briefly discussed with reduz. You'll notice in the try_join_item logic there is a large commented section with light logic, and at the moment it simply breaks the between item batching when lights are used. It shouldn't need to in theory, but in practice if there are overlaps between geometry, I found the result of joined items is different when lit, because of multiple passes: i.e. Tilemaps etc do still benefit even with lights though. In fact there's another small optimization I haven't put in yet which should help on the lights front. Provided the vertex buffer isn't full, it can be reused for each light. At the moment it is refilled for the main pass and each light. The current method should still should be miles faster than the old method though. There's quite a bit of low hanging fruit in general still available for optimization, and indeed adding other primitives, but I'd ideally prefer to get this more straightforward approach in use, to work out any regressions and get some feedback, before adding more, because the more code, the more difficult to debug. It's a bit of a balancing act really between run time speed and code complexity. I forgot to mention in terms of review, I've moved a lot of the old code into RasterizerCanvasBase, and just left the higher level code and new stuff in RasterizerCanvas. As such you should be pretty much able to skim read over RasterizerCanvasBase as it is unchanged. |
OMG, that was traumatic trying to test on Android. An entire day spent reinstalling android studio multiple times to get it to update so it would compile the latest version of the android templates. Then the old S3TC import bug on GLES2 #28308 (must get a fix for that merged) compounded by a new bug where Godot seems to keep trying to re-import textures because they aren't there, if you remove S3TC from the import list in ProjectSettings. 😧 Anyway upshot is that the PR seems to work fine for me both on the Android emulator and on my ancient Google Nexus 7 2013. 😁 Would appreciate more testing on devices / platforms though. I suspect it will be ok everywhere because it doesn't do anything too funky API wise, and early ARM Android devices are the most likely to show up any problems because they will crash without properly aligned structures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I left a few nitpicky comments (and I will do another round of naming nitpicks before this is merged).
I left a few questions where I couldn't figure something out. But overall I didn't have much trouble understanding the logic behind this and you seem to have done a great job of maintaining the current logic of the renderer.
I can see how you left it open to extend this to other Command types fairly easily. I'm wondering if our best approach may be to make a "test-release" and link it in a blog post to get wider testing, even before merging the PR and then adding in some of the other command types in this PR. @akien-mga may have some thoughts on that. But I think that testing is necessary to making sure this works, and you've put in so much work so far, we really ought to get some wider testing before asking you to make the final leap and finish it up.
// between the non-batched renderer and the batched renderer, | ||
// in order to find regressions. | ||
// This should not be defined except during development. | ||
//#define KESSEL_FLASH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will have to get rid of this for the final mergeable PR
// state.canvas_shader.bind(); | ||
|
||
// ris.last_blend_mode = -1; | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just delete the commented out sections, From what I can tell, you have already pulled out the necessary pieces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was expecting we'd decide to do that, we can always look at the old version for reference.
In terms of maintenance, this is one of the most error prone areas actually. The logic in try_join must exactly match the logic in the renderer, otherwise you will get subtle rendering bugs as a result of joining items that shouldn't have been joined. In one of the other versions of the PR, I tried putting this all in the same uber function. However it does end up quite messy.
Although I've avoided the uber function approach here, I'm equally happy to go with either approach, there are pros and cons with both. The split version is easier to read and understand (especially for people unfamiliar with how it works). However the combined uber function may possibly be easier to maintain in the long run.
I think we can either change to uber function now, or leave as is and change with another PR if it proves to be a better approach. One other factor in favour of the split approach is that it might be easier initially in terms of dealing with ironing out major problems.
For the uber function to work it would probably either have to be littered with a lot of #ifdefs, or use templates, to ensure getting 2 different runtime versions of the function, versus lots of branching at runtime which would slow everything down unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine for now. I have a feeling that if we add proper state tracking we can simplify this and the main render function enough that the duplication won't be so bad.
// if there are more than this number of commands in the item, we | ||
// don't allow joining (separate state changes, and hardware transform) | ||
// This is set to quite a conservative (low) number until investigated properly. | ||
const int MAX_JOIN_ITEM_COMMANDS = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this properly. Is this to ensure that an object with many commands (like a tilemap) is assigned its own batch and avoids doing all the software transforms associated with batching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, pretty much. Essentially the problem is this: Any item that contains commands that are default (i.e. not handled by software transform and the batching renderer) should not be joined.
In order to work this out, it does a lookahead through the commands, which could potentially be very expensive. As such it makes sense to put a limit on this to some small number, which will catch nearly all cases which need joining, but not be overly expensive in the case of items with large numbers of commands.
There may be a far more efficient way of handling this. Ideally it would be nice in general to mark some extra data like this on the items themselves, and some dirty flags for when things change in the item. There's a lot of inefficiencies throughout to deal with the historical design.
In practice I don't think this will be a problem in 3.2 to do like this, but we could look at some better approaches in 4.0 where it might be possible to move some of this processing further up into the visual server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that makes sense. Thank you.
We should consider making this a project setting as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do, especially for testing, if you think this a good idea. 👍
I do suspect that once we have figured out a good value, it won't need to be user editable .. I'm conscious there's a problem with 'project settings creep' at the moment, so I'd hesitate to go overboard on this. Plus it won't be at all obvious to end users what this does. Maybe the project settings overchoice problem will be addressed in 4.0 with 'advanced settings'.
Another candidate for the project settings is the max vertex buffer size, which at the moment is hard coded in initialize().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, please make max vertex size a project setting as well. I meant to comment on that but forgot.
return false; | ||
} | ||
|
||
void RasterizerCanvasGLES2::_canvas_render_item(Item *ci, RIState &ris) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's concerning that this logic is now copied three times (when attempting to join, when rendering batched and here). I wonder if there is any way we could combine.
On the other hand, I do appreciate how simple it makes the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section here is only repeated as a temporary measure for backward compatibility (i.e. turning batching off and finding regressions with the old renderer).
Once it has been fully tested and we are sure it works for everyone, there should be no need to retain the legacy codepath, and it could potentially be better to just remove it. As you say it is problematic trying to maintain 2 duplicate codepaths.
edit - Actually thinking about it, I'll double check it isn't possible to make a legacy path through the new renderer like I did with the old PR. I'm a bit wary of mixing up the new stuff with too much legacy code and making it harder to read etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im okay with commenting this saying that it needs to be removed before being merged into 3.2. Its fine for regression testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'll comment that. 👍
I think we should merge this in a dedicated branch (e.g. If you agree, I can create that branch and we can just change the base branch for this PR on GitHub. |
I think this sounds like a good plan of action. There are a lot of 2d guys waiting for this so we should get quite a few testers and can fix any bugs. |
Done, I changed the base branch to I suggest we focus first on getting the current state merged, then I'll make a testing release, and we can work further on fixing regressions if there are any, and adding support for more primitives in subsequent PRs. WDYT? |
Great! 👍 Ok changes are now made, barring the availability of the #define for regression testing, and my suggestion regarding the merging of the legacy codepath with the new codepath to save on duplication, as I'm not sure whether it is a good idea yet (will discuss with clayjohn). Regardless, it should now be okay for making builds for testing, barring more suggestions for changes. |
6a3b4db
to
5375abf
Compare
I have now added some more project settings, as requested. I figured out that I should be creating them in RasterizerCanvasGLES2 initialize, rather than in visual server, because they only apply to GLES2. As there are now several settings instead of putting In that there are 3 settings: and to I've given these all reasonable ranges. Let me know if there are any preferences for renaming these, or putting them in different places, this can easily be done. I've done some basic testing and they seem to work ok and are quite useful for diagnostics etc. We may decide to remove some in future. I haven't yet written the class reference entries, will do that now, but committed first what I've added. I've also added couple of lines to support zero buffer length in the rasterizer array, as it will use that if batching is turned off, to save memory, and fixed an un-initialized value for total_quads that was causing problems with one of these parameter settings (I've done fuller constructors for BatchData and RenderItemState). |
I think you still have to define them in a common code branch, otherwise they will not be visible if using the editor with GLES3 (even if configured to use GLES2 on some platforms), and they won't appear in the docs unless we specifically generate docs with |
Ah okay, I'll move them into visual_server.cpp. I only got that idea from searching in files for If they are globally visible, should I make it clear they are for GLES2 only somehow? For instance make the section |
Yeah that one seems to be defined both in There's also It's all a bit brittle :) |
Ok, I've moved the project/settings back to visual_server.cpp, and have added gles2_ to the corresponding names, and have added entries in the class reference. Seems to be working fine, let me know any other changes. 👍 The max_join_commands setting is quite handy because if you set it to zero it will prevent joining items which will be useful for finding regressions. allow_in_editorI've also added a new temporary setting The If we are satisfied it works fine in the editor and is running okay, we can remove the option and always have it run. flash_batchingAs this is a testing branch I've added the facility to flash between the old renderer and the batched renderer as a project setting. This is helpful for finding regressions. It should not be needed after the initial testing phase. |
ada6032
to
e9f2282
Compare
@akien-mga Just a note that the CI has been repeatedly failing today (last 3 commits for this PR) on appveyor from timeouts (taking longer than 60 mins). I don't know if it is to do with the extra python checks that you were adding, or if it is just appveyor being slow. Maybe there isn't normal caching because we are on a new branch. I haven't changed any includes that might slow down, the only major change is adding to the ProjectSettings.xml class reference... Anyway although there is a cross it doesn't indicate there's a problem with the code (well unless there was a problem that would have happened after timeout lol). 😁 |
Hi! I'm trying test this feature beacuse it hypes me, i think in my game the bottlenecked is just this. But in the first tryed, i can't notice performance increases. I supossed what i had a lot of sprites, then a tryed in a empty scene, with only a TileMap (figure 1). but neither. I was testing the performance reading the fps with batch rendering, and without it |
@lawnjelly Yeah I guess it no longer uses the |
Just tested with recent changes to this PR in a project of mine. At first I thought there's no significant gains, but I couldn't believe that. Then I recalled that I've actually rewritten the way bezier textures are drawn in relation to z-index (when you have to draw an arm either in front or behind a character, or even in front of other layers like weapons). Unfortunately, the drawing API doesn't provide a way to simulate different z-index within the func _draw():
draw_set_z_index(zi)
# ... draw textures on different z_index within the same CanvasItem Instead, I had to basically split up the same class and use different draw offsets/ranges: Which I suppose does "harmful" things performance-wise when this is split between canvas items. This is where I had a drop in performance once again, but I have to say that in either case, GLES2 with batching performs better/on par with GLES3. But good news is that, without the splitting hack (only drawing an entire arm within a single canvas item), I'm able to draw around 300 characters on screen before FPS goes below 60 (was only able to render around 80 characters on GLES3). Having said that, I'm totally satisfied with the performance gains, and some optimizations are possible to achieve on the per-project level now (both rendering and GDScript bottlenecks), so thanks for the work you've put into this @lawnjelly, hopefully my testing efforts wasn't in vain. 😁 |
Yeah, I can't say without examining your new project what your new bottlenecks are. It could be
There are limitations as to what can be batched in order to preserve the behaviour of the previous renderer, and not have visual regressions. The updating of the visual server did look very slow in your bezier approach and it is worth an investigation, but it is outside the scope of this PR. I think the aim here should be to at least cover the most popular use-cases, and as many generalised use cases as possible. If you are using quite bespoke techniques you may have to experiment a bit to find out what works best. Profiling is really useful for this kind of thing, especially if you can compile from source. That reminds me I'm intending to add some diagnostics within the IDE to help with this kind of thing. I believe the 'monitor' tab of the Godot IDE didn't work at all for GLES2 2D before, but I'll try and get it to correctly monitor the number of drawcalls and whatever else is possible. I may be able to get it to debug print out some diagnostics for a frame, I'm not sure yet though, it would depend on some method to call from the IDE or script.
Yes, the z index is set for the entire item (usually a node), and not the commands within it. However afaik if you order the drawing commands yourself, this order will be respected in the renderer. And it will probably be more efficient to boot (providing you do your sorting efficiently your side). |
This adds 2 new values (items and draw calls) to the performance monitor in a '2d' section, rather than reusing the 3d values in the 'raster' section. This makes it far easier to optimize games to minimize drawcalls.
On suggestion from Akien I've now changed I've also added (in the same commit) some very basic features to the IDE Debugger monitor:
These are particularly useful for getting batching working as well as possible in your game, because you can see the drawcalls reduce dramatically the more batching you are achieving. It also operates with batching switched off. The alternative is to reuse 'drawcalls' in the raster section (if adding new values might e.g. affect the efficiency of the debugger), but a new section is preferable if possible because it enables you to examine 2d and 3d in isolation in 3d games which have 2d user interfaces. It also may be possible to add more performance stats but this should do to start with. I haven't implemented these for GLES3, but that should be fairly simple to use the same mechanism. I'd also appreciate anyone familiar with the Debugger monitor looking over how I added the values in this commit. I literally did a find_in_files for existing values until it worked, so I can't guarantee I've done everything correctly, particularly whether it might affect monitoring over network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akien-mga We have been discussing the current state on IRC, I think this is ready to be added to the beta.
In most of our test situations batching is at least as fast, if not faster than current code. However, batching can be up to 10% slower when the game is heavily CPU limited. In this situation, batching still reduces the GPU load, but the overall FPS suffers a slight drop. lawnjelly and I feel like this shouldnt hold back beta testing, but it does mean we will leave in the batching toggle in ProjectSettings for now, as some users will benefit from turning batching off.
Added project setting to enable / disable print frame diagnostics every 10 seconds. This prints out a list of batches and info, which is useful to optimize games and identify performance problems.
Just as a final tweak, based on the conversation last night with @clayjohn, I've added a simple frame diagnostics mode. You can turn it on and off in the project settings When set to on it prints every 10 seconds a log of the batches being used in the frame, in this example form:
This shows how successful the item joining is, and whether the batch is default (D) or rect (R), then number of items in the batch, the batch tex ID in square brackets, and a MULTI symbol is there are multiple rects in the batch. The aim should be to get as many MULTIs as possible. Projects that are not batching well will have a lot of single rect batches. Note that complex games will have quite long frame logs, in which case they might get truncated by the editor. You can either run godot from a console to see the full log, or you can increase the IDE buffer size by increasing |
Looks good to me. Let's get this merged and tested in production 🤘 Thanks a ton for the tremendous work @lawnjelly, @clayjohn and all testers! |
WARNING: render_batches: NinePatch without texture not supported yet in GLES2 backend, skipping. |
Panel -> Custom Styles -> StyleBoxTexture -> Texture -> New AtlasTexture. I tried to do the same thing in one button and it gave the same result ... There is a problem with the Custom Style ... Not is accepting Atlas. It seems to be an engine problem. But I solved it in a strange way. Instead of creating an AtlasTexture, I loaded the texture and just established the region. It worked, but it was confusing. |
Despite the function name, this area is just copy pasted from the old renderer (to reuse the same code between the legacy and batched renderer). I suspect the old renderer would do the same (if you switch off use_batching), or compare with e.g. Godot 3.2.1. |
Is 0.5 half the screen or quarter? If you're dividing the X and Y by 2 or multiplying by 0.5 it's quarter, You'd have to divide by 1.4142~ or multiply by 0.7071~ to actually get half screen res. It's an important distinction for documentation. |
It's an area : width * height in all cases. Effectively a pixel count.
In this way we can selectively apply glScissor in cases which are likely to save a lot of fill rate. Or we can apply it in all cases (set to 0), or turn it off completely (1). On my PC there is not a massive cost to scissoring and setting it to 0 is quite viable. However it is possible that on some hardware there may be some cost to scissoring (at the very least, there is a couple of GL calls), such that the best value lies somewhere in between. I don't have the facilities to rigorously test this so the decision is left to the user. We may later be able to automate some of these decisions to an extent, but it is hard to get feedback of the effects of changing this kind of thing without running e.g. some diagnostic tests at startup. This is similar to the problems faced by a car ECU. |
Looking at
read like the orphaning done in the initialization?
I believe that the zero size results in an noop. When we introduced orphaning in GLES3 which gave the speed boost on mobile the rationale was to orphan the whole buffer. |
This is a little bit of a gray area, as far as I am concerned. Generally I go with manually using a circular set of vertex buffers to make it explicit, but in this case we are reusing the same VB multiple times per frame (as much as 1000+ times) so orphaning seemed easier. https://www.khronos.org/opengl/wiki/Buffer_Object_Streaming It seems the driver is able to do whatever it wants, so we are having to predict what it might do to a certain extent. Using the same size as before makes some sense if you are going to be reusing a similar sized buffer (especially if once per frame). However in this case, the buffer size will be varying each time so we don't gain much by doing that. In fact, it might be detrimental because it might decide to reserve some memory just in case you glSubBuffer to it. I think I did try both and decided this made the most sense, but am willing to change it if we discover it causes problems (it has seemed fine so far). 👍 If you think about it, if we were reusing the same vertex buffer 1000 times in a frame, and this was 1 meg VB, if it actually successfully did re-use these allocations, it might be churning 1 gig of memory per frame, and using > 3 gigs in storage. Which would be extremely bad news. In practice I'm hoping the NULL just will say 'I don't care about the last version, use it, then delete'. Which is what we want in this case. It may work just as well to remove the NULL call (I'll try and test this out). 😁 |
It will work but that's not the point. The point of orphaning is to let the driver know that we're not interested in the previously allocated and written to buffer and that we're done with it. The first bind/unbind cycle says "go draw this data". The second time we do that, we may overlap with what is currently in progress, which is a nasty constraint for the driver and usually results in a slowdown (which is what I saw last time I dealt with that stuff). Unless we orphan the whole buffer, After the first bind/unbind cycle, the driver assumes that we may rebind and write to the buffer again. But if we say, bind(size, null) there is no other interpretation than "we're done with the whole previous buffer, we need a new one". If we say bind(0, null), this means nothing significant as the buffer may be written to again and rebound. The latter is basically like a malloc(0) - no harm to the heap, no effect. |
Afaik, glBufferData always should release the reference to the previous buffer and create a new fresh allocation (it presumably leaves a ref count on the previous object will get to zero once it has been rendered and is no longer owned). That's the difference between glBufferData and glBufferSubData. glBufferSubData is to update some part of the buffer. glBufferData is to say 'throw away what you've got and give us a new one, and upload this new data'. Orphaning is probably more relevant when you want to use glBufferSubData in combination. If you don't do it in that scenario, presumably it has to do a load of housekeeping to keep the rest of the buffer the same (and may cause pipeline stalls). We are not doing that here, we are simply throwing out the whole of the old buffer. https://www.khronos.org/registry/OpenGL-Refpages/es3.0/html/glBufferData.xhtml
|
Yes, you are right and I stand corrected. And that glBufferData(0, 0) seems spurious at best. The driver can't do anything meaningful with it as it's not even an early indication of whatever follows, which is an overwrite. |
You are very probably right, I put it in during development as a failsafe (hence the 'for now' comment). It can hopefully be removed without ill effect. 👍 In fact this may have caused the confusion, it is probably a leftover from testing. I was regularly removing sections of code, and I think the failsafe was there to ensure the VB was orphaned at that point, if the second glBuffer call was commented out (I forget the exact reason). In the same way there are still a few state changes that may be redundant, these can be cleaned up at a later stage. |
@lawnjelly What's your take on batching support for CPUParticles2D? Is that potentially within scope or totally out of sight assuming you can weight or "feel" the complexity of it. |
Looking in GLES2 at least, the particles I think from memory are rendered with a multimesh, and looks very inefficient with setup and drawcalls per particle. It could probably be done as a single dynamic mesh much easier (and unrelated to) the rest of the batching. However, there may be lots of gotchas relating to the existing particle design - it could either be easy or a can of worms. Haven't looked at the particles in GLES3, they may be better already. I don't really have time to look at it myself (GLES3 batching and helping write new renderers for 4.x have to take priority). If someone else wants to have a go, go for it, but bear in mind there may be some changes to the rasterizer layout for GLES3 batching. And all this will change for 4.x anyway. I still have to figure out some stuff about the shader binding versions in GLES3 too that has cropped up with a bug with custom shaders and the GLES3 nvidia workaround that may be relevant to GLES3 batching (the binding and states have to be done slightly differently to GLES2). |
Thanks. I was asking about CPU particles in GLES2 because GLES3 has GPU particles, while GLES2 is "limited" to CPU particles which are, as you say, probably suboptimal for the time being. Since the game we're on is a 2D game using GLES2, it has a bunch of CPU particles and most primitives are left out from batching. Enabling batching produces worse performance results than without it, so vanilla GLES2 is best for us. GLES3 also performs worse than GLES2 despite the GPU particles - much more stutter and performance variance. Profiling shows the usual GPU bottlenecks, most probably at the CPU/GPU interface level, since the GPU load itself is relatively low. That said, it looks to me like there are 2 ways of addressing this: a) lower the load on the CPU/GPU sync points = batching, b) move most of those CPU particles to shaders = back to assembly language. There may be other solutions as well but not on top of my hat. In short, great fun ahaid. PS: I have looked at the CPU Particles code. It is indeed handled with a multimesh which looks already quite optimized to me. I don't see any obvious improvement. It is the same code for GLES2 and GLES3 with the sole difference being that GLES3 uses instance rendering which isn't supported in GLES2. The inner particle loop contains only the per particle transforms and the rest of the setup is externalized. So I am not really sure what "single dynamic mesh" being probably much easier means cf. @lanwjelly since the code looks state of the art with what GLES2 allows to do for rendering multiple objects. |
2d rendering is currently bottlenecked by drawing primitives one at a time, limiting OpenGL efficiency. This PR batches primitives and renders in fewer drawcalls, resulting in significant performance improvements. This also speeds up text rendering.
This PR batches across canvas items as well as within items.
The code dynamically chooses between a vertex format with and without color, depending on the input data for a frame, in order to optimize throughput and maximize batch size.
Test Builds
Many thanks to Akien for making test builds (more info further down thread), please try them out and let us know how you get on with your game / test projects. Let us know especially any visual bugs so we can fix them! 👍
https://downloads.tuxfamily.org/godotengine/testing/3.2.2-gles2_batching_build2/
Notes
projectsettings/rendering/gles2/
and the new rendering can be toggled with the old behaviour using theuse_batching
booleanuse_batching_in_editor
. Only use this when you are sure it is working fine for some of your projects. This is more for finding regressions at this point than for editor performance.History and Explanation
I rewrote the 2d renderer several times (this is probably the 4th or so version, and about the 7th iteration). An earlier version was in fact considerably more advanced, however a massive problem in practice has been regressions between the new renderer behaviour and the old renderer. To that end I threw out a lot of earlier work and considerably simplified the problem in order to retain as much of the existing logic as possible and maintain compatibility.
In the original PR it proved simple enough to batch commands within a render item. This worked for many cases such as tilemaps and text. However there was keenness to be able to batch different items together where possible, such as in bunnymark, with large numbers of independently moving sprites.
An important consideration was the functions had to be able to be called multiple times instead of a single time, because it had to deal with the situation where the vertex buffer was full, flush, then refill.
In addition in order to batch between items, where items are joined, it is necessary to use software transform. However, where items cannot be joined, or the items contain a large number of commands, the items are not joined and hardware transform is used. This retains maximum speed for tilemaps and large areas of text.
Future
We can add batching for other primitives using the same method. Note that this PR is made against Godot 3.2 but is also designed to eventually be used in 4.0.
Performance increases - what to expect
Most importantly, the frame rate increases you will get will be highly dependent on what are the bottlenecks in your game. It MAY NOT speed up all games, especially those that are running fast already - some will see great differences, others, not so much.
The performance increases should be on a par with the old PR, with the addition of increases due to batching across items.
However, I will say that the additional benefits of the batching across items versus just batching within items may not be that huge in real world use cases, versus test cases, as it is less likely to be a bottleneck.
How bottlenecks work
If the game is spending 1% of the time making drawcalls, even if we speed up the batching 100%, you only have 1% of gain available. If on the other hand your game is spending 80% of time in drawcalls, you can see very large performance increases. You can either use profiling to examine this (a large amount of time spent in the graphics driver normally indicates API bottlenecks), or simple try out the build.
A Quick Guide
The settings are in ProjectSettings->Rendering->Gles2
use_batching
- turns on and off (you can easily compare frame rates with and without)flash_batching
- if batching is on, this alternates between the legacy renderer and batching on alternate frames - this can help find visual differences, which you should report here so I can fixuse_batching_in_editor
- this will restart the editor and use batching to render the Godot IDE. Providing running projects look ok with batching, I'm 99.9% sure running in the editor should be fine too. Please report any visual differences in behaviour in the editor so I can fix.max_join_item_commands
- the default should work fine, but it is possible that different values may give you slightly better results in your game. A value of 0 turns joining items off, which can be used to help diagnose regressions (it will usually slightly decrease batching gains though).colored_vertex_format_threshold
- try varying this to see how it affects the performance in your game, the best value may be game / hardware dependent, we are not sure yet.light_scissor_area_threshold
- if you are using lights in your game, be sure to test varying this, even if you do not benefit from batching. This can significantly reduce GPU fillrate requirements, which can be a bottleneck in some games.