GLES2 2d Batch rendering (across items) #37349

lawnjelly · 2020-03-27T10:24:47Z

2d rendering is currently bottlenecked by drawing primitives one at a time, limiting OpenGL efficiency. This PR batches primitives and renders in fewer drawcalls, resulting in significant performance improvements. This also speeds up text rendering.

This PR batches across canvas items as well as within items.

The code dynamically chooses between a vertex format with and without color, depending on the input data for a frame, in order to optimize throughput and maximize batch size.

Test Builds

Many thanks to Akien for making test builds (more info further down thread), please try them out and let us know how you get on with your game / test projects. Let us know especially any visual bugs so we can fix them! 👍
https://downloads.tuxfamily.org/godotengine/testing/3.2.2-gles2_batching_build2/

Notes

This is an enhanced version of my original PR GLES2 2d Batch rendering #35957, which could not batch across multiple items
This currently batches only rects
The batching settings are in projectsettings/rendering/gles2/ and the new rendering can be toggled with the old behaviour using the use_batching boolean
There is an experimental option use_batching_in_editor. Only use this when you are sure it is working fine for some of your projects. This is more for finding regressions at this point than for editor performance.

History and Explanation

I rewrote the 2d renderer several times (this is probably the 4th or so version, and about the 7th iteration). An earlier version was in fact considerably more advanced, however a massive problem in practice has been regressions between the new renderer behaviour and the old renderer. To that end I threw out a lot of earlier work and considerably simplified the problem in order to retain as much of the existing logic as possible and maintain compatibility.

In the original PR it proved simple enough to batch commands within a render item. This worked for many cases such as tilemaps and text. However there was keenness to be able to batch different items together where possible, such as in bunnymark, with large numbers of independently moving sprites.

The simple approach taken here was to add an extra pass to the items, to identify any that could be joined with the previous item, to form joined_items. This entails duplicating a simplified version of the state logic.
The next stage was to modify the existing renderer to deal with joined_items rather than single items.
Finally the existing code in the old PR was used to batch the commands in these joined items.

An important consideration was the functions had to be able to be called multiple times instead of a single time, because it had to deal with the situation where the vertex buffer was full, flush, then refill.

In addition in order to batch between items, where items are joined, it is necessary to use software transform. However, where items cannot be joined, or the items contain a large number of commands, the items are not joined and hardware transform is used. This retains maximum speed for tilemaps and large areas of text.

Future

We can add batching for other primitives using the same method. Note that this PR is made against Godot 3.2 but is also designed to eventually be used in 4.0.

Performance increases - what to expect

Most importantly, the frame rate increases you will get will be highly dependent on what are the bottlenecks in your game. It MAY NOT speed up all games, especially those that are running fast already - some will see great differences, others, not so much.

Generally, the more rects you have on screen, the higher the density of your tilemaps, the more text, the more you are likely to benefit.
If you are only drawing a small number of rects you are unlikely to be bottlenecked by drawcalls, and will see less benefit.
If you are bottlenecked by fillrate (e.g. a smaller screen results in higher frame rate) you will see less benefit.
You will typically see relative larger performance increase in release builds than in debug builds, because the batching is CPU limited.

The performance increases should be on a par with the old PR, with the addition of increases due to batching across items.

However, I will say that the additional benefits of the batching across items versus just batching within items may not be that huge in real world use cases, versus test cases, as it is less likely to be a bottleneck.

How bottlenecks work

If the game is spending 1% of the time making drawcalls, even if we speed up the batching 100%, you only have 1% of gain available. If on the other hand your game is spending 80% of time in drawcalls, you can see very large performance increases. You can either use profiling to examine this (a large amount of time spent in the graphics driver normally indicates API bottlenecks), or simple try out the build.

A Quick Guide

The settings are in ProjectSettings->Rendering->Gles2

use_batching - turns on and off (you can easily compare frame rates with and without)
flash_batching - if batching is on, this alternates between the legacy renderer and batching on alternate frames - this can help find visual differences, which you should report here so I can fix
use_batching_in_editor - this will restart the editor and use batching to render the Godot IDE. Providing running projects look ok with batching, I'm 99.9% sure running in the editor should be fine too. Please report any visual differences in behaviour in the editor so I can fix.
max_join_item_commands - the default should work fine, but it is possible that different values may give you slightly better results in your game. A value of 0 turns joining items off, which can be used to help diagnose regressions (it will usually slightly decrease batching gains though).
colored_vertex_format_threshold - try varying this to see how it affects the performance in your game, the best value may be game / hardware dependent, we are not sure yet.
light_scissor_area_threshold - if you are using lights in your game, be sure to test varying this, even if you do not benefit from batching. This can significantly reduce GPU fillrate requirements, which can be a bottleneck in some games.

clayjohn · 2020-03-27T18:35:31Z

Just starting taking a look now. Run it through my previous test project and I am still getting ~5x speedup (heavy text with interleaved sprites and lights). I will review the code this afternoon and we can discuss what to do next tomorrow.

edit: I wasnt able to spend as much time reviewing as I had hoped. But so far I am liking this implementation. It is feeling closer to the original ItemCommands batching PR than your other iterations. i.e. it feels a little more elegant. Testing it out on my device I also see a substantial speed increase, even compared to the previous PR for batching ItemCommands.

lawnjelly · 2020-03-28T11:03:40Z

heavy text with interleaved sprites and lights

The lights is interesting and something I briefly discussed with reduz. You'll notice in the try_join_item logic there is a large commented section with light logic, and at the moment it simply breaks the between item batching when lights are used.

It shouldn't need to in theory, but in practice if there are overlaps between geometry, I found the result of joined items is different when lit, because of multiple passes:

i.e. (a * light) + (b * light) is NOT the same as (a + b) * light

Tilemaps etc do still benefit even with lights though. In fact there's another small optimization I haven't put in yet which should help on the lights front. Provided the vertex buffer isn't full, it can be reused for each light. At the moment it is refilled for the main pass and each light. The current method should still should be miles faster than the old method though.

There's quite a bit of low hanging fruit in general still available for optimization, and indeed adding other primitives, but I'd ideally prefer to get this more straightforward approach in use, to work out any regressions and get some feedback, before adding more, because the more code, the more difficult to debug. It's a bit of a balancing act really between run time speed and code complexity.

I forgot to mention in terms of review, I've moved a lot of the old code into RasterizerCanvasBase, and just left the higher level code and new stuff in RasterizerCanvas. As such you should be pretty much able to skim read over RasterizerCanvasBase as it is unchanged.

lawnjelly · 2020-03-28T15:16:10Z

OMG, that was traumatic trying to test on Android. An entire day spent reinstalling android studio multiple times to get it to update so it would compile the latest version of the android templates. Then the old S3TC import bug on GLES2 #28308 (must get a fix for that merged) compounded by a new bug where Godot seems to keep trying to re-import textures because they aren't there, if you remove S3TC from the import list in ProjectSettings. 😧

Anyway upshot is that the PR seems to work fine for me both on the Android emulator and on my ancient Google Nexus 7 2013. 😁 Would appreciate more testing on devices / platforms though.

I suspect it will be ok everywhere because it doesn't do anything too funky API wise, and early ARM Android devices are the most likely to show up any problems because they will crash without properly aligned structures.

clayjohn

Looks great! I left a few nitpicky comments (and I will do another round of naming nitpicks before this is merged).

I left a few questions where I couldn't figure something out. But overall I didn't have much trouble understanding the logic behind this and you seem to have done a great job of maintaining the current logic of the renderer.

I can see how you left it open to extend this to other Command types fairly easily. I'm wondering if our best approach may be to make a "test-release" and link it in a blog post to get wider testing, even before merging the PR and then adding in some of the other command types in this PR. @akien-mga may have some thoughts on that. But I think that testing is necessary to making sure this works, and you've put in so much work so far, we really ought to get some wider testing before asking you to make the final leap and finish it up.

drivers/gles2/rasterizer_canvas_gles2.h

clayjohn · 2020-03-29T05:05:58Z

drivers/gles2/rasterizer_canvas_gles2.cpp

+// between the non-batched renderer and the batched renderer,
+// in order to find regressions.
+// This should not be defined except during development.
+//#define KESSEL_FLASH


Will have to get rid of this for the final mergeable PR

drivers/gles2/rasterizer_canvas_gles2.cpp

clayjohn · 2020-03-29T05:41:07Z

drivers/gles2/rasterizer_canvas_gles2.cpp

+		//			state.canvas_shader.bind();
+
+		//			ris.last_blend_mode = -1;
+		//		}


I think just delete the commented out sections, From what I can tell, you have already pulled out the necessary pieces

Yeah I was expecting we'd decide to do that, we can always look at the old version for reference.

In terms of maintenance, this is one of the most error prone areas actually. The logic in try_join must exactly match the logic in the renderer, otherwise you will get subtle rendering bugs as a result of joining items that shouldn't have been joined. In one of the other versions of the PR, I tried putting this all in the same uber function. However it does end up quite messy.

Although I've avoided the uber function approach here, I'm equally happy to go with either approach, there are pros and cons with both. The split version is easier to read and understand (especially for people unfamiliar with how it works). However the combined uber function may possibly be easier to maintain in the long run.

I think we can either change to uber function now, or leave as is and change with another PR if it proves to be a better approach. One other factor in favour of the split approach is that it might be easier initially in terms of dealing with ironing out major problems.

For the uber function to work it would probably either have to be littered with a lot of #ifdefs, or use templates, to ensure getting 2 different runtime versions of the function, versus lots of branching at runtime which would slow everything down unnecessarily.

I think this is fine for now. I have a feeling that if we add proper state tracking we can simplify this and the main render function enough that the duplication won't be so bad.

clayjohn · 2020-03-29T05:42:39Z

drivers/gles2/rasterizer_canvas_gles2.cpp

+	// if there are more than this number of commands in the item, we
+	// don't allow joining (separate state changes, and hardware transform)
+	// This is set to quite a conservative (low) number until investigated properly.
+	const int MAX_JOIN_ITEM_COMMANDS = 16;


I'm not sure I understand this properly. Is this to ensure that an object with many commands (like a tilemap) is assigned its own batch and avoids doing all the software transforms associated with batching?

Yes, pretty much. Essentially the problem is this: Any item that contains commands that are default (i.e. not handled by software transform and the batching renderer) should not be joined.

In order to work this out, it does a lookahead through the commands, which could potentially be very expensive. As such it makes sense to put a limit on this to some small number, which will catch nearly all cases which need joining, but not be overly expensive in the case of items with large numbers of commands.

There may be a far more efficient way of handling this. Ideally it would be nice in general to mark some extra data like this on the items themselves, and some dirty flags for when things change in the item. There's a lot of inefficiencies throughout to deal with the historical design.

In practice I don't think this will be a problem in 3.2 to do like this, but we could look at some better approaches in 4.0 where it might be possible to move some of this processing further up into the visual server.

Right, that makes sense. Thank you.

We should consider making this a project setting as well.

Can do, especially for testing, if you think this a good idea. 👍

I do suspect that once we have figured out a good value, it won't need to be user editable .. I'm conscious there's a problem with 'project settings creep' at the moment, so I'd hesitate to go overboard on this. Plus it won't be at all obvious to end users what this does. Maybe the project settings overchoice problem will be addressed in 4.0 with 'advanced settings'.

Another candidate for the project settings is the max vertex buffer size, which at the moment is hard coded in initialize().

Ah, yes, please make max vertex size a project setting as well. I meant to comment on that but forgot.

clayjohn · 2020-03-29T05:45:02Z

drivers/gles2/rasterizer_canvas_gles2.cpp

+	return false;
+}
+
+void RasterizerCanvasGLES2::_canvas_render_item(Item *ci, RIState &ris) {


It's concerning that this logic is now copied three times (when attempting to join, when rendering batched and here). I wonder if there is any way we could combine.

On the other hand, I do appreciate how simple it makes the implementation.

This section here is only repeated as a temporary measure for backward compatibility (i.e. turning batching off and finding regressions with the old renderer).

Once it has been fully tested and we are sure it works for everyone, there should be no need to retain the legacy codepath, and it could potentially be better to just remove it. As you say it is problematic trying to maintain 2 duplicate codepaths.

edit - Actually thinking about it, I'll double check it isn't possible to make a legacy path through the new renderer like I did with the old PR. I'm a bit wary of mixing up the new stuff with too much legacy code and making it harder to read etc.

Im okay with commenting this saying that it needs to be removed before being merged into 3.2. Its fine for regression testing

Ok I'll comment that. 👍

akien-mga · 2020-03-29T09:22:22Z

I can see how you left it open to extend this to other Command types fairly easily. I'm wondering if our best approach may be to make a "test-release" and link it in a blog post to get wider testing, even before merging the PR and then adding in some of the other command types in this PR. @akien-mga may have some thoughts on that. But I think that testing is necessary to making sure this works, and you've put in so much work so far, we really ought to get some wider testing before asking you to make the final leap and finish it up.

I think we should merge this in a dedicated branch (e.g. 3.2-gles2-batching), and I can use it to make official test builds to get wide testing before landing it in the stable branch.

If you agree, I can create that branch and we can just change the base branch for this PR on GitHub.

lawnjelly · 2020-03-29T09:29:11Z

I think we should merge this in a dedicated branch (e.g. 3.2-gles2-batching), and I can use it to make official test builds to get wide testing before landing it in the stable branch.

I think this sounds like a good plan of action. There are a lot of 2d guys waiting for this so we should get quite a few testers and can fix any bugs.

akien-mga · 2020-03-29T09:42:56Z

Done, I changed the base branch to 3.2-gles2-batching.

I suggest we focus first on getting the current state merged, then I'll make a testing release, and we can work further on fixing regressions if there are any, and adding support for more primitives in subsequent PRs. WDYT?

lawnjelly · 2020-03-29T09:48:50Z

Great! 👍
Sounds good, I'll make the changes clayjohn has suggested, then it should be okay for some testing!

Ok changes are now made, barring the availability of the #define for regression testing, and my suggestion regarding the merging of the legacy codepath with the new codepath to save on duplication, as I'm not sure whether it is a good idea yet (will discuss with clayjohn). Regardless, it should now be okay for making builds for testing, barring more suggestions for changes.

lawnjelly · 2020-03-30T09:14:25Z

I have now added some more project settings, as requested. I figured out that I should be creating them in RasterizerCanvasGLES2 initialize, rather than in visual server, because they only apply to GLES2.

As there are now several settings instead of putting use_batching in the rendering/quality/2d section I added a section rendering/quality/2d_batching.

In that there are 3 settings:
use_batching
max_join_item_commands
colored_vertex_format_threshold

and to rendering/limits/buffers I have added 2d_batch_buffer_size.

I've given these all reasonable ranges. Let me know if there are any preferences for renaming these, or putting them in different places, this can easily be done. I've done some basic testing and they seem to work ok and are quite useful for diagnostics etc. We may decide to remove some in future.

I haven't yet written the class reference entries, will do that now, but committed first what I've added.

I've also added couple of lines to support zero buffer length in the rasterizer array, as it will use that if batching is turned off, to save memory, and fixed an un-initialized value for total_quads that was causing problems with one of these parameter settings (I've done fuller constructors for BatchData and RenderItemState).

akien-mga · 2020-03-30T09:26:07Z

I have now added some more project settings, as requested. I figured out that I should be creating them in RasterizerCanvasGLES2 initialize, rather than in visual server, because they only apply to GLES2.

I think you still have to define them in a common code branch, otherwise they will not be visible if using the editor with GLES3 (even if configured to use GLES2 on some platforms), and they won't appear in the docs unless we specifically generate docs with godot --video-driver GLES2 --doctool ..

lawnjelly · 2020-03-30T09:34:27Z

I think you still have to define them in a common code branch, otherwise they will not be visible if using the editor with GLES3 (even if configured to use GLES2 on some platforms), and they won't appear in the docs unless we specifically generate docs with godot --video-driver GLES2 --doctool ..

Ah okay, I'll move them into visual_server.cpp. I only got that idea from searching in files for rendering/limits/buffers/canvas_polygon_buffer_size_kb and finding it was created in rasterizer_canvas_gles3.cpp, but maybe that is handled differently.

If they are globally visible, should I make it clear they are for GLES2 only somehow? For instance make the section gles2_2d_batching or something like that?

akien-mga · 2020-03-30T09:39:48Z

Ah okay, I'll move them into visual_server.cpp. I only got that idea from searching in files for rendering/limits/buffers/canvas_polygon_buffer_size_kb and finding it was created in rasterizer_canvas_gles3.cpp, but maybe that is handled differently.

Yeah that one seems to be defined both in rasterizer_canvas_gles2.cpp and rasterizer_canvas_gles3.cpp, so it works, though it's a bit inconsistent (it's also defined once with GLOBAL_DEF_RST and once with GLOBAL_DEF, so the behavior differs).

There's also rendering/quality/2d/gles2_use_nvidia_rect_flicker_workaround defined in main.cpp to be always defined even when using GLES3 (so that you don't miss it/lose the configured value when switching renderers).

It's all a bit brittle :)

lawnjelly · 2020-03-30T11:13:14Z

Ok, I've moved the project/settings back to visual_server.cpp, and have added gles2_ to the corresponding names, and have added entries in the class reference. Seems to be working fine, let me know any other changes. 👍

The max_join_commands setting is quite handy because if you set it to zero it will prevent joining items which will be useful for finding regressions.

allow_in_editor

I've also added a new temporary setting allow_in_editor, which requires a restart, and marked it as experimental in the class reference. It has occurred to me that we will need to test it in the editor at some point anyway, and having it in the experimental branch might be as good a time as any.

The allow_in_editor requires a restart, and isn't completely user friendly, as it only makes the decision whether to use at startup (i.e. if you change batching project settings, they have no effect in the editor until you restart), but it could still be very useful. Note that it is storing the setting in the project, rather than as an editor setting.

If we are satisfied it works fine in the editor and is running okay, we can remove the option and always have it run.

flash_batching

As this is a testing branch I've added the facility to flash between the old renderer and the batched renderer as a project setting. This is helpful for finding regressions. It should not be needed after the initial testing phase.

lawnjelly · 2020-03-30T17:37:50Z

@akien-mga Just a note that the CI has been repeatedly failing today (last 3 commits for this PR) on appveyor from timeouts (taking longer than 60 mins). I don't know if it is to do with the extra python checks that you were adding, or if it is just appveyor being slow. Maybe there isn't normal caching because we are on a new branch. I haven't changed any includes that might slow down, the only major change is adding to the ProjectSettings.xml class reference...

Anyway although there is a cross it doesn't indicate there's a problem with the code (well unless there was a problem that would have happened after timeout lol). 😁

leanmendoza · 2020-03-30T18:21:21Z

Hi! I'm trying test this feature beacuse it hypes me, i think in my game the bottlenecked is just this. But in the first tryed, i can't notice performance increases. I supossed what i had a lot of sprites, then a tryed in a empty scene, with only a TileMap (figure 1). but neither. I was testing the performance reading the fps with batch rendering, and without it
Any suggestions for see where is the bottlenecked?

akien-mga · 2020-03-30T18:38:05Z

@lawnjelly Yeah I guess it no longer uses the 3.2 cache due to the base branch change. And a build from scratch on AppVeyor takes... 57 to 65 min I guess? So it often times out. I'll change the base branch again for now and only change it back when merging.

Xrayez · 2020-04-15T13:45:47Z

Just tested with recent changes to this PR in a project of mine.

At first I thought there's no significant gains, but I couldn't believe that. Then I recalled that I've actually rewritten the way bezier textures are drawn in relation to z-index (when you have to draw an arm either in front or behind a character, or even in front of other layers like weapons).

Unfortunately, the drawing API doesn't provide a way to simulate different z-index within the _draw method itself, which I'd imagine would look like this:

func _draw():
	draw_set_z_index(zi)
	# ... draw textures on different z_index within the same CanvasItem

Instead, I had to basically split up the same class and use different draw offsets/ranges:

Which I suppose does "harmful" things performance-wise when this is split between canvas items.

This is where I had a drop in performance once again, but I have to say that in either case, GLES2 with batching performs better/on par with GLES3.

But good news is that, without the splitting hack (only drawing an entire arm within a single canvas item), I'm able to draw around 300 characters on screen before FPS goes below 60 (was only able to render around 80 characters on GLES3).

Having said that, I'm totally satisfied with the performance gains, and some optimizations are possible to achieve on the per-project level now (both rendering and GDScript bottlenecks), so thanks for the work you've put into this @lawnjelly, hopefully my testing efforts wasn't in vain. 😁

lawnjelly · 2020-04-15T15:11:25Z

Yeah, I can't say without examining your new project what your new bottlenecks are. It could be

the way you are updating the visual server, which was the bottleneck in the previous version (taking > 10x longer than the actual rendering)
it could be a situation that isn't covered
or it could be that you are now producing something which is not possible to batch.

There are limitations as to what can be batched in order to preserve the behaviour of the previous renderer, and not have visual regressions.

The updating of the visual server did look very slow in your bezier approach and it is worth an investigation, but it is outside the scope of this PR.

I think the aim here should be to at least cover the most popular use-cases, and as many generalised use cases as possible. If you are using quite bespoke techniques you may have to experiment a bit to find out what works best.

Profiling is really useful for this kind of thing, especially if you can compile from source.

That reminds me I'm intending to add some diagnostics within the IDE to help with this kind of thing. I believe the 'monitor' tab of the Godot IDE didn't work at all for GLES2 2D before, but I'll try and get it to correctly monitor the number of drawcalls and whatever else is possible. I may be able to get it to debug print out some diagnostics for a frame, I'm not sure yet though, it would depend on some method to call from the IDE or script.

Unfortunately, the drawing API doesn't provide a way to simulate different z-index within the _draw method itself

Yes, the z index is set for the entire item (usually a node), and not the commands within it. However afaik if you order the drawing commands yourself, this order will be respected in the renderer. And it will probably be more efficient to boot (providing you do your sorting efficiently your side).

This adds 2 new values (items and draw calls) to the performance monitor in a '2d' section, rather than reusing the 3d values in the 'raster' section. This makes it far easier to optimize games to minimize drawcalls.

lawnjelly · 2020-04-16T11:06:41Z

On suggestion from Akien I've now changed use_batching_in_editor to default to true.

I've also added (in the same commit) some very basic features to the IDE Debugger monitor:

2d/items - number of canvas items, or joined canvas items using batching
2d/draw_calls - number of draw calls

These are particularly useful for getting batching working as well as possible in your game, because you can see the drawcalls reduce dramatically the more batching you are achieving. It also operates with batching switched off.

The alternative is to reuse 'drawcalls' in the raster section (if adding new values might e.g. affect the efficiency of the debugger), but a new section is preferable if possible because it enables you to examine 2d and 3d in isolation in 3d games which have 2d user interfaces. It also may be possible to add more performance stats but this should do to start with.

I haven't implemented these for GLES3, but that should be fairly simple to use the same mechanism.

I'd also appreciate anyone familiar with the Debugger monitor looking over how I added the values in this commit. I literally did a find_in_files for existing values until it worked, so I can't guarantee I've done everything correctly, particularly whether it might affect monitoring over network.

clayjohn

@akien-mga We have been discussing the current state on IRC, I think this is ready to be added to the beta.

In most of our test situations batching is at least as fast, if not faster than current code. However, batching can be up to 10% slower when the game is heavily CPU limited. In this situation, batching still reduces the GPU load, but the overall FPS suffers a slight drop. lawnjelly and I feel like this shouldnt hold back beta testing, but it does mean we will leave in the batching toggle in ProjectSettings for now, as some users will benefit from turning batching off.

Added project setting to enable / disable print frame diagnostics every 10 seconds. This prints out a list of batches and info, which is useful to optimize games and identify performance problems.

lawnjelly · 2020-04-17T08:04:55Z

Just as a final tweak, based on the conversation last night with @clayjohn, I've added a simple frame diagnostics mode. You can turn it on and off in the project settings rendering/gles2/debug/frame_diagnostics.

When set to on it prints every 10 seconds a log of the batches being used in the frame, in this example form:

canvas_begin FRAME 4716
items
	joined_item 2561 refs
		batch D 0
		batch R 2560 [0] MULTI
		batch R 12 [1] MULTI
	joined_item 1 refs
		batch D 1
	joined_item 1 refs
		batch D 0
		batch R 6 [0] MULTI

This shows how successful the item joining is, and whether the batch is default (D) or rect (R), then number of items in the batch, the batch tex ID in square brackets, and a MULTI symbol is there are multiple rects in the batch. The aim should be to get as many MULTIs as possible.

Projects that are not batching well will have a lot of single rect batches. Note that complex games will have quite long frame logs, in which case they might get truncated by the editor. You can either run godot from a console to see the full log, or you can increase the IDE buffer size by increasing network/limits/debugger_stdout/max_chars_per_second.

akien-mga · 2020-04-17T10:55:25Z

Looks good to me.

Let's get this merged and tested in production 🤘

Thanks a ton for the tremendous work @lawnjelly, @clayjohn and all testers!

Firty · 2020-04-20T13:47:41Z

WARNING: render_batches: NinePatch without texture not supported yet in GLES2 backend, skipping.
At: drivers/gles2/rasterizer_canvas_gles2.cpp:883
I am having this error when I try to create an AtlasTexture in a Custom Style of a Panel.

Firty · 2020-04-20T13:54:55Z

AVISO: render_batches: NinePatch sem textura ainda não suportada no back-end do GLES2, pulando.
Em: drivers / gles2 / rasterizer_canvas_gles2.cpp: 883
Estou tendo esse erro ao tentar criar uma AtlasTexture no estilo personalizado de um painel.

Panel -> Custom Styles -> StyleBoxTexture -> Texture -> New AtlasTexture.

I tried to do the same thing in one button and it gave the same result ... There is a problem with the Custom Style ... Not is accepting Atlas.

It seems to be an engine problem. But I solved it in a strange way. Instead of creating an AtlasTexture, I loaded the texture and just established the region. It worked, but it was confusing.

lawnjelly · 2020-04-20T15:57:42Z

WARNING: render_batches: NinePatch without texture not supported yet in GLES2 backend, skipping.
At: drivers/gles2/rasterizer_canvas_gles2.cpp:883
I am having this error when I try to create an AtlasTexture in a Custom Style of a Panel.

Despite the function name, this area is just copy pasted from the old renderer (to reuse the same code between the legacy and batched renderer). I suspect the old renderer would do the same (if you switch off use_batching), or compare with e.g. Godot 3.2.1.

Two-Tone · 2020-05-25T19:52:05Z

[light_scissor_area_threshold]
A value of 0.0 means any area saved will activate scissoring, 0.5 is half the screen etc. The actual pixel area is calculated regularly in case the user resizes the screen.

Is 0.5 half the screen or quarter? If you're dividing the X and Y by 2 or multiplying by 0.5 it's quarter, You'd have to divide by 1.4142~ or multiply by 0.7071~ to actually get half screen res. It's an important distinction for documentation.

lawnjelly · 2020-05-26T06:22:51Z

[light_scissor_area_threshold]
A value of 0.0 means any area saved will activate scissoring, 0.5 is half the screen etc. The actual pixel area is calculated regularly in case the user resizes the screen.

Is 0.5 half the screen or quarter? If you're dividing the X and Y by 2 or multiplying by 0.5 it's quarter, You'd have to divide by 1.4142~ or multiply by 0.7071~ to actually get half screen res. It's an important distinction for documentation.

It's an area : width * height in all cases. Effectively a pixel count.

We have an AABB A for the light, and an AABB B for each item.
We can calculate the intersection area between the 2 C.
The area potentially saved by scissoring D, is B - C.
If the area saved D (as proportion of total screen pixels) >= light_scissor_area_threshold, then we apply the scissor

In this way we can selectively apply glScissor in cases which are likely to save a lot of fill rate. Or we can apply it in all cases (set to 0), or turn it off completely (1). On my PC there is not a massive cost to scissoring and setting it to 0 is quite viable. However it is possible that on some hardware there may be some cost to scissoring (at the very least, there is a couple of GL calls), such that the best value lies somewhere in between. I don't have the facilities to rigorously test this so the decision is left to the user.

We may later be able to automate some of these decisions to an extent, but it is hard to get feedback of the effects of changing this kind of thing without running e.g. some diagnostic tests at startup. This is similar to the problems faced by a car ECU.

oeleo1 · 2020-05-26T15:19:52Z

Looking at rasterizer_canvas_gles2.cpp, shouldn't the orphaning code in _batch_upload_buffers()

// orphan the old (for now)
glBufferData(GL_ARRAY_BUFFER, 0, 0, GL_DYNAMIC_DRAW);

read like the orphaning done in the initialization?

glBufferData(GL_ARRAY_BUFFER, bdata.vertex_buffer_size_bytes, NULL, GL_DYNAMIC_DRAW);

I believe that the zero size results in an noop. When we introduced orphaning in GLES3 which gave the speed boost on mobile the rationale was to orphan the whole buffer.

lawnjelly · 2020-05-26T16:14:00Z

This is a little bit of a gray area, as far as I am concerned. Generally I go with manually using a circular set of vertex buffers to make it explicit, but in this case we are reusing the same VB multiple times per frame (as much as 1000+ times) so orphaning seemed easier.

https://www.khronos.org/opengl/wiki/Buffer_Object_Streaming

It seems the driver is able to do whatever it wants, so we are having to predict what it might do to a certain extent. Using the same size as before makes some sense if you are going to be reusing a similar sized buffer (especially if once per frame). However in this case, the buffer size will be varying each time so we don't gain much by doing that. In fact, it might be detrimental because it might decide to reserve some memory just in case you glSubBuffer to it.

I think I did try both and decided this made the most sense, but am willing to change it if we discover it causes problems (it has seemed fine so far). 👍

If you think about it, if we were reusing the same vertex buffer 1000 times in a frame, and this was 1 meg VB, if it actually successfully did re-use these allocations, it might be churning 1 gig of memory per frame, and using > 3 gigs in storage. Which would be extremely bad news.

In practice I'm hoping the NULL just will say 'I don't care about the last version, use it, then delete'. Which is what we want in this case. It may work just as well to remove the NULL call (I'll try and test this out). 😁

oeleo1 · 2020-05-26T18:10:52Z

It will work but that's not the point. The point of orphaning is to let the driver know that we're not interested in the previously allocated and written to buffer and that we're done with it. The first bind/unbind cycle says "go draw this data". The second time we do that, we may overlap with what is currently in progress, which is a nasty constraint for the driver and usually results in a slowdown (which is what I saw last time I dealt with that stuff). Unless we orphan the whole buffer, After the first bind/unbind cycle, the driver assumes that we may rebind and write to the buffer again. But if we say, bind(size, null) there is no other interpretation than "we're done with the whole previous buffer, we need a new one". If we say bind(0, null), this means nothing significant as the buffer may be written to again and rebound. The latter is basically like a malloc(0) - no harm to the heap, no effect.

lawnjelly · 2020-05-26T18:35:22Z

It will work but that's not the point. The point of orphaning is to let the driver know that we're not interested in the previously allocated and written to buffer and that we're done with it. The first bind/unbind cycle says "go draw this data". The second time we do that, we may overlap with what is currently in progress, which is a nasty constraint for the driver and usually results in a slowdown (which is what I saw last time I dealt with that stuff). Unless we orphan the whole buffer, After the first bind/unbind cycle, the driver assumes that we may rebind and write to the buffer again. But if we say, bind(size, null) there is no other interpretation than "we're done with the whole previous buffer, we need a new one". If we say bind(0, null), this means nothing significant as the buffer may be written to again and rebound. The latter is basically like a malloc(0) - no harm to the heap, no effect.

Afaik, glBufferData always should release the reference to the previous buffer and create a new fresh allocation (it presumably leaves a ref count on the previous object will get to zero once it has been rendered and is no longer owned). That's the difference between glBufferData and glBufferSubData. glBufferSubData is to update some part of the buffer. glBufferData is to say 'throw away what you've got and give us a new one, and upload this new data'.

Orphaning is probably more relevant when you want to use glBufferSubData in combination. If you don't do it in that scenario, presumably it has to do a load of housekeeping to keep the rest of the buffer the same (and may cause pipeline stalls). We are not doing that here, we are simply throwing out the whole of the old buffer.

https://www.khronos.org/registry/OpenGL-Refpages/es3.0/html/glBufferData.xhtml

glBufferData creates a new data store for the buffer object currently bound to target. Any pre-existing data store is deleted. The new data store is created with the specified size in bytes and usage. If data is not NULL, the data store is initialized with data from this pointer.

oeleo1 · 2020-05-26T20:11:42Z

Yes, you are right and I stand corrected. And that glBufferData(0, 0) seems spurious at best. The driver can't do anything meaningful with it as it's not even an early indication of whatever follows, which is an overwrite.

lawnjelly · 2020-05-27T06:58:33Z

And that glBufferData(0, 0) seems spurious at best. The driver can't do anything meaningful with it as it's not even an early indication of whatever follows, which is an overwrite.

You are very probably right, I put it in during development as a failsafe (hence the 'for now' comment). It can hopefully be removed without ill effect. 👍

In fact this may have caused the confusion, it is probably a leftover from testing. I was regularly removing sections of code, and I think the failsafe was there to ensure the VB was orphaned at that point, if the second glBuffer call was commented out (I forget the exact reason). In the same way there are still a few state changes that may be redundant, these can be cleaned up at a later stage.

oeleo1 · 2020-09-07T14:29:43Z

@lawnjelly What's your take on batching support for CPUParticles2D? Is that potentially within scope or totally out of sight assuming you can weight or "feel" the complexity of it.

lawnjelly · 2020-09-07T14:48:57Z

@lawnjelly What's your take on batching support for CPUParticles2D? Is that potentially within scope or totally out of sight assuming you can weight or "feel" the complexity of it.

Looking in GLES2 at least, the particles I think from memory are rendered with a multimesh, and looks very inefficient with setup and drawcalls per particle. It could probably be done as a single dynamic mesh much easier (and unrelated to) the rest of the batching. However, there may be lots of gotchas relating to the existing particle design - it could either be easy or a can of worms. Haven't looked at the particles in GLES3, they may be better already.

I don't really have time to look at it myself (GLES3 batching and helping write new renderers for 4.x have to take priority). If someone else wants to have a go, go for it, but bear in mind there may be some changes to the rasterizer layout for GLES3 batching. And all this will change for 4.x anyway.

I still have to figure out some stuff about the shader binding versions in GLES3 too that has cropped up with a bug with custom shaders and the GLES3 nvidia workaround that may be relevant to GLES3 batching (the binding and states have to be done slightly differently to GLES2).

oeleo1 · 2020-09-07T15:07:38Z

Thanks. I was asking about CPU particles in GLES2 because GLES3 has GPU particles, while GLES2 is "limited" to CPU particles which are, as you say, probably suboptimal for the time being.

Since the game we're on is a 2D game using GLES2, it has a bunch of CPU particles and most primitives are left out from batching. Enabling batching produces worse performance results than without it, so vanilla GLES2 is best for us. GLES3 also performs worse than GLES2 despite the GPU particles - much more stutter and performance variance. Profiling shows the usual GPU bottlenecks, most probably at the CPU/GPU interface level, since the GPU load itself is relatively low. That said, it looks to me like there are 2 ways of addressing this: a) lower the load on the CPU/GPU sync points = batching, b) move most of those CPU particles to shaders = back to assembly language. There may be other solutions as well but not on top of my hat.

In short, great fun ahaid.

PS: I have looked at the CPU Particles code. It is indeed handled with a multimesh which looks already quite optimized to me. I don't see any obvious improvement. It is the same code for GLES2 and GLES3 with the sole difference being that GLES3 uses instance rendering which isn't supported in GLES2. The inner particle loop contains only the per particle transforms and the rest of the setup is externalized. So I am not really sure what "single dynamic mesh" being probably much easier means cf. @lanwjelly since the code looks state of the art with what GLES2 allows to do for rendering multiple objects.

lawnjelly requested a review from reduz as a code owner March 27, 2020 10:24

Calinou added enhancement topic:rendering labels Mar 27, 2020

lawnjelly force-pushed the kessel32_1 branch from d8da373 to 5313ef8 Compare March 27, 2020 12:01

clayjohn added this to the 3.2 milestone Mar 27, 2020

clayjohn reviewed Mar 29, 2020

View reviewed changes

akien-mga changed the base branch from 3.2 to 3.2-gles2-batching March 29, 2020 09:41

lawnjelly force-pushed the kessel32_1 branch 2 times, most recently from 6a3b4db to 5375abf Compare March 30, 2020 08:56

lawnjelly force-pushed the kessel32_1 branch from 5375abf to fe09e1a Compare March 30, 2020 11:08

lawnjelly requested a review from a team as a code owner March 30, 2020 11:08

lawnjelly changed the title ~~[WIP] GLES2 2d Batch rendering (across items)~~ GLES2 2d Batch rendering (across items) Mar 30, 2020

lawnjelly force-pushed the kessel32_1 branch 2 times, most recently from ada6032 to e9f2282 Compare March 30, 2020 14:38

clayjohn approved these changes Mar 30, 2020

View reviewed changes

Items and draw calls added to IDE Monitor in '2d' section

b6d6523

This adds 2 new values (items and draw calls) to the performance monitor in a '2d' section, rather than reusing the 3d values in the 'raster' section. This makes it far easier to optimize games to minimize drawcalls.

clayjohn approved these changes Apr 16, 2020

View reviewed changes

Add frame diagnostics for GLES2 Batch renderer

72adefa

Added project setting to enable / disable print frame diagnostics every 10 seconds. This prints out a list of batches and info, which is useful to optimize games and identify performance problems.

lawnjelly force-pushed the kessel32_1 branch from 2ae941d to 72adefa Compare April 17, 2020 07:55

akien-mga merged commit 008e074 into godotengine:3.2 Apr 17, 2020

akien-mga mentioned this pull request Apr 19, 2020

Testing results for new 2D batching in the GLES2 renderer #38004

Closed

akien-mga removed the cherrypick:3.x Considered for cherry-picking into a future 3.x release label Apr 20, 2020

Ranoller mentioned this pull request May 4, 2020

GLES 2 BATCHING - Camera border renders fixed to viewport in editor in certains circumstances (3.2.2. beta1) #38467

Closed

red1939 mentioned this pull request May 12, 2020

Add static and dynamic batching support? #1527

Closed

clayjohn mentioned this pull request Jun 20, 2020

Text rendering performance low due to missing batching #19917

Closed

Griefchief mentioned this pull request Jul 20, 2020

SkeletonIK not updating after the first time when Skeleton Animation is playing #40099

Closed

GLES2 2d Batch rendering (across items) #37349

GLES2 2d Batch rendering (across items) #37349

Conversation

lawnjelly commented Mar 27, 2020 • edited Loading

Test Builds

Notes

History and Explanation

Future

Performance increases - what to expect

How bottlenecks work

A Quick Guide

clayjohn commented Mar 27, 2020 • edited Loading

lawnjelly commented Mar 28, 2020 • edited Loading

lawnjelly commented Mar 28, 2020

clayjohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lawnjelly Mar 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lawnjelly Mar 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akien-mga commented Mar 29, 2020

lawnjelly commented Mar 29, 2020

akien-mga commented Mar 29, 2020

lawnjelly commented Mar 29, 2020 • edited Loading

lawnjelly commented Mar 30, 2020

akien-mga commented Mar 30, 2020

lawnjelly commented Mar 30, 2020 • edited Loading

akien-mga commented Mar 30, 2020

lawnjelly commented Mar 30, 2020 • edited Loading

allow_in_editor

flash_batching

lawnjelly commented Mar 30, 2020

leanmendoza commented Mar 30, 2020

akien-mga commented Mar 30, 2020

Xrayez commented Apr 15, 2020

lawnjelly commented Apr 15, 2020

lawnjelly commented Apr 16, 2020

clayjohn left a comment

Choose a reason for hiding this comment

lawnjelly commented Apr 17, 2020

akien-mga commented Apr 17, 2020

Firty commented Apr 20, 2020

Firty commented Apr 20, 2020 • edited Loading

lawnjelly commented Apr 20, 2020

Two-Tone commented May 25, 2020

lawnjelly commented May 26, 2020 • edited Loading

oeleo1 commented May 26, 2020

lawnjelly commented May 26, 2020 • edited Loading

oeleo1 commented May 26, 2020

lawnjelly commented May 26, 2020

oeleo1 commented May 26, 2020

lawnjelly commented May 27, 2020 • edited Loading

oeleo1 commented Sep 7, 2020

lawnjelly commented Sep 7, 2020 • edited Loading

oeleo1 commented Sep 7, 2020 • edited Loading

lawnjelly commented Mar 27, 2020 •

edited

Loading

clayjohn commented Mar 27, 2020 •

edited

Loading

lawnjelly commented Mar 28, 2020 •

edited

Loading

lawnjelly Mar 29, 2020 •

edited

Loading

lawnjelly Mar 29, 2020 •

edited

Loading

lawnjelly commented Mar 29, 2020 •

edited

Loading

lawnjelly commented Mar 30, 2020 •

edited

Loading

lawnjelly commented Mar 30, 2020 •

edited

Loading

Firty commented Apr 20, 2020 •

edited

Loading

lawnjelly commented May 26, 2020 •

edited

Loading

lawnjelly commented May 26, 2020 •

edited

Loading

lawnjelly commented May 27, 2020 •

edited

Loading

lawnjelly commented Sep 7, 2020 •

edited

Loading

oeleo1 commented Sep 7, 2020 •

edited

Loading