GLES2 2d Batch rendering #35957

lawnjelly · 2020-02-06T16:46:06Z

2d rendering is currently bottlenecked by drawing rects one at a time, limiting OpenGL efficiency. This PR batches rects and renders in fewer drawcalls, resulting in significant performance improvements. This also speeds up text rendering.

The code dynamically chooses between a vertex format with and without color, depending on the input data for a frame, in order to optimize throughput and maximize batch size.

Notes

Although it only batches rects to start with, the same framework can be used to batch the other primitives.
Can only batch within _canvas_item_render_commands. Notably this covers tilemaps, text, the _draw API but NOT separate sprites (we evaluated this and it would require far more complex changes).
There is a high speed simple growable pod dynamic array template included. For now it is included in the drivers/gles2 directory.

I hesitate to put it in core as it is really only intended for these rasterizers and is not general purpose. It will be very easy to replace when we have a high speed, non-COW vector in the future.

I'm using indexed primitives so that only 4 verts are required per quad. Indices in GLES2 however can only be 16 bit, so they are limited to addressing 65535 verts (from the origin). For this reason there is a maximum size for a vertex buffer, 1 quad less than would fit into 65535. When this occurs, the routine ends the batch, draws the current batches, resets and starts filling batches again where it left off.
Swapping between colored / non-colored vertex format is currently done based on a simple heuristic at runtime. It is also possible to use hysteresis, or to allow the user to switch manually, however this has not been implemented yet, it may not be required.
Now tested on Linux, Windows 10, Android, WebGL. The more testing the better, and should anyone be able to try it on their devices / more platforms that would be welcome.
Because there is still the possibility for bugs, I have added (temporarily) a Project setting : rendering/quality/2d/use_batching which defaults to true, but can be turned off if there are problems. The legacy non-batched method is still included.
Just in case it affects people being able to even start up the editor in 2d, I have disabled the batching in the IDE until we have some feedback. If all goes well it could be enabled in the IDE by default with another PR.

What kind of speed increases can I expect?

It depends on a number of factors, including what you are rendering and what the hardware is. Benefit roughly scales with the number of quads. There will be less benefit in games that are fill rate limited already. In my tests with 2d games / tests, frame rate increases of 2-10x were typical. I was getting larger speed increases on desktop than on my tablet.

Note there are some cases (notably bunnymark v2) which can't be batched currently, see below. Any more figures doing the comparison on different hardware would be welcome.

Increases will be higher in release build. With some more tweaks I have now clocked 58x increase in frame rate in vertex throughput limited scene ( #19917 ).

Geequlim · 2020-02-06T17:07:52Z

Does it increase the bunnymark score?

lawnjelly · 2020-02-06T17:42:24Z

Does it increase the bunnymark score?

Bunnymark V2 GDscript (uses Sprite)

Old 1870, Batched 1870

No it isn't faster, you may have found a rare case it can't accelerate. 😢

At a guess I'd think each bunny is causing a separate call to _canvas_item_render_commands. I can only batch within that function, if it is called with e.g. 1 bunny each time, there is no batching. Also in the single case, it is currently using the same codepath as nvidia workaround, which is slower. I can easily put in a special case for this kind of thing, so it is never slower than the old version (edit DONE).

I did have a look initially at batching over multiple _canvas_item_render_commands calls, however because of the historical design of the engine, it looked quite a bit more complex. I think the best I can do in that case is to just fall back to the legacy method, it will probably be quicker for a single quad (because the batching uses index buffers etc).

Bunnymark V1 DrawTex GDScript (uses draw_texture)

Old 2060, Batched 3238

I suspect the low increase is because something else is bottlenecking it, I haven't looked in more detail. Each bunny is a separate node running gdscript so that's going to be pretty slow compared to things like tilemaps.

Bunnymark V1Sprites
Wouldn't converge to a solution at all in old mode? I think there's a problem with the benchmark, it was hovering around 56/57 fps and not giving any figures.

Overall, bunnymark might be useful for comparing languages but it's not a very useful real world test for rendering. Usually it can be a good idea to test each aspect in isolation. And within rendering there are different aspects, you can be limited by vertex throughput, or fill rate etc.

clayjohn · 2020-02-06T17:53:18Z

On a heavy tilemap (cell size 8) on my low-end hardware (Celeron N2840). I get about a 10X speed increase. Frame times from 5-33 ms are reduced to 0.5-2.5.

The speed increase is obvious. I will be reviewing the code this evening in more detail. But I think we should also try to get lots of testers on different hardware. The other batching PR ended up having problems with specific hardware. So testing on a wider base is likely needed here as well.

lawnjelly · 2020-02-06T18:04:32Z

Great! 😄 It seems I have a bunch of warnings as errors to sort out for some platforms, that may have to wait until after the weekend. Still it can be tested on desktop in the meantime. 👍

Also note to any reviewers, don't get carried away yet on the details as I'll still be doing some more work on it after the weekend (and there are still various variable names to change to godot standard).

clayjohn · 2020-02-06T18:15:29Z

Regarding defining _use_2d_batching. I think it should be moved to VisualServer.cpp There is no need to have it defined in Engine here is an example of a typical rendering quality setting

Defined in VisualServer.cpp

godot/servers/visual_server.cpp

Line 2398 in 0812f99

GLOBAL_DEF("rendering/quality/reflections/high_quality_ggx", true);

Used in rasterizer:

godot/drivers/gles3/rasterizer_storage_gles3.cpp

Line 8350 in 0812f99

bool ggx_hq = GLOBAL_GET("rendering/quality/reflections/high_quality_ggx");

fire · 2020-02-06T18:52:43Z

There are some cicd errors. https://travis-ci.org/godotengine/godot/builds/646965929?utm_source=github_status&utm_medium=notification

lawnjelly · 2020-02-06T19:09:46Z

There are some cicd errors. https://travis-ci.org/godotengine/godot/builds/646965929?utm_source=github_status&utm_medium=notification

Yup I will have to deal with them after the weekend unless I get some time tomorrow (am too tired tonight, is bedtime here! 😃 . I think they are warnings as errors about using godot things like Color in pod structures, nothing serious. It's actually a little more complex than you might think, because the vertex format has to be 32 bit and Vector2 etc might be using real, so could change to double in future, so I'd prefer to make anything future proof.

novhack · 2020-02-08T01:12:05Z

This one is a game changer. My project jumped from less than 200 to 850 FPS. It's a powerful desktop so I guess a mobile experience would not be nice without this.

clayjohn

This is a start. Mostly just naming convention nit picks. I will do a more in depth review when I have time tonight.

Looking great so far. :)

drivers/gles2/rasterizer_array.h

drivers/gles2/rasterizer_canvas_gles2.h