Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GLES2 2d Batch rendering (across items) #37349

Merged
merged 7 commits into from
Apr 17, 2020
Merged

Conversation

lawnjelly
Copy link
Member

@lawnjelly lawnjelly commented Mar 27, 2020

2d rendering is currently bottlenecked by drawing primitives one at a time, limiting OpenGL efficiency. This PR batches primitives and renders in fewer drawcalls, resulting in significant performance improvements. This also speeds up text rendering.

This PR batches across canvas items as well as within items.

The code dynamically chooses between a vertex format with and without color, depending on the input data for a frame, in order to optimize throughput and maximize batch size.

Test Builds

Many thanks to Akien for making test builds (more info further down thread), please try them out and let us know how you get on with your game / test projects. Let us know especially any visual bugs so we can fix them! 👍
https://downloads.tuxfamily.org/godotengine/testing/3.2.2-gles2_batching_build2/

Notes

  • This is an enhanced version of my original PR GLES2 2d Batch rendering #35957, which could not batch across multiple items
  • This currently batches only rects
  • The batching settings are in projectsettings/rendering/gles2/ and the new rendering can be toggled with the old behaviour using the use_batching boolean
  • There is an experimental option use_batching_in_editor. Only use this when you are sure it is working fine for some of your projects. This is more for finding regressions at this point than for editor performance.

History and Explanation

I rewrote the 2d renderer several times (this is probably the 4th or so version, and about the 7th iteration). An earlier version was in fact considerably more advanced, however a massive problem in practice has been regressions between the new renderer behaviour and the old renderer. To that end I threw out a lot of earlier work and considerably simplified the problem in order to retain as much of the existing logic as possible and maintain compatibility.

In the original PR it proved simple enough to batch commands within a render item. This worked for many cases such as tilemaps and text. However there was keenness to be able to batch different items together where possible, such as in bunnymark, with large numbers of independently moving sprites.

  • The simple approach taken here was to add an extra pass to the items, to identify any that could be joined with the previous item, to form joined_items. This entails duplicating a simplified version of the state logic.
  • The next stage was to modify the existing renderer to deal with joined_items rather than single items.
  • Finally the existing code in the old PR was used to batch the commands in these joined items.

An important consideration was the functions had to be able to be called multiple times instead of a single time, because it had to deal with the situation where the vertex buffer was full, flush, then refill.

In addition in order to batch between items, where items are joined, it is necessary to use software transform. However, where items cannot be joined, or the items contain a large number of commands, the items are not joined and hardware transform is used. This retains maximum speed for tilemaps and large areas of text.

Future

We can add batching for other primitives using the same method. Note that this PR is made against Godot 3.2 but is also designed to eventually be used in 4.0.

Performance increases - what to expect

Most importantly, the frame rate increases you will get will be highly dependent on what are the bottlenecks in your game. It MAY NOT speed up all games, especially those that are running fast already - some will see great differences, others, not so much.

  • Generally, the more rects you have on screen, the higher the density of your tilemaps, the more text, the more you are likely to benefit.
  • If you are only drawing a small number of rects you are unlikely to be bottlenecked by drawcalls, and will see less benefit.
  • If you are bottlenecked by fillrate (e.g. a smaller screen results in higher frame rate) you will see less benefit.
  • You will typically see relative larger performance increase in release builds than in debug builds, because the batching is CPU limited.

The performance increases should be on a par with the old PR, with the addition of increases due to batching across items.

However, I will say that the additional benefits of the batching across items versus just batching within items may not be that huge in real world use cases, versus test cases, as it is less likely to be a bottleneck.

How bottlenecks work

If the game is spending 1% of the time making drawcalls, even if we speed up the batching 100%, you only have 1% of gain available. If on the other hand your game is spending 80% of time in drawcalls, you can see very large performance increases. You can either use profiling to examine this (a large amount of time spent in the graphics driver normally indicates API bottlenecks), or simple try out the build.

A Quick Guide

The settings are in ProjectSettings->Rendering->Gles2

  • use_batching - turns on and off (you can easily compare frame rates with and without)
  • flash_batching - if batching is on, this alternates between the legacy renderer and batching on alternate frames - this can help find visual differences, which you should report here so I can fix
  • use_batching_in_editor - this will restart the editor and use batching to render the Godot IDE. Providing running projects look ok with batching, I'm 99.9% sure running in the editor should be fine too. Please report any visual differences in behaviour in the editor so I can fix.
  • max_join_item_commands - the default should work fine, but it is possible that different values may give you slightly better results in your game. A value of 0 turns joining items off, which can be used to help diagnose regressions (it will usually slightly decrease batching gains though).
  • colored_vertex_format_threshold - try varying this to see how it affects the performance in your game, the best value may be game / hardware dependent, we are not sure yet.
  • light_scissor_area_threshold - if you are using lights in your game, be sure to test varying this, even if you do not benefit from batching. This can significantly reduce GPU fillrate requirements, which can be a bottleneck in some games.

@clayjohn
Copy link
Member

clayjohn commented Mar 27, 2020

Just starting taking a look now. Run it through my previous test project and I am still getting ~5x speedup (heavy text with interleaved sprites and lights). I will review the code this afternoon and we can discuss what to do next tomorrow.

edit: I wasnt able to spend as much time reviewing as I had hoped. But so far I am liking this implementation. It is feeling closer to the original ItemCommands batching PR than your other iterations. i.e. it feels a little more elegant. Testing it out on my device I also see a substantial speed increase, even compared to the previous PR for batching ItemCommands.

@clayjohn clayjohn added this to the 3.2 milestone Mar 27, 2020
@lawnjelly
Copy link
Member Author

lawnjelly commented Mar 28, 2020

heavy text with interleaved sprites and lights

The lights is interesting and something I briefly discussed with reduz. You'll notice in the try_join_item logic there is a large commented section with light logic, and at the moment it simply breaks the between item batching when lights are used.

It shouldn't need to in theory, but in practice if there are overlaps between geometry, I found the result of joined items is different when lit, because of multiple passes:

i.e. (a * light) + (b * light) is NOT the same as (a + b) * light

Tilemaps etc do still benefit even with lights though. In fact there's another small optimization I haven't put in yet which should help on the lights front. Provided the vertex buffer isn't full, it can be reused for each light. At the moment it is refilled for the main pass and each light. The current method should still should be miles faster than the old method though.

There's quite a bit of low hanging fruit in general still available for optimization, and indeed adding other primitives, but I'd ideally prefer to get this more straightforward approach in use, to work out any regressions and get some feedback, before adding more, because the more code, the more difficult to debug. It's a bit of a balancing act really between run time speed and code complexity.

I forgot to mention in terms of review, I've moved a lot of the old code into RasterizerCanvasBase, and just left the higher level code and new stuff in RasterizerCanvas. As such you should be pretty much able to skim read over RasterizerCanvasBase as it is unchanged.

@lawnjelly
Copy link
Member Author

OMG, that was traumatic trying to test on Android. An entire day spent reinstalling android studio multiple times to get it to update so it would compile the latest version of the android templates. Then the old S3TC import bug on GLES2 #28308 (must get a fix for that merged) compounded by a new bug where Godot seems to keep trying to re-import textures because they aren't there, if you remove S3TC from the import list in ProjectSettings. 😧

Anyway upshot is that the PR seems to work fine for me both on the Android emulator and on my ancient Google Nexus 7 2013. 😁 Would appreciate more testing on devices / platforms though.

I suspect it will be ok everywhere because it doesn't do anything too funky API wise, and early ARM Android devices are the most likely to show up any problems because they will crash without properly aligned structures.

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I left a few nitpicky comments (and I will do another round of naming nitpicks before this is merged).

I left a few questions where I couldn't figure something out. But overall I didn't have much trouble understanding the logic behind this and you seem to have done a great job of maintaining the current logic of the renderer.

I can see how you left it open to extend this to other Command types fairly easily. I'm wondering if our best approach may be to make a "test-release" and link it in a blog post to get wider testing, even before merging the PR and then adding in some of the other command types in this PR. @akien-mga may have some thoughts on that. But I think that testing is necessary to making sure this works, and you've put in so much work so far, we really ought to get some wider testing before asking you to make the final leap and finish it up.

drivers/gles2/rasterizer_canvas_gles2.h Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.h Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.h Outdated Show resolved Hide resolved
// between the non-batched renderer and the batched renderer,
// in order to find regressions.
// This should not be defined except during development.
//#define KESSEL_FLASH
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have to get rid of this for the final mergeable PR

drivers/gles2/rasterizer_canvas_gles2.cpp Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.cpp Outdated Show resolved Hide resolved
drivers/gles2/rasterizer_canvas_gles2.cpp Outdated Show resolved Hide resolved
// state.canvas_shader.bind();

// ris.last_blend_mode = -1;
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just delete the commented out sections, From what I can tell, you have already pulled out the necessary pieces

Copy link
Member Author

@lawnjelly lawnjelly Mar 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was expecting we'd decide to do that, we can always look at the old version for reference.

In terms of maintenance, this is one of the most error prone areas actually. The logic in try_join must exactly match the logic in the renderer, otherwise you will get subtle rendering bugs as a result of joining items that shouldn't have been joined. In one of the other versions of the PR, I tried putting this all in the same uber function. However it does end up quite messy.

Although I've avoided the uber function approach here, I'm equally happy to go with either approach, there are pros and cons with both. The split version is easier to read and understand (especially for people unfamiliar with how it works). However the combined uber function may possibly be easier to maintain in the long run.

I think we can either change to uber function now, or leave as is and change with another PR if it proves to be a better approach. One other factor in favour of the split approach is that it might be easier initially in terms of dealing with ironing out major problems.

For the uber function to work it would probably either have to be littered with a lot of #ifdefs, or use templates, to ensure getting 2 different runtime versions of the function, versus lots of branching at runtime which would slow everything down unnecessarily.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now. I have a feeling that if we add proper state tracking we can simplify this and the main render function enough that the duplication won't be so bad.

// if there are more than this number of commands in the item, we
// don't allow joining (separate state changes, and hardware transform)
// This is set to quite a conservative (low) number until investigated properly.
const int MAX_JOIN_ITEM_COMMANDS = 16;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this properly. Is this to ensure that an object with many commands (like a tilemap) is assigned its own batch and avoids doing all the software transforms associated with batching?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pretty much. Essentially the problem is this: Any item that contains commands that are default (i.e. not handled by software transform and the batching renderer) should not be joined.

In order to work this out, it does a lookahead through the commands, which could potentially be very expensive. As such it makes sense to put a limit on this to some small number, which will catch nearly all cases which need joining, but not be overly expensive in the case of items with large numbers of commands.

There may be a far more efficient way of handling this. Ideally it would be nice in general to mark some extra data like this on the items themselves, and some dirty flags for when things change in the item. There's a lot of inefficiencies throughout to deal with the historical design.

In practice I don't think this will be a problem in 3.2 to do like this, but we could look at some better approaches in 4.0 where it might be possible to move some of this processing further up into the visual server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that makes sense. Thank you.

We should consider making this a project setting as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do, especially for testing, if you think this a good idea. 👍

I do suspect that once we have figured out a good value, it won't need to be user editable .. I'm conscious there's a problem with 'project settings creep' at the moment, so I'd hesitate to go overboard on this. Plus it won't be at all obvious to end users what this does. Maybe the project settings overchoice problem will be addressed in 4.0 with 'advanced settings'.

Another candidate for the project settings is the max vertex buffer size, which at the moment is hard coded in initialize().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, please make max vertex size a project setting as well. I meant to comment on that but forgot.

return false;
}

void RasterizerCanvasGLES2::_canvas_render_item(Item *ci, RIState &ris) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's concerning that this logic is now copied three times (when attempting to join, when rendering batched and here). I wonder if there is any way we could combine.

On the other hand, I do appreciate how simple it makes the implementation.

Copy link
Member Author

@lawnjelly lawnjelly Mar 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section here is only repeated as a temporary measure for backward compatibility (i.e. turning batching off and finding regressions with the old renderer).

Once it has been fully tested and we are sure it works for everyone, there should be no need to retain the legacy codepath, and it could potentially be better to just remove it. As you say it is problematic trying to maintain 2 duplicate codepaths.

edit - Actually thinking about it, I'll double check it isn't possible to make a legacy path through the new renderer like I did with the old PR. I'm a bit wary of mixing up the new stuff with too much legacy code and making it harder to read etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im okay with commenting this saying that it needs to be removed before being merged into 3.2. Its fine for regression testing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'll comment that. 👍

@akien-mga
Copy link
Member

I can see how you left it open to extend this to other Command types fairly easily. I'm wondering if our best approach may be to make a "test-release" and link it in a blog post to get wider testing, even before merging the PR and then adding in some of the other command types in this PR. @akien-mga may have some thoughts on that. But I think that testing is necessary to making sure this works, and you've put in so much work so far, we really ought to get some wider testing before asking you to make the final leap and finish it up.

I think we should merge this in a dedicated branch (e.g. 3.2-gles2-batching), and I can use it to make official test builds to get wide testing before landing it in the stable branch.

If you agree, I can create that branch and we can just change the base branch for this PR on GitHub.

@lawnjelly
Copy link
Member Author

I think we should merge this in a dedicated branch (e.g. 3.2-gles2-batching), and I can use it to make official test builds to get wide testing before landing it in the stable branch.

I think this sounds like a good plan of action. There are a lot of 2d guys waiting for this so we should get quite a few testers and can fix any bugs.

@akien-mga akien-mga changed the base branch from 3.2 to 3.2-gles2-batching March 29, 2020 09:41
@akien-mga
Copy link
Member

Done, I changed the base branch to 3.2-gles2-batching.

I suggest we focus first on getting the current state merged, then I'll make a testing release, and we can work further on fixing regressions if there are any, and adding support for more primitives in subsequent PRs. WDYT?

@lawnjelly
Copy link
Member Author

lawnjelly commented Mar 29, 2020

Great! 👍
Sounds good, I'll make the changes clayjohn has suggested, then it should be okay for some testing!

Ok changes are now made, barring the availability of the #define for regression testing, and my suggestion regarding the merging of the legacy codepath with the new codepath to save on duplication, as I'm not sure whether it is a good idea yet (will discuss with clayjohn). Regardless, it should now be okay for making builds for testing, barring more suggestions for changes.

@lawnjelly lawnjelly force-pushed the kessel32_1 branch 2 times, most recently from 6a3b4db to 5375abf Compare March 30, 2020 08:56
@lawnjelly
Copy link
Member Author

I have now added some more project settings, as requested. I figured out that I should be creating them in RasterizerCanvasGLES2 initialize, rather than in visual server, because they only apply to GLES2.

As there are now several settings instead of putting use_batching in the rendering/quality/2d section I added a section rendering/quality/2d_batching.

In that there are 3 settings:
use_batching
max_join_item_commands
colored_vertex_format_threshold

and to rendering/limits/buffers I have added 2d_batch_buffer_size.

I've given these all reasonable ranges. Let me know if there are any preferences for renaming these, or putting them in different places, this can easily be done. I've done some basic testing and they seem to work ok and are quite useful for diagnostics etc. We may decide to remove some in future.

I haven't yet written the class reference entries, will do that now, but committed first what I've added.

I've also added couple of lines to support zero buffer length in the rasterizer array, as it will use that if batching is turned off, to save memory, and fixed an un-initialized value for total_quads that was causing problems with one of these parameter settings (I've done fuller constructors for BatchData and RenderItemState).

@akien-mga
Copy link
Member

I have now added some more project settings, as requested. I figured out that I should be creating them in RasterizerCanvasGLES2 initialize, rather than in visual server, because they only apply to GLES2.

I think you still have to define them in a common code branch, otherwise they will not be visible if using the editor with GLES3 (even if configured to use GLES2 on some platforms), and they won't appear in the docs unless we specifically generate docs with godot --video-driver GLES2 --doctool ..

@lawnjelly
Copy link
Member Author

lawnjelly commented Mar 30, 2020

I think you still have to define them in a common code branch, otherwise they will not be visible if using the editor with GLES3 (even if configured to use GLES2 on some platforms), and they won't appear in the docs unless we specifically generate docs with godot --video-driver GLES2 --doctool ..

Ah okay, I'll move them into visual_server.cpp. I only got that idea from searching in files for rendering/limits/buffers/canvas_polygon_buffer_size_kb and finding it was created in rasterizer_canvas_gles3.cpp, but maybe that is handled differently.

If they are globally visible, should I make it clear they are for GLES2 only somehow? For instance make the section gles2_2d_batching or something like that?

@akien-mga
Copy link
Member

Ah okay, I'll move them into visual_server.cpp. I only got that idea from searching in files for rendering/limits/buffers/canvas_polygon_buffer_size_kb and finding it was created in rasterizer_canvas_gles3.cpp, but maybe that is handled differently.

Yeah that one seems to be defined both in rasterizer_canvas_gles2.cpp and rasterizer_canvas_gles3.cpp, so it works, though it's a bit inconsistent (it's also defined once with GLOBAL_DEF_RST and once with GLOBAL_DEF, so the behavior differs).

There's also rendering/quality/2d/gles2_use_nvidia_rect_flicker_workaround defined in main.cpp to be always defined even when using GLES3 (so that you don't miss it/lose the configured value when switching renderers).

It's all a bit brittle :)

@lawnjelly
Copy link
Member Author

lawnjelly commented Mar 30, 2020

Ok, I've moved the project/settings back to visual_server.cpp, and have added gles2_ to the corresponding names, and have added entries in the class reference. Seems to be working fine, let me know any other changes. 👍

The max_join_commands setting is quite handy because if you set it to zero it will prevent joining items which will be useful for finding regressions.

allow_in_editor

I've also added a new temporary setting allow_in_editor, which requires a restart, and marked it as experimental in the class reference. It has occurred to me that we will need to test it in the editor at some point anyway, and having it in the experimental branch might be as good a time as any.

The allow_in_editor requires a restart, and isn't completely user friendly, as it only makes the decision whether to use at startup (i.e. if you change batching project settings, they have no effect in the editor until you restart), but it could still be very useful. Note that it is storing the setting in the project, rather than as an editor setting.

If we are satisfied it works fine in the editor and is running okay, we can remove the option and always have it run.

flash_batching

As this is a testing branch I've added the facility to flash between the old renderer and the batched renderer as a project setting. This is helpful for finding regressions. It should not be needed after the initial testing phase.

@lawnjelly lawnjelly changed the title [WIP] GLES2 2d Batch rendering (across items) GLES2 2d Batch rendering (across items) Mar 30, 2020
@lawnjelly lawnjelly force-pushed the kessel32_1 branch 2 times, most recently from ada6032 to e9f2282 Compare March 30, 2020 14:38
@lawnjelly
Copy link
Member Author

@akien-mga Just a note that the CI has been repeatedly failing today (last 3 commits for this PR) on appveyor from timeouts (taking longer than 60 mins). I don't know if it is to do with the extra python checks that you were adding, or if it is just appveyor being slow. Maybe there isn't normal caching because we are on a new branch. I haven't changed any includes that might slow down, the only major change is adding to the ProjectSettings.xml class reference...

Anyway although there is a cross it doesn't indicate there's a problem with the code (well unless there was a problem that would have happened after timeout lol). 😁

@leanmendoza
Copy link

Hi! I'm trying test this feature beacuse it hypes me, i think in my game the bottlenecked is just this. But in the first tryed, i can't notice performance increases. I supossed what i had a lot of sprites, then a tryed in a empty scene, with only a TileMap (figure 1). but neither. I was testing the performance reading the fps with batch rendering, and without it
Any suggestions for see where is the bottlenecked?

image

@akien-mga
Copy link
Member

@lawnjelly Yeah I guess it no longer uses the 3.2 cache due to the base branch change. And a build from scratch on AppVeyor takes... 57 to 65 min I guess? So it often times out. I'll change the base branch again for now and only change it back when merging.

@Xrayez
Copy link
Contributor

Xrayez commented Apr 15, 2020

Just tested with recent changes to this PR in a project of mine.

At first I thought there's no significant gains, but I couldn't believe that. Then I recalled that I've actually rewritten the way bezier textures are drawn in relation to z-index (when you have to draw an arm either in front or behind a character, or even in front of other layers like weapons).

Unfortunately, the drawing API doesn't provide a way to simulate different z-index within the _draw method itself, which I'd imagine would look like this:

func _draw():
	draw_set_z_index(zi)
	# ... draw textures on different z_index within the same CanvasItem

Instead, I had to basically split up the same class and use different draw offsets/ranges:

Annotation 2020-04-15 163337

Annotation 2020-04-15 163427

Which I suppose does "harmful" things performance-wise when this is split between canvas items.

This is where I had a drop in performance once again, but I have to say that in either case, GLES2 with batching performs better/on par with GLES3.

But good news is that, without the splitting hack (only drawing an entire arm within a single canvas item), I'm able to draw around 300 characters on screen before FPS goes below 60 (was only able to render around 80 characters on GLES3).

Having said that, I'm totally satisfied with the performance gains, and some optimizations are possible to achieve on the per-project level now (both rendering and GDScript bottlenecks), so thanks for the work you've put into this @lawnjelly, hopefully my testing efforts wasn't in vain. 😁

@lawnjelly
Copy link
Member Author

Yeah, I can't say without examining your new project what your new bottlenecks are. It could be

  • the way you are updating the visual server, which was the bottleneck in the previous version (taking > 10x longer than the actual rendering)
  • it could be a situation that isn't covered
  • or it could be that you are now producing something which is not possible to batch.

There are limitations as to what can be batched in order to preserve the behaviour of the previous renderer, and not have visual regressions.

The updating of the visual server did look very slow in your bezier approach and it is worth an investigation, but it is outside the scope of this PR.

I think the aim here should be to at least cover the most popular use-cases, and as many generalised use cases as possible. If you are using quite bespoke techniques you may have to experiment a bit to find out what works best.

Profiling is really useful for this kind of thing, especially if you can compile from source.

That reminds me I'm intending to add some diagnostics within the IDE to help with this kind of thing. I believe the 'monitor' tab of the Godot IDE didn't work at all for GLES2 2D before, but I'll try and get it to correctly monitor the number of drawcalls and whatever else is possible. I may be able to get it to debug print out some diagnostics for a frame, I'm not sure yet though, it would depend on some method to call from the IDE or script.

Unfortunately, the drawing API doesn't provide a way to simulate different z-index within the _draw method itself

Yes, the z index is set for the entire item (usually a node), and not the commands within it. However afaik if you order the drawing commands yourself, this order will be respected in the renderer. And it will probably be more efficient to boot (providing you do your sorting efficiently your side).

This adds 2 new values (items and draw calls) to the performance monitor in a '2d' section, rather than reusing the 3d values in the 'raster' section.

This makes it far easier to optimize games to minimize drawcalls.
@lawnjelly
Copy link
Member Author

On suggestion from Akien I've now changed use_batching_in_editor to default to true.

I've also added (in the same commit) some very basic features to the IDE Debugger monitor:

  • 2d/items - number of canvas items, or joined canvas items using batching
  • 2d/draw_calls - number of draw calls

These are particularly useful for getting batching working as well as possible in your game, because you can see the drawcalls reduce dramatically the more batching you are achieving. It also operates with batching switched off.

The alternative is to reuse 'drawcalls' in the raster section (if adding new values might e.g. affect the efficiency of the debugger), but a new section is preferable if possible because it enables you to examine 2d and 3d in isolation in 3d games which have 2d user interfaces. It also may be possible to add more performance stats but this should do to start with.

I haven't implemented these for GLES3, but that should be fairly simple to use the same mechanism.

I'd also appreciate anyone familiar with the Debugger monitor looking over how I added the values in this commit. I literally did a find_in_files for existing values until it worked, so I can't guarantee I've done everything correctly, particularly whether it might affect monitoring over network.

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akien-mga We have been discussing the current state on IRC, I think this is ready to be added to the beta.

In most of our test situations batching is at least as fast, if not faster than current code. However, batching can be up to 10% slower when the game is heavily CPU limited. In this situation, batching still reduces the GPU load, but the overall FPS suffers a slight drop. lawnjelly and I feel like this shouldnt hold back beta testing, but it does mean we will leave in the batching toggle in ProjectSettings for now, as some users will benefit from turning batching off.

Added project setting to enable / disable print frame diagnostics every 10 seconds. This prints out a list of batches and info, which is useful to optimize games and identify performance problems.
@lawnjelly
Copy link
Member Author

Just as a final tweak, based on the conversation last night with @clayjohn, I've added a simple frame diagnostics mode. You can turn it on and off in the project settings rendering/gles2/debug/frame_diagnostics.

When set to on it prints every 10 seconds a log of the batches being used in the frame, in this example form:

canvas_begin FRAME 4716
items
	joined_item 2561 refs
		batch D 0
		batch R 2560 [0] MULTI
		batch R 12 [1] MULTI
	joined_item 1 refs
		batch D 1
	joined_item 1 refs
		batch D 0
		batch R 6 [0] MULTI

This shows how successful the item joining is, and whether the batch is default (D) or rect (R), then number of items in the batch, the batch tex ID in square brackets, and a MULTI symbol is there are multiple rects in the batch. The aim should be to get as many MULTIs as possible.

Projects that are not batching well will have a lot of single rect batches. Note that complex games will have quite long frame logs, in which case they might get truncated by the editor. You can either run godot from a console to see the full log, or you can increase the IDE buffer size by increasing network/limits/debugger_stdout/max_chars_per_second.

@akien-mga
Copy link
Member

Looks good to me.

Let's get this merged and tested in production 🤘

Thanks a ton for the tremendous work @lawnjelly, @clayjohn and all testers!

@akien-mga akien-mga merged commit 008e074 into godotengine:3.2 Apr 17, 2020
@akien-mga akien-mga removed the cherrypick:3.x Considered for cherry-picking into a future 3.x release label Apr 20, 2020
@Firty
Copy link

Firty commented Apr 20, 2020

WARNING: render_batches: NinePatch without texture not supported yet in GLES2 backend, skipping.
At: drivers/gles2/rasterizer_canvas_gles2.cpp:883
I am having this error when I try to create an AtlasTexture in a Custom Style of a Panel.

@Firty
Copy link

Firty commented Apr 20, 2020

AVISO: render_batches: NinePatch sem textura ainda não suportada no back-end do GLES2, pulando.
Em: drivers / gles2 / rasterizer_canvas_gles2.cpp: 883
Estou tendo esse erro ao tentar criar uma AtlasTexture no estilo personalizado de um painel.

Panel -> Custom Styles -> StyleBoxTexture -> Texture -> New AtlasTexture.

I tried to do the same thing in one button and it gave the same result ... There is a problem with the Custom Style ... Not is accepting Atlas.

It seems to be an engine problem. But I solved it in a strange way. Instead of creating an AtlasTexture, I loaded the texture and just established the region. It worked, but it was confusing.

@lawnjelly
Copy link
Member Author

WARNING: render_batches: NinePatch without texture not supported yet in GLES2 backend, skipping.
At: drivers/gles2/rasterizer_canvas_gles2.cpp:883
I am having this error when I try to create an AtlasTexture in a Custom Style of a Panel.

Despite the function name, this area is just copy pasted from the old renderer (to reuse the same code between the legacy and batched renderer). I suspect the old renderer would do the same (if you switch off use_batching), or compare with e.g. Godot 3.2.1.

@Two-Tone
Copy link

[light_scissor_area_threshold]
A value of 0.0 means any area saved will activate scissoring, 0.5 is half the screen etc. The actual pixel area is calculated regularly in case the user resizes the screen.

Is 0.5 half the screen or quarter? If you're dividing the X and Y by 2 or multiplying by 0.5 it's quarter, You'd have to divide by 1.4142~ or multiply by 0.7071~ to actually get half screen res. It's an important distinction for documentation.

@lawnjelly
Copy link
Member Author

lawnjelly commented May 26, 2020

[light_scissor_area_threshold]
A value of 0.0 means any area saved will activate scissoring, 0.5 is half the screen etc. The actual pixel area is calculated regularly in case the user resizes the screen.

Is 0.5 half the screen or quarter? If you're dividing the X and Y by 2 or multiplying by 0.5 it's quarter, You'd have to divide by 1.4142~ or multiply by 0.7071~ to actually get half screen res. It's an important distinction for documentation.

It's an area : width * height in all cases. Effectively a pixel count.

  • We have an AABB A for the light, and an AABB B for each item.
  • We can calculate the intersection area between the 2 C.
  • The area potentially saved by scissoring D, is B - C.
  • If the area saved D (as proportion of total screen pixels) >= light_scissor_area_threshold, then we apply the scissor

In this way we can selectively apply glScissor in cases which are likely to save a lot of fill rate. Or we can apply it in all cases (set to 0), or turn it off completely (1). On my PC there is not a massive cost to scissoring and setting it to 0 is quite viable. However it is possible that on some hardware there may be some cost to scissoring (at the very least, there is a couple of GL calls), such that the best value lies somewhere in between. I don't have the facilities to rigorously test this so the decision is left to the user.

We may later be able to automate some of these decisions to an extent, but it is hard to get feedback of the effects of changing this kind of thing without running e.g. some diagnostic tests at startup. This is similar to the problems faced by a car ECU.

@oeleo1
Copy link

oeleo1 commented May 26, 2020

Looking at rasterizer_canvas_gles2.cpp, shouldn't the orphaning code in _batch_upload_buffers()

// orphan the old (for now)
glBufferData(GL_ARRAY_BUFFER, 0, 0, GL_DYNAMIC_DRAW);

read like the orphaning done in the initialization?

glBufferData(GL_ARRAY_BUFFER, bdata.vertex_buffer_size_bytes, NULL, GL_DYNAMIC_DRAW);

I believe that the zero size results in an noop. When we introduced orphaning in GLES3 which gave the speed boost on mobile the rationale was to orphan the whole buffer.

@lawnjelly
Copy link
Member Author

lawnjelly commented May 26, 2020

This is a little bit of a gray area, as far as I am concerned. Generally I go with manually using a circular set of vertex buffers to make it explicit, but in this case we are reusing the same VB multiple times per frame (as much as 1000+ times) so orphaning seemed easier.

https://www.khronos.org/opengl/wiki/Buffer_Object_Streaming

It seems the driver is able to do whatever it wants, so we are having to predict what it might do to a certain extent. Using the same size as before makes some sense if you are going to be reusing a similar sized buffer (especially if once per frame). However in this case, the buffer size will be varying each time so we don't gain much by doing that. In fact, it might be detrimental because it might decide to reserve some memory just in case you glSubBuffer to it.

I think I did try both and decided this made the most sense, but am willing to change it if we discover it causes problems (it has seemed fine so far). 👍

If you think about it, if we were reusing the same vertex buffer 1000 times in a frame, and this was 1 meg VB, if it actually successfully did re-use these allocations, it might be churning 1 gig of memory per frame, and using > 3 gigs in storage. Which would be extremely bad news.

In practice I'm hoping the NULL just will say 'I don't care about the last version, use it, then delete'. Which is what we want in this case. It may work just as well to remove the NULL call (I'll try and test this out). 😁

@oeleo1
Copy link

oeleo1 commented May 26, 2020

It will work but that's not the point. The point of orphaning is to let the driver know that we're not interested in the previously allocated and written to buffer and that we're done with it. The first bind/unbind cycle says "go draw this data". The second time we do that, we may overlap with what is currently in progress, which is a nasty constraint for the driver and usually results in a slowdown (which is what I saw last time I dealt with that stuff). Unless we orphan the whole buffer, After the first bind/unbind cycle, the driver assumes that we may rebind and write to the buffer again. But if we say, bind(size, null) there is no other interpretation than "we're done with the whole previous buffer, we need a new one". If we say bind(0, null), this means nothing significant as the buffer may be written to again and rebound. The latter is basically like a malloc(0) - no harm to the heap, no effect.

@lawnjelly
Copy link
Member Author

It will work but that's not the point. The point of orphaning is to let the driver know that we're not interested in the previously allocated and written to buffer and that we're done with it. The first bind/unbind cycle says "go draw this data". The second time we do that, we may overlap with what is currently in progress, which is a nasty constraint for the driver and usually results in a slowdown (which is what I saw last time I dealt with that stuff). Unless we orphan the whole buffer, After the first bind/unbind cycle, the driver assumes that we may rebind and write to the buffer again. But if we say, bind(size, null) there is no other interpretation than "we're done with the whole previous buffer, we need a new one". If we say bind(0, null), this means nothing significant as the buffer may be written to again and rebound. The latter is basically like a malloc(0) - no harm to the heap, no effect.

Afaik, glBufferData always should release the reference to the previous buffer and create a new fresh allocation (it presumably leaves a ref count on the previous object will get to zero once it has been rendered and is no longer owned). That's the difference between glBufferData and glBufferSubData. glBufferSubData is to update some part of the buffer. glBufferData is to say 'throw away what you've got and give us a new one, and upload this new data'.

Orphaning is probably more relevant when you want to use glBufferSubData in combination. If you don't do it in that scenario, presumably it has to do a load of housekeeping to keep the rest of the buffer the same (and may cause pipeline stalls). We are not doing that here, we are simply throwing out the whole of the old buffer.

https://www.khronos.org/registry/OpenGL-Refpages/es3.0/html/glBufferData.xhtml

glBufferData creates a new data store for the buffer object currently bound to target. Any pre-existing data store is deleted. The new data store is created with the specified size in bytes and usage. If data is not NULL, the data store is initialized with data from this pointer.

@oeleo1
Copy link

oeleo1 commented May 26, 2020

Yes, you are right and I stand corrected. And that glBufferData(0, 0) seems spurious at best. The driver can't do anything meaningful with it as it's not even an early indication of whatever follows, which is an overwrite.

@lawnjelly
Copy link
Member Author

lawnjelly commented May 27, 2020

And that glBufferData(0, 0) seems spurious at best. The driver can't do anything meaningful with it as it's not even an early indication of whatever follows, which is an overwrite.

You are very probably right, I put it in during development as a failsafe (hence the 'for now' comment). It can hopefully be removed without ill effect. 👍

In fact this may have caused the confusion, it is probably a leftover from testing. I was regularly removing sections of code, and I think the failsafe was there to ensure the VB was orphaned at that point, if the second glBuffer call was commented out (I forget the exact reason). In the same way there are still a few state changes that may be redundant, these can be cleaned up at a later stage.

@oeleo1
Copy link

oeleo1 commented Sep 7, 2020

@lawnjelly What's your take on batching support for CPUParticles2D? Is that potentially within scope or totally out of sight assuming you can weight or "feel" the complexity of it.

@lawnjelly
Copy link
Member Author

lawnjelly commented Sep 7, 2020

@lawnjelly What's your take on batching support for CPUParticles2D? Is that potentially within scope or totally out of sight assuming you can weight or "feel" the complexity of it.

Looking in GLES2 at least, the particles I think from memory are rendered with a multimesh, and looks very inefficient with setup and drawcalls per particle. It could probably be done as a single dynamic mesh much easier (and unrelated to) the rest of the batching. However, there may be lots of gotchas relating to the existing particle design - it could either be easy or a can of worms. Haven't looked at the particles in GLES3, they may be better already.

I don't really have time to look at it myself (GLES3 batching and helping write new renderers for 4.x have to take priority). If someone else wants to have a go, go for it, but bear in mind there may be some changes to the rasterizer layout for GLES3 batching. And all this will change for 4.x anyway.

I still have to figure out some stuff about the shader binding versions in GLES3 too that has cropped up with a bug with custom shaders and the GLES3 nvidia workaround that may be relevant to GLES3 batching (the binding and states have to be done slightly differently to GLES2).

@oeleo1
Copy link

oeleo1 commented Sep 7, 2020

Thanks. I was asking about CPU particles in GLES2 because GLES3 has GPU particles, while GLES2 is "limited" to CPU particles which are, as you say, probably suboptimal for the time being.

Since the game we're on is a 2D game using GLES2, it has a bunch of CPU particles and most primitives are left out from batching. Enabling batching produces worse performance results than without it, so vanilla GLES2 is best for us. GLES3 also performs worse than GLES2 despite the GPU particles - much more stutter and performance variance. Profiling shows the usual GPU bottlenecks, most probably at the CPU/GPU interface level, since the GPU load itself is relatively low. That said, it looks to me like there are 2 ways of addressing this: a) lower the load on the CPU/GPU sync points = batching, b) move most of those CPU particles to shaders = back to assembly language. There may be other solutions as well but not on top of my hat.

In short, great fun ahaid.

PS: I have looked at the CPU Particles code. It is indeed handled with a multimesh which looks already quite optimized to me. I don't see any obvious improvement. It is the same code for GLES2 and GLES3 with the sole difference being that GLES3 uses instance rendering which isn't supported in GLES2. The inner particle loop contains only the per particle transforms and the rest of the setup is externalized. So I am not really sure what "single dynamic mesh" being probably much easier means cf. @lanwjelly since the code looks state of the art with what GLES2 allows to do for rendering multiple objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.