-
-
Notifications
You must be signed in to change notification settings - Fork 21.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GLES2 2d Batch rendering #35957
GLES2 2d Batch rendering #35957
Conversation
Does it increase the bunnymark score? |
Bunnymark V2 GDscript (uses Sprite) Old 1870, Batched 1870 No it isn't faster, you may have found a rare case it can't accelerate. 😢 At a guess I'd think each bunny is causing a separate call to _canvas_item_render_commands. I can only batch within that function, if it is called with e.g. 1 bunny each time, there is no batching. Also in the single case, it is currently using the same codepath as nvidia workaround, which is slower. I can easily put in a special case for this kind of thing, so it is never slower than the old version (edit DONE). I did have a look initially at batching over multiple _canvas_item_render_commands calls, however because of the historical design of the engine, it looked quite a bit more complex. I think the best I can do in that case is to just fall back to the legacy method, it will probably be quicker for a single quad (because the batching uses index buffers etc). Bunnymark V1 DrawTex GDScript (uses draw_texture) Old 2060, Batched 3238 I suspect the low increase is because something else is bottlenecking it, I haven't looked in more detail. Each bunny is a separate node running gdscript so that's going to be pretty slow compared to things like tilemaps. Bunnymark V1Sprites Overall, bunnymark might be useful for comparing languages but it's not a very useful real world test for rendering. Usually it can be a good idea to test each aspect in isolation. And within rendering there are different aspects, you can be limited by vertex throughput, or fill rate etc. |
On a heavy tilemap (cell size 8) on my low-end hardware (Celeron N2840). I get about a 10X speed increase. Frame times from 5-33 ms are reduced to 0.5-2.5. The speed increase is obvious. I will be reviewing the code this evening in more detail. But I think we should also try to get lots of testers on different hardware. The other batching PR ended up having problems with specific hardware. So testing on a wider base is likely needed here as well. |
Great! 😄 It seems I have a bunch of warnings as errors to sort out for some platforms, that may have to wait until after the weekend. Still it can be tested on desktop in the meantime. 👍 Also note to any reviewers, don't get carried away yet on the details as I'll still be doing some more work on it after the weekend (and there are still various variable names to change to godot standard). |
Regarding defining Defined in VisualServer.cpp godot/servers/visual_server.cpp Line 2398 in 0812f99
Used in rasterizer: godot/drivers/gles3/rasterizer_storage_gles3.cpp Line 8350 in 0812f99
|
There are some cicd errors. https://travis-ci.org/godotengine/godot/builds/646965929?utm_source=github_status&utm_medium=notification |
Yup I will have to deal with them after the weekend unless I get some time tomorrow (am too tired tonight, is bedtime here! 😃 . I think they are warnings as errors about using godot things like Color in pod structures, nothing serious. It's actually a little more complex than you might think, because the vertex format has to be 32 bit and Vector2 etc might be using real, so could change to double in future, so I'd prefer to make anything future proof. |
This one is a game changer. My project jumped from less than 200 to 850 FPS. It's a powerful desktop so I guess a mobile experience would not be nice without this. |
1d1ce2d
to
2d66e3f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a start. Mostly just naming convention nit picks. I will do a more in depth review when I have time tonight.
Looking great so far. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. This finishes the first pass. Mostly naming convention stuff. Some change to the flow in trivial areas and some questions about using `CRASH_COND``. Overall looks very good.
We should discuss later how this will fit in with other primitives. Right now it is set up really well for just RECTS. But I think the biggest benefit would come if we could batch all types of primitives together.
|
||
// this should always succeed after growing | ||
batch = bdata.batches.request(); | ||
CRASH_COND(!batch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference here would be to try to fail gracefully. Although, given that this should never occur. I think it is probably best to at least use CRASH_COND_MSG
Line 462 in de932a5
#define CRASH_COND_MSG(m_cond, m_msg) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually now you mention it, I will see if I can remove some of these, I really put some of them in as asserts to check the code path during development, and they will slow things down at runtime. Or at least wrap them to only occur in debug build. And as you say some might be able to be replaced by a graceful fail macro.
bdata.verts_colored.reset(); | ||
bdata.batches_temp.reset(); | ||
|
||
CRASH_COND(bdata.verts_colored.max_size() != bdata.verts.max_size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these need to be CRASH_COND
s Is there any way we could gracefully recover from this?
We try to avoid any and all crashes. Our preference is to ERR_PRINT
when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They actually should be debug only, I've wrapped them in #ifdef DEBUG_ENABLED. It is a check on program flow, if this assertion is violated the rest of the function won't work. There is no sensible way to recover I don't think. Good news is users can't cause this, only a core programmer changing the allocation sizes so they are unequal, and it would be best to catch this immediately.
The CRASH_COND checks within this function could actually be removed entirely, since they can't occur at the moment, but I am just hoping that having them available will make it easier to maintain. I've also added a comment to that purpose.
indices.set(i_pos + 5, q_pos + 3); | ||
|
||
// we can only use 16 bit indices in GLES2! | ||
CRASH_COND((q_pos + 3) > 65535); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to ERR_FAIL_COND_MSG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make it debug only, but crashing rather than just an error message might be a good idea to quickly catch logic errors if this allocation code is changed in future (i.e. if someone later changes max_quads without realising they can overrun the max index size, this is a really common gotcha with indices, i.e. I've done it myself too many times to count lol 😄 ).
If you overrun the max index size, it will 'appear' to work, however the whole vertex buffer won't get used in drawing and it will wrap around to the front causing hard to diagnose visual bugs. Best case. 😬 . Let me know either way, I can change.
ddb8aac
to
a780346
Compare
Tested on: Windows 10, android, and WebGL All work fine. All see substantial performance benefits with heavy text/tilemap rendering. For posterity I will summarize our discussion from IRC earlier today: This approach to batching is very simple to implement and easy to understand. To implement batching on an Item-level would be significantly more complex and may require redesigning the GLES canvas rasterizer. Further, it would create a significant overhead cost as sorting the Items and identifying when batching is appropriate can and will be expensive. We decided the best approach going forward was to focus on batching ItemCommands and then communicate this implementation to users. (e.g. if you need extra speed, take advantage of batching by using the custom drawing API). However, we can also consider adding a second batching implementation that batches Items by combining them into ItemCommands. This would be toggleable separate from regular batching as it would incur the above-mentioned overhead cost. |
Sorting the items is kind of out of the question, that would just break 2D entirely.
Between canvas items, if there is light, then you flush on the light pass. If the mateial changes, then you flush. Finally when you are done rendering, you flush again. This will make it possible to batch as much as realistically possible, but it is to be expected that some things like lights or material changes will avoid you from going further, but it's still fine. |
For @clayjohn and reference here we discussed extending it to handle batching between the calls to get it working with sprites etc too:
|
@clayjohn I just want to make sure I understand this correctly - the only way to get the benefits of this PR is to use |
I've tried a similar approach (i.e. ItemCommands batching) in #20077. This has been rejected at that time and resulted with another batching try #20965 however that caused performance issues on mobile patforms. I keep my fingers crossed for this pull request so we've eventually got the proper GLES2 batching in Godot. :) |
Yup, I saw your PR (or one of them)! It is absolutely not easy to retrofit fully so I totally understand the frustration you had (on top of actually getting anything working at all, it's really easy to get stalls from the dynamic VBs). I took what I would consider the easy approach at first, but batching across canvas_render_items becomes considerably more involved. Hopefully I am just starting to get somewhere with the extended version today (it is rendering mostly ok now, just still not accelerating the bunnymark v2, but I probably need to batch over multiple calls to canvas_render_items, as each bunny is a node). Getting only 4 hours sleep didn't help either! Part of the problem is that it has not been designed with batching in mind (it is really written as an immediate mode renderer), so there is a lot of CPU sapping house keeping to do, which would be nice to push upstream in the call stack. Still, getting anything working at this higher level will be better than the current situation. |
@lawnjelly as far as I remember I've been able to batch buffer update calls (but not draw calls) across canvas_render_items in #20077 by introducing two commands, i.e. begin list and end list that were used to split canvas_render_items into the prepare and flush phases. |
@lawnjelly I really (really really) hope your PR is going to hit the master because I can't sleep due to the missing batching. 😊 |
I think it would be best to just add a second commit on here. Then you still have the snapshot of how things are now and we can track everything in one PR. |
Hi, I've been anxious to try this feature out since I heard about it a couple weeks ago and today I decided to build it. I can confirm that the performance increase is impressive, with a tilemap based game on mobile jumping from 16 FPS to 40 FPS on Firefox and a full 60 FPS in Chrome. I would like to say that I'm completely satisfied with it, but it seems like either the master branch of Godot (most likely) or this PR causes an issue with touch screen button functionality. Assuming I did everything right, which I'm not even 100% about. I built the lawnjelly:batch-2d branch from source on a xenial (Ubuntu 16.04) 64-bit chroot on my system (which is the latest version of Ubuntu) as well as the JavaScript export template. Everything works for the most part (for example, when testing and playing on desktop the buttons work fine) but on mobile I can no longer press them. Keep in mind that this was a project I was developing using 3.2-stable just yesterday and they worked fine there. So I can only assume there's some weird issue. The touch screen buttons neither make the pressed animation or send out any kind of signal, which I'm guessing means click/touch events might be messed up in general for the HTML version on mobile. I'm not sure if this is the proper place to bring this up or if it is directly related to the PR, but since it happened while building the branch associated with it I figured I should mention it somewhere. I am really excited for this feature and I'm really hoping it makes it in to 3.2.1. Edit: I went ahead and took the liberty of uploading both versions of my current game project in its current state as it was built with both 3.2-stable and this PR repo. The virtual controller will only display on mobile, but I did test for debugging purposes and the virtual controller does in fact work on both desktop native and desktop HTML versions of the game built with the batch2d version. Only on the mobile version, where they are actually needed, does this problem occur. You can also see on the top left hand corner an FPS counter though which shows the stark difference between the rendering techniques. https://animerrill.com/test/3.2-stable-test/ |
@AniMerrill : The touchscreen issue sounds unrelated. I can't guarantee it 100%, but I struggle to think of a mechanism where the rendering could interfere with input. It is more likely it is due to whatever version of master I happened to have at the time I was developing the PR (it was not 3.2 stable, most likely a bit afterwards or even slightly before, where some bugs may have been introduced by other commits). I'm actually not 100% sure what happens when you grab my PR branch (not being a git expert), whether it mixes it with 3.2 stable or whatever version I happened to have during development, I'm assuming the latter. If the above is the case, and you really want a build with 3.2 stable, you could do it by grabbing 3.2 stable source. Then grab only the files or bits I changed (there's probably only 2 or 3, it is mainly in If you search in issues, you may find something has been worked on recently regarding touchscreen input that might be responsible for the bug you saw. Good to hear it is working on web though. 👍 |
@lawnjelly : Alright, so I did a bunch of tests and attempts today. Final verdict is that I wasn't able to get a 3.2 stable version working with these changes. I'm sure this is probably down to a mistake on my part during the build process, or at least I have to assume so. It does look like whenever you created the branch for https://animerrill.com/test/batch2d-w-stable2/ Unfortunately it does still look like the touch screen buttons are still unfunctional, and even worse it says:
Which means that this build also wasn't able to even use the batching, so the framerate problems are exactly the same + there's no way to use the virtual controls. The other message that popped up on both this build and the one I posted about last time was the following:
Which I have no way of proving this is related, and I don't have any way of checking the browser console on my phone, but it is the only other difference between this build and the vanilla 3.2-stable build where the controls work. And this also happens on the version where the only change should have been to the Again I am excited for this feature and I hope to see it in an upcoming RC if at all possible, or at least in 3.2.1. The performance difference for a simple tilemap that this batching routine provides over the current stable branch is phenomenal and extremely necessary for the limited hardware of mobile play. I just wish I could get it working without these bugs myself, but I'm hoping I don't have to wait too long regardless. |
@AniMerrill you can't just copy over the GLES2 folder. You have to copy over every file that has been changed in this PR. So that includes servers/visual_server.cpp |
If I have time I'll try and make a version against 3.2 stable, however, I'm pretty much a git newbie myself, and this kind of version mismatching can be quite difficult to deal with. Normally all this is handled for us contributors during merging. Bear in mind that if you grab 3.2 branch now and build from source, it is not the same as the released 3.2 stable. Commits have been cherry picked and added since then, which may have fixed or introduced new bugs: https://github.com/godotengine/godot/commits/3.2 So if you are going to compare with and without the PR, you have to compare both versions built from source. The chances are that something to do with input is unrelated, you may have to do some detective work yourself to work out what is causing that bug. I can't really spend too long on this at the moment because I'm busy working on the next version. The sooner we finish that the sooner it will be available officially. |
@clayjohn : Alright, that did seem to be my issue for that last build then. Once I updated that file and recompiled, everything seems to be in working order now. https://animerrill.com/test/batch2d-w-stable5/ Unfortunate finding perhaps though is that the performance gain on this version seems pretty negligible, seems to be <5 FPS gain versus the 10-20 FPS gain that the first attempt I made. For full clarification here are the other two versions I am comparing against. https://animerrill.com/test/3.2-stable-test/ (Made with vanilla 3.2-stable as can be downloaded from the main page) These are some general FPS measurements I got by just sort of eyeballing it, so take it with a grain of salt I guess:
I can only assume that there must have already been a lot of other massive improvements on the backend by the time this branch was created, because besides the loss of touch screen button functionality the performance boost created through the PR build is pretty substantial. @lawnjelly : Also yes, I did mind to grab the actual release version of 3.2-stable so these comparisons are between the proper download provided at godotengine.org and the preserved code base that should have been used for that specific build. So the only difference between my stable build and the stable with batch2d build should be the files altered in this PR. So that likely has some amount to do with the performance differences etc. In any case, at least I got it working and modest performance boost is better than no performance boost at all. Looking forward to the final version of this, hope it gets merged soon and made official. |
@AniMerrill the stable release from the website has optimizations available. Lawnjelly was saying you need to build from the 3.2 branch yourself to test. Alternatively, for a fair comparison, you need to build the 3.2 merged branch using release instead of debug target. |
@clayjohn : Okay for clarification, I am using the version of 3.2-stable that I downloaded like the day of it's release. Again, unless I am mistaken, this should be precisely the build made from the tagged release code above which wouldn't have the updates that differentiate it from the vanilla version I am using. I have also used release version builds for all my tests with this PR. Unless you're suggesting that the post release commits would be necessary to get the full advantage of this PR, in which case I could probably try that. |
@AniMerrill this has nothing to do with the post-release commits. The official editor binaries are built using a series of compiler optimizations that you likely haven't enabled when you built it yourself. The best way to compare is to build both yourself. That means you need to checkout the 3.2 release branch and build it the same way you built the 3.2-batch2d branch and then compare those two versions. |
@clayjohn : Are you talking about enabling things like link-time optimization? Because I did enable that. I'm not sure what other compiler optimizations are available though besides the things suggested here to reduce the executable size. Also I would assume and hope that if there are optimizations made to the officially provided releases that those specifications are public so that I could try to replicate them myself, either way if I can't get an improvement over the official release then it really doesn't matter. |
@AniMerrill. Ah, good. So you are aware. Between LTO and target=release you should get pretty similar performance to the stable binary. I believe they also use a specific compiler that has further benefits. Also, the speed improvements depend on your test scene. This POC PR only benefits tilemaps and text. So if you are testing on a scene that doesn't make heavy use of either you are unlikely to see much benefit. Lawnjelly is working on a much larger PR that will provide speed benefits across the board. But it differs substantially in implementation from this one. |
Just to point out you can actually directly compare both for a single build with this PR, because there is a project setting which can enable / disable it: As clayjohn says, the benefits you get will depend on the bottlenecks in your particular game (and the bottlenecks in the webversion, which may be different to regular builds). |
@clayjohn and @lawnjelly : Just to make absolute sure, I rebuilt again making sure that both the editor and export template were built using LTO and had their most optimal target selected (release_debug and release respectively). I could have sworn I did this before, but these builds all took me about an hour each so it's hard to say if I remembered to set the flags correctly every time. Here are the commands used:
I also went ahead and tried out what @lawnjelly suggested and made two of the same build just with one that has the batching disabled. https://animerrill.com/test/batch2d-w-stable6-disabled/ Disabled
Enabled
I think the biggest difference I noticed was that the enabled version tended to be more stable towards the higher end of it's range and would have a higher low end to the range... but that's basically it. For full disclosure, I can launch both versions from the editor natively on my desktop machine and it pulls 59/60 FPS basically flawlessly so I can only assume this basically comes down to the limitations of WebASM and mobile hardware more than anything. Doing all these tests also elucidated to me how poor Firefox is for mobile game performance, at least for Godot. Also as far as the specific bottlenecks this PR attempts to solve, I am aware that it basically only works for tilemaps, text, or any object where _draw() functions are explicitly used (if I understand correctly). I'd rather not share the code for this publicly any time soon and I doubt it's urgent enough to have me send the code privately somehow, but I can basically assure you that the performance issues are 100% being caused by the tilemaps in the scene. I have two tilemap layers, one for foreground and one for background detail, and just removing the foreground map increases performance significantly. That's sort of what's frustrating about the current stable implementation, because what is going on in my game scene is about on the level of what the NES could do over 30 years ago. The common wisdom is that tilemaps and tilesets reduce both memory usage and render processing because it reduces a complicated scene to reusable chunks, but the current implementation in Godot would work better if I had 320x240 pngs uniquely drawn for each bit of the level. It's also my understanding that Godot doesn't seem to cull things outside the viewport which is also somewhat silly. I might be wrong about that. I am definitely excited for both the future of this PR and of this upcoming PR which will work for all scene objects generally. Hopefully between these and a bunch of bug fixes from the main 3.2 branch we will see a lot of general performance boosts. In any case, thank you for your time. I feel if nothing else this was a learning experience for me and hopefully my one unique use case has maybe given you some further ideas as well. I appreciate the work you're doing and I wish you good luck going forward. |
Awesome that you work on this! I noticed two bugs in 1fa753852edfe6f5393096d5d75c6ad8f6613d35 from your kessel_32_1 branch:
Sample project (ignore the name): |
This PR has been superseded by this one: #37349 Edit - Yes it is a regression, I'll see if I can fix it. Many thanks! 👍 |
Thanks for your kind reply :) Yes that's the one I tested. Maybe you can close this PR then? It's hard to keep track on what is the newest PR. |
Will do good idea. I think these won't be too tricky to fix, I've possibly fixed them already in another version I was working on (hard to keep track of lol). 👍 Yes, it's in the item joining, I'll add an extra check for this. For now if you set max join commands to 0 it will fix it. |
Putting a comment for those who might wonder why this was closed: Superseded by #37349. |
2d rendering is currently bottlenecked by drawing rects one at a time, limiting OpenGL efficiency. This PR batches rects and renders in fewer drawcalls, resulting in significant performance improvements. This also speeds up text rendering.
The code dynamically chooses between a vertex format with and without color, depending on the input data for a frame, in order to optimize throughput and maximize batch size.
Notes
Although it only batches rects to start with, the same framework can be used to batch the other primitives.
Can only batch within _canvas_item_render_commands. Notably this covers tilemaps, text, the _draw API but NOT separate sprites (we evaluated this and it would require far more complex changes).
There is a high speed simple growable pod dynamic array template included. For now it is included in the drivers/gles2 directory.
I hesitate to put it in core as it is really only intended for these rasterizers and is not general purpose. It will be very easy to replace when we have a high speed, non-COW vector in the future.
I'm using indexed primitives so that only 4 verts are required per quad. Indices in GLES2 however can only be 16 bit, so they are limited to addressing 65535 verts (from the origin). For this reason there is a maximum size for a vertex buffer, 1 quad less than would fit into 65535. When this occurs, the routine ends the batch, draws the current batches, resets and starts filling batches again where it left off.
Swapping between colored / non-colored vertex format is currently done based on a simple heuristic at runtime. It is also possible to use hysteresis, or to allow the user to switch manually, however this has not been implemented yet, it may not be required.
Now tested on Linux, Windows 10, Android, WebGL. The more testing the better, and should anyone be able to try it on their devices / more platforms that would be welcome.
Because there is still the possibility for bugs, I have added (temporarily) a Project setting :
rendering/quality/2d/use_batching
which defaults to true, but can be turned off if there are problems. The legacy non-batched method is still included.Just in case it affects people being able to even start up the editor in 2d, I have disabled the batching in the IDE until we have some feedback. If all goes well it could be enabled in the IDE by default with another PR.
What kind of speed increases can I expect?
It depends on a number of factors, including what you are rendering and what the hardware is. Benefit roughly scales with the number of quads. There will be less benefit in games that are fill rate limited already. In my tests with 2d games / tests, frame rate increases of 2-10x were typical. I was getting larger speed increases on desktop than on my tablet.
Note there are some cases (notably bunnymark v2) which can't be batched currently, see below. Any more figures doing the comparison on different hardware would be welcome.
Increases will be higher in release build. With some more tweaks I have now clocked 58x increase in frame rate in vertex throughput limited scene ( #19917 ).