-
-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using the slowest data structure almost every time. #23998
Comments
This comment has been minimized.
This comment has been minimized.
Curious to see what you measure. |
So this can explain why a simple Light2D node in Godot can burn your computer? |
To clarify, this is about CPU side inefficiencies all over the code. This has nothing to do with godot features or godot GPU rendering itself. |
@vblanco20-1 on a bit of a side-tangent, you and I had talked about the nodes being modeled as ECS entities instead. I wonder if the trick is to do a feature branch of godot with a new ent module that would gradually work side-by-side with the tree. like a get_tree() and get_registry(). the ent module would probably miss like 80% of the functionality of the tree/scene, but it could be useful for as a testbed, especially for stuff like composing large static levels with lots of objects (culling, streaming, batch rendering). Reduced functionality and flexibility but greater perfromance. |
Before going full ECS (wich i might do) i want to work on some low hanging fruits as a experiment. I might try to go full data-oriented later down the line. |
So, first updates: update_dirty_instances : from 0.2-0.25 miliseconds to 0.1 milisecond The fun part? the replacement for octree cull is not using any acceleration structure, it just iterates over a dumb array with all the AABBs. The even funnier part? The new culling is 10 lines of code. If i wanted to have it parallel, it would be a single change to the line. Ill continue with implementing my new culling with the lights, to see how much the speedup accumulates. |
we should probably get a google benchmark directory going too on the master branch. That may help with validating so people don't have to argue about it. |
There is the godotengine/godot-tests repository, but not many people are using it yet. |
New update: Ive implemented a fairly dumb spatial acceleration structure, with 2 levels. Im just generating it every frame Further improvement could make it better and update it dynamically instead of remaking it. In the image, the different rows are different scopes, and each of the colored block is a task/chunk of code. In that capture, im doing both the octree cull and my cull side by side to have a direct comparaison It demonstrates that recreating the entire acceleration structure is faster than a single old octree cull. I havent found a single case where my dumb structure is slower than the current octree, other that the cost of the initial creation. Another bonus is that its very multithread friendly, and can be scaled to whatever core count you want just with a parallel for. It also checks the memory completely contigously, so it should be possible to make it do SIMD without much hassle. Another important detail is that its vastly simpler than the current octree, with way less lines of code. |
You are causing more hype that the end of game of thrones.... |
Modified the algos a bit so now they can use C++17 parallel algorythms, wich allows the programmer to just tell it to do a parallel for, or a parallel sort. My culler is now 10 times faster than godot octree for the main view. |
If you want to look at the changes, the main important thing is here: |
This should compile with VS2015? Console throws a bunch of errors about the files inside entt\ |
@Ranoller it's c++17 so make sure you're using newer compiler |
Uh, I think Godot did not make the move to c++17 yet? At least I recall some discussions on the topic? |
Godot is C++ 03. This move will be controversial. I recomend @vblanco20-1 to talk with Juan when he finish hollydays... this optimization will be great and we don't want an instantly "This Will Never Happend TM |
how many objects? |
Around 2200. The godot octree performs better if the check is very small, because it early outs. The bigger the query, the slower godot octree is compared to my solution. If the godot octree cant early-out 90% of the scene, then its dreadfully slow, because its essentially iterating linked lists for every single object, while my system is structure of arrays where each cache line holds a good amount of AABBs, greatly reducing cache misses. My system works by having 2 levels of AABBs, its an array of blocks, where every block can hold 128 instances. The top level is mostly an array of AABBs + an array of Blocks. If the AABB check passes, then i iterate the block. The block is array of structure of AABBs, Mask, and a pointer to the instance. This way everything is faaaaast. Right now, the generation of the top level data structure is done in a very dumb way. If it was better generated, its performance could be a lot bigger. Im performing some experiments with different block sizes other than 128. |
Ive checked different numbers, and 128 per block seems to still be the sweetspot somehow. By changing the algorythm to separate "big" objects from small objects, ive managed to get another 30% speed upgrade, mostly in the case where you are not looking into the center of the map. It works becouse that way the big objects dont bloat the block AABBs that also include small objects. I believe that 3 sizes would probably work the best. Block generation is still not optimal at all. The big blocks only cull about 10 to 20% of the instances when looking at the center, and up to 50% when looking outside the map, so its performing a LOT of extra AABB checks than what it should need. I think an improvement would probably be to reuse the current octree but "flatten" it. There is now not a single case where my algorithm is equal or worse than the current octree, even without parallel execution. |
@vblanco20-1 I assume you are doing your comparison with godot compiled in |
@Faless Its release_debug, wich does add some optimizations. I havent been able to test full "release" as i cant get it to open the game (game needs to be built too?) Ive had an idea on how to improve the culling a lot more, removing the need to regenerate every frame, and creating better spatial optimization. Ill see what i can try. In theory such a thing could remove the regenerate, improve the perf of the main cull a bit, and have a specific mode for pointlights which would increase their cull performance by a huge amount, so much that they would have nearly 0 cost, because it will turn into a cheap O(1) to get the "objects in a radius". Its inspired by how one of my experiments worked, where im doing 400.000 physics objects bouncing from each other. |
|
I'm closing this thread, because I don't think it's the right way to contribute or help. @vblanco20-1 Again, I really appreciate that you have good intentions, but you don't understand any of the inner workings of the engine, or the philosophy behind it. Most of your comments are towards things you don't really understand how they are used, how critical they are to performance, or what their priorities are in general. Your tone is also unnecessarily aggressive, too. Instead of asking what you don't understand, or why something is done in a certain way, you just go all out arrogant. This is not the right attitude for this community. Most of the things you are mentioning regarding optimization are only a very small surface of the amount of items that I planned for optimization in Godot 4.0 (there are plenty more things in my list). I already told you this would get rewritten several times since several months ago. If you don't believe me, it's fine, but I feel you are wasting your own time with this and confusing a lot of people for no obvious reason. I obviously welcome very much your feedback once I'm done with my work, but all you are doing now is beating a dead horse, so just chill for a while. |
Again, when an alpha of Godot 4.0 is rolled out (hopefully before the end of this year), you are welcome to profile all the new code and give feedback on how further optimize it. I'm sure that we will all benefit. For now, there is not much of a point in discussing anything else here since existing code will go away in 4.0, and nothing will get merged in the 3.x branch, where at this point stability is more important than optimizations. Since many may be curious about the technical details:
Again, you can ask about plans and design anytime you want instead of going rogue and complain, this is not the best way to help the project. I'm also sure that many reading this thread are wondering why the spatial indexer was slow to begin with. The reason is that, maybe you are new to Godot, but until very recently the 3D engine was more than 10 years old and extremely outdated. Work was done to modernize it in OpenGL ES 3.0, but we had to desist due to problems we found in OpenGL and the fact it was deprecated for Vulkan (and Apple abandoned it). Added to this, Godot used to run on devices like the PSP not too long ago (which had only 24mb of memory available for both engine and game, so a lot of the core code is very conservative regarding memory allocation). As hardware is very different now, this is being changed for code that is more optimal and uses more memory, but when it's not needed doing this work is pointless and it's fine that you'll see lists used in a lot of places where performance does not matter. Also, many optimizations we wanted to do (moving many of the mutex code to atomics for better performance) had to be put on hold until we could move Godot to C++11 (which has much better support for inlined atomics and it does not requiere you to include windows headers, which pollute the whole namespace), which was not something we could do in a stable branch. The move to C++11 will take place after Godot 3.2 is branched and feature freezed, otherwise keeping Vulkan an Master branches in sync would be a huge pain. There is not much of a hurry, as currently focus is now on Vulkan itself. Sorry, things take time, but I prefer they are done properly rather than rushing them. It pays off better in the long term. Right now all the performance optimizations are being worked on and should be ready soon (if you tested the Vulkan branch, the 2D engine is way faster than it used to be already). |
Hi reduz,
I highly doubt that a linked list will have better overall performance, let it be speed or memory efficiency, than a dynamic array with exponential growth. The latter is guaranteed to occupy at most twice as much space as it actually uses, while a When properly wrapped (and as much I can tell from what I've seen already of the godot code, they are), they look pretty much the same to the programmer, so I don't understand what you mean with "they [Lists] make the code intent more clear". IMHO, Lists are valid exactly in two conditions:
libc-allocators are usually quite smart, and since an OAHashMap (or a std::unordered_map) reallocate their storages from time to time (amortized constant-time), the allocator usually manages to keep its memory blocks compact. I firmly believe that an OAHashMap does not effectively consume more heap than a plain binary tree map like Map. Instead, I'm quite sure that the huge pointer overhead in each element of Map actually consumes more memory than any heap fragmentation of OAHashmap (or std::unordered_map). After all, I think the best way to sort out these kinds of arguments is to benchmark it. Surely, this is of much greater use for Godot 4.0, since -- as you said -- lots of performance optimizations will happen there and there is not much use of improving code paths that may well be completely rewritten in 4.0 anyway. But @reduz, what do you think of benchmarking all these changes that @vblanco20-1 suggested (maybe even now, in 3.1). If @vblanco20-1 (or anyone else) is willing to invest the time of writing such a benchmarking suite and to evaluate Godot3.1's performance (both in terms of speed and "heap consumption considering fragmentation") against vblanco's changes? It might yield valuable hints for the actual 4.0 changes. I think that such a methodology fits your "[things] are done properly rather than rushing them" well. @vblanco20-1: In fact, I appreciate your work. Would you be motivated to create such benchmarks, so we can actually measure whether your changes are actual performance improvements? I would be very interested. |
@Windfisch I suggest you re-read my post above, as you have misread or misunderstood a few things. I'll clarify them for you.
In any case, I promised Victor that indexing code can be pluggable, so eventually different algorithms for different types of games can be implemented, too. Godot is open source and so is his fork. We are all open and sharing here, nothing should stop you from using his work if you need it. |
Since that optimizations seems not to affect gdscript, render or the "stuttering" problems, and there are the things that people complains (me includes), maybe with acomplish that optimizations people will be happy (me includes)... doesn´t need lua jit velocity... |
Thanks @reduz for your clarification. It indeed made your points more clear than the previous postings. It's good that the spatial indexing code will be pluggable, because indeed that one fell on my feet before when handling lots of objects at vastly different scales. Looking forward to 4.0. |
I was also thinking about it and I think it may be a good idea to put some shared document with ideas on optimizing spatial indexing, so more contributors can learn about this and also throw ideas or do implementations. I have very good idea about what needs to be done, but I'm sure there is a lot of room here for doing further optimizations and coming up with interesting algorithms. We can put very clear requirements that we know algorithms have to meet (as in, not requiring user tweaking if possible for average elements in the world size, not brute-forcing stuff with threads if possible -they are not free, other parts of the engine may need them too, like physics or animation and they consume more battery on mobile-, compatibility with re-projection based occlusion culling -so, some form of hierarchy may be desired, but should be tested against brute force too-, be smart about shadow buffer updates -don't update if nothing changed-, explore optimizations such as re-projection based occlusion culling for directional shadows, etc). We can also discuss creation of some benchmark tests (I don't think the TPS demo is a good benchmark because it doesn't have as many objects or occlusion). @vblanco20-1 if you are willing to follow our coding/language style and philosophy you are of course more than welcome to lend a hand. The rest of the rendering code (actual rendering using RenderingDevice) is more or less straightforward and there's not many ways to do it, but indexing seems like a more interesting problem to solve for optimization. |
@reduz for reference on the spatial indexing. The original tile based culling was removed and replaced by an Octree about halfway through this thread. The octree i got there is WIP (missing some re-fit functionality), but the results are quite good for the prototype. Its code is not that good from its prototype nature, so its only useful to check how this kind of octree would perform in a complex scene such as tps-demo. The main idea is that only the leafs on the octree hold objects, and those objects are held on an array of size 64 (compile time size, can be different). An octree leaf will only split once it "overflows" into 65 elements. When you remove objects, if each of the leafs of the parent node fit into the 64 size array, then we merge the leafs back into their parent node, which becomes a leaf. By doing that, we can minimize the time testing on the nodes, as the octree wont end up being too deep. Another good thing i was doing is that the leaf nodes are also stored in a flat array, which allows a parallel-for in the culling. This way the hierarchical cull can be used when doing cull for point shadows, or other "small" operations, and the flat parallel-for culling can be used for the main view. Of course could just use hierarchical for everything, but could be slower and cant be parallelized. The block structure will waste a bit of memory, but even in worst case scenarios i dont think it will waste a lot of memory as the nodes will merge if they drop below an amount. It also allows using a pool allocator given that both the nodes and the leafs will have a constant size. I also have multiple octrees in my fork, which is used to optimize some things. For example i have an octree only for shadowcaster objects, which allows skipping all the logic related to "can cast shadows" when culling for shadowmaps. On my concerns for Vector and others on the render API, this issue explains what i was worrying about. #24731 On the library and C++17 stuff of the fork.. thats unnecesary. The fork overuses a lot of libraries because i needed some parts of them. The only thing thats really needed, and that i think godot needs, is a industry-strenght parallel queue, i used moodycamel queue for it. On the lambdas, their usage is mainly for culling, and its used to save a significant amount of memory, as you only need to save the objects that pass your checks into the output array. The alternative is to do iterator objects (thats what unreal engine does with their octree), but it ends up being worse code and a lot harder to write. My intention was not to "go rogue", but to answer frequent comments on "you are free to make a fork to demonstrate", which is exactly what ive done. The first message on the thread is a bit alarmist and not very correct. Sorry if i appeared rude to you, as my only goal is to help the most popular open source game engine. |
@vblanco20-1 That sounds great, and the way you mention the octree works makes a lot of sense. I honestly don't mind implementing a parallel queue (seems simple enough for not needing an external dependency, and need C++11 to do it properly, so guess that will happen only on the Vulkan branch). If you have this commited, I will definitely want to take a look at it, so I can use it as a basis for the indexer rewrite in the Vulkan branch (I'm just rewriting most stuff there, so it needs to be rewritten anyway). Of course if you want to aid with the implementation (probably when there are more things in place with the new API), it's super welcome. The use of a parallel version with flattened arrays is interesting, but my point of view on it is that it won't be that useful for occlusion culling, and it would be interesting to measure how much of an improvement over regular octree culling is, considering the extra amount of cores used. Keep in mind there are many other areas that may be using multiple threads that may make more efficient (less brute force) use of it (like physics), so if even if with this approach we get a relatively marginal improvement, it is not always in the best of interests, as it may starve other areas from using CPU. Maybe it could be optional. I also find Bullet 3's implementation of dbvt pretty interesting, which does incremental self balancing and linear allocation, it's one of the things I wanted to research more (I've seen this approach implemented in proprietary engines I can't mention :P ), as algorithm wise, a balanced binary tree is by design just a lot less redundant than an octree, and may work better in both pairing and occlusion tests, so a pluggable broadphase/culling aproach may be actually a good idea, given we make proper useful benchmark scenes. In any case, you are exceptionally smart and it would be great if we can work together on this, but I will just ask you to please understand and respect many of the existing design philosophies we have, even if they may not always suit your taste. The project is huge, with a lot of inertial force, and changing things for the sake of tastes is just not a good idea. User friendliness and "just works" is always first on the list regarding design, and simplicity and readability of code usually takes priority over efficiency (that is, when efficiency is not a goal because an area is not critical). Last time we discussed, you wanted to change pretty much the whole thing and did not want to listen to any reasoning or focus on actual problems, and that's not how we usually work with the community and contributors, which is why my advice to chill was well intended. |
This comment has been minimized.
This comment has been minimized.
Edit: Sorry, maybe I am some offtopic but I wanted to clarify the benefits of open source.
I don't think so. The nature of open source is just that you can use and reuse the code freely as you want. So that will not be an end of some project but more options for the users. Anyway the original developers can just merge improvements from the fork. They are free to do it always. That is another nature of the open source world. And in the worst of the cases the fork will be better than the original and a lot of contributer will go there. Nobody lost, all win. Oh, sorry, if a company is behind the original maybe they lost (or they can merge the improvements from the fork too). |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Could we get this thread locked and moved to a dedicated and well formed conversation about specific performance improvements to RFC(Godot Proposals). I don't think this thread has anything meaningful add to the conversation around performance issues in Godot any longer(though I do believe some performance improvements would be nice but now this is just noise), I think something more directly actionable should be preferred. cc: @Calinou @akien-mga ? |
If locking of the thread is imminent, it might be nice to have some links at the end referencing the more actionable versions of the proposals raised in this thread. So if anyone is aware of such proposals/links, please share them now. 🔗 |
@swarnimarun We don't lock threads unless there is a spam issue or other form of bad actor activity. One random reply after 4 years is not a valid reason for it. We prefer issues are open for discussion at all times, because some may find even the old threads helpful. |
Just to note on my side. This is a very old issue, almost five years old. But somehow it keeps resurfacing in social media, so I think its a good time to add some clarifications. Everything mentioned originally has been 100% corrected (and I want to thank @vblanco20-1 who provided many of the original suggestions):
Even though what is mentioned on this issue has already been taken care of, there still is work pending to do optimization wise in Godot 4:
But the situation now is day and night compared to 2018. There has been literally hundreds of pull requests geared to optimize Godot in the meantime, to the point that Godot 4 is pretty much all new code rewritten from scratch. There is also a lot of continued work being done to optimize different areas and this has not stopped. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
I'm locking temporarily to stop the noise from people who keep asking to lock this. Please stop commenting about this. I'll reopen it then so people who actually have something to say on the topics of this issue can do it. Edit: Unlocked. |
Ill comment as i see people still pinging me about this issue. Modern godot has fixed most of the issues detailed here, and the architecture has changed enough that the 2 versions are not comparable. Even the tps demo used as the benchmark has changed significantly. |
MODERATOR NOTE: This issue is over five years old and discusses a reality that no longer matches the status quo of the codebase. For more information, scroll to this comment.
--
I was reviewing the code of godot, and i`ve found complete disregard for any kind of cpu side performance, down to the core data structures. Its using List everywhere, when a linked list is the slowest data structure you can use on modern PCs. Almost any other data structure would be faster, specially on small sized structures. Its specially egregious on things like the light list or reflection captures in renderables, where you are storing 3 pointers every time, instead of just giving it a stack array of 8 max lights (for example) + an extra for the cases where it would be more than that.
Same thing with RID_Owner being a tree where it could be a hash map, or a slotmap. Also the Octree implementation for culling has the exact same issue.
I want to ask about the design intention behind that total and absolute overuse of linked lists and dreadful performance pointer heavy data structures everywhere in the code. Is that for a specific reason? In most of the cases a "chunked" linked list, where you do a linked list of arrays would automatically bring performance gains on the same code.
The use of these data structures also prevents any kind of "easy" parallelism in a good chunk of the code and completely trashes the cache.
I am working on making a proof of concept implementation of refactoring some of the internals to use better data structures. At this moment i am working on a rewrite of the culling code which bypasses the current octree and just uses a flat array with optional parallelism. Ill use the tps-demo as benchmark and come back with results, which in fact i have benchmarked to go 25 levels deep on that octree...
On another more happy note, i am very impressed with the code style quality, everything is easy to follow and understand, and quite well commented.
The text was updated successfully, but these errors were encountered: