Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the slowest data structure almost every time. #23998

Closed
vblanco20-1 opened this issue Nov 26, 2018 · 73 comments
Closed

Using the slowest data structure almost every time. #23998

vblanco20-1 opened this issue Nov 26, 2018 · 73 comments

Comments

@vblanco20-1
Copy link

vblanco20-1 commented Nov 26, 2018

MODERATOR NOTE: This issue is over five years old and discusses a reality that no longer matches the status quo of the codebase. For more information, scroll to this comment.

--

I was reviewing the code of godot, and i`ve found complete disregard for any kind of cpu side performance, down to the core data structures. Its using List everywhere, when a linked list is the slowest data structure you can use on modern PCs. Almost any other data structure would be faster, specially on small sized structures. Its specially egregious on things like the light list or reflection captures in renderables, where you are storing 3 pointers every time, instead of just giving it a stack array of 8 max lights (for example) + an extra for the cases where it would be more than that.

Same thing with RID_Owner being a tree where it could be a hash map, or a slotmap. Also the Octree implementation for culling has the exact same issue.

I want to ask about the design intention behind that total and absolute overuse of linked lists and dreadful performance pointer heavy data structures everywhere in the code. Is that for a specific reason? In most of the cases a "chunked" linked list, where you do a linked list of arrays would automatically bring performance gains on the same code.

The use of these data structures also prevents any kind of "easy" parallelism in a good chunk of the code and completely trashes the cache.

I am working on making a proof of concept implementation of refactoring some of the internals to use better data structures. At this moment i am working on a rewrite of the culling code which bypasses the current octree and just uses a flat array with optional parallelism. Ill use the tps-demo as benchmark and come back with results, which in fact i have benchmarked to go 25 levels deep on that octree...

On another more happy note, i am very impressed with the code style quality, everything is easy to follow and understand, and quite well commented.

@mafiesto4

This comment has been minimized.

@ghost
Copy link

ghost commented Nov 27, 2018

Curious to see what you measure.

@Ranoller
Copy link
Contributor

So this can explain why a simple Light2D node in Godot can burn your computer?

@CptPotato
Copy link
Contributor

CptPotato commented Nov 27, 2018

@Ranoller I don't think so. The bad lighting performance is most likely related to how Godot performs 2D lighting at the moment: Rendering every sprite n-times (with n being the number of lights that are affecting it).

Edit: see #23593

@vblanco20-1
Copy link
Author

To clarify, this is about CPU side inefficiencies all over the code. This has nothing to do with godot features or godot GPU rendering itself.

@pgruenbacher
Copy link

pgruenbacher commented Nov 27, 2018

@vblanco20-1 on a bit of a side-tangent, you and I had talked about the nodes being modeled as ECS entities instead. I wonder if the trick is to do a feature branch of godot with a new ent module that would gradually work side-by-side with the tree. like a get_tree() and get_registry(). the ent module would probably miss like 80% of the functionality of the tree/scene, but it could be useful for as a testbed, especially for stuff like composing large static levels with lots of objects (culling, streaming, batch rendering). Reduced functionality and flexibility but greater perfromance.

@vblanco20-1
Copy link
Author

Before going full ECS (wich i might do) i want to work on some low hanging fruits as a experiment. I might try to go full data-oriented later down the line.

@vblanco20-1
Copy link
Author

vblanco20-1 commented Dec 1, 2018

So, first updates:

update_dirty_instances : from 0.2-0.25 miliseconds to 0.1 milisecond
octree_cull (the main view): from 0.35 miliseconds to 0.1 milisecond

The fun part? the replacement for octree cull is not using any acceleration structure, it just iterates over a dumb array with all the AABBs.

The even funnier part? The new culling is 10 lines of code. If i wanted to have it parallel, it would be a single change to the line.

Ill continue with implementing my new culling with the lights, to see how much the speedup accumulates.

@pgruenbacher
Copy link

we should probably get a google benchmark directory going too on the master branch. That may help with validating so people don't have to argue about it.

@bojidar-bg
Copy link
Contributor

bojidar-bg commented Dec 2, 2018

There is the godotengine/godot-tests repository, but not many people are using it yet.

@vblanco20-1
Copy link
Author

vblanco20-1 commented Dec 2, 2018

New update:

image

Ive implemented a fairly dumb spatial acceleration structure, with 2 levels. Im just generating it every frame Further improvement could make it better and update it dynamically instead of remaking it.

In the image, the different rows are different scopes, and each of the colored block is a task/chunk of code. In that capture, im doing both the octree cull and my cull side by side to have a direct comparaison

It demonstrates that recreating the entire acceleration structure is faster than a single old octree cull.
Also that further culls tend to be about half to one third of the current octree.

I havent found a single case where my dumb structure is slower than the current octree, other that the cost of the initial creation.

Another bonus is that its very multithread friendly, and can be scaled to whatever core count you want just with a parallel for.

It also checks the memory completely contigously, so it should be possible to make it do SIMD without much hassle.

Another important detail is that its vastly simpler than the current octree, with way less lines of code.

@Ranoller
Copy link
Contributor

Ranoller commented Dec 2, 2018

You are causing more hype that the end of game of thrones....

@vblanco20-1
Copy link
Author

image

Modified the algos a bit so now they can use C++17 parallel algorythms, wich allows the programmer to just tell it to do a parallel for, or a parallel sort.
Speedup of about x2 in the main frustrum cull, but for the lights is about same speed as before.

My culler is now 10 times faster than godot octree for the main view.

@vblanco20-1
Copy link
Author

If you want to look at the changes, the main important thing is here:
https://github.com/vblanco20-1/godot/blob/4ab733200faa20e0dadc9306e7cc93230ebc120a/servers/visual/visual_server_scene.cpp#L387
This is the new culling function. The acceleration structure is on the function right above it.

@Ranoller
Copy link
Contributor

Ranoller commented Dec 2, 2018

This should compile with VS2015? Console throws a bunch of errors about the files inside entt\

@pgruenbacher
Copy link

@Ranoller it's c++17 so make sure you're using newer compiler

@Zireael07
Copy link
Contributor

Uh, I think Godot did not make the move to c++17 yet? At least I recall some discussions on the topic?

@Ranoller
Copy link
Contributor

Ranoller commented Dec 3, 2018

Godot is C++ 03. This move will be controversial. I recomend @vblanco20-1 to talk with Juan when he finish hollydays... this optimization will be great and we don't want an instantly "This Will Never Happend TM

@nem0
Copy link

nem0 commented Dec 3, 2018

@vblanco20-1

octree_cull (the main view): from 0.35 miliseconds to 0.1 milisecond

how many objects?

@vblanco20-1
Copy link
Author

vblanco20-1 commented Dec 3, 2018

@nem0

@vblanco20-1

octree_cull (the main view): from 0.35 miliseconds to 0.1 milisecond

how many objects?

Around 2200. The godot octree performs better if the check is very small, because it early outs. The bigger the query, the slower godot octree is compared to my solution. If the godot octree cant early-out 90% of the scene, then its dreadfully slow, because its essentially iterating linked lists for every single object, while my system is structure of arrays where each cache line holds a good amount of AABBs, greatly reducing cache misses.

My system works by having 2 levels of AABBs, its an array of blocks, where every block can hold 128 instances.

The top level is mostly an array of AABBs + an array of Blocks. If the AABB check passes, then i iterate the block. The block is array of structure of AABBs, Mask, and a pointer to the instance. This way everything is faaaaast.

Right now, the generation of the top level data structure is done in a very dumb way. If it was better generated, its performance could be a lot bigger. Im performing some experiments with different block sizes other than 128.

@vblanco20-1
Copy link
Author

vblanco20-1 commented Dec 3, 2018

Ive checked different numbers, and 128 per block seems to still be the sweetspot somehow.

By changing the algorythm to separate "big" objects from small objects, ive managed to get another 30% speed upgrade, mostly in the case where you are not looking into the center of the map. It works becouse that way the big objects dont bloat the block AABBs that also include small objects. I believe that 3 sizes would probably work the best.

Block generation is still not optimal at all. The big blocks only cull about 10 to 20% of the instances when looking at the center, and up to 50% when looking outside the map, so its performing a LOT of extra AABB checks than what it should need.

I think an improvement would probably be to reuse the current octree but "flatten" it.

There is now not a single case where my algorithm is equal or worse than the current octree, even without parallel execution.

@Faless
Copy link
Collaborator

Faless commented Dec 11, 2018

@vblanco20-1 I assume you are doing your comparison with godot compiled in release mode (which uses -O3) and not the regular editor build in debug (which has no optimizations and is default) right?
Sorry if it's a silly question but I see no mention of that in the thread.
Nice work anyway :)

@vblanco20-1
Copy link
Author

vblanco20-1 commented Dec 11, 2018

@Faless Its release_debug, wich does add some optimizations. I havent been able to test full "release" as i cant get it to open the game (game needs to be built too?)

Ive had an idea on how to improve the culling a lot more, removing the need to regenerate every frame, and creating better spatial optimization. Ill see what i can try. In theory such a thing could remove the regenerate, improve the perf of the main cull a bit, and have a specific mode for pointlights which would increase their cull performance by a huge amount, so much that they would have nearly 0 cost, because it will turn into a cheap O(1) to get the "objects in a radius".

Its inspired by how one of my experiments worked, where im doing 400.000 physics objects bouncing from each other.

@vblanco20-1
Copy link
Author

image

New culling is implemented now. The block generation is now far smarter, resuling in a faster cull. The special case for lights is not implemented yet.
Ive commited it to my fork

@LikeLakers2
Copy link
Contributor

LikeLakers2 commented Dec 13, 2018

I am curious if you can run that benchmark tool (the one from your screenshots) on a build from the current master branch, to give us a reference point for how much performance your implementation is giving over the current implementation. I'm a dummy, woops.

@reduz
Copy link
Member

reduz commented Jul 22, 2019

I'm closing this thread, because I don't think it's the right way to contribute or help.

@vblanco20-1 Again, I really appreciate that you have good intentions, but you don't understand any of the inner workings of the engine, or the philosophy behind it. Most of your comments are towards things you don't really understand how they are used, how critical they are to performance, or what their priorities are in general.

Your tone is also unnecessarily aggressive, too. Instead of asking what you don't understand, or why something is done in a certain way, you just go all out arrogant. This is not the right attitude for this community.

Most of the things you are mentioning regarding optimization are only a very small surface of the amount of items that I planned for optimization in Godot 4.0 (there are plenty more things in my list). I already told you this would get rewritten several times since several months ago. If you don't believe me, it's fine, but I feel you are wasting your own time with this and confusing a lot of people for no obvious reason.

I obviously welcome very much your feedback once I'm done with my work, but all you are doing now is beating a dead horse, so just chill for a while.

@reduz reduz closed this as completed Jul 22, 2019
@reduz
Copy link
Member

reduz commented Jul 22, 2019

Again, when an alpha of Godot 4.0 is rolled out (hopefully before the end of this year), you are welcome to profile all the new code and give feedback on how further optimize it. I'm sure that we will all benefit. For now, there is not much of a point in discussing anything else here since existing code will go away in 4.0, and nothing will get merged in the 3.x branch, where at this point stability is more important than optimizations.

Since many may be curious about the technical details:

  • All the spatial indexing code (what vblanco rewrote) will be replaced by a linear algorithm for culling with multiple threads, combined with an octree for hierarchial occlusion culling and a SAP for overlapping checks, which are most likely the best all rounder algorithms that warrant good performance on any kind of game. Allocation for these structures will be in the same vein as the new RID_Allocator, which is O(1) ). I discussed this with @vblanco20-1 before and explained that his approach does not scale well for all types of games, as it requires user to have a certain degree of expertise to tweak which commonly the typical Godot user is not expected to have. It was also not a good approach for adding occlusion culling.
  • I won't use arrays when lists can be used because lists do small temporal allocations with zero risk of fragentation. In some cases, I prefer to allocate a sectioned array (aligned to pages, so they cause 0 fragmentation) that always grow and never shrink (like in RID_Allocator or the new CanvasItem in the 2D engine Vulkan branch, which now allows you to redraw items with a lot of commands very efficiently), but there has to be a performance reason for this. When lists are used in Godot it's because small allocations are preferred over performance (and actually they make the code intent more clear for others to read).
  • PoolVector is intended for very large allocations with consecutive memory pages. Until Godot 2.1 it used a preallocated memory pool, but this was removed in 3.x and right now the current behavior is wrong. For 4.0 it will be replaced with virtual memory, it's on the list of things to do.
  • Comparing Godot's Vector<> to std::vector is pointless because they have different use cases. We use it mostly to pass data around and we lock it for fast access (via ptr() or ptrw() methods). If we used std::vector, Godot would be much slower due to unnecessary copying. We also take a lot of advantage of the copy on write mechanic for many different uses.
  • Map<> is just simpler and friendlier for large amount of elements and no need to worry about hastable growing/shinking creating fragmentation. When performance is required, HashMap is used instead (though it's true, should probably use OAHashMap more, but it's too new and never had time to do that). As a general philosophy, when performance is not priority, small allocations are always preferred over large ones because it's easier for the memory allocator from your operating system to find tiny holes to put them in (which is pretty much what it is designed to do), effectively consuming less heap.

Again, you can ask about plans and design anytime you want instead of going rogue and complain, this is not the best way to help the project.

I'm also sure that many reading this thread are wondering why the spatial indexer was slow to begin with. The reason is that, maybe you are new to Godot, but until very recently the 3D engine was more than 10 years old and extremely outdated. Work was done to modernize it in OpenGL ES 3.0, but we had to desist due to problems we found in OpenGL and the fact it was deprecated for Vulkan (and Apple abandoned it).

Added to this, Godot used to run on devices like the PSP not too long ago (which had only 24mb of memory available for both engine and game, so a lot of the core code is very conservative regarding memory allocation). As hardware is very different now, this is being changed for code that is more optimal and uses more memory, but when it's not needed doing this work is pointless and it's fine that you'll see lists used in a lot of places where performance does not matter.

Also, many optimizations we wanted to do (moving many of the mutex code to atomics for better performance) had to be put on hold until we could move Godot to C++11 (which has much better support for inlined atomics and it does not requiere you to include windows headers, which pollute the whole namespace), which was not something we could do in a stable branch. The move to C++11 will take place after Godot 3.2 is branched and feature freezed, otherwise keeping Vulkan an Master branches in sync would be a huge pain. There is not much of a hurry, as currently focus is now on Vulkan itself.

Sorry, things take time, but I prefer they are done properly rather than rushing them. It pays off better in the long term. Right now all the performance optimizations are being worked on and should be ready soon (if you tested the Vulkan branch, the 2D engine is way faster than it used to be already).

@Windfisch
Copy link
Contributor

Hi reduz,
while I mostly see that your points are valid, I'd like to comment on two on which I disagree:

  • I won't use arrays when lists can be used because lists do small temporal allocations with zero risk of fragmentation. In some cases, I prefer to allocate a sectioned array (aligned to pages, so they cause 0 fragmentation) that always grow and never shrink (like in RID_Allocator or the new CanvasItem in the 2D engine Vulkan branch, which now allows you to redraw items with a lot of commands very efficiently), but there has to be a performance reason for this. When lists are used in Godot it's because small allocations are preferred over performance (and actually they make the code intent more clear for others to read).

I highly doubt that a linked list will have better overall performance, let it be speed or memory efficiency, than a dynamic array with exponential growth. The latter is guaranteed to occupy at most twice as much space as it actually uses, while a List<some pointer> uses exactly three times as much storage (the actual content, the next and the prev pointer). For a sectioned array, things look even better.

When properly wrapped (and as much I can tell from what I've seen already of the godot code, they are), they look pretty much the same to the programmer, so I don't understand what you mean with "they [Lists] make the code intent more clear".

IMHO, Lists are valid exactly in two conditions:

  • You need to frequently erase/insert elements at the middle of the container
  • or you need constant-time (and not just amortized constant-time) insertion/deletion of elements. I.e., in real-time contexts where a time-consuming allocation of memory isn't possible, they're fine.
  • Map<> is just simpler and friendlier for large amount of elements and no need to worry about hastable growing/shinking creating fragmentation. When performance is required, HashMap is used instead (though it's true, should probably use OAHashMap more, but it's too new and never had time to do that). As a general philosophy, when performance is not priority, small allocations are always preferred over large ones because it's easier for the memory allocator from your operating system to find tiny holes to put them in (which is pretty much what it is designed to do), effectively consuming less heap.

libc-allocators are usually quite smart, and since an OAHashMap (or a std::unordered_map) reallocate their storages from time to time (amortized constant-time), the allocator usually manages to keep its memory blocks compact. I firmly believe that an OAHashMap does not effectively consume more heap than a plain binary tree map like Map. Instead, I'm quite sure that the huge pointer overhead in each element of Map actually consumes more memory than any heap fragmentation of OAHashmap (or std::unordered_map).

After all, I think the best way to sort out these kinds of arguments is to benchmark it. Surely, this is of much greater use for Godot 4.0, since -- as you said -- lots of performance optimizations will happen there and there is not much use of improving code paths that may well be completely rewritten in 4.0 anyway.

But @reduz, what do you think of benchmarking all these changes that @vblanco20-1 suggested (maybe even now, in 3.1). If @vblanco20-1 (or anyone else) is willing to invest the time of writing such a benchmarking suite and to evaluate Godot3.1's performance (both in terms of speed and "heap consumption considering fragmentation") against vblanco's changes? It might yield valuable hints for the actual 4.0 changes.

I think that such a methodology fits your "[things] are done properly rather than rushing them" well.

@vblanco20-1: In fact, I appreciate your work. Would you be motivated to create such benchmarks, so we can actually measure whether your changes are actual performance improvements? I would be very interested.

@reduz
Copy link
Member

reduz commented Jul 22, 2019

@Windfisch I suggest you re-read my post above, as you have misread or misunderstood a few things. I'll clarify them for you.

  • Lists are used exactly for the use case you describe, and I never claimed that they have better performance. They are more efficient than arrays for reusing heap simply because they are made-up of small allocations. On a scale (when you use them a lot for their intended use case), this really makes a difference. When performance is required, other containers which are faster, more packed or have better cache coherency are used already. For good or bad, Victor mostly focused in one of the older engine areas (if not the oldest actually) which was never optimized since the times of it being an in-house engine used to publish games for PSP. This had a rewrite pending since a long time, but there were other priorities. His main optimization was the CPU based voxel cone tracing I added recently, which to be honest, I did a bad job with that because it was too rushed near the 3.0 release, but the proper fix for this is a different algorithm entirely, and not adding parallel processing like he did.
  • I never argued about the performance of @vblanco20-1's work and I frankly don't care about it (so you don't need to make him waste time doing benchmarks). The reasons for not merging his work is because 1) The algorithms he uses needs manual tweaking depending on average size of objects in the game, something most Godot users will need to do. I tend to favor algorithms that may be a bit slower but scale better, without the need for tweaks. 2) The algorithm he uses is not good for occlusion culling (simple octree is better due to the hierarchical nature). 3) The algorithm he uses is not good for pairing (SAP is often better). 4) He uses C++17 and libraries I'm not interested in supporting, or lambdas which I believe are unnecessary 5) I'm already working on optimizing this for 4.0, and the 3.x branch has stability as priority and we intend to release 3.2 as soon as possible, so this won't be modified or worked on there. Existing code may be slower, but it's very stable and tested. If this gets merged ant there are bug reports, regressions, etc. No one will have time to work on it or help Victor because we are already mostly busy with 4.0 branch. This is all explained above, so I suggest you re-read the post.

In any case, I promised Victor that indexing code can be pluggable, so eventually different algorithms for different types of games can be implemented, too.

Godot is open source and so is his fork. We are all open and sharing here, nothing should stop you from using his work if you need it.

@Ranoller
Copy link
Contributor

Since that optimizations seems not to affect gdscript, render or the "stuttering" problems, and there are the things that people complains (me includes), maybe with acomplish that optimizations people will be happy (me includes)... doesn´t need lua jit velocity...
Works in "copy on write" was a very big performance optimization in a plugin of mine (from 25 seconds in a script parse to only 1 second in a script of 7000 lines)... i feel that this kind of optimizations are that we need, in gdscript, in render and accomplish the stuttering problem... that´s all.

@Windfisch
Copy link
Contributor

Thanks @reduz for your clarification. It indeed made your points more clear than the previous postings.

It's good that the spatial indexing code will be pluggable, because indeed that one fell on my feet before when handling lots of objects at vastly different scales. Looking forward to 4.0.

@reduz
Copy link
Member

reduz commented Jul 23, 2019

I was also thinking about it and I think it may be a good idea to put some shared document with ideas on optimizing spatial indexing, so more contributors can learn about this and also throw ideas or do implementations. I have very good idea about what needs to be done, but I'm sure there is a lot of room here for doing further optimizations and coming up with interesting algorithms.

We can put very clear requirements that we know algorithms have to meet (as in, not requiring user tweaking if possible for average elements in the world size, not brute-forcing stuff with threads if possible -they are not free, other parts of the engine may need them too, like physics or animation and they consume more battery on mobile-, compatibility with re-projection based occlusion culling -so, some form of hierarchy may be desired, but should be tested against brute force too-, be smart about shadow buffer updates -don't update if nothing changed-, explore optimizations such as re-projection based occlusion culling for directional shadows, etc). We can also discuss creation of some benchmark tests (I don't think the TPS demo is a good benchmark because it doesn't have as many objects or occlusion). @vblanco20-1 if you are willing to follow our coding/language style and philosophy you are of course more than welcome to lend a hand.

The rest of the rendering code (actual rendering using RenderingDevice) is more or less straightforward and there's not many ways to do it, but indexing seems like a more interesting problem to solve for optimization.

@vblanco20-1
Copy link
Author

vblanco20-1 commented Jul 23, 2019

@reduz for reference on the spatial indexing. The original tile based culling was removed and replaced by an Octree about halfway through this thread. The octree i got there is WIP (missing some re-fit functionality), but the results are quite good for the prototype. Its code is not that good from its prototype nature, so its only useful to check how this kind of octree would perform in a complex scene such as tps-demo.
It is inspired by how the unreal engine octree works, but with some modifications like the possibility of a flat iterate.

The main idea is that only the leafs on the octree hold objects, and those objects are held on an array of size 64 (compile time size, can be different). An octree leaf will only split once it "overflows" into 65 elements. When you remove objects, if each of the leafs of the parent node fit into the 64 size array, then we merge the leafs back into their parent node, which becomes a leaf.

By doing that, we can minimize the time testing on the nodes, as the octree wont end up being too deep.

Another good thing i was doing is that the leaf nodes are also stored in a flat array, which allows a parallel-for in the culling. This way the hierarchical cull can be used when doing cull for point shadows, or other "small" operations, and the flat parallel-for culling can be used for the main view. Of course could just use hierarchical for everything, but could be slower and cant be parallelized.

The block structure will waste a bit of memory, but even in worst case scenarios i dont think it will waste a lot of memory as the nodes will merge if they drop below an amount. It also allows using a pool allocator given that both the nodes and the leafs will have a constant size.

I also have multiple octrees in my fork, which is used to optimize some things. For example i have an octree only for shadowcaster objects, which allows skipping all the logic related to "can cast shadows" when culling for shadowmaps.


On my concerns for Vector and others on the render API, this issue explains what i was worrying about. #24731


On the library and C++17 stuff of the fork.. thats unnecesary. The fork overuses a lot of libraries because i needed some parts of them. The only thing thats really needed, and that i think godot needs, is a industry-strenght parallel queue, i used moodycamel queue for it.


On the lambdas, their usage is mainly for culling, and its used to save a significant amount of memory, as you only need to save the objects that pass your checks into the output array. The alternative is to do iterator objects (thats what unreal engine does with their octree), but it ends up being worse code and a lot harder to write.


My intention was not to "go rogue", but to answer frequent comments on "you are free to make a fork to demonstrate", which is exactly what ive done. The first message on the thread is a bit alarmist and not very correct. Sorry if i appeared rude to you, as my only goal is to help the most popular open source game engine.

@reduz
Copy link
Member

reduz commented Jul 23, 2019

@vblanco20-1 That sounds great, and the way you mention the octree works makes a lot of sense. I honestly don't mind implementing a parallel queue (seems simple enough for not needing an external dependency, and need C++11 to do it properly, so guess that will happen only on the Vulkan branch). If you have this commited, I will definitely want to take a look at it, so I can use it as a basis for the indexer rewrite in the Vulkan branch (I'm just rewriting most stuff there, so it needs to be rewritten anyway). Of course if you want to aid with the implementation (probably when there are more things in place with the new API), it's super welcome.

The use of a parallel version with flattened arrays is interesting, but my point of view on it is that it won't be that useful for occlusion culling, and it would be interesting to measure how much of an improvement over regular octree culling is, considering the extra amount of cores used. Keep in mind there are many other areas that may be using multiple threads that may make more efficient (less brute force) use of it (like physics), so if even if with this approach we get a relatively marginal improvement, it is not always in the best of interests, as it may starve other areas from using CPU. Maybe it could be optional.

I also find Bullet 3's implementation of dbvt pretty interesting, which does incremental self balancing and linear allocation, it's one of the things I wanted to research more (I've seen this approach implemented in proprietary engines I can't mention :P ), as algorithm wise, a balanced binary tree is by design just a lot less redundant than an octree, and may work better in both pairing and occlusion tests, so a pluggable broadphase/culling aproach may be actually a good idea, given we make proper useful benchmark scenes.

In any case, you are exceptionally smart and it would be great if we can work together on this, but I will just ask you to please understand and respect many of the existing design philosophies we have, even if they may not always suit your taste. The project is huge, with a lot of inertial force, and changing things for the sake of tastes is just not a good idea. User friendliness and "just works" is always first on the list regarding design, and simplicity and readability of code usually takes priority over efficiency (that is, when efficiency is not a goal because an area is not critical). Last time we discussed, you wanted to change pretty much the whole thing and did not want to listen to any reasoning or focus on actual problems, and that's not how we usually work with the community and contributors, which is why my advice to chill was well intended.

@BlueCannonBall

This comment has been minimized.

@Anyeos
Copy link

Anyeos commented Mar 18, 2020

Edit: Sorry, maybe I am some offtopic but I wanted to clarify the benefits of open source.

@mixedCase I would be extremely hesitant to support any kind of fork of Godot. Forking and fragmentation is often the death of open source projects.

I don't think so. The nature of open source is just that you can use and reuse the code freely as you want. So that will not be an end of some project but more options for the users.
What you say is a monopoly, and that is not what defend the open source. Don't get cheated by companies that pretend to show you that there is better to have full control of something. That is a lie, at least in the open source world, because if someone other have the code you too have the code. What they need to do is to take care of how to handle a community or how to be OK with that.

Anyway the original developers can just merge improvements from the fork. They are free to do it always. That is another nature of the open source world.

And in the worst of the cases the fork will be better than the original and a lot of contributer will go there. Nobody lost, all win. Oh, sorry, if a company is behind the original maybe they lost (or they can merge the improvements from the fork too).

@Koalamana9

This comment was marked as off-topic.

@Saul2022

This comment was marked as off-topic.

@swarnimarun
Copy link
Contributor

Could we get this thread locked and moved to a dedicated and well formed conversation about specific performance improvements to RFC(Godot Proposals). I don't think this thread has anything meaningful add to the conversation around performance issues in Godot any longer(though I do believe some performance improvements would be nice but now this is just noise), I think something more directly actionable should be preferred.

cc: @Calinou @akien-mga ?

@Venryx
Copy link

Venryx commented Sep 20, 2023

If locking of the thread is imminent, it might be nice to have some links at the end referencing the more actionable versions of the proposals raised in this thread. So if anyone is aware of such proposals/links, please share them now. 🔗

@YuriSizov
Copy link
Contributor

@swarnimarun We don't lock threads unless there is a spam issue or other form of bad actor activity. One random reply after 4 years is not a valid reason for it.

We prefer issues are open for discussion at all times, because some may find even the old threads helpful.

@reduz
Copy link
Member

reduz commented Sep 20, 2023

Just to note on my side. This is a very old issue, almost five years old. But somehow it keeps resurfacing in social media, so I think its a good time to add some clarifications.

Everything mentioned originally has been 100% corrected (and I want to thank @vblanco20-1 who provided many of the original suggestions):

  • Linked Lists no longer used in hot paths
  • RID Owner uses O(1) allocation (array)
  • We have a superb implementation of HashMap and HashSet in Godot 4, we even took care to ensure the hashing functions for each Godot datatype generate the best entropy, and hence have the best distribution possible.
  • We use worst case memory allocators (despite the name this is a good thing, means memory grows to accommodate the game needs on demand and stays there) and pools almost everywhere performance matters.
  • Data oriented structures are used now extensively where it matters.
  • Original issues raised regarding culling have both been resolved in both Godot 3 and Godot 4.
  • Threading now is used extensively.
  • The list of improvements since this issue was opened is far too long. Five years is a long time.

Even though what is mentioned on this issue has already been taken care of, there still is work pending to do optimization wise in Godot 4:

  • Multithreading in many areas still needs work (mainly on the scene side, this is very new).
  • Some locking in a few areas could be better.
  • Godot still does not support texture/mesh streaming, which should help a lot with I/O performance.
  • And really many other things.

But the situation now is day and night compared to 2018. There has been literally hundreds of pull requests geared to optimize Godot in the meantime, to the point that Godot 4 is pretty much all new code rewritten from scratch.

There is also a lot of continued work being done to optimize different areas and this has not stopped.

@Zireael07

This comment was marked as off-topic.

@YuriSizov

This comment was marked as off-topic.

@Zireael07

This comment was marked as off-topic.

@rptfrg

This comment was marked as off-topic.

@akien-mga
Copy link
Member

akien-mga commented Sep 20, 2023

I'm locking temporarily to stop the noise from people who keep asking to lock this. Please stop commenting about this.

I'll reopen it then so people who actually have something to say on the topics of this issue can do it.

Edit: Unlocked.

@godotengine godotengine locked and limited conversation to collaborators Sep 20, 2023
@godotengine godotengine unlocked this conversation Sep 20, 2023
@vblanco20-1
Copy link
Author

Ill comment as i see people still pinging me about this issue. Modern godot has fixed most of the issues detailed here, and the architecture has changed enough that the 2 versions are not comparable. Even the tps demo used as the benchmark has changed significantly.
This fork performance upgrades as detailed in posts above were done over the base of godot 3.3, and things have really changed since then. Also the fork is a experimental thing i created for fun so it goes into absurd levels just because i felt like it, like running shadow casting render logic in parallel threads or heavily parallelizing most of the renderer beyond what makes practical sense. It only really works for the TPS demo and its near guaranteed to be bugged for anything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests