-
-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve core and GDScript VM multithreading performance #98469
base: master
Are you sure you want to change the base?
Improve core and GDScript VM multithreading performance #98469
Conversation
Could you try with the |
@RandomShaper Just tested. Maybe a little better compared to the master SpinLock, But still there is a big difference between without locks and with locks because locks use shared memory. MasterSpinLock: out4.mp4Improved SpinLock: out3.mp4 |
7bf516f
to
9794e39
Compare
Nice results. I got a few questions:
|
@CedNaru Yes, actually 4 bits are lost with the new implementation, so I reduced the number of bits for the validator and increased for the position. It is theoretically possible to make some algorithm that will compress the data, but this will impose an overhead.
The idea is that the ObjetDB expands exponentially so as not to use too much memory.
Yes this is a cache line trick, Each mutex must be cache line aligned. But it is necessary to merge this #85167 so that there is no warning and to know exactly the size of the cache line. |
With this new method, you end up with 2 arrays anyway, one for the blocks and one for the slots. You can technically split those 31 bits differently than 5 and 26 with no overhead. For example, the block bits could be increased to 10 and slot bits decreased to 14 bits (to go back to the same validator size as before), you would end up with 1024 blocks with a size of 16384 objects each. It sure removes the exponential growth, but a single block doesn't take much memory. It basically becomes a paged allocator.
Got it. So it's basically to have the guarantee that each mutex lives in a different cache line so they don't fight each other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested locally, it works as expected.
Benchmark
PC specifications
- CPU: Intel Core i9-13900K
- GPU: NVIDIA GeForce RTX 4090
- RAM: 64 GB (2×32 GB DDR5-5800 C30)
- SSD: Solidigm P44 Pro 2 TB
- OS: Linux (Fedora 40)
Note: Project is capped at 60 FPS while running. Uncapped FPS allows for slightly higher results in both cases, but it's less stable over time.
Using a release export template with optimize=speed lto=full
.
Before
After
My result isn't quite as high as the one shown in OP, but it's still a noticeable improvement compared to before.
@Calinou Are you sure you used |
When multithreading is slower than single-threading?
Multithreading is slow if many threads are working on shared data. The problem arises if one thread changes the data used by other threads, due to which the cache in these threads becomes invalid, and the processor work with cache memory of the 3rd level.
An even bigger problem becomes when shared data are atomic types with which only one thread can work (
RefCounted
,SpinLock
,Mutex
).What and how was improved?
ObjectDB lock-free
The
get_instance
method now does not useSpinLock
. Similar to this PR: #97465.The class now allocates blocks. Each block is an
ObjectSlot
array. The size of the blocks increases exponentially.This gets rid of the
realloc
that was used before and requiredSpinLock
to get anObject
.RWLock
RWLock
was improved. Now each thread has its own mutex for reading data. All mutexes are locked during write. Before that,shared_timed_mutex
was used, which was shared by all threads.StringName assignment
Assigning one
StringName
to anotherStringName
uses the atomic operationsref
,unref
.In the method
Object::_call_bind
and_call_deferred_bind
,VariantInternal
was used, which receivedStringName
without using ref.In
gdscript_vm
c++ references were used instead of assignments.Weak Variant for RefCounted
godot/modules/gdscript/gdscript_vm.cpp
Line 621 in 533c616
Each time any GDScript method is called,
reference
is called in one commonscript
variable. Also, when the method ends, the Variant is destroyed, which causes an unreference.Therefore, I implemented the constructor of a weak
Variant
fromRefCounted
so that the Variant doesn't count the number of references that point to it.In order to use this you need to pass the optional true parameter to constructor like this:
Variant(script, true)
.What wasn't improved?
One bottleneck remains
_ObjectDebugLock
, but I did not change it because it is not used in the release build.Performance testing.
MRP Multithreading_Test.zip. This is the same as in #58279, but for 4.3+ versions.
For testing, compile
scons scu_build=yes target=template release to=full
.While loop where calculations are performed on many threads:
Master:
out1.mp4
CurrentPR:
out2.mp4
Blue shows the number of cycle iterations per u_sec.
Red shows the total number of iterations of all threads per u_sec.
Also here https://github.com/godotengine/godot-demo-projects/tree/master/3d/voxel, because multithreading is used, the average
Mesh
generation time is improved by 58%.Fixes #58279
Fixes #65404