Improve core and GDScript VM multithreading performance #98469

Nazarwadim · 2024-10-23T18:28:53Z

When multithreading is slower than single-threading?

Multithreading is slow if many threads are working on shared data. The problem arises if one thread changes the data used by other threads, due to which the cache in these threads becomes invalid, and the processor work with cache memory of the 3rd level.
An even bigger problem becomes when shared data are atomic types with which only one thread can work (RefCounted, SpinLock, Mutex).

What and how was improved?

ObjectDB lock-free
The get_instance method now does not use SpinLock. Similar to this PR: #97465.

The class now allocates blocks. Each block is an ObjectSlot array. The size of the blocks increases exponentially.
This gets rid of the realloc that was used before and required SpinLock to get an Object.

RWLock
RWLock was improved. Now each thread has its own mutex for reading data. All mutexes are locked during write. Before that, shared_timed_mutex was used, which was shared by all threads.

StringName assignment
Assigning one StringName to another StringName uses the atomic operations ref, unref.

In the method Object::_call_bind and _call_deferred_bind, VariantInternal was used, which received StringName without using ref.
In gdscript_vm c++ references were used instead of assignments.

Weak Variant for RefCounted

godot/modules/gdscript/gdscript_vm.cpp

Line 621 in 533c616

memnew_placement(&stack[ADDR_STACK_CLASS], Variant(script));

Each time any GDScript method is called, reference is called in one common script variable. Also, when the method ends, the Variant is destroyed, which causes an unreference.
Therefore, I implemented the constructor of a weak Variant from RefCounted so that the Variant doesn't count the number of references that point to it.

In order to use this you need to pass the optional true parameter to constructor like this: Variant(script, true).

What wasn't improved?

One bottleneck remains _ObjectDebugLock, but I did not change it because it is not used in the release build.

Performance testing.

MRP Multithreading_Test.zip. This is the same as in #58279, but for 4.3+ versions.

For testing, compile scons scu_build=yes target=template release to=full.
While loop where calculations are performed on many threads:

while(count > 0):
	count -= 1
	
	#Comment/uncomment these lines to test different functions
	some_val += 1
	some_val = sin(some_val)
	some_val = _increment(some_val)
	some_val += noise.get_noise_2d(1,1)
	##Unique function per thread
	call(name, some_val)

Master:

out1.mp4

CurrentPR:

out2.mp4

Blue shows the number of cycle iterations per u_sec.
Red shows the total number of iterations of all threads per u_sec.

Also here https://github.com/godotengine/godot-demo-projects/tree/master/3d/voxel, because multithreading is used, the average Mesh generation time is improved by 58%.

Fixes #58279
Fixes #65404

RandomShaper · 2024-10-24T10:54:51Z

Could you try with the SpinLock from #85167 in ObjectDB and report what the outcome is?

Nazarwadim · 2024-10-24T12:22:49Z

@RandomShaper Just tested. Maybe a little better compared to the master SpinLock, ~~but I did not notice a significant result~~ the video below shows a difference of about 30%.

But still there is a big difference between without locks and with locks because locks use shared memory.

MasterSpinLock:

out4.mp4

Improved SpinLock:

out3.mp4

CedNaru · 2024-10-25T00:51:12Z

Nice results.

I got a few questions:

The number of bits used to find the location of the Object has been increased from 24 to 31, increasing both the maximum number of objects and the risk of ID collision. Was that established that this trade off is necessary and safe ?
Unless I misunderstood the code, aren't we wasting index space with this new method ? Validator and reference bits aside, 5 bits are used for the address of the block and 26 for the slot address inside the block.
Because each block is bigger than the previous, only the last one is big enough to make use of the 26 bits address. For example, the first block has a size of 128, which will use 6 bits at most before switching to the next block of size 256, meaning that 20 bits remain unused for it. Shouldn't all blocks have the same size instead ?
What is that huge padding of 64 bytes for in the new RWlock implementation ? Is that some cache line trick ? If yes, shouldn't that be adjusted so the struct is exactly 64 bytes ?

Nazarwadim · 2024-10-25T16:34:43Z

@CedNaru Yes, actually 4 bits are lost with the new implementation, so I reduced the number of bits for the validator and increased for the position. It is theoretically possible to make some algorithm that will compress the data, but this will impose an overhead.

Shouldn't all blocks have the same size instead ?

The idea is that the ObjetDB expands exponentially so as not to use too much memory.

What is that huge padding of 64 bytes for in the new RWlock implementation ? Is that some cache line trick ? If yes, shouldn't that be adjusted so the struct is exactly 64 bytes ?

Yes this is a cache line trick, Each mutex must be cache line aligned. But it is necessary to merge this #85167 so that there is no warning and to know exactly the size of the cache line.

CedNaru · 2024-10-25T16:49:52Z

The idea is that the ObjetDB expands exponentially so as not to use too much memory.

With this new method, you end up with 2 arrays anyway, one for the blocks and one for the slots. You can technically split those 31 bits differently than 5 and 26 with no overhead. For example, the block bits could be increased to 10 and slot bits decreased to 14 bits (to go back to the same validator size as before), you would end up with 1024 blocks with a size of 16384 objects each.

It sure removes the exponential growth, but a single block doesn't take much memory. It basically becomes a paged allocator.

Yes this is a cache line trick, Each mutex must be cache line aligned. But it is necessary to merge this #85167 so that there is no warning and to know exactly the size of the cache line.

Got it. So it's basically to have the guarantee that each mutex lives in a different cache line so they don't fight each other.

Calinou

Tested locally, it works as expected.

Benchmark

PC specifications

CPU: Intel Core i9-13900K
GPU: NVIDIA GeForce RTX 4090
RAM: 64 GB (2×32 GB DDR5-5800 C30)
SSD: Solidigm P44 Pro 2 TB
OS: Linux (Fedora 40)

Note: Project is capped at 60 FPS while running. Uncapped FPS allows for slightly higher results in both cases, but it's less stable over time.

Using a release export template with optimize=speed lto=full.

Before

After

My result isn't quite as high as the one shown in OP, but it's still a noticeable improvement compared to before.

core/object/object.cpp

Nazarwadim · 2024-10-26T14:16:48Z

@Calinou Are you sure you used target=template_release? Because on template debug/editor I get similar result due to _ObjectDebugLock bottleneck. Anyway, thanks for testing.

AThousandShips added enhancement topic:core needs testing performance labels Oct 23, 2024

AThousandShips added this to the 4.x milestone Oct 23, 2024

clayjohn requested a review from RandomShaper October 23, 2024 19:22

Nazarwadim added 7 commits October 24, 2024 21:45

Do not reference StringName in GDScript VM

581730e

Make RWLock multithread independent for read

ea9699f

Make ObjectDB lock free for read

530fc17

Don`t ref StringName in _call_bind

bc6fabe

Implement is_weak_ref for Variant

afd4c48

Use uint32_t

a71a8f3

NextFree use defines, RWLock init in constructor

9794e39

Nazarwadim force-pushed the improve_core_multithreading_performance branch from 7bf516f to 9794e39 Compare October 24, 2024 18:49

Calinou approved these changes Oct 26, 2024

View reviewed changes

Calinou reviewed Oct 26, 2024

View reviewed changes

core/object/object.cpp Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve core and GDScript VM multithreading performance #98469

Improve core and GDScript VM multithreading performance #98469

Nazarwadim commented Oct 23, 2024 •

edited

Loading

RandomShaper commented Oct 24, 2024

Nazarwadim commented Oct 24, 2024 •

edited

Loading

CedNaru commented Oct 25, 2024 •

edited

Loading

Nazarwadim commented Oct 25, 2024

CedNaru commented Oct 25, 2024 •

edited

Loading

Calinou left a comment •

edited

Loading

Nazarwadim commented Oct 26, 2024

Improve core and GDScript VM multithreading performance #98469

Are you sure you want to change the base?

Improve core and GDScript VM multithreading performance #98469

Conversation

Nazarwadim commented Oct 23, 2024 • edited Loading

When multithreading is slower than single-threading?

What and how was improved?

What wasn't improved?

Performance testing.

RandomShaper commented Oct 24, 2024

Nazarwadim commented Oct 24, 2024 • edited Loading

CedNaru commented Oct 25, 2024 • edited Loading

Nazarwadim commented Oct 25, 2024

CedNaru commented Oct 25, 2024 • edited Loading

Calinou left a comment • edited Loading

Choose a reason for hiding this comment

Benchmark

Before

After

Nazarwadim commented Oct 26, 2024

Nazarwadim commented Oct 23, 2024 •

edited

Loading

Nazarwadim commented Oct 24, 2024 •

edited

Loading

CedNaru commented Oct 25, 2024 •

edited

Loading

CedNaru commented Oct 25, 2024 •

edited

Loading

Calinou left a comment •

edited

Loading