Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve core and GDScript VM multithreading performance #98469

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Nazarwadim
Copy link
Contributor

@Nazarwadim Nazarwadim commented Oct 23, 2024

When multithreading is slower than single-threading?

Multithreading is slow if many threads are working on shared data. The problem arises if one thread changes the data used by other threads, due to which the cache in these threads becomes invalid, and the processor work with cache memory of the 3rd level.
An even bigger problem becomes when shared data are atomic types with which only one thread can work (RefCounted, SpinLock, Mutex).

What and how was improved?

ObjectDB lock-free
The get_instance method now does not use SpinLock. Similar to this PR: #97465.

The class now allocates blocks. Each block is an ObjectSlot array. The size of the blocks increases exponentially.
This gets rid of the realloc that was used before and required SpinLock to get an Object.

RWLock
RWLock was improved. Now each thread has its own mutex for reading data. All mutexes are locked during write. Before that, shared_timed_mutex was used, which was shared by all threads.

StringName assignment
Assigning one StringName to another StringName uses the atomic operations ref, unref.

In the method Object::_call_bind and _call_deferred_bind, VariantInternal was used, which received StringName without using ref.
In gdscript_vm c++ references were used instead of assignments.

Weak Variant for RefCounted

memnew_placement(&stack[ADDR_STACK_CLASS], Variant(script));

Each time any GDScript method is called, reference is called in one common script variable. Also, when the method ends, the Variant is destroyed, which causes an unreference.
Therefore, I implemented the constructor of a weak Variant from RefCounted so that the Variant doesn't count the number of references that point to it.

In order to use this you need to pass the optional true parameter to constructor like this: Variant(script, true).

What wasn't improved?

One bottleneck remains _ObjectDebugLock, but I did not change it because it is not used in the release build.

Performance testing.

MRP Multithreading_Test.zip. This is the same as in #58279, but for 4.3+ versions.

For testing, compile scons scu_build=yes target=template release to=full.
While loop where calculations are performed on many threads:

while(count > 0):
	count -= 1
	
	#Comment/uncomment these lines to test different functions
	some_val += 1
	some_val = sin(some_val)
	some_val = _increment(some_val)
	some_val += noise.get_noise_2d(1,1)
	##Unique function per thread
	call(name, some_val)

Master:

out1.mp4

CurrentPR:

out2.mp4

Blue shows the number of cycle iterations per u_sec.
Red shows the total number of iterations of all threads per u_sec.

Also here https://github.com/godotengine/godot-demo-projects/tree/master/3d/voxel, because multithreading is used, the average Mesh generation time is improved by 58%.

Fixes #58279
Fixes #65404

@RandomShaper
Copy link
Member

Could you try with the SpinLock from #85167 in ObjectDB and report what the outcome is?

@Nazarwadim
Copy link
Contributor Author

Nazarwadim commented Oct 24, 2024

@RandomShaper Just tested. Maybe a little better compared to the master SpinLock, but I did not notice a significant result the video below shows a difference of about 30%.

But still there is a big difference between without locks and with locks because locks use shared memory.

MasterSpinLock:

out4.mp4

Improved SpinLock:

out3.mp4

@Nazarwadim Nazarwadim force-pushed the improve_core_multithreading_performance branch from 7bf516f to 9794e39 Compare October 24, 2024 18:49
@CedNaru
Copy link
Contributor

CedNaru commented Oct 25, 2024

Nice results.

I got a few questions:

  • The number of bits used to find the location of the Object has been increased from 24 to 31, increasing both the maximum number of objects and the risk of ID collision. Was that established that this trade off is necessary and safe ?
  • Unless I misunderstood the code, aren't we wasting index space with this new method ? Validator and reference bits aside, 5 bits are used for the address of the block and 26 for the slot address inside the block.
    Because each block is bigger than the previous, only the last one is big enough to make use of the 26 bits address. For example, the first block has a size of 128, which will use 6 bits at most before switching to the next block of size 256, meaning that 20 bits remain unused for it. Shouldn't all blocks have the same size instead ?
  • What is that huge padding of 64 bytes for in the new RWlock implementation ? Is that some cache line trick ? If yes, shouldn't that be adjusted so the struct is exactly 64 bytes ?

@Nazarwadim
Copy link
Contributor Author

@CedNaru Yes, actually 4 bits are lost with the new implementation, so I reduced the number of bits for the validator and increased for the position. It is theoretically possible to make some algorithm that will compress the data, but this will impose an overhead.

Shouldn't all blocks have the same size instead ?

The idea is that the ObjetDB expands exponentially so as not to use too much memory.

What is that huge padding of 64 bytes for in the new RWlock implementation ? Is that some cache line trick ? If yes, shouldn't that be adjusted so the struct is exactly 64 bytes ?

Yes this is a cache line trick, Each mutex must be cache line aligned. But it is necessary to merge this #85167 so that there is no warning and to know exactly the size of the cache line.

@CedNaru
Copy link
Contributor

CedNaru commented Oct 25, 2024

The idea is that the ObjetDB expands exponentially so as not to use too much memory.

With this new method, you end up with 2 arrays anyway, one for the blocks and one for the slots. You can technically split those 31 bits differently than 5 and 26 with no overhead. For example, the block bits could be increased to 10 and slot bits decreased to 14 bits (to go back to the same validator size as before), you would end up with 1024 blocks with a size of 16384 objects each.

It sure removes the exponential growth, but a single block doesn't take much memory. It basically becomes a paged allocator.

Yes this is a cache line trick, Each mutex must be cache line aligned. But it is necessary to merge this #85167 so that there is no warning and to know exactly the size of the cache line.

Got it. So it's basically to have the guarantee that each mutex lives in a different cache line so they don't fight each other.

Copy link
Member

@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally, it works as expected.

Benchmark

PC specifications
  • CPU: Intel Core i9-13900K
  • GPU: NVIDIA GeForce RTX 4090
  • RAM: 64 GB (2×32 GB DDR5-5800 C30)
  • SSD: Solidigm P44 Pro 2 TB
  • OS: Linux (Fedora 40)

Note: Project is capped at 60 FPS while running. Uncapped FPS allows for slightly higher results in both cases, but it's less stable over time.

Using a release export template with optimize=speed lto=full.

Before

Screenshot_20241026_141852 png webp

After

My result isn't quite as high as the one shown in OP, but it's still a noticeable improvement compared to before.

Screenshot_20241026_141810 png webp

@Nazarwadim
Copy link
Contributor Author

@Calinou Are you sure you used target=template_release? Because on template debug/editor I get similar result due to _ObjectDebugLock bottleneck. Anyway, thanks for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants