Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations for GDScript VM #70838

Merged
merged 1 commit into from
Jan 5, 2023

Conversation

reduz
Copy link
Member

@reduz reduz commented Jan 2, 2023

  • Removed instruction argument count and instruction prefetching. This is now done on the fly. Reduces jumps.
  • Removed the address mode switch, addressing is now simple array indexing.
  • OPCODE_DISPATCH now goes directly to the next instruction, like in Godot 3.x.

Performance seems to improve significantly (more than 40% faster on release on typed code vs master), here is some quick benchmark I made:

Master:
Debug:
Time taken (untyped): 33.708msec.
Time taken (typed): 20.464msec.
Release:
Time taken (untyped): 8.257msec.
Time taken (typed): 5.761msec.

This PR:
Debug:
Time taken (untyped): 29.234msec.
Time taken (typed): 16.027msec.
Release:
Time taken (untyped): 6.41msec.
Time taken (typed): 3.311msec.

Test source code:

const elems = 65536

func benchmark():

	var arr = []
	for x in range(elems):
		arr.append(randi() % elems)


	var t = Time.get_ticks_usec()

	var acc = 0.0

	for e in arr:
		var e2 = arr[e]
		var e3 = arr[e2]
		acc += e * e2
		acc *= e3 + e
		acc = sqrt(acc)

	
	print("Time taken (untyped): ",(Time.get_ticks_usec() - t)/1000.0,"msec.")		

	var arr2 := PackedInt32Array()
	for x in range(elems):
		arr2.append(randi() % elems)


	t = Time.get_ticks_usec()

	var acc2 : float = 0.0

	for e in arr2:
		var e2 :int = arr2[e]
		var e3 :int = arr2[e2]
		acc2 += e * e2
		acc2 *= e3 + e
		acc2 = sqrt(acc)

	
	print("Time taken (typed): ",(Time.get_ticks_usec() - t)/1000.0,"msec.")		

I don't have much else I can use to test performance, so if anyone wants to lend a hand and compare with master (both on debug and release) or test in your production project, it would be very welcome.

@reduz reduz requested a review from a team as a code owner January 2, 2023 14:32
@reduz reduz force-pushed the gdscript-vm-optimization branch 2 times, most recently from 96c403f to 1cf2f75 Compare January 2, 2023 14:42
@Chaosus Chaosus added this to the 4.0 milestone Jan 2, 2023
@reduz reduz force-pushed the gdscript-vm-optimization branch from 1cf2f75 to 3258dac Compare January 2, 2023 15:21
@Calinou
Copy link
Member

Calinou commented Jan 2, 2023

If anyone's curious, I ported the benchmark code to PHP, Python and Ruby. Benchmark times were measured on an i7-6700K @ 4.4 GHz on Fedora 36.

PHP

  • Time taken (PHP 8.1.13 + XDebug): 16.055 msec
  • Time taken (PHP 8.1.13): 3.538 msec
  • Time taken (PHP 8.1.13 + JIT): 1.605 msec1
Code
<?php

const elems = 65536;

function benchmark() {
    $arr = [];
    for ($i = 0; $i < elems; $i++) {
        $arr[] = mt_rand(0, 2147483647) % elems;
    }

    $t = microtime(true);

    $acc = 0.0;

    foreach ($arr as $e) {
        $e2 = $arr[$e];
        $e3 = $arr[$e2];
        $acc += $e * $e2;
        $acc *= $e3 + $e;
        $acc = sqrt($acc);
    }

    echo "Time taken (PHP): " . (microtime(true) - $t) * 1000 . " msec.";
}

benchmark();

Python

  • Time taken (Python 3.9.0): 12.492 msec
  • Time taken (Pyston 2.3.4): 8.459 msec1
  • Time taken (Pyston 2.3.5): 6.686 msec1
Code
import math
import random
from time import time

elems = 65536

def benchmark():
    arr = []
    for x in range(elems):
        arr.append(random.randint(0, 2147483647) % elems)

    t = time()

    acc = 0.0

    for e in arr:
        e2 = arr[e]
        e3 = arr[e2]
        acc += e * e2
        acc *= e3 + e
        acc = math.sqrt(acc)

    print("Time taken (Python):", (time() - t) * 1000, "msec.")

benchmark()

Ruby

  • Time taken (Ruby 3.2.0): 7.357 msec
  • Time taken (Ruby 3.2.0 + --jit): 4.711 msec1
Code
ELEMS = 65536

def benchmark()
    arr = []
    for i in 1..ELEMS
        arr.append(rand(2147483647) % ELEMS)
    end

    t = Time.now

    acc = 0.0

    for e in arr
        e2 = arr[e]
        e3 = arr[e2]
        acc += e * e2
        acc *= e3 + e
        acc = Math.sqrt(acc)
    end


    puts("Time taken (Ruby): " + ((Time.now - t) * 1000).to_s + " msec.")
end

benchmark()

Footnotes

  1. A JIT compiler is not comparable to interpreters in terms of performance. This figure is included for reference only. 2 3 4

@Byteron
Copy link
Contributor

Byteron commented Jan 2, 2023

If anyone's curious, I ported the benchmark code to PHP, Python and Ruby. Benchmark times were measured on an i7-6700K @ 4.4 GHz on Fedora 36.

PHP

  • Time taken (PHP 8.1.13 + XDebug): 16.055 msec
  • Time taken (PHP 8.1.13): 3.538 msec
  • Time taken (PHP 8.1.13 + JIT): 1.605 msec1

Code

Python

  • Time taken (Python 3.9.0): 12.492 msec
  • Time taken (Pyston 2.3.4): 8.459 msec1
  • Time taken (Pyston 2.3.5): 6.686 msec1

Code

Ruby

  • Time taken (Ruby 3.2.0): 7.357 msec
  • Time taken (Ruby 3.2.0 + --jit): 4.711 msec1

Code

Footnotes

  1. A JIT compiler is not comparable to interpreters in terms of performance. This figure is included for reference only. ↩2 ↩3 ↩4

Given you are on different Hardware it probably makes sense to also include GDScript Numbers on your 6700K.

Copy link
Member

@vnen vnen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, apart from the bit I commented about (which can be addressed later).

@bruvzg
Copy link
Member

bruvzg commented Jan 2, 2023

Some tests on M1 mac, using release editor binary (average of 10 runs):

Time
Godot (master, untyped) 13.9504
Godot (master, typed) 8.9020
Godot (this pr, untyped) 11.2082
Godot (this pr, typed) 6.0292
Python 3.10.9 9.1007
Ruby 2.6.10p210 8.6488

@barathbheeman
Copy link

I'm new to godot, so can someone pls explain why the comparisons are made against python, php and ruby instead of c# and c++ since these are the languages someone might actually use in a godot project?

@Calinou
Copy link
Member

Calinou commented Jan 2, 2023

I'm new to godot, so can someone pls explain why the comparisons are made against python, php and ruby instead of c# and c++ since these are the languages someone might actually use in a godot project?

C# and C++ are languages compiled ahead-of-time, so these aren't fair comparisons to make. It's not possible for an interpreted language to even match an AOT-compiled language in terms of raw performance. Comparing interpreted languages with JIT-compiled languages is a stretch already, as JIT suffers from restrictions that interpreted languages don't have (such as iOS and consoles heavily restricting JIT usage).

@jordo
Copy link
Contributor

jordo commented Jan 2, 2023

I'm new to godot, so can someone pls explain why the comparisons are made against python, php and ruby instead of c# and c++ since these are the languages someone might actually use in a godot project?

We've been doing a lot of transpiling gdscript to c++ for our game logic in some performance heavy areas... so I have a lot of experience here. We run a lot of godot gameservers on cloud infrastructure so we pay for every CPU cycle used.

For GDScript vs C++ we can typically see close to two orders of magnitude of speedup, (usually 40-100X faster), which is why it's not a fair comparison but also illustrates how there can be so much to gain with an AOT language + optimization pass that compiles to machine instructions.

The typed gdscript example code runs in 16.033msec on my 2019 x86 macbook pro.

The native c++ version below, executes in 0.021000msec. So in this example, actually close to 1000X faster.

	int elems = 65536;
	
	std::vector<int> arr = {};
	for (int i = 0; i < elems; i++) {
		arr.push_back(rand() % elems);
	}

	uint64_t t = mach_absolute_time();

	float acc = 0.0;

	for (int e = 0; e < elems; e++) {
		int e2 = arr[e];
		int e3 = arr[e2];
		acc += e * e2;
		acc *= e3 + e;
		acc = sqrt(acc);
	}

	printf("Time taken (typed): %.4f%s",(mach_absolute_time() - t)/1000.0,"msec.");

Now we typically don't see a 1000X improvement. But in this trivial example the native optimization passes that a c++ compiler can do (vectorization, inlining, loop unrolling, et al) seem to really play a factor. In production code with typical game logic we usually see about a 40X-100X performance improvement.

* Removed instruction argument count and instruction prefetching. This is now done on the fly. Reduces jumps.
* OPCODE_DISPATCH now goes directly to the next instruction, like in Godot 3.x.

I have nothing I can use to test performance, so if anyone wants to lend a hand and compare with master (both on debug and release), it would be very welcome.
@reduz reduz force-pushed the gdscript-vm-optimization branch from 3258dac to 7211e04 Compare January 2, 2023 22:44
@reduz
Copy link
Member Author

reduz commented Jan 2, 2023

@jordo We have been discussing this here - godotengine/godot-proposals#6031 and the general consensus seems to be that users would be happy with an optional transpiler to C, but this is obviously Godot 4.1+ material.

@fire

This comment was marked as outdated.

@TokisanGames
Copy link
Contributor

TokisanGames commented Jan 3, 2023

I ran our game w/ nearly 30k lines of static typed code and a thousand assets and compared beta 10. I didn't expect much. Though we have a lot of code we're not bound by gdscript or the CPU. We're bound by the renderer and the GPU. I rarely see any scripts with significant performance issues on the monitors, and fix them if I do.

Core i9-12900H
RTX 3070 mobile
Win 11/64

Testing the PR build (7211e04), I do have a lot more renderer errors and visual artifacts (prob due to other PRs). I did not get any gdscript errors.

What B10 This PR
Loading the project and a big scene from the manager 24.7s 38s (1st) 17.6 (2nd)
Loading the game to title screen 14.1s 27.4s (1st) 13.6 (2nd)
Loading the first level 8.4s 16.6s (1st) 8.5s (2nd)
Playing, running around one area 100-145fps 100-145fps

Beta 10
image

PR
image

@Zireael07
Copy link
Contributor

Why the big difference between 1st and 2nd run?

@Norrox
Copy link
Contributor

Norrox commented Jan 3, 2023

I'm new to godot, so can someone pls explain why the comparisons are made against python, php and ruby instead of c# and c++ since these are the languages someone might actually use in a godot project?
What Calinou said and we can already use C# with Godot so why compare the two ? :D

@akien-mga
Copy link
Member

Note that the way binaries are compiled has a big impact on performance comparisons. Official beta builds are a lot more optimized than CI builds, so to do an accurate comparison using the PR's CI build, I would suggest downloading a CI build from the master branch around the time that PR was made (e.g. https://github.com/godotengine/godot/actions/runs/3818020117).

LinuxUserGD added a commit to LinuxUserGD/gdscript-transpiler-bin that referenced this pull request Jan 3, 2023
@cridenour
Copy link
Contributor

Ran my project with around ~8K lines of statically typed GDScript in both debug and release. I don't have any benchmark numbers at the moment, but the procedural world generation felt faster - and felt much faster on MacOS.

No errors anywhere in the project.

@LinuxUserGD
Copy link
Contributor

LinuxUserGD commented Jan 4, 2023

With Godot 4 beta 10 I noticed that string concatenation seems to be 700 times slower compared to Python.
GDScript (16.63s):

func string() -> int:
	var x : String = ""
	for i in range(0, 300000):
		x += " "
	return x.length()

Python 3.10 (0.023s):

def string():
    x = ""
    for i in range(0, 300000):
        x += " "
    return len(x)

@reduz
Copy link
Member Author

reduz commented Jan 4, 2023

@LinuxUserGD most likely Python is optimized for string processing (because that's a common use case for it) and their string class keeps track of all the additions to later merge them when you read from it. Godot is not designed with string processing in mind hence this is slow. We have a StringBuilder class we use on the C++ side that could be exposed to script if there is really demand, but probably there isn't.

@holzmb
Copy link

holzmb commented Jan 4, 2023

will this get backported to godot 3

@AThousandShips
Copy link
Member

@holzmb GDScript has been completely reworked for 4.x

@reduz
Copy link
Member Author

reduz commented Jan 4, 2023

@holzmb
This optimization is already in Godot 3.x:

  • OPCODE_DISPATCH now goes directly to the next instruction, like in Godot 3.x.

What is not there is the addressing mode optimization, but that won't happen because its too much work.

@akien-mga akien-mga merged commit fc4a734 into godotengine:master Jan 5, 2023
@akien-mga
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.