optimise some of the bounds checks #34

alindima · 2023-12-18T13:23:14Z

This brings a performance improvement of 20-30%.

Where possible, compiler is aided to optimise away the bounds checks without any unsafe code. No unsafe code was used.

This PR does not touch AVX code, because when testing, I did not see a noticeable improvement for that case.

Numbers before:

~~~ [ Benchmark case: 1000000 bytes ] ~~~
Encode RUST (10 cycles): 397.182 ms
Decode RUST (10 cycles): 961.006 ms
Encode C++ (10 cycles): 221.003 ms
Decode C++ (10 cycles): 572.489 ms

~~~ [ Benchmark case: 2500000 bytes ] ~~~
Encode RUST (10 cycles): 1018.27 ms
Decode RUST (10 cycles): 2424.6 ms
Encode C++ (10 cycles): 602.386 ms
Decode C++ (10 cycles): 1459.45 ms

~~~ [ Benchmark case: 5000000 bytes ] ~~~
Encode RUST (10 cycles): 2043.65 ms
Decode RUST (10 cycles): 4813.27 ms
Encode C++ (10 cycles): 1208.36 ms
Decode C++ (10 cycles): 2892.14 ms

~~~ [ Benchmark case: 10000000 bytes ] ~~~
Encode RUST (10 cycles): 4095.26 ms
Decode RUST (10 cycles): 9.61965 s
Encode C++ (10 cycles): 2416.67 ms
Decode C++ (10 cycles): 5.76792 s

Numbers now:

~~~ [ Benchmark case: 1000000 bytes ] ~~~
Encode RUST (10 cycles): 335.291 ms -> 18.5% better than master
Decode RUST (10 cycles): 739.528 ms -> 30% better than master
Encode C++ (10 cycles): 211.608 ms
Decode C++ (10 cycles): 562.939 ms

~~~ [ Benchmark case: 2500000 bytes ] ~~~
Encode RUST (10 cycles): 855.59 ms -> 19% better than master
Decode RUST (10 cycles): 1830.64 ms -> 32% better than master
Encode C++ (10 cycles): 559.997 ms
Decode C++ (10 cycles): 1434.1 ms

~~~ [ Benchmark case: 5000000 bytes ] ~~~
Encode RUST (10 cycles): 1730.34 ms -> 18% better than master
Decode RUST (10 cycles): 3633.53 ms -> 32% better than master
Encode C++ (10 cycles): 1177.38 ms
Decode C++ (10 cycles): 2863.14 ms

~~~ [ Benchmark case: 10000000 bytes ] ~~~
Encode RUST (10 cycles): 3475.36 ms -> 17.8% better than master
Decode RUST (10 cycles): 7.25712 s -> 32.5% better than master
Encode C++ (10 cycles): 2372.91 ms
Decode C++ (10 cycles): 5.7262 s

The only thing preventing from this implementation being as fast as kagome's C++ impl are the few lines annotated with:

// TODO: Optimising bounds checks on this line will yield a great performance improvement.

I couldn't yet manage to get the compiler to ellide the bounds checks in those cases. Another way of achieving this would be to add a bit of unsafe code.

This brings a performance improvement of 40-100%, making this implementation as fast as the C++ alternative in kagome. Where possible, compiler is aided to optimise away the bounds checks without any unsafe code. However, a fair amount of unsafe code was needed, but it doesn't lower the security posture as the needed assertions were already being made. Signed-off-by: alindima <alin@parity.io>

…cks-optimisation

alindima added 2 commits December 18, 2023 15:17

fix clippy

a3b5159

alindima requested a review from ordian December 18, 2023 13:49

alindima mentioned this pull request Dec 18, 2023

Compare with kagome's impl #14

Closed

alindima added 4 commits December 19, 2023 14:11

Merge remote-tracking branch 'origin/master' into alindima/bounds-che…

748bdb9

…cks-optimisation

switch to using safe optimisations

8d0ade9

Merge remote-tracking branch 'origin/master' into alindima/bounds-che…

41372f2

…cks-optimisation

revert some changes

6d67787

alindima changed the title ~~optimise bounds checks~~ optimise some of the bounds checks Jan 11, 2024

ordian merged commit 886be0e into master Jan 11, 2024
9 checks passed

ordian deleted the alindima/bounds-checks-optimisation branch January 11, 2024 10:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimise some of the bounds checks #34

optimise some of the bounds checks #34

alindima commented Dec 18, 2023 •

edited

Loading

optimise some of the bounds checks #34

optimise some of the bounds checks #34

Conversation

alindima commented Dec 18, 2023 • edited Loading

alindima commented Dec 18, 2023 •

edited

Loading