Optimizations #96

turol · 2024-08-06T16:23:00Z

A few cleanups, a benchmark script and some optimizations.

Disassembling a ~60MB binary on my Intel i7-12700H laptop:

    N           Min           Max        Median           Avg        Stddev
x  20         1.111         1.136         1.118        1.1192  0.0059701009
+  20         1.093         1.115           1.1       1.10075  0.0058478516
Difference at 95.0% confidence
	-0.01845 +/- 0.00378221
	-1.6485% +/- 0.337939%
	(Student's t, pooled s = 0.00590929)

turol · 2024-08-09T11:44:26Z

Triggers UBSAN failures. You should probably have that in your CI...

ianichitei · 2024-08-09T12:22:01Z

Yes, the GitHub CI does not run any tests. It is on my TODO list, but haven't got the time to add that yet. Is UBSAN/ASAN triggered only with these changes?

I experimented with this and I'm not sure I'm running it right, or that I'm interpreting the results the right way.

For example, comparing a disasmtool compiled from 8b4cc48 with the one from this PR, I get:

$ ./benchmark.sh ./old ./new in.bin 10
x before.txt
+ after.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                       x              +                                                            +                                                                                           |
|x       x              x      x       +   x          x                      +                  x   *   +  x       +                             +               +                             +|
|         |________________________________M_____A________|_____________________________|_______________M__A________________________________________________|                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         1.406         1.434         1.417        1.4186   0.010297788
+  10         1.416         1.456         1.433        1.4339   0.012931014
Difference at 95.0% confidence
        0.0153 +/- 0.0109827
        1.07853% +/- 0.774195%
        (Student's t, pooled s = 0.0116888)

Doesn't this mean that the new version is actually slower?

Compiled for release with gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, running on a Ubuntu 22.04.4 LTS (via WSL2 on Windows 11 23H2), with a 13th Gen Intel(R) Core(TM) i7-1370P CPU.

turol · 2024-08-09T12:48:44Z

UBSAN was triggered with these changes, the previous code is clean. The branchless sign extension is not OK when the input is already 64-bit.

That output means that it is indeed slower. I was using GCC 13 on Debian testing running on real hardware. It could also still be a measurement artifact. I was using 20 samples.

turol · 2024-08-09T13:16:42Z

What is the size of your in.bin?

Do you get the expected null result if old and new are identical binaries?

On an AMD 5950x with Gentoo, GCC 13 and the same ~60 MB binary with 100 samples:
Two identical copies of the old binary

    N           Min           Max        Median           Avg        Stddev
x 100         1.421         1.547         1.465       1.46747   0.020762464
+ 100         1.422         1.506         1.462       1.46093   0.020405117
Difference at 95.0% confidence
        -0.00654 +/- 0.00570575
        -0.445665% +/- 0.388815%
        (Student's t, pooled s = 0.0205846)

Old vs new code

    N           Min           Max        Median           Avg        Stddev
x 100         1.427         1.574         1.493       1.48831   0.023063678
+ 100         1.386         1.524         1.424       1.42498   0.021953109
Difference at 95.0% confidence
        -0.06333 +/- 0.00624091
        -4.25516% +/- 0.419328%
        (Student's t, pooled s = 0.0225152)

The performance characteristics of modern systems are just completely f***** up. Maybe I should configure it for 99% confidence and always do more samples but that takes forever to run.

ianichitei · 2024-08-09T15:58:27Z

I used a 58M file I obtained by concatenating a bunch of shellcodes together.

If I use the same binary I get timings that are much closer:

$ ./benchmark.sh ./old ./old in.bin 10
    N           Min           Max        Median           Avg        Stddev
x  10         1.456         1.518         1.502        1.4941   0.021855332
+  10         1.452         1.527         1.502        1.4937    0.02150478
No difference proven at 95.0% confidence

But this is just a small sample. These kinds of changes are small enough that if we want to accurately measure them we probably need to build some micro benchmarks.

turol · 2024-08-10T15:19:39Z

I managed to fix UBSAN. It turns out that sometimes ND_SIGN_EX is called with sz == 0 ...

turol · 2024-08-19T11:14:11Z

What do I need to do to get this merged?

vlutas · 2024-08-22T06:14:28Z

Hello,

We'll go through the PR soon, and we will let you know if changes are needed.

Thanks for the contribution!

vlutas · 2024-08-22T08:10:06Z

First of all, Looking at your benchmark, I see that you used the nv option to skip the textual disassembly, which is good. disasmtool also includes an option to run an internal benchmark, which measures how long it takes to decode the instructions - check out the iv option.
Secondly, the way you run the benchmark, the measurement will be skewed by a lot of different factors, such as file I/O, interrupts that might be generated, etc. - this is because you measure the runtime of the entire executable. It is much better to micro-benchmark the exact event you want to measure - in this case, only the instruction decoding step (this is what the iv option does).
Finally, I measured the old code base (the current one) and the new code base (your modifications), and it shows that your changes make the decoding slightly slower (4.7% slower for regular code, 5.2% slower for random data). Test methodology involved decoding 2 files, each 1G in size: one containing regular x64 code, and one containing random data:

Regular code, 1G file:
    Old disasm: Disassembled 304551208 instructions in 31194ms, 9763134.1925 instructions/second, 247.169054 clocks/instruction, average ilen 3.525704 bytes
    New disasm: Disassembled 304551208 instructions in 32545ms, 9357849.3778 instructions/second, 258.988099 clocks/instruction, average ilen 3.525704 bytes
Random bytes, 1G file:
    Old disasm: Disassembled 415555063 instructions in 45722ms, 9088733.2794 instructions/second, 267.193588 clocks/instruction, average ilen 2.822585 bytes
    New disasm: Disassembled 415555063 instructions in 47967ms, 8663353.2012 instructions/second, 281.134721 clocks/instruction, average ilen 2.822585 bytes

Of course, being ~5% slower is not necessary bad by itself, but the point is that I fail to see any consistent decoding performance improvement with the current changes.

turol · 2024-08-22T13:22:59Z

I/O effects should have been eliminated by the pre-runs before the actual benchmark loop. Interrupt effects can't be eliminated without radically changing the system configuration. If the OS wants to punt our program from the CPU to do something else it will. We can only minimize the effect by running more samples.

I replaced use of time with instructions/second from -iv output. Now higher value is better. However I still get the same result.

You appear to be using clock() for measuring time without using CLOCKS_PER_SEC
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/clock?view=msvc-170
So some of my numbers are different from yours by a factor of 1000.

Numbers on my Intel i7-12700H laptop but I get broadly similar results from AMD 5950x.

For a real executable (.text section only, 32MB):

Disassembled 19791808 instructions in 1156758ms, 17109.7222 instructions/second, 141.391124 clocks/instruction, average ilen 3.393740 bytes
Disassembled 19791808 instructions in 1094370ms, 18085.1156 instructions/second, 132.392333 clocks/instruction, average ilen 3.393740 bytes

    N           Min           Max        Median           Avg        Stddev
x  30     16955.914     17147.389     17066.273     17059.275     50.281244
+  30     17805.978     18094.127      18026.59     18007.061     63.651212
Difference at 95.0% confidence
	947.786 +/- 29.6487
	5.55584% +/- 0.173798%
	(Student's t, pooled s = 57.3571)

And 32MB of randomness:

Disassembled 12987011 instructions in 856135ms, 15169.3495 instructions/second, 161.957771 clocks/instruction, average ilen 2.822512 bytes
Disassembled 12987011 instructions in 809213ms, 16048.9401 instructions/second, 152.054722 clocks/instruction, average ilen 2.822512 bytes

    N           Min           Max        Median           Avg        Stddev
x  30     14934.328      15169.35     15072.905      15062.49     58.431712
+  30      15013.99      16048.94     15972.651      15932.42     181.52074
Difference at 95.0% confidence
	869.93 +/- 69.7011
	5.77547% +/- 0.462746%
	(Student's t, pooled s = 134.841)

There is something very very strange going on in your setup.
Are you also using WSL2? Can you try on real non-virtualized setup?

Are you only running each benchmark once? The reason I use a multiple runs is to

Get some proper statistics for ministat
Alternate between the two programs to reduce the effect of clock speed rampup or thermal throttling
Reduce the effect of virtual memory layout

vlutas · 2024-08-22T14:37:38Z

There is nothing strange about my setup - it's bare metal Windows.
What is different though is that you're running the tests on Linux (WSL), while I'm running them on Windows. Ergo, it probably boils down to how the compiler decides to generate & arrange the code (GCC vs. MSVC), as you're reporting improvement, whereas I have a downgrade of performance. We may both be right... :)
However, like I said in my previous comment, I currently fail to see a consistent performance improvement with these changes. I will investigate how it behaves in WSL as well, and I will go through the code changes, to see what should be kept, and what should be modified in order to get a consistent performance improvement.
Thank you again for your contribution!

vlutas · 2024-08-22T16:17:53Z

bddisasm/bdx86_decoder.c

@@ -3378,22 +3378,22 @@ NdFindInstruction(
        break;
    }

-    while ((!stop) && (ND_NULL != pTable))


Adding so many gotos inside the while loop makes the code quite hard to follow. I wouldn't make such changes to the decoder loop, unless they offer a significant benefit.

This commit alone is about 2% faster. I'd call that significant.

It may do so on Linux, but it brings no benefit on Windows (in fact, I see a small performance downgrade with this modification). Combined with the more hard to follow code, I see no reason to merge the changes to the NdFindInstruction function at this moment.
The other modifications seem OK though.

I've left this bit out for now.

inc/bdx86_core.h

turol · 2024-08-24T19:03:22Z

Updated. New numbers:

Disassembled 19791808 instructions in 1112486ms, 17790.6131 instructions/second, 135.348606 clocks/instruction, average ilen 3.393740 bytes
Disassembled 19791808 instructions in 1068668ms, 18520.0717 instructions/second, 128.996045 clocks/instruction, average ilen 3.393740 bytes

    N           Min           Max        Median           Avg        Stddev
x  30     17568.559     17824.131     17771.013     17764.312     49.819621
+  30     18105.614     18600.397      18529.66     18490.219      103.5881
Difference at 95.0% confidence
	725.907 +/- 42.0142
	4.08632% +/- 0.236509%
	(Student's t, pooled s = 81.2788)

turol · 2024-08-24T19:12:22Z

And I'm running Linux on real hardware, no virtualization. @ianichitei was using WSL2 which I suspected might be screwing up the timings.

turol · 2024-08-24T19:41:03Z

ND_MSB needs a cast to ND_UINT64 or ubsan fails at

bddisasm/bddisasm/bdx86_decoder.c

Line 1419 in 8b4cc48

    
           Operand->Info.Memory.Disp = ND_SIGN_EX(Instrux->DispLength, Instrux->Displacement);

because a ND_UINT32 is shifted by 63 bits.

But with it there's a warning at

bddisasm/bdshemu/bdshemu_x86.c

Line 2860 in 8b4cc48

sign = ND_MSB(dst.Size, res.Value.Qwords[0]);

What would be the best way to fix this?

vlutas · 2024-08-26T06:27:00Z

ND_MSB needs a cast to ND_UINT64 or ubsan fails at

bddisasm/bddisasm/bdx86_decoder.c

Line 1419 in 8b4cc48

Operand->Info.Memory.Disp = ND_SIGN_EX(Instrux->DispLength, Instrux->Displacement);

because a ND_UINT32 is shifted by 63 bits.

But with it there's a warning at

bddisasm/bdshemu/bdshemu_x86.c

Line 2860 in 8b4cc48

sign = ND_MSB(dst.Size, res.Value.Qwords[0]);

What would be the best way to fix this?

Hello! The problem there is that the displacement sign extension takes place even if there is no displacement (the DispLength is 0). This is the case here and here. With the new macro, you should make sure to skip sign-extending the displacement, if there is none, as it will result in an undefined behavior, due to the 0-length displacement: Operand->Info.Memory.Disp = Instrux->HasDisp ? ND_SIGN_EX(Instrux->DispLength, Instrux->Displacement) : 0;.

The ND_MSB can therefore remain simple: #define ND_MSB(sz, x) ((((x)) >> (((sz) * 8) - 1)) & 1), with no need for type-casts. This also ensures we won't break anyone's code.

turol · 2024-08-26T14:49:19Z

Thanks, that did it.

vlutas · 2024-08-29T06:40:28Z

benchmark.sh

+    *) ;;
+esac
+
+truncate -s 0 before.txt


Instead of using hard-coded output files, maybe use some names based on the disasmtool used - for example, $FIRST.result and $SECOND.result.

Done. Also improved error messages and cleaned up white space.

turol marked this pull request as draft August 9, 2024 11:59

turol force-pushed the optimize branch from 5a87ef9 to 0c676b7 Compare August 10, 2024 15:18

turol marked this pull request as ready for review August 10, 2024 15:19

turol force-pushed the optimize branch from 0c676b7 to 4479b5e Compare August 22, 2024 12:34

vlutas reviewed Aug 22, 2024

View reviewed changes

inc/bdx86_core.h Outdated Show resolved Hide resolved

vlutas reviewed Aug 22, 2024

View reviewed changes

inc/bdx86_core.h Outdated Show resolved Hide resolved

vlutas reviewed Aug 22, 2024

View reviewed changes

inc/bdx86_core.h Outdated Show resolved Hide resolved

turol force-pushed the optimize branch from 4479b5e to 222097d Compare August 24, 2024 18:53

turol added 2 commits August 26, 2024 17:29

Fix comment

4b320e0

Only set bcstSize when it's required

b261dd4

turol force-pushed the optimize branch from 222097d to 5734839 Compare August 26, 2024 14:40

turol requested a review from vlutas August 28, 2024 10:33

turol force-pushed the optimize branch from 5734839 to a3a1fd5 Compare August 28, 2024 15:45

vlutas reviewed Aug 29, 2024

View reviewed changes

turol added 3 commits August 29, 2024 17:25

Add benchmark script

f967327

Only call ND_SIGN_EX when instruction has displacement

8bc44ae

Replace ND_SIGN_EX with less branchy version

56bbcca

turol force-pushed the optimize branch from a3a1fd5 to 56bbcca Compare August 29, 2024 14:26

vlutas merged commit c877b50 into bitdefender:master Aug 29, 2024
8 checks passed

turol deleted the optimize branch August 31, 2024 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations #96

Optimizations #96

turol commented Aug 6, 2024

turol commented Aug 9, 2024

ianichitei commented Aug 9, 2024 •

edited

Loading

turol commented Aug 9, 2024

turol commented Aug 9, 2024

ianichitei commented Aug 9, 2024

turol commented Aug 10, 2024

turol commented Aug 19, 2024

vlutas commented Aug 22, 2024

vlutas commented Aug 22, 2024

turol commented Aug 22, 2024

vlutas commented Aug 22, 2024

vlutas Aug 22, 2024

turol Aug 24, 2024

vlutas Aug 28, 2024

turol Aug 28, 2024

turol commented Aug 24, 2024

turol commented Aug 24, 2024

turol commented Aug 24, 2024

vlutas commented Aug 26, 2024 •

edited

Loading

turol commented Aug 26, 2024

vlutas Aug 29, 2024

turol Aug 29, 2024

Optimizations #96

Optimizations #96

Conversation

turol commented Aug 6, 2024

turol commented Aug 9, 2024

ianichitei commented Aug 9, 2024 • edited Loading

turol commented Aug 9, 2024

turol commented Aug 9, 2024

ianichitei commented Aug 9, 2024

turol commented Aug 10, 2024

turol commented Aug 19, 2024

vlutas commented Aug 22, 2024

vlutas commented Aug 22, 2024

turol commented Aug 22, 2024

vlutas commented Aug 22, 2024

vlutas Aug 22, 2024

Choose a reason for hiding this comment

turol Aug 24, 2024

Choose a reason for hiding this comment

vlutas Aug 28, 2024

Choose a reason for hiding this comment

turol Aug 28, 2024

Choose a reason for hiding this comment

turol commented Aug 24, 2024

turol commented Aug 24, 2024

turol commented Aug 24, 2024

vlutas commented Aug 26, 2024 • edited Loading

turol commented Aug 26, 2024

vlutas Aug 29, 2024

Choose a reason for hiding this comment

turol Aug 29, 2024

Choose a reason for hiding this comment

ianichitei commented Aug 9, 2024 •

edited

Loading

vlutas commented Aug 26, 2024 •

edited

Loading