Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations #96

Merged
merged 5 commits into from
Aug 29, 2024
Merged

Optimizations #96

merged 5 commits into from
Aug 29, 2024

Conversation

turol
Copy link
Contributor

@turol turol commented Aug 6, 2024

A few cleanups, a benchmark script and some optimizations.

Disassembling a ~60MB binary on my Intel i7-12700H laptop:

    N           Min           Max        Median           Avg        Stddev
x  20         1.111         1.136         1.118        1.1192  0.0059701009
+  20         1.093         1.115           1.1       1.10075  0.0058478516
Difference at 95.0% confidence
	-0.01845 +/- 0.00378221
	-1.6485% +/- 0.337939%
	(Student's t, pooled s = 0.00590929)

@turol
Copy link
Contributor Author

turol commented Aug 9, 2024

Triggers UBSAN failures. You should probably have that in your CI...

@turol turol marked this pull request as draft August 9, 2024 11:59
@ianichitei
Copy link
Contributor

ianichitei commented Aug 9, 2024

Yes, the GitHub CI does not run any tests. It is on my TODO list, but haven't got the time to add that yet. Is UBSAN/ASAN triggered only with these changes?

I experimented with this and I'm not sure I'm running it right, or that I'm interpreting the results the right way.

For example, comparing a disasmtool compiled from 8b4cc48 with the one from this PR, I get:

$ ./benchmark.sh ./old ./new in.bin 10
x before.txt
+ after.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                       x              +                                                            +                                                                                           |
|x       x              x      x       +   x          x                      +                  x   *   +  x       +                             +               +                             +|
|         |________________________________M_____A________|_____________________________|_______________M__A________________________________________________|                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         1.406         1.434         1.417        1.4186   0.010297788
+  10         1.416         1.456         1.433        1.4339   0.012931014
Difference at 95.0% confidence
        0.0153 +/- 0.0109827
        1.07853% +/- 0.774195%
        (Student's t, pooled s = 0.0116888)

Doesn't this mean that the new version is actually slower?

Compiled for release with gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, running on a Ubuntu 22.04.4 LTS (via WSL2 on Windows 11 23H2), with a 13th Gen Intel(R) Core(TM) i7-1370P CPU.

@turol
Copy link
Contributor Author

turol commented Aug 9, 2024

UBSAN was triggered with these changes, the previous code is clean. The branchless sign extension is not OK when the input is already 64-bit.

That output means that it is indeed slower. I was using GCC 13 on Debian testing running on real hardware. It could also still be a measurement artifact. I was using 20 samples.

@turol
Copy link
Contributor Author

turol commented Aug 9, 2024

What is the size of your in.bin?

Do you get the expected null result if old and new are identical binaries?

On an AMD 5950x with Gentoo, GCC 13 and the same ~60 MB binary with 100 samples:
Two identical copies of the old binary

    N           Min           Max        Median           Avg        Stddev
x 100         1.421         1.547         1.465       1.46747   0.020762464
+ 100         1.422         1.506         1.462       1.46093   0.020405117
Difference at 95.0% confidence
        -0.00654 +/- 0.00570575
        -0.445665% +/- 0.388815%
        (Student's t, pooled s = 0.0205846)

Old vs new code

    N           Min           Max        Median           Avg        Stddev
x 100         1.427         1.574         1.493       1.48831   0.023063678
+ 100         1.386         1.524         1.424       1.42498   0.021953109
Difference at 95.0% confidence
        -0.06333 +/- 0.00624091
        -4.25516% +/- 0.419328%
        (Student's t, pooled s = 0.0225152)

The performance characteristics of modern systems are just completely f***** up. Maybe I should configure it for 99% confidence and always do more samples but that takes forever to run.

@ianichitei
Copy link
Contributor

I used a 58M file I obtained by concatenating a bunch of shellcodes together.

If I use the same binary I get timings that are much closer:

$ ./benchmark.sh ./old ./old in.bin 10
    N           Min           Max        Median           Avg        Stddev
x  10         1.456         1.518         1.502        1.4941   0.021855332
+  10         1.452         1.527         1.502        1.4937    0.02150478
No difference proven at 95.0% confidence

But this is just a small sample. These kinds of changes are small enough that if we want to accurately measure them we probably need to build some micro benchmarks.

@turol
Copy link
Contributor Author

turol commented Aug 10, 2024

I managed to fix UBSAN. It turns out that sometimes ND_SIGN_EX is called with sz == 0 ...

@turol turol marked this pull request as ready for review August 10, 2024 15:19
@turol
Copy link
Contributor Author

turol commented Aug 19, 2024

What do I need to do to get this merged?

@vlutas
Copy link
Collaborator

vlutas commented Aug 22, 2024

Hello,

We'll go through the PR soon, and we will let you know if changes are needed.

Thanks for the contribution!

@vlutas
Copy link
Collaborator

vlutas commented Aug 22, 2024

First of all, Looking at your benchmark, I see that you used the nv option to skip the textual disassembly, which is good. disasmtool also includes an option to run an internal benchmark, which measures how long it takes to decode the instructions - check out the iv option.
Secondly, the way you run the benchmark, the measurement will be skewed by a lot of different factors, such as file I/O, interrupts that might be generated, etc. - this is because you measure the runtime of the entire executable. It is much better to micro-benchmark the exact event you want to measure - in this case, only the instruction decoding step (this is what the iv option does).
Finally, I measured the old code base (the current one) and the new code base (your modifications), and it shows that your changes make the decoding slightly slower (4.7% slower for regular code, 5.2% slower for random data). Test methodology involved decoding 2 files, each 1G in size: one containing regular x64 code, and one containing random data:

Regular code, 1G file:
    Old disasm: Disassembled 304551208 instructions in 31194ms, 9763134.1925 instructions/second, 247.169054 clocks/instruction, average ilen 3.525704 bytes
    New disasm: Disassembled 304551208 instructions in 32545ms, 9357849.3778 instructions/second, 258.988099 clocks/instruction, average ilen 3.525704 bytes
Random bytes, 1G file:
    Old disasm: Disassembled 415555063 instructions in 45722ms, 9088733.2794 instructions/second, 267.193588 clocks/instruction, average ilen 2.822585 bytes
    New disasm: Disassembled 415555063 instructions in 47967ms, 8663353.2012 instructions/second, 281.134721 clocks/instruction, average ilen 2.822585 bytes

Of course, being ~5% slower is not necessary bad by itself, but the point is that I fail to see any consistent decoding performance improvement with the current changes.

@turol
Copy link
Contributor Author

turol commented Aug 22, 2024

I/O effects should have been eliminated by the pre-runs before the actual benchmark loop. Interrupt effects can't be eliminated without radically changing the system configuration. If the OS wants to punt our program from the CPU to do something else it will. We can only minimize the effect by running more samples.

I replaced use of time with instructions/second from -iv output. Now higher value is better. However I still get the same result.

You appear to be using clock() for measuring time without using CLOCKS_PER_SEC
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/clock?view=msvc-170
So some of my numbers are different from yours by a factor of 1000.

Numbers on my Intel i7-12700H laptop but I get broadly similar results from AMD 5950x.

For a real executable (.text section only, 32MB):

Disassembled 19791808 instructions in 1156758ms, 17109.7222 instructions/second, 141.391124 clocks/instruction, average ilen 3.393740 bytes
Disassembled 19791808 instructions in 1094370ms, 18085.1156 instructions/second, 132.392333 clocks/instruction, average ilen 3.393740 bytes

    N           Min           Max        Median           Avg        Stddev
x  30     16955.914     17147.389     17066.273     17059.275     50.281244
+  30     17805.978     18094.127      18026.59     18007.061     63.651212
Difference at 95.0% confidence
	947.786 +/- 29.6487
	5.55584% +/- 0.173798%
	(Student's t, pooled s = 57.3571)

And 32MB of randomness:

Disassembled 12987011 instructions in 856135ms, 15169.3495 instructions/second, 161.957771 clocks/instruction, average ilen 2.822512 bytes
Disassembled 12987011 instructions in 809213ms, 16048.9401 instructions/second, 152.054722 clocks/instruction, average ilen 2.822512 bytes

    N           Min           Max        Median           Avg        Stddev
x  30     14934.328      15169.35     15072.905      15062.49     58.431712
+  30      15013.99      16048.94     15972.651      15932.42     181.52074
Difference at 95.0% confidence
	869.93 +/- 69.7011
	5.77547% +/- 0.462746%
	(Student's t, pooled s = 134.841)

There is something very very strange going on in your setup.
Are you also using WSL2? Can you try on real non-virtualized setup?

Are you only running each benchmark once? The reason I use a multiple runs is to

  1. Get some proper statistics for ministat
  2. Alternate between the two programs to reduce the effect of clock speed rampup or thermal throttling
  3. Reduce the effect of virtual memory layout

@vlutas
Copy link
Collaborator

vlutas commented Aug 22, 2024

There is nothing strange about my setup - it's bare metal Windows.
What is different though is that you're running the tests on Linux (WSL), while I'm running them on Windows. Ergo, it probably boils down to how the compiler decides to generate & arrange the code (GCC vs. MSVC), as you're reporting improvement, whereas I have a downgrade of performance. We may both be right... :)
However, like I said in my previous comment, I currently fail to see a consistent performance improvement with these changes. I will investigate how it behaves in WSL as well, and I will go through the code changes, to see what should be kept, and what should be modified in order to get a consistent performance improvement.
Thank you again for your contribution!

@@ -3378,22 +3378,22 @@ NdFindInstruction(
break;
}

while ((!stop) && (ND_NULL != pTable))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding so many gotos inside the while loop makes the code quite hard to follow. I wouldn't make such changes to the decoder loop, unless they offer a significant benefit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit alone is about 2% faster. I'd call that significant.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may do so on Linux, but it brings no benefit on Windows (in fact, I see a small performance downgrade with this modification). Combined with the more hard to follow code, I see no reason to merge the changes to the NdFindInstruction function at this moment.
The other modifications seem OK though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left this bit out for now.

inc/bdx86_core.h Outdated Show resolved Hide resolved
inc/bdx86_core.h Outdated Show resolved Hide resolved
inc/bdx86_core.h Outdated Show resolved Hide resolved
@turol
Copy link
Contributor Author

turol commented Aug 24, 2024

Updated. New numbers:

Disassembled 19791808 instructions in 1112486ms, 17790.6131 instructions/second, 135.348606 clocks/instruction, average ilen 3.393740 bytes
Disassembled 19791808 instructions in 1068668ms, 18520.0717 instructions/second, 128.996045 clocks/instruction, average ilen 3.393740 bytes

    N           Min           Max        Median           Avg        Stddev
x  30     17568.559     17824.131     17771.013     17764.312     49.819621
+  30     18105.614     18600.397      18529.66     18490.219      103.5881
Difference at 95.0% confidence
	725.907 +/- 42.0142
	4.08632% +/- 0.236509%
	(Student's t, pooled s = 81.2788)

@turol
Copy link
Contributor Author

turol commented Aug 24, 2024

And I'm running Linux on real hardware, no virtualization. @ianichitei was using WSL2 which I suspected might be screwing up the timings.

@turol
Copy link
Contributor Author

turol commented Aug 24, 2024

ND_MSB needs a cast to ND_UINT64 or ubsan fails at

Operand->Info.Memory.Disp = ND_SIGN_EX(Instrux->DispLength, Instrux->Displacement);

because a ND_UINT32 is shifted by 63 bits.

But with it there's a warning at

sign = ND_MSB(dst.Size, res.Value.Qwords[0]);

What would be the best way to fix this?

@vlutas
Copy link
Collaborator

vlutas commented Aug 26, 2024

ND_MSB needs a cast to ND_UINT64 or ubsan fails at

Operand->Info.Memory.Disp = ND_SIGN_EX(Instrux->DispLength, Instrux->Displacement);

because a ND_UINT32 is shifted by 63 bits.

But with it there's a warning at

sign = ND_MSB(dst.Size, res.Value.Qwords[0]);

What would be the best way to fix this?

Hello! The problem there is that the displacement sign extension takes place even if there is no displacement (the DispLength is 0). This is the case here and here. With the new macro, you should make sure to skip sign-extending the displacement, if there is none, as it will result in an undefined behavior, due to the 0-length displacement: Operand->Info.Memory.Disp = Instrux->HasDisp ? ND_SIGN_EX(Instrux->DispLength, Instrux->Displacement) : 0;.

The ND_MSB can therefore remain simple: #define ND_MSB(sz, x) ((((x)) >> (((sz) * 8) - 1)) & 1), with no need for type-casts. This also ensures we won't break anyone's code.

@turol
Copy link
Contributor Author

turol commented Aug 26, 2024

Thanks, that did it.

benchmark.sh Outdated
*) ;;
esac

truncate -s 0 before.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using hard-coded output files, maybe use some names based on the disasmtool used - for example, $FIRST.result and $SECOND.result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also improved error messages and cleaned up white space.

@vlutas vlutas merged commit c877b50 into bitdefender:master Aug 29, 2024
8 checks passed
@turol turol deleted the optimize branch August 31, 2024 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants