Rewrite math jit helpers to managed code #98858

MichalPetryka · 2024-02-23T15:33:01Z

Rewrites math helpers that has multiple implementations (C++ for CoreCLR, Windows X86 asm for CoreCLR and C#/C++ for NativeAOT) into a single, shared C# implementation.
I've put the new helpers in Math and MathF but some other place like RuntimeHelpers could perhaps be better.

On 64bit platforms this should be mostly the same, some direct helper calls might be replaced with indirect ones when the helpers didn't get to T1 yet.

Might result in slight perf regressions on X86 and ARM32 due to removal of asm paths and the managed helpers needing to do another call in some cases.

Regressions could be made up by converting casts to calls earlier and making them inlineable, that's not handled here to keep the changes simple.

This removes most of the implementation of 4 unused helpers, I think I forwarded them to temporary managed variants for legacy R2R images correctly but verification would be appreciated.

huoyaoyuan · 2024-02-23T18:27:38Z

I've put the new helpers in Math and MathF but some other place like RuntimeHelpers could perhaps be better.

They may have a separate helper type, like CastHelpers.

src/libraries/System.Private.CoreLib/src/System/Math.cs

MichalPetryka · 2024-02-23T18:57:04Z

@MihuBot

jkotas · 2024-02-23T22:14:57Z

This removes most of the implementation of 4 unused helpers, I think I forwarded them to temporary managed variants for legacy R2R images correctly but verification would be appreciated.

It would be better to leave these no longer used helpers alone, and delete them next time we rev min R2R version.

khushal1996 · 2024-03-06T22:12:14Z

@MichalPetryka
Just wanted to ask if there is a plan to merge this PR soon. Since #97529 will depend on this PR. Any idea when the current PR will be merged?

MichalPetryka · 2024-03-06T23:53:55Z

@MichalPetryka Just wanted to ask if there is a plan to merge this PR soon. Since #97529 will depend on this PR. Any idea when the current PR will be merged?

I've cleaned up the changes now and added some comments, I've also verified that .NET 8 X86 R2R binaries seem to still be working so I think this is only waiting for review now?
One question would be whether the helpers should be moved to some other file.
cc @tannergooding @jkotas

src/coreclr/inc/jithelpers.h

src/coreclr/tools/Common/TypeSystem/IL/HelperExtensions.cs

MichalPetryka · 2024-03-15T02:20:53Z

The tests are failing:

Assert failure(PID 9608 [0x00002588], Thread: 9612 [0x258c]): !"Precode::GetPrecodeFromEntryPoint: Unexpected code in precode"

CORECLR! Precode::GetPrecodeFromEntryPoint + 0x4A (0x718b1743)
CORECLR! CEEJitInfo::getHelperFtn + 0x1EB (0x719160db)

Ah I might know what this is, it seems to not like the Dbl2UInt helper being unmanaged in CoreCLR after I ported it to managed in NativeAOT again.

src/libraries/System.Private.CoreLib/src/System/Double.cs

filipnavara · 2024-03-15T16:48:58Z

We are missing this in win-x86 NativeAOT:

--- a/src/coreclr/nativeaot/Runtime/MathHelpers.cpp
+++ b/src/coreclr/nativeaot/Runtime/MathHelpers.cpp
@@ -271,6 +271,14 @@ FCIMPL1_F(float, tanhf, float x)
     return std::tanhf(x);
 FCIMPLEND

+FCIMPL2_DD(double, fmod, double x, double y)
+    return std::fmod(x, y);
+FCIMPLEND
+
+FCIMPL2_FF(float, fmodf, float x, float y)
+    return std::fmodf(x, y);
+FCIMPLEND
+
 FCIMPL3_DDD(double, fma, double x, double y, double z)
     return std::fma(x, y, z);
 FCIMPLEND

Also, the throw helpers are not properly rooted in the scanning phase:

    *** Methods compiled but not scanned:
    int64.MultiplyChecked(int64,int64)
    [S.P.CoreLib]System.ThrowHelper.ThrowDivideByZeroException()
    uint64.MultiplyChecked(uint64,uint64)
    [S.P.CoreLib]System.Text.StringBuilder+AppendInterpolatedStringHandler.AppendFormatted<uint64>(uint64)
    [S.P.CoreLib]System.Text.StringBuilder+AppendInterpolatedStringHandler.AppendFormattedWithTempSpace<uint64>(uint64,in
  t32,string)
    [S.P.CoreLib]System.Text.StringBuilder+AppendInterpolatedStringHandler.AppendCustomFormatter<uint64>(uint64,string)
    Unhandled exception: System.Exception: Scanning failure
       at ILCompiler.Program.Run() in D:\runtime\src\coreclr\tools\aot\ILCompiler\Program.cs:line 651
       at ILCompiler.ILCompilerRootCommand.<>c__DisplayClass236_0.<.ctor>b__0(ParseResult result) in D:\runtime\src\corec
  lr\tools\aot\ILCompiler\ILCompilerRootCommand.cs:line 292
       at System.CommandLine.Invocation.InvocationPipeline.Invoke(ParseResult parseResult)

src/coreclr/tools/aot/ILCompiler.Compiler/IL/ILImporter.Scanner.cs

src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/JitHelper.cs

src/coreclr/vm/corelib.h

src/libraries/System.Private.CoreLib/src/System/Double.cs

src/coreclr/nativeaot/Runtime/MathHelpers.cpp

tannergooding · 2024-03-15T22:50:37Z

src/coreclr/inc/jithelpers.h

-    JITHELPER(CORINFO_HELP_DBL2INT_OVF,         JIT_Dbl2IntOvf,     CORINFO_HELP_SIG_8_STACK)
+    DYNAMICJITHELPER(CORINFO_HELP_LNG2DBL,      NULL,               CORINFO_HELP_SIG_8_STACK)
+    DYNAMICJITHELPER(CORINFO_HELP_ULNG2DBL,     NULL,               CORINFO_HELP_SIG_8_STACK)
+    JITHELPER(CORINFO_HELP_DBL2INT,             JIT_Dbl2Int,        CORINFO_HELP_SIG_8_STACK)


What platform is this needed for?

Arm, Arm64, x86, and x64 all provide a direct instruction for double to int32 as part of the baseline ISAs we require.

It's unused but can't be removed without an R2R bump. Also, Intel PRs might need it again.

src/coreclr/tools/aot/ILCompiler.Compiler/IL/ILImporter.Scanner.cs

…r.cs Co-authored-by: Jan Kotas <jkotas@microsoft.com>

src/libraries/System.Private.CoreLib/src/System/Single.cs

tannergooding · 2024-03-15T23:21:30Z

src/libraries/System.Private.CoreLib/src/System/UInt64.cs

+        //
+
+        [StackTraceHidden]
+        private static ulong MultiplyChecked(ulong left, ulong right)


This seems like a very expensive way to implement MultiplyChecked.

For UInt128 we simply do BigMul and check if upper is 0: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/UInt128.cs,1379

But more practically speaking, we can determine overflow just by doing if ((128 - LeadingZeroCount(left) - LeadingZeroCount(right)) > 64). Since x-bit * y-bit will produce an (x + y)-bit product; anything else must fit into the given 64-bits.

I'd prefer to change logic here in a separate PR. Keep in mind that this is for 32bit platforms so things like BigMul are slow there.

I'd prefer to change logic here in a separate PR

That's fine.

Keep in mind that this is for 32bit platforms so things like BigMul are slow there.

That's why I suggested the extremely cheap check of if ((128 - LeadingZeroCount(left) - LeadingZeroCount(right)) > 64). That, on a 32-bit platform, ends up being an almost free check to determine overflow without doing any expensive multiplication (it simply uses the lzcnt equivalent that every platform we target, including WASM, natively supports).

Then, you only have to do a regular 64x64-bit multiplication since you know it can't overflow. You don't have to do the extra work (and many branches) to check overflow at each step in the algorithm.

That, on a 32-bit platform, ends up being an almost free check to determine overflow without doing any expensive multiplication

It looks like an approximate check to determine overflow. If it fails, you still need to take the more expensive multiplication to determine whether it is actually going to overflow - correct?

The current implementation has a variant of this idea where it checks whether the high 32-bits of both numbers are zero. If they are, it knows that there is not going to be an overflow and so it multiplies the low 32-bit portions and returns. If they are not, it does the more expensive multiplication.

I think the difference between what's there currently and your suggestion is in the speed of the fast path and the curoff for the fast path. The current implementation has lower cutoff and faster fast path. Your suggestion has higher cutoff and slower fast path.

It looks like an approximate check to determine overflow.

Yes.

you still need to take the more expensive multiplication to determine whether it is actually going to overflow

Not quite.

In general, for an x-bit * y-bit value (excepting where one input is 0), you need a minimum of max(x, y) bits to represent the product and a maximum of x + y bits.

The minimum bit count comes about because of the special case 1, which returns the original value. Multiplication then adds one additional bit to the output for each power of 2 (multiplying by 2 is equivalent to shifting left by 1, multiplying by 4 is equivalent to shifting left by 2, thus multiplying by 3 must be between these two values and takes less than 2 additional bits).

Thus the number of bits needed is actually ceil(log2(x) + log2(y)) bits.

Finding the input with the smaller length and computing its effective log2 are both incredibly cheap operations that all our targets have direct acceleration to support. So it should always be significantly cheaper than what we're doing here already.

The approach I detailed of giving if ((128 - LeadingZeroCount(left) - LeadingZeroCount(right)) > 64) is an even cheaper approximation and there is some subset of valid inputs that would trigger it and for which we'd then need to compute the actual bit count.

So it basically just comes down to handling the 99% of inputs that can't possibly overflow as cheaply as possible. This is always fine because lzcnt is a native instruction for Arm, Arm64, and WASM. It is part of our default R2R target for x86/x64 and if you happen to be on x86/x64 that is pre-lzcnt (more than 11 years old at this point), you end up getting (sizeof(T) * 8 - 1) ^ BSR instead. For Arm32 and x86, you at worst get an extra branch and addition in here to account for the upper vs lower halves of the 64-bit input.

For the subset of cases that hit the "it could actually be under 64-bits", you then need to do log2 which just uses the already computed lzcnt as input. So we remain with an incredibly efficient fast path that minimizes the total amount of work and doesn't need to do the extra multiplications required to check if bits 64-127 are non-zero.

The existing code (both the C/C++ version in regular CoreCLR and the managed native AOT version) does not use the BigMul helper, so it does not have this perf bug.

Putting AggressiveInlining on BigMul like I did in #100027 makes the regression be about 2.9% on CoreCLR on my machine, I'd argue that's good enough in exchange for sharing the implementation between runtimes. It will also likely improve with future optimizations to the JIT so I'd assume it's gonna get better than native in the future (today it does stuff like or edx, 0 and add eax, 0).

It may be a good idea to split this PR into multiple smaller one so that each of the changes gets the right level of scrutiny.

While I'd agree with this before, I feel like splitting the PR at this point after most review was done would only complicate things here.

I feel like splitting the PR at this point after most review was done would only complicate things here.

It would allow me or other maintainers to get these changes merged with higher confidence and less work. In general, multiple smaller PRs are less total work for maintainers to get through than one big PR.

I have peeled off first chunk into #100222

#100222 is merged with number of adjustments. I think the next good-sized chunk to peel off into a separate PR is everything to do with the long multiplication. @MichalPetryka Are you interested in doing that?

I'll look into splitting this up tomorrow.

jkotas · 2024-03-16T03:54:00Z

/azp run runtime-extra-platforms, runtime-nativeaot outerloop

azure-pipelines · 2024-03-16T03:54:17Z

Azure Pipelines successfully started running 1 pipeline(s).

jkotas · 2024-03-16T03:54:51Z

/azp run runtime-nativeaot-outerloop

azure-pipelines · 2024-03-16T03:55:00Z

Azure Pipelines successfully started running 1 pipeline(s).

MichalPetryka · 2024-03-16T16:03:08Z

The Build linux-arm64 Release NativeAOT_Checked_Libs failure seems like it could be related, not sure why the Checked cast helper would not throw with float -> int anymore though and why it'd fail only there.

jkotas · 2024-03-16T16:44:08Z

I expected that it would fail in some other combinations too. We have too many possible combinations in the test matrix and we are not able to test all of them.

jkotas · 2024-05-12T06:39:15Z

I have extracted the remaining good changes from this PR to #102129

Rewrite math jit helpers to managed code

5968f8d

MichalPetryka requested a review from MichalStrehovsky as a code owner February 23, 2024 15:33

dotnet-issue-labeler bot added the area-VM-coreclr label Feb 23, 2024

Fix asserts

dfbcf22

huoyaoyuan reviewed Feb 23, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Math.cs Outdated Show resolved Hide resolved

src/libraries/System.Private.CoreLib/src/System/Math.cs Outdated Show resolved Hide resolved

MihuBot mentioned this pull request Feb 23, 2024

[X64] [MichalPetryka] Rewrite math jit helpers to managed code MihuBot/runtime-utils#264

Open

This was referenced Feb 24, 2024

[mono][tvos] OOM in System.IO.Tests.MemoryStreamTests #92467

Closed

[mt][browser] HttpClient_CancelInDifferentThread failing with operation cancelled #98101

Open

EgorBo mentioned this pull request Feb 24, 2024

When min major R2R version moves forward, consider some cleanups #98871

Open

build-analysis bot mentioned this pull request Feb 24, 2024

[browser][MT] Assert failed: Cannot find Promise for JSHandle -2 #98406

Closed

Merge branch 'main' into math-helpers

c665079

teo-tsirpanis added the community-contribution Indicates that the PR has been added by a community member label Feb 26, 2024

MichalPetryka mentioned this pull request Feb 29, 2024

Scalar/Packed conversions for floating point to integer #97529

Merged

Merge branch 'main' into math-helpers

d150f78

This was referenced Feb 29, 2024

System.Text.Json failing some large file tests #59678

Closed

[wasm] System.Text.Json.Tests running out of memory #98578

Closed

DeepakRajendrakumaran mentioned this pull request Mar 1, 2024

Experiment[Draft]- please do not review #99143

Closed

khushal1996 mentioned this pull request Mar 5, 2024

Experiment[Draft]- please do not review #99275

Closed

Merge remote-tracking branch 'upstream/main' into math-helpers

49c110d

Cleanup changes, add comments

878b573

Fix typo

ef4868b

BruceForstall reviewed Mar 7, 2024

View reviewed changes

src/coreclr/inc/jithelpers.h Show resolved Hide resolved

MichalPetryka and others added 2 commits March 7, 2024 01:26

Fix small type overflow in the scanner

08eaa08

Update JIT-EE GUID

a96f68a

MichalStrehovsky reviewed Mar 7, 2024

View reviewed changes

src/coreclr/tools/Common/TypeSystem/IL/HelperExtensions.cs Outdated Show resolved Hide resolved

Remove the unmanaged helper from the tiering optimization

b4a5511