-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement 128-bit Multiply intrinsic for x86/x64 #58263
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics Issue DetailsBackground and motivationIntel x86/x64 provides MUL/IMUL instructions that compute the low and high bits of a multiplication in a single instruction. API Proposalnamespace System.Runtime.Intrinsics.X86
{
partial class X86Base
{
// the 32-bit/nint variants might not be needed actually as they don't provide much value
public (uint Lower, uint Upper) Multiply(uint left, uint right);
public (uint Lower, int Upper) Multiply(int left, int right);
public (nuint Lower, nuint Upper) Multiply(nuint left, nuint right);
public (nuint Lower, nint Upper) Multiply(nint left, nint right);
partial class X64
{
public (ulong Lower, ulong Upper) Multiply(ulong left, ulong right);
public (ulong Lower, long Upper) Multiply(long left, long right);
}
}
} Related DivMod API: #27292 API UsageIn // Add -power factors of 10, -power <= (29 - 15) = 14.
power = -power;
if (X86.X86Base.X64.IsSupported || Arm.ArmBase.Arm64.IsSupported)
{
ulong low64;
ulong hi64 = Math.BigMul(mant, s_ulongPowers10[power], out low64);
if (hi64 > uint.MaxValue)
Number.ThrowOverflowException(TypeCode.Decimal);
result.High = (uint)hi64;
result.Low64 = low64;
}
else if (power < 10)
{
uint pow10 = s_powers10[power];
ulong low64 = UInt32x32To64((uint)mant, pow10);
ulong hi64 = UInt32x32To64((uint)(mant >> 32), pow10);
result.Low = (uint)low64;
hi64 += low64 >> 32;
result.Mid = (uint)hi64;
hi64 >>= 32;
result.High = (uint)hi64;
}
else
{
UInt64x64To128(mant, s_ulongPowers10[power], ref result);
} Another example in // Highest 32 bits is non-zero. Calculate 5 more partial products.
if (X86.X86Base.X64.IsSupported)
{
ulong mid64 = tmp;
(tmp2, tmp) = X86.X86Base.X64.Multiply(d1.High, d2.Low64);
if (mid64 > (mid64 += tmp2)) // add with carry detection
tmp++;
(tmp2, ulong tmphigh) = X86.X86Base.X64.Multiply(d2.High, d1.Low64, &tmp2);
tmp += tmphigh;
if (mid64 > (mid64 += tmp2)) // add with carry detection
tmp++;
bufProd.Mid64 = mid64;
}
else
{
tmp2 = UInt32x32To64(d1.Low, d2.High);
tmp += tmp2; // this could generate carry
uint tmp3 = 0;
if (tmp < tmp2) // detect carry
tmp3 = 1;
tmp2 = UInt32x32To64(d1.High, d2.Low);
tmp += tmp2; // this could generate carry
bufProd.U2 = (uint)tmp;
if (tmp < tmp2) // detect carry
tmp3++;
tmp2 = ((ulong)tmp3 << 32) | (tmp >> 32);
tmp = UInt32x32To64(d1.Mid, d2.High);
tmp += tmp2; // this could generate carry
tmp3 = 0;
if (tmp < tmp2) // detect carry
tmp3 = 1;
tmp2 = UInt32x32To64(d1.High, d2.Mid);
tmp += tmp2; // this could generate carry
bufProd.U3 = (uint)tmp;
if (tmp < tmp2) // detect carry
tmp3++;
tmp = ((ulong)tmp3 << 32) | (tmp >> 32);
}
bufProd.High64 = UInt32x32To64(d1.High, d2.High) + tmp;
hiProd = 5;
|
Why would it be slower? |
ARM64 doesn't have a single instruction for BigMul, so it uses 64-bit MUL + UMULH, which are twice as expensive as 32x32->64 UMULL (which |
I suppose we could start off with the API being internal only and use it through |
There is an existing PR for getting MULX codegen to be actually useful: #37928 |
Maybe a somewhat unusual question, but wouldn't it make sense to create a new primitive instead of working with an |
@tannergooding How about this, together with |
Background and motivation
Intel x86/x64 provides MUL/IMUL instructions that compute the low and high bits of a multiplication in a single instruction.
This would be very useful for
Math.BigMul
implementations (currently using intrinsics on ARM64 and MULX on x64, but that has bad CQ at the moment).This would also speed up
System.Decimal
calculations significantly (currently can't useMath.BigMul
there because it's slower than the hand-tuned existing code that composes big multiplications from smaller 32x32 multiplications).API Proposal
Related DivMod API: #27292
API Usage
In
decimal.DecCalc.VarDecFromR8
this would allow:Another example in
decimal.DecCalc.VarDecMul
that would only be faster on x64 (because on ARM64 BigMul is actually two expensive instructions):The text was updated successfully, but these errors were encountered: