-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve CopySign performance for integer types #90970
Conversation
Tagging subscribers to this area: @dotnet/area-system-runtime |
@dotnet-policy-service agree |
It's better to include benchmarks for different cases, including positive, negative, throwing, and different size. I can expect result to vary among these. |
Codegen diff for Int32: before: G_M000_IG01: ;; offset=0000H
sub rsp, 40
G_M000_IG02: ;; offset=0004H
mov eax, ecx
neg eax
test ecx, ecx
cmovl ecx, eax
test edx, edx
jl SHORT G_M000_IG05
G_M000_IG03: ;; offset=0011H
test ecx, ecx
jl SHORT G_M000_IG07
mov eax, ecx
G_M000_IG04: ;; offset=0017H
add rsp, 40
ret
G_M000_IG05: ;; offset=001CH
mov eax, ecx
neg eax
G_M000_IG06: ;; offset=0020H
add rsp, 40
ret
G_M000_IG07: ;; offset=0025H
call [CSPlayground.Program:ThrowNegateTwosCompOverflow()]
int3 after: G_M000_IG01: ;; offset=0000H
sub rsp, 40
G_M000_IG02: ;; offset=0004H
cmp ecx, 0xFFFFFFFF80000000
jne SHORT G_M000_IG04
G_M000_IG03: ;; offset=000CH
test edx, edx
jge SHORT G_M000_IG06
G_M000_IG04: ;; offset=0010H
xor edx, ecx
mov eax, edx
not eax
sar eax, 31
sar edx, 31
sub edx, eax
mov eax, edx
imul eax, ecx
G_M000_IG05: ;; offset=0023H
add rsp, 40
ret
G_M000_IG06: ;; offset=0028H
xor eax, eax
sub eax, 0xFFFFFFFF80000000
jo SHORT G_M000_IG08
G_M000_IG07: ;; offset=0031H
add rsp, 40
ret
G_M000_IG08: ;; offset=0036H
call CORINFO_HELP_OVERFLOW
int3 |
I don't know the exact reason, but in my project code or in these benchmarks' code, there are cases where CopySign is called inside a loop, and using my implementation does improve the performance. I observed that some data show a bimodal pattern, where both peaks are faster than the original implementation. I am still unclear about the cause of the bimodal pattern.
I just want to note that throwing exceptions is regarded as an exceptional case, and it is not included and should not be included in the benchmarks' input data and statistics. Test_0: xor (as a nop)
BenchmarkDotNet v0.13.7, Windows 11 (10.0.22621.2134/22H2/2022Update/SunValley2) CopySignSByteBenchmark // * Summary *
CopySignInt16Benchmark // * Summary *
// * Hints * CopySignInt32Benchmark // * Summary *
// * Hints * CopySignInt64Benchmark // * Summary *
// * Hints * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now concerned about more scenarios.
Is a imul really faster than cmov? On 32 bit platform, 64 bit mul would be undoubtfully much slower.
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
internal static int SignZeroToOne(long value) | ||
{ | ||
return (int)(value >> (64 - 2)) | 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casting to int32 will introduce unnecessary truncate&extend on 64 bit platform, but saves a move on 32 bit platform.
Also note that we typically put these methods in System.Numerics.BitOperations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think returning an int (Int32) for Sign is reasonable, because it is the common return type for sign functions. Also, using a shorter bit length than long (Int64) may provide optimization opportunities for subsequent multiplications. Regarding where to put these methods, I think Sign and Abs are closely related, so it makes sense to group them together. Besides, the original code already referenced some throw helper methods from Math, so moving them to System.Numerics.BitOperations would make the code more scattered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In such primitive function, one instruction counts. The helper isn't exposed to public, so it should be most performant form. In case when needed, you can use #if TARGETS_64BIT
to provide different implementation.
We need to understand the performance better.
I think the current compilers are not always good at using cmov. This can be seen from the disassembly results. The benchmark data shows that PR's code is still better than the original code, even on 32-bit platforms, except for the Int128 case. I will continue to modify the code to address this issue. |
public static Int128 CopySign_A1(Int128 value, Int128 sign) {
return Int128.CopySign(value, sign);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int SignZeroToOne(Int64 value) {
return unchecked((int)(value >> (64 - 1)) - (int)(~value >> (64 - 1)));
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int SignZeroToOne2(Int64 value) {
return unchecked((int)(value >> (64 - 2)) | 1);
}
public static Int128 CopySign_A2(Int128 value, Int128 sign) {
if (Int64.MinValue != unchecked((Int64)(value >> 64)) || UInt64.MinValue != unchecked((UInt64)value) || Int128.IsNegative(sign)) {
return value * SignZeroToOne(unchecked((Int64)(value >> 64)) ^ unchecked((Int64)(sign >> 64)));
}
return checked(-unchecked((Int64)(value >> 64)));
}
[DoesNotReturn]
public static void ThrowOverflowException() {
throw new OverflowException();
}
public static Int128 CopySign_A3(Int128 value, Int128 sign) {
if (Int64.MinValue == unchecked((Int64)(value >> 64)) && UInt64.MinValue == unchecked((UInt64)value) && Int128.IsNegative(sign)) {
ThrowOverflowException();
}
return value * SignZeroToOne(unchecked((Int64)(value >> 64)) ^ unchecked((Int64)(sign >> 64)));
}
public static Int128 CopySign_A4(Int128 value, Int128 sign) {
if (Int64.MinValue == unchecked((Int64)(value >> 64)) && UInt64.MinValue == unchecked((UInt64)value) && Int128.IsNegative(sign)) {
ThrowOverflowException();
}
return value * SignZeroToOne2(unchecked((Int64)(value >> 64)) ^ unchecked((Int64)(sign >> 64)));
}
[MethodImpl(MethodImplOptions.AggressiveInlining | MethodImplOptions.AggressiveOptimization)]
public static Int128 CopySign_A5(Int128 value, Int128 sign) {
Int128 absValue = value;
if (Int128.IsNegative(absValue)) {
absValue = -absValue;
}
if (Int128.IsPositive(sign)) {
if (Int128.IsNegative(absValue)) {
ThrowOverflowException();
}
return absValue;
}
return -absValue;
}
[MethodImpl(MethodImplOptions.AggressiveInlining | MethodImplOptions.AggressiveOptimization)]
public static Int128 CopySign_A6(Int128 value, Int128 sign) {
Int128 t = value;
if (Int128.IsPositive(sign)) {
if (Int128.IsNegative(t)) {
t = -t;
if (Int128.IsNegative(t)) {
ThrowOverflowException();
}
}
return t;
}
if (Int128.IsPositive(t)) {
t = -t;
}
return t;
} CopySignInt128Benchmark // * Summary * BenchmarkDotNet v0.13.7, Windows 11 (10.0.22621.2134/22H2/2022Update/SunValley2) Jit=RyuJit PowerPlanMode=8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c Toolchain=.NET 8.0
// * Summary * BenchmarkDotNet v0.13.7, Windows 11 (10.0.22621.2134/22H2/2022Update/SunValley2) Jit=RyuJit PowerPlanMode=8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c Toolchain=.NET 8.0
|
hello? |
ping @tannergooding as the area owner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LEI-Hongfaan thank you for your contribution and apologies for the delay in the review.
The changes overall look good to me, but there is still place for improvement (please see my comments).
For the benchmarks I've provided in dotnet/performance#3462 I got following results:
dotnet run -c Release -f net8.0 --filter System.Tests.Perf_*CopySign* --join --corerun D:\projects\runtime\artifacts\bin\testhost\net9.0-windows-Release-x64\shared\Microsoft.NETCore.App\main\corerun.exe D:\projects\forks\copySign\artifacts\bin\testhost\net9.0-windows-Release-x64\shared\Microsoft.NETCore.App\pr\corerun.exe D:\projects\forks\copySign\artifacts\bin\testhost\net9.0-windows-Release-x64\shared\Microsoft.NETCore.App\9.0.0\corerun.exe
BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.2428/22H2/2022Update/SunValley2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-alpha.1.23531.2
[Host] : .NET 8.0.0 (8.0.23.47906), X64 RyuJIT AVX2
suggestions : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
PR : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
main : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Type | Method | Job | value | sign | Mean | Ratio |
---|---|---|---|---|---|---|
Perf_Int16 | CopySign | suggestions | 1 | -1 | 0.2648 ns | 0.69 |
Perf_Int16 | CopySign | PR | 1 | -1 | 0.2602 ns | 0.68 |
Perf_Int16 | CopySign | main | 1 | -1 | 0.3834 ns | 1.00 |
Perf_SByte | CopySign | suggestions | 1 | -1 | 0.2660 ns | 0.63 |
Perf_SByte | CopySign | PR | 1 | -1 | 0.2364 ns | 0.56 |
Perf_SByte | CopySign | main | 1 | -1 | 0.4206 ns | 1.00 |
Perf_Int128 | CopySign | suggestions | 1 | -1 | 1.1218 ns | 0.17 |
Perf_Int128 | CopySign | PR | 1 | -1 | 1.4312 ns | 0.22 |
Perf_Int128 | CopySign | main | 1 | -1 | 6.5770 ns | 1.00 |
Perf_Int64 | CopySign | suggestions | 1 | -1 | 0.1840 ns | 0.37 |
Perf_Int64 | CopySign | PR | 1 | -1 | 0.4993 ns | 1.00 |
Perf_Int64 | CopySign | main | 1 | -1 | 0.4980 ns | 1.00 |
Perf_Int32 | CopySign | suggestions | 1 | -1 | 0.2086 ns | 1.02 |
Perf_Int32 | CopySign | PR | 1 | -1 | 0.2003 ns | 0.98 |
Perf_Int32 | CopySign | main | 1 | -1 | 0.2048 ns | 1.00 |
} | ||
|
||
return (short)(-absValue); | ||
return (short)(value * Math.SignZeroToOne(value ^ sign)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiplication is much more expensive on older CPUs and may significantly regress performance.
It's not necessarily "cheap" on modern CPUs either (still a minimum of 3-4 cycles for 32-bit, and 3-6 cycles for 64-bit) and while reduction of branches can be goodness, it might hurt things overall and hinder JIT optimizations for some constants.
// >= 0: returns 1 | ||
// < 0: returns -1 | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
internal static int SignZeroToOne(int value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name isn't "clear" IMO and needs something that makes its meaning clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SignOrOneIfZero?
Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>
if (IsPositive(sign)) | ||
#if TARGET_32BIT | ||
Int128 t = value; | ||
if (Int128.IsPositive(sign)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int128.
is unnecessary as we are in the Int128
type
I'm notably not really a fan of this additional complexity to maintain two copies of the code. 32-bit is already extremely pessimized for Int128
and I don't think it's worth the additional micro-optimizations here. Just have the singular code path that is consistent with the other types instead.
return t; | ||
#else | ||
// Int128.MinValue == value && 0 <= sign | ||
if ((long)value._upper == long.MinValue && value._lower == 0 && (long)sign._upper >= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just do this the simple/readable way as:
if ((long)value._upper == long.MinValue && value._lower == 0 && (long)sign._upper >= 0) | |
if ((value == Int128.MinValue) && IsPositive(sign)) |
The JIT should take care of inlining the comparison and IsPositive check to give good codegen.
{ | ||
Math.ThrowNegateTwosCompOverflow(); | ||
} | ||
return value * Math.SignOrOneIfZero((long)value._upper ^ (long)sign._upper); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not a fan of this name nor of the cost for multiplication (which increases significantly as the size of the type does).
I think the entire logic here can be simplified down to:
if (SameSign(value, sign))
{
return value;
}
else if (value == Int128.MinValue)
{
ThrowNegateTwosCompOverflow();
}
return -value;
where you have:
bool SameSign(Int128 left, Int128 right)
{
// All positive values have the most significant bit clear
// All negative values have the most significant bit set
//
// (x ^ y) produces 0 if the bits are the same and 1 if they
// differ. Therefore, (x ^ y) produces a positive value if the
// signs are the same and a negative value if they differ.
return (long)(left._upper ^ right._upper) >= 0;
}
This gives you up to the same 2 comparisons you currently have, but reduces the bit toggle to simply a two's complement operation (negation), which is far cheaper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SameSign
operation can even be optimized for 32-bit
platforms by only comparing the upper half of _upper
, which is a much more reasonable platform specific optimization and keeps it heavily isolated away to only where needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments above on how this can be simplified while still keeping the perf improvements that you're looking for. Thanks for continuing to work on this.
This pull request has been automatically marked |
This pull request will now be closed since it had been marked |
Fix #77579