Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize bswap+mov to movbe on xarch #66965

Merged
merged 11 commits into from
May 8, 2022
Merged

Conversation

aromaa
Copy link
Contributor

@aromaa aromaa commented Mar 21, 2022

Adds lowering for the pattern BSWAP|BSWAP16(IND) and STOREIND(addr, BSWAP|BSWAP16(x)) on xarch and emits the movbe instruction.

Methods using the BinaryPrimitives read & write helpers do not yet benefit from this optimization as their code has been layed out in a way that is not easily recognizable. This is not fixed in this PR.

Fixes #953

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 21, 2022
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Mar 21, 2022
@ghost
Copy link

ghost commented Mar 21, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Adds lowering for the pattern BSWAP(IND) ands STOREIND(addr, BSWAP(x)) on xarch and emits the movbe instruction instead. This does not match the 16-bit node BSWAP16 as the importer wraps it inside short <- int <- short cast and made it more complicated to deal with.

Methods using the BinaryPrimitives read & write helpers do not yet benefit from this optimization as they use MemoryMarshal under the hood which breaks the pattern. This should be switched to use Unsafe to take advantage of this, which is not included in this PR.

Fixes #953

Author: aromaa
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 22, 2022

Methods using the BinaryPrimitives read & write helpers do not yet benefit from this optimization as they use MemoryMarshal under the hood which breaks the pattern. This should be switched to use Unsafe to take advantage of this, which is not included in this PR.

Can you expand on this a bit? What is it about them that breaks the pattern?
When I was investigating this I discussed it with some people in discord and chose to rewrite the read and write primitives to avoid a multi use variable and that cleared up the tree considerably for me.
instead of:

public static int ReadInt32BigEndian(ReadOnlySpan<byte> source)
{
    int result = MemoryMarshal.Read<int>(source);
    if (BitConverter.IsLittleEndian)
    {
        result = ReverseEndianness(result);
    }
    return result;
}

I used:

public static int ReadInt32BigEndian(ReadOnlySpan<byte> source)
{
    if (BitConverter.IsLittleEndian)
    {
        return BinaryPrimitives.ReverseEndianness(MemoryMarshal.Read<int>(source));
    }
    else
    {
        return MemoryMarshal.Read<int>(source);
    }
}

@aromaa
Copy link
Contributor Author

aromaa commented Mar 22, 2022

Can you expand on this a bit? What is it about them that breaks the pattern?

Yes, the read one is trivial to solve as you mentioned above and gets optimized by this PR. But the problematic one is the MemoryMarshal.Write, which ends up creating bound checks between bswap and mov and can't be recognized easily. To fix this the bound check needs to be manually written before ReverseEndianess.

The IR for writing is following:

N003 (???,???) [000115] ------------                 IL_OFFSET void   INLRT @ 0x000[E-] REG NA
N005 (  1,  1) [000091] -------N----        t91 =    LCL_VAR   byref  V00 arg0         u:1 rcx Zero Fseq[_pointer] REG rcx $80
                                                  /--*  t91    byref
N007 (  3,  2) [000092] n-----------        t92 = *  IND       byref  REG rax <l:$140, c:$81>
                                                  /--*  t92    byref
N009 (  7,  5) [000093] DA----------              *  STORE_LCL_VAR byref  V12 tmp10        d:1 rax REG rax
N011 (  1,  1) [000000] -------N----         t0 =    LCL_VAR   byref  V00 arg0         u:1 rcx (last use) REG rcx $80
                                                  /--*  t0     byref
N013 (  2,  2) [000096] -c----------        t96 = *  LEA(b+8)  byref  REG NA
                                                  /--*  t96    byref
N015 (  4,  4) [000097] n-----------        t97 = *  IND       int    REG rcx <l:$200, c:$c1>
                                                  /--*  t97    int
N017 (  4,  4) [000098] DA----------              *  STORE_LCL_VAR int    V13 tmp11        d:1 rcx REG rcx
N019 (???,???) [000116] ------------                 IL_OFFSET void   INL01 @ 0x000[E-] <- INLRT @ 0x000[E-] REG NA
N021 (  1,  1) [000001] ------------         t1 =    LCL_VAR   int    V01 arg1         u:1 rdx (last use) REG rdx $c0
                                                  /--*  t1     int
N023 (  2,  2) [000007] ------------         t7 = *  BSWAP     int    REG rdx $c2
                                                  /--*  t7     int
N025 (  2,  3) [000009] DA----------              *  STORE_LCL_VAR int    V04 tmp2         d:1 rdx REG rdx
N027 (???,???) [000117] ------------                 IL_OFFSET void   INL01 @ ??? <- INLRT @ 0x000[E-] REG NA
N029 (  3,  2) [000101] ------------       t101 =    LCL_VAR   byref  V12 tmp10        u:1 rax (last use) REG rax <l:$140, c:$81>
                                                  /--*  t101   byref
N031 (  3,  3) [000102] DA----------              *  STORE_LCL_VAR byref  V14 tmp12        d:1 rax REG rax
N033 (???,???) [000118] ------------                 IL_OFFSET void   INL03 @ ??? <- INL01 @ ??? <- INLRT @ 0x000[E-] REG NA
N035 (  1,  1) [000069] ------------        t69 =    LCL_VAR   int    V13 tmp11        u:1 rcx (last use) REG rcx <l:$200, c:$c1>
N037 (  1,  1) [000065] -c----------        t65 =    CNS_INT   int    4 REG NA $45
                                                  /--*  t69    int
                                                  +--*  t65    int
N039 (  3,  3) [000041] N------N-U--              *  LT        void   REG NA <l:$281, c:$280>
N041 (  5,  5) [000042] ------------              *  JTRUE     void   REG NA $VN.Void

------------ BB02 [000..001) (return), preds={BB01} succs={}
N045 (???,???) [000119] ------------                 IL_OFFSET void   INL03 @ 0x02B[E-] <- INL01 @ ??? <- INLRT @ 0x000[E-] REG NA
N047 (???,???) [000120] ------------                 IL_OFFSET void   INL08 @ 0x000[E-] <- INL03 @ 0x02B[E-] <- INL01 @ ??? <- INLRT @ 0x000[E-] REG NA
N049 (  1,  1) [000080] ------------        t80 =    LCL_VAR   byref  V14 tmp12        u:1 rax (last use) REG rax <l:$140, c:$81>
N051 (  1,  1) [000081] ------------        t81 =    LCL_VAR   int    V04 tmp2         u:1 rdx (last use) REG rdx $c2
                                                  /--*  t80    byref
                                                  +--*  t81    int
N053 (???,???) [000121] -A-XG-------              *  STOREIND  int    REG NA
N055 (???,???) [000122] ------------                 IL_OFFSET void   INLRT @ 0x007[E-] REG NA
N057 (  0,  0) [000005] ------------                 RETURN    void   REG NA $VN.Void

------------ BB03 [000..001) (throw), preds={BB01} succs={}
N061 (???,???) [000123] ------------                 IL_OFFSET void   INL03 @ 0x024[E-] <- INL01 @ ??? <- INLRT @ 0x000[E-] REG NA
N063 (  1,  1) [000052] ------------        t52 =    CNS_INT   int    41 REG rcx $46
                                                  /--*  t52    int
N065 (???,???) [000124] ------------       t124 = *  PUTARG_REG int    REG rcx
N067 (  2, 10) [000125] Hc----------       t125 =    CNS_INT(h) long   0x7ffa2e025c78 ftn REG NA
                                                  /--*  t125   long
N069 (  4, 12) [000126] -c----------       t126 = *  IND       long   REG NA
                                                  /--*  t124   int    arg0 in rcx
                                                  +--*  t126   long   control expr
N071 ( 15,  7) [000053] --CXG-------              *  CALL      void   System.ThrowHelper.ThrowArgumentOutOfRangeException REG NA $VN.Void

@MichalStrehovsky
Copy link
Member

MichalStrehovsky commented Mar 22, 2022

Once this is ready, could you please also add this for NativeAOT configs? The blueprint for NativeAOT-specific changes is in #63563 - it should be mostly mechanical.

Since there's no public API, the extent of needed changes will be smaller than the above pull request.

src/coreclr/jit/emitxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/codegenxarch.cpp Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lsrabuild.cpp Outdated Show resolved Hide resolved
@jakobbotsch
Copy link
Member

@aromaa Can you let me know when this is ready to be reviewed again?

@jakobbotsch
Copy link
Member

@aromaa can you merge from main? I think superpmi-diffs is failing because #68292 was merged in the meantime and it does not have a baseline release JIT that matches for this branch.

@aromaa
Copy link
Contributor Author

aromaa commented Apr 25, 2022

@aromaa can you merge from main? I think superpmi-diffs is failing because #68292 was merged in the meantime and it does not have a baseline release JIT that matches for this branch.

The diffs are actually failing because the jiteeversionguid.h was changed in the PR because the ISA was modified. You would get bogus diffs if it tried to run them. I tried to do local collection but I had some trouble on it few days ago, but I'm planning to get the diffs before merging.

@jakobbotsch
Copy link
Member

The diffs are actually failing because the jiteeversionguid.h was changed in the PR because the ISA was modified. You would get bogus diffs if it tried to run them. I tried to do local collection but I had some trouble on it few days ago, but I'm planning to get the diffs before merging.

Af of course. Usually in this case we would use jit-diff instead. But your change also not really incompatible with previous collections, so it might be the easiest to just make a temporary hack of the JIT-EE GUID/ISA check to collect SPMI diffs.

@aromaa
Copy link
Contributor Author

aromaa commented Apr 25, 2022

Af of course. Usually in this case we would use jit-diff instead. But your change also not really incompatible with previous collections, so it might be the easiest to just make a temporary hack of the JIT-EE GUID/ISA check to collect SPMI diffs.

I tried changing that but it gives me bogus diffs where POPCNT and MOVBE were missing so I gave up on that. Removing the ISA before running the diffs works fine but I didint bother to do that too many times and relied on the test cases.

@aromaa
Copy link
Contributor Author

aromaa commented Apr 26, 2022

Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 2.4243817735148437E+31
Total PerfScoreUnits of diff: 2.4243817735148437E+31
Total PerfScoreUnits of delta: -16,86 (-0.00 % of base)
Total relative delta: NaN
    diff is an improvement.
    relative diff is a regression.
Detail diffs
Top file regressions (PerfScoreUnits):
        0,25 : System.Net.Sockets.dasm (0,00 % of base)
        0,20 : System.Diagnostics.DiagnosticSource.dasm (0,00 % of base)

Top file improvements (PerfScoreUnits):
       -9,70 : System.Private.CoreLib.dasm (-0,00 % of base)
       -5,20 : System.Formats.Cbor.dasm (-0,03 % of base)
       -2,34 : System.Memory.dasm (-0,00 % of base)
       -0,07 : System.Net.Primitives.dasm (-0,00 % of base)

6 total files with Perf Score differences (4 improved, 2 regressed), 265 unchanged.

Top method regressions (PerfScoreUnits):
        0,90 (5,59 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:ReadDoubleBigEndian(System.ReadOnlySpan`1[Byte]):double
        0,90 (5,59 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborHelpers:ReadDoubleBigEndian(System.ReadOnlySpan`1[Byte]):double
        0,50 (3,25 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:ReadSingleBigEndian(System.ReadOnlySpan`1[Byte]):float
        0,50 (3,25 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborHelpers:ReadSingleBigEndian(System.ReadOnlySpan`1[Byte]):float
        0,40 (0,86 % of base) : System.Diagnostics.DiagnosticSource.dasm - System.Diagnostics.ActivitySpanId:.ctor(System.ReadOnlySpan`1[Byte]):this
        0,25 (0,49 % of base) : System.Net.Sockets.dasm - System.Net.Sockets.SocketPal:SetMulticastOption(System.Net.Sockets.SafeSocketHandle,int,System.Net.Sockets.MulticastOption):int

Top method improvements (PerfScoreUnits):
       -2,00 (-2,11 % of base) : System.Private.CoreLib.dasm - System.Guid:<TryParseExactD>g__TryCompatParsing|30_0(System.ReadOnlySpan`1[Char],byref):bool
       -1,80 (-1,28 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborReader:ReadHalf():System.Half:this
       -1,80 (-9,84 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryWriteHalfBigEndian(System.Span`1[Byte],System.Half):bool
       -1,80 (-10,37 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:WriteHalfBigEndian(System.Span`1[Byte],System.Half)
       -1,80 (-11,73 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:ReadHalfBigEndian(System.ReadOnlySpan`1[Byte]):System.Half
       -1,80 (-11,73 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborHelpers:ReadHalfBigEndian(System.ReadOnlySpan`1[Byte]):System.Half
       -1,80 (-2,52 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborWriter:WriteHalf(System.Half):this
       -1,25 (-8,29 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadInt16BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -1,25 (-8,94 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadUInt16BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -1,22 (-0,76 % of base) : System.Memory.dasm - System.Buffers.SequenceReaderExtensions:TryReadReverseEndianness(byref,byref):bool (3 methods)
       -1,12 (-1,01 % of base) : System.Memory.dasm - System.Buffers.SequenceReaderExtensions:TryReadBigEndian(byref,byref):bool (3 methods)
       -0,80 (-0,46 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborReader:ReadSingle():float:this
       -0,50 (-3,84 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadInt32BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,50 (-3,84 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadUInt32BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,40 (-0,20 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborReader:ReadDouble():double:this
       -0,20 (-0,23 % of base) : System.Diagnostics.DiagnosticSource.dasm - System.Diagnostics.ActivityTraceId:.ctor(System.ReadOnlySpan`1[Byte]):this
       -0,10 (-0,72 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadInt64BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,10 (-0,72 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadUInt64BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,07 (-0,26 % of base) : System.Net.Primitives.dasm - System.Net.IPAddress:MapToIPv6():System.Net.IPAddress:this

Top method regressions (percentages):
        0,90 (5,59 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:ReadDoubleBigEndian(System.ReadOnlySpan`1[Byte]):double
        0,90 (5,59 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborHelpers:ReadDoubleBigEndian(System.ReadOnlySpan`1[Byte]):double
        0,50 (3,25 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:ReadSingleBigEndian(System.ReadOnlySpan`1[Byte]):float
        0,50 (3,25 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborHelpers:ReadSingleBigEndian(System.ReadOnlySpan`1[Byte]):float
        0,40 (0,86 % of base) : System.Diagnostics.DiagnosticSource.dasm - System.Diagnostics.ActivitySpanId:.ctor(System.ReadOnlySpan`1[Byte]):this
        0,25 (0,49 % of base) : System.Net.Sockets.dasm - System.Net.Sockets.SocketPal:SetMulticastOption(System.Net.Sockets.SafeSocketHandle,int,System.Net.Sockets.MulticastOption):int

Top method improvements (percentages):
       -1,80 (-11,73 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:ReadHalfBigEndian(System.ReadOnlySpan`1[Byte]):System.Half
       -1,80 (-11,73 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborHelpers:ReadHalfBigEndian(System.ReadOnlySpan`1[Byte]):System.Half
       -1,80 (-10,37 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:WriteHalfBigEndian(System.Span`1[Byte],System.Half)
       -1,80 (-9,84 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryWriteHalfBigEndian(System.Span`1[Byte],System.Half):bool
       -1,25 (-8,94 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadUInt16BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -1,25 (-8,29 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadInt16BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,50 (-3,84 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadInt32BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,50 (-3,84 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadUInt32BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -1,80 (-2,52 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborWriter:WriteHalf(System.Half):this
       -2,00 (-2,11 % of base) : System.Private.CoreLib.dasm - System.Guid:<TryParseExactD>g__TryCompatParsing|30_0(System.ReadOnlySpan`1[Char],byref):bool
       -1,80 (-1,28 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborReader:ReadHalf():System.Half:this
       -1,12 (-1,01 % of base) : System.Memory.dasm - System.Buffers.SequenceReaderExtensions:TryReadBigEndian(byref,byref):bool (3 methods)
       -1,22 (-0,76 % of base) : System.Memory.dasm - System.Buffers.SequenceReaderExtensions:TryReadReverseEndianness(byref,byref):bool (3 methods)
       -0,10 (-0,72 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadInt64BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,10 (-0,72 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadUInt64BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,80 (-0,46 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborReader:ReadSingle():float:this
       -0,07 (-0,26 % of base) : System.Net.Primitives.dasm - System.Net.IPAddress:MapToIPv6():System.Net.IPAddress:this
       -0,20 (-0,23 % of base) : System.Diagnostics.DiagnosticSource.dasm - System.Diagnostics.ActivityTraceId:.ctor(System.ReadOnlySpan`1[Byte]):this
       -0,40 (-0,20 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborReader:ReadDouble():double:this

25 total methods with Perf Score differences (19 improved, 6 regressed), 378877 unchanged.


Regression example
@@ -2,15 +2,14 @@ G_M10490_IG01:
        sub      rsp, 40
        vzeroupper
                                                ;; size=7 bbWeight=0    PerfScore 0.00
 G_M10490_IG02:
        mov      rax, bword ptr [rcx]
        mov      ecx, dword ptr [rcx+8]
        cmp      ecx, 8
        jl       SHORT G_M10490_IG04
-       mov      rcx, qword ptr [rax]
-       bswap    rcx
+       movbe    rcx, qword ptr [rax]
        vmovd    xmm0, rcx
-                                               ;; size=22 bbWeight=1    PerfScore 10.25
+                                               ;; size=21 bbWeight=1    PerfScore 11.25
 G_M10490_IG03:
        add      rsp, 40
        ret
@@ -15,10 +14,10 @@ G_M10490_IG03:
        add      rsp, 40
        ret
                                                ;; size=5 bbWeight=1    PerfScore 1.25
 G_M10490_IG04:
        mov      ecx, 41
        call     [System.ThrowHelper:ThrowArgumentOutOfRangeException(int)]
        int3
                                                ;; size=12 bbWeight=0    PerfScore 0.00

-; Total bytes of code 46, prolog size 7, PerfScore 16.10, instruction count 14, allocated bytes for code 46 (MethodHash=cb5ed705) for method System.Buffers.Binary.BinaryPrimitives:ReadDoubleBigEndian(System.ReadOnlySpan`1[Byte]):double
+; Total bytes of code 45, prolog size 7, PerfScore 17.00, instruction count 13, allocated bytes for code 45 (MethodHash=cb5ed705) for method System.Buffers.Binary.BinaryPrimitives:ReadDoubleBigEndian(System.ReadOnlySpan`1[Byte]):double

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, just a couple of small nits.

src/coreclr/jit/codegenxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
@jakobbotsch
Copy link
Member

Regression

I'm not sure why some of these would show up as perfscore regressions but I wouldn't worry too much about it. I would expect this transform to be an improvement whenever it applies.

@jakobbotsch
Copy link
Member

cc @tannergooding, can you take a look just to confirm that the various ISA table entries look right?

@aromaa
Copy link
Contributor Author

aromaa commented Apr 26, 2022

I'm not sure why some of these would show up as perfscore regressions but I wouldn't worry too much about it. I would expect this transform to be an improvement whenever it applies.

I investigated it a bit and looks like to be due to the logic in emitter::insEvaluateExecutionCost. It decreases the latency by one per instruction and then does max(throughput, latency) which ends up the value to be 0.5 for 16 and 32 bswap and 1 for 64 bswap. So having identical perf scores and one less instruction we actually get higher perf score estimate due to ironically having one less instructions.

Thank you for the reviews! Learned a lot and hopefully further optimization attempts go more smoothly :)

@jakobbotsch
Copy link
Member

I investigated it a bit and looks like to be due to the logic in emitter::insEvaluateExecutionCost. It decreases the latency by one per instruction and then does max(throughput, latency) which ends up the value to be 0.5 for 16 and 32 bswap and 1 for 64 bswap. So having identical perf scores and one less instruction we actually get higher perf score estimate due to ironically having one less instructions.

Ah, that's quite unfortunate, but good to know. Probably something we ought to look into.

Thank you for the reviews! Learned a lot and hopefully further optimization attempts go more smoothly :)

Don't worry about it, so did I. And FWIW, I would not consider the process of this PR unsmooth -- the JIT is complex and the optimization you made in this PR has to deal with a lot of the details, so it is understandable that there were a few corner cases that requires some extra treatment.

@jakobbotsch
Copy link
Member

ping @tannergooding for a review of the ISA related changes

@tannergooding
Copy link
Member

@aromaa, could you please resolve the merge conflict? You should just be able to just keep the guid already generated for this PR.

We should be able to merge this once that's in (provided CI is passing).

@aromaa
Copy link
Contributor Author

aromaa commented May 6, 2022

Failure is #68690. Not sure why the Mono leg is failing, there doesn't seem to be much to go with?

@jakobbotsch jakobbotsch merged commit 24714ef into dotnet:main May 8, 2022
@jakobbotsch
Copy link
Member

Failures looked unrelated. Thanks for the contribution!

@aromaa aromaa deleted the opts/movbe branch May 8, 2022 12:42
@pentp
Copy link
Contributor

pentp commented May 9, 2022

I'm not sure why some of these would show up as perfscore regressions but I wouldn't worry too much about it. I would expect this transform to be an improvement whenever it applies.

I investigated it a bit and looks like to be due to the logic in emitter::insEvaluateExecutionCost. It decreases the latency by one per instruction and then does max(throughput, latency) which ends up the value to be 0.5 for 16 and 32 bswap and 1 for 64 bswap. So having identical perf scores and one less instruction we actually get higher perf score estimate due to ironically having one less instructions.

This is probably not a mistake - according to uops.info:
mov min. latency is 2 + 64-bit bswap latency is 2 on Skylake-X.
64-bit movbe min. latency is 4 on Skylake-X.
On AMD movbe has a latency of 6 while mov is 5 + bswap is 1.
So this optimization might in some cases actually be slower.

@ghost ghost locked as resolved and limited conversation to collaborators Jun 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT should emit "movbe" instead of "mov / bswap" on compatible hardware
9 participants