Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain loops do not get recorded in optLoopTable #43713

Closed
kunalspathak opened this issue Oct 22, 2020 · 14 comments
Closed

Certain loops do not get recorded in optLoopTable #43713

kunalspathak opened this issue Oct 22, 2020 · 14 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Oct 22, 2020

Today, we fail to capture certain loops inside optLoopTable. optLoopTable is a data structure that should ideally contain all the loops so we can make further loop optimizations like loop unrolling, loop cloning or hoist loop code.

Below is the code taken from QuickSortSpan benchmark.

for (i = lo, j = hi, pivot = data[hi]; i < j;)
{
    while (i < j && data[i] <= pivot)
    {
        ++i;
    }
    while (j > i && data[j] >= pivot)
    {
        --j;
    }
    if (i < j)
    {
        temp = data[i];
        data[i] = data[j];
        data[j] = temp;
    }
}

In the assembly code, the 2 inner while loops have this kind of representation and hence do not make it to optLoopTable.

Assembly code
; Assembly listing for method Span.Sorting:TestQuickSortSpan(System.Span`1[Int32])
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments

; Lcl frame size = 56

G_M37982_IG01:              ;; offset=0000H
 00007ff8`a20ebd20        57                   push     rdi
 00007ff8`a20ebd21        56                   push     rsi
 00007ff8`a20ebd22        55                   push     rbp
 00007ff8`a20ebd23        53                   push     rbx
 00007ff8`a20ebd24        4883EC38             sub      rsp, 56
 00007ff8`a20ebd28        33C0                 xor      rax, rax
 00007ff8`a20ebd2a        4889442428           mov      qword ptr [rsp+28H], rax
 00007ff8`a20ebd2f        4889442430           mov      qword ptr [rsp+30H], rax
						;; bbWeight=1    PerfScore 6.50
G_M37982_IG02:              ;; offset=0014H
 00007ff8`a20ebd34        488B31               mov      rsi, bword ptr [rcx]
 00007ff8`a20ebd37        8B7908               mov      edi, dword ptr [rcx+8]
						;; bbWeight=1    PerfScore 4.00
G_M37982_IG03:              ;; offset=001AH
 00007ff8`a20ebd3a        83FF01               cmp      edi, 1
 00007ff8`a20ebd3d        7F09                 jg       SHORT G_M37982_IG05
						;; bbWeight=1    PerfScore 1.25
G_M37982_IG04:              ;; offset=001FH
 00007ff8`a20ebd3f        4883C438             add      rsp, 56
; =========================== 32B boundary ===========================
 00007ff8`a20ebd43        5B                   pop      rbx
 00007ff8`a20ebd44        5D                   pop      rbp
 00007ff8`a20ebd45        5E                   pop      rsi
 00007ff8`a20ebd46        5F                   pop      rdi
 00007ff8`a20ebd47        C3                   ret      
						;; bbWeight=0.50 PerfScore 1.63
G_M37982_IG05:              ;; offset=0028H
 00007ff8`a20ebd48        8D4FFF               lea      ecx, [rdi-1]
 00007ff8`a20ebd4b        33DB                 xor      ebx, ebx
 00007ff8`a20ebd4d        8BD1                 mov      edx, ecx
 00007ff8`a20ebd4f        3BD7                 cmp      edx, edi
 00007ff8`a20ebd51        0F8315010000         jae      G_M37982_IG23
 00007ff8`a20ebd57        4863C2               movsxd   rax, edx
 00007ff8`a20ebd5a        8B0486               mov      eax, dword ptr [rsi+4*rax]
 00007ff8`a20ebd5d        EB60                 jmp      SHORT G_M37982_IG12
						;; bbWeight=0.50 PerfScore 3.25
G_M37982_IG06:              ;; offset=003FH
 00007ff8`a20ebd5f        FFC3                 inc      ebx
; =========================== 32B boundary ===========================
						;; bbWeight=2    PerfScore 0.50
G_M37982_IG07:              ;; offset=0041H
 00007ff8`a20ebd61        3BDA                 cmp      ebx, edx
 00007ff8`a20ebd63        7D15                 jge      SHORT G_M37982_IG10
 00007ff8`a20ebd65        3BDF                 cmp      ebx, edi
 00007ff8`a20ebd67        0F83FF000000         jae      G_M37982_IG23
 00007ff8`a20ebd6d        4C63C3               movsxd   r8, ebx
 00007ff8`a20ebd70        42390486             cmp      dword ptr [rsi+4*r8], eax
 00007ff8`a20ebd74        7EE9                 jle      SHORT G_M37982_IG06
						;; bbWeight=16    PerfScore 92.00
G_M37982_IG08:              ;; offset=0056H
 00007ff8`a20ebd76        EB02                 jmp      SHORT G_M37982_IG10
						;; bbWeight=2    PerfScore 4.00
G_M37982_IG09:              ;; offset=0058H
 00007ff8`a20ebd78        FFCA                 dec      edx
						;; bbWeight=8    PerfScore 2.00
G_M37982_IG10:              ;; offset=005AH
 00007ff8`a20ebd7a        3BD3                 cmp      edx, ebx
 00007ff8`a20ebd7c        7E11                 jle      SHORT G_M37982_IG11
 00007ff8`a20ebd7e        3BD7                 cmp      edx, edi
; =========================== 32B boundary ===========================
 00007ff8`a20ebd80        0F83E6000000         jae      G_M37982_IG23
 00007ff8`a20ebd86        4C63C2               movsxd   r8, edx
 00007ff8`a20ebd89        42390486             cmp      dword ptr [rsi+4*r8], eax
 00007ff8`a20ebd8d        7DE9                 jge      SHORT G_M37982_IG09
						;; bbWeight=16    PerfScore 92.00
G_M37982_IG11:              ;; offset=006FH
 00007ff8`a20ebd8f        3BDA                 cmp      ebx, edx
 00007ff8`a20ebd91        7D2C                 jge      SHORT G_M37982_IG12
 00007ff8`a20ebd93        3BDF                 cmp      ebx, edi
 00007ff8`a20ebd95        0F83D1000000         jae      G_M37982_IG23
 00007ff8`a20ebd9b        4C63C3               movsxd   r8, ebx
 00007ff8`a20ebd9e        468B0486             mov      r8d, dword ptr [rsi+4*r8]
; =========================== 32B boundary ===========================
 00007ff8`a20ebda2        4C63CB               movsxd   r9, ebx
 00007ff8`a20ebda5        3BD7                 cmp      edx, edi
 00007ff8`a20ebda7        0F83BF000000         jae      G_M37982_IG23
 00007ff8`a20ebdad        4C63D2               movsxd   r10, edx
 00007ff8`a20ebdb0        468B1496             mov      r10d, dword ptr [rsi+4*r10]
 00007ff8`a20ebdb4        4689148E             mov      dword ptr [rsi+4*r9], r10d
 00007ff8`a20ebdb8        4C63CA               movsxd   r9, edx
 00007ff8`a20ebdbb        4689048E             mov      dword ptr [rsi+4*r9], r8d
						;; bbWeight=2    PerfScore 21.50
G_M37982_IG12:              ;; offset=009FH
 00007ff8`a20ebdbf        3BDA                 cmp      ebx, edx
; =========================== 32B boundary ===========================
 00007ff8`a20ebdc1        7C9E                 jl       SHORT G_M37982_IG07
						;; bbWeight=4    PerfScore 5.00
G_M37982_IG13:              ;; offset=00A3H
 00007ff8`a20ebdc3        3BD9                 cmp      ebx, ecx
 00007ff8`a20ebdc5        741C                 je       SHORT G_M37982_IG14
 00007ff8`a20ebdc7        3BDF                 cmp      ebx, edi
 00007ff8`a20ebdc9        0F839D000000         jae      G_M37982_IG23
 00007ff8`a20ebdcf        4C63C3               movsxd   r8, ebx
 00007ff8`a20ebdd2        468B0486             mov      r8d, dword ptr [rsi+4*r8]
 00007ff8`a20ebdd6        4863D3               movsxd   rdx, ebx
 00007ff8`a20ebdd9        890496               mov      dword ptr [rsi+4*rdx], eax
 00007ff8`a20ebddc        4863C9               movsxd   rcx, ecx
 00007ff8`a20ebddf        4489048E             mov      dword ptr [rsi+4*rcx], r8d
; =========================== 32B boundary ===========================
						;; bbWeight=0.50 PerfScore 3.63
G_M37982_IG14:              ;; offset=00C3H
 00007ff8`a20ebde3        8BCB                 mov      ecx, ebx
 00007ff8`a20ebde5        8BD7                 mov      edx, edi
 00007ff8`a20ebde7        483BCA               cmp      rcx, rdx
 00007ff8`a20ebdea        777A                 ja       SHORT G_M37982_IG22
						;; bbWeight=0.50 PerfScore 0.88
G_M37982_IG15:              ;; offset=00CCH
 00007ff8`a20ebdec        48B960300EB402020000 mov      rcx, 0x202B40E3060
 00007ff8`a20ebdf6        488B29               mov      rbp, gword ptr [rcx]
 00007ff8`a20ebdf9        488BD5               mov      rdx, rbp
 00007ff8`a20ebdfc        4889542420           mov      gword ptr [rsp+20H], rdx
; =========================== 32B boundary ===========================
 00007ff8`a20ebe01        488BCA               mov      rcx, rdx
 00007ff8`a20ebe04        85DB                 test     ebx, ebx
 00007ff8`a20ebe06        7D08                 jge      SHORT G_M37982_IG17
						;; bbWeight=0.50 PerfScore 2.50
G_M37982_IG16:              ;; offset=00E8H
 00007ff8`a20ebe08        488BD1               mov      rdx, rcx
 00007ff8`a20ebe0b        E8B8FE8DFF           call     System.Diagnostics.Debug:Fail(System.String,System.String)
						;; bbWeight=0.25 PerfScore 0.31
G_M37982_IG17:              ;; offset=00F0H
 00007ff8`a20ebe10        8BCB                 mov      ecx, ebx
 00007ff8`a20ebe12        488D442428           lea      rax, bword ptr [rsp+28H]
 00007ff8`a20ebe17        488930               mov      bword ptr [rax], rsi
 00007ff8`a20ebe1a        894808               mov      dword ptr [rax+8], ecx
 00007ff8`a20ebe1d        488D4C2428           lea      rcx, bword ptr [rsp+28H]
; =========================== 32B boundary ===========================
 00007ff8`a20ebe22        E889CA8FFF           call     Span.Sorting:TestQuickSortSpan(System.Span`1[Int32])
 00007ff8`a20ebe27        FFC3                 inc      ebx
 00007ff8`a20ebe29        3BDF                 cmp      ebx, edi
 00007ff8`a20ebe2b        7739                 ja       SHORT G_M37982_IG22
						;; bbWeight=0.50 PerfScore 2.88
G_M37982_IG18:              ;; offset=010DH
 00007ff8`a20ebe2d        2BFB                 sub      edi, ebx
 00007ff8`a20ebe2f        488B542420           mov      rdx, gword ptr [rsp+20H]
 00007ff8`a20ebe34        85FF                 test     edi, edi
 00007ff8`a20ebe36        7D08                 jge      SHORT G_M37982_IG20
						;; bbWeight=0.50 PerfScore 1.25
G_M37982_IG19:              ;; offset=0118H
 00007ff8`a20ebe38        488BCA               mov      rcx, rdx
 00007ff8`a20ebe3b        E888FE8DFF           call     System.Diagnostics.Debug:Fail(System.String,System.String)
; =========================== 32B boundary ===========================
						;; bbWeight=0.25 PerfScore 0.31
G_M37982_IG20:              ;; offset=0120H
 00007ff8`a20ebe40        4863CB               movsxd   rcx, ebx
 00007ff8`a20ebe43        488D0C8E             lea      rcx, bword ptr [rsi+4*rcx]
 00007ff8`a20ebe47        488D442428           lea      rax, bword ptr [rsp+28H]
 00007ff8`a20ebe4c        488908               mov      bword ptr [rax], rcx
 00007ff8`a20ebe4f        897808               mov      dword ptr [rax+8], edi
 00007ff8`a20ebe52        488D4C2428           lea      rcx, bword ptr [rsp+28H]
 00007ff8`a20ebe57        E854CA8FFF           call     Span.Sorting:TestQuickSortSpan(System.Span`1[Int32])
 00007ff8`a20ebe5c        90                   nop      
						;; bbWeight=0.50 PerfScore 2.50
G_M37982_IG21:              ;; offset=013DH
 00007ff8`a20ebe5d        4883C438             add      rsp, 56
; =========================== 32B boundary ===========================
 00007ff8`a20ebe61        5B                   pop      rbx
 00007ff8`a20ebe62        5D                   pop      rbp
 00007ff8`a20ebe63        5E                   pop      rsi
 00007ff8`a20ebe64        5F                   pop      rdi
 00007ff8`a20ebe65        C3                   ret      
						;; bbWeight=0.50 PerfScore 1.63
G_M37982_IG22:              ;; offset=0146H
 00007ff8`a20ebe66        E83DEB8EFF           call     System.ThrowHelper:ThrowArgumentOutOfRangeException()
 00007ff8`a20ebe6b        CC                   int3     
						;; bbWeight=0    PerfScore 0.00
G_M37982_IG23:              ;; offset=014CH
 00007ff8`a20ebe6c        E8AF87065F           call     CORINFO_HELP_RNGCHKFAIL
 00007ff8`a20ebe71        CC                   int3     
						;; bbWeight=0    PerfScore 0.00

; Total bytes of code 338, prolog size 20, PerfScore 283.30, instruction count 115 (MethodHash=bb696ba1) for method Span.Sorting:TestQuickSortSpan(System.Span`1[Int32])
; ============================================================

Here, none of the 3 loops (outer for and 2 inner while) loops are present in optLoopTable. On investigation, @BruceForstall and I found out that it happens because the entry to the loop is below the bottom of the loop and from the entry it jumps up to the top block, however the bottom's backedge jumps to the first block of the loop.

Here is the pictorial representation. Ideally, we favor below loops. The diagram is taken from optimizer.cpp.

        |
        v
      head
        |
        |  top/first <--+
        |       |       |
        |      ...      |
        |       |       |
        |       v       |
        +---> entry     |
                |       |
               ...      |
                |       |
                v       |
         +-- exit/tail  |
         |      |       |
         |     ...      |
         |      |       |
         |      v       |
         |    bottom ---+
         |
         +------+
                |
                v

However, in this case, the loop is little different and hence we reject it in FindNaturalLoops() because the code FindEntry returns nullptr and hence we don't recognize it as a loop.

        |
        v
      head
        |
        |   +--> first
        |   |      |
        |   |      |
        |   |      v
        |   |     top <---+
        |   |     ...     |
        |   |      |      |
        |   |      v      |
        |   +--- bottom   |
        |          |      |
        |         ...     |
        |          |      |
        |          v      |
        +------> entry ---+
        |
        +------+
                |
                v

We should fix it to identify such loops or at least make an entry to optLoopTable to take advantage of other loop optimizations.

This should be one of the part of #43549.

category:cq
theme:loop-opt
skill-level:expert
cost:large
impact:medium

@kunalspathak kunalspathak added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 22, 2020
@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib

@BruceForstall
Copy link
Member

There are two sibling inner loops.

The first is not recognized because the immediately preceding HEAD block doesn't fall or branch into the loop.

The second is not recognized for a different reason. The HEAD block, which points to ENTRY, must be the ONLY outside-loop predecessor of ENTRY (there is some exception for HEAD dominating another predecessor). The second loop ENTRY has two outside-loop predecessors, corresponding to the two failure cases of the preceding loop condition. The loop conditions are interesting because both the inner loops have multiple-clause loop conditions that necessitate multiple blocks, each branching directly to the entry of the next loop.

@BruceForstall BruceForstall added this to the 6.0.0 milestone Oct 22, 2020
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Oct 22, 2020
@briansull
Copy link
Contributor

As in Andy's discussion we want loops to be transformed into a do { } while (expr) form so that we can have a loop pre-header.

@BruceForstall BruceForstall added JitUntriaged CLR JIT issues needing additional triage and removed JitUntriaged CLR JIT issues needing additional triage labels Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 7, 2020
@kunalspathak
Copy link
Member Author

I have also noticed that we do not make an entry of any cloned loops inside optLoopTable. There may be some known reason for that, but just wanted to highlight.

@kunalspathak
Copy link
Member Author

I have also noticed that we do not make an entry of any cloned loops inside optLoopTable. There may be some known reason for that, but just wanted to highlight.

I hit this problem again when investigating a regression in one of the benchmark because of loop alignment. The problem was in the generated code for IndexOf() method shown below:

internal override int IndexOf(T[] array, T value, int startIndex, int count)
{
int endIndex = startIndex + count;
if (value == null)
{
for (int i = startIndex; i < endIndex; i++)
{
if (array[i] == null) return i;
}
}
else
{
for (int i = startIndex; i < endIndex; i++)
{
if (array[i] != null && array[i].Equals(value)) return i;
}
}
return -1;
}

Here, we clone both the loops shown above. While cloning, we also copy the fact that these loops need alignment. However none of the cloned loop is recorded in optLoopTable. Further, when we try to check if the loop has a call or not, we just go over the loops in optLoopTable. We see that the 2nd loop (the one inside else condition) has a call and we remove the BBF_LOOP_ALIGN flag from that loop, which means, that we will no longer align that loop. However, we fail to update the same fact for cloned loop because we never check that loop for calls. Because of that, we end up aligning the cloned loop present in else condition. As a side-effect of the way loop alignment feature work, we add NOP compensation for over-estimated instruction inside the original loop code because we are about to add padding to align the cloned loop.

In below assembly code, G_M30624_IG05 and G_M30624_IG08 (cloned) represents the loop inside if condition, while G_M30624_IG14 and G_M30624_IG21 (cloned) represents the loop inside else condition. Since we add padding to align G_M30624_IG21, we add NOP compensation inside loop G_M30624_IG14. In many benchmarks, we execute the loop G_M30624_IG14 which could degrade the performance because of presence of extra NOP compensation which could have been eliminated had we knew that G_M30624_IG21 shouldn't be aligned because of presence of calls in the loop.

Assembly code of IndexOf()
; Assembly listing for method System.Collections.Generic.GenericEqualityComparer`1[__Canon][System.__Canon]:IndexOf(System.__Canon[],System.__Canon,int,int):int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 this         [V00,T08] (  4,  6   )     ref  ->  rbx         this class-hnd
;  V01 arg1         [V01,T02] ( 15, 34   )     ref  ->  rsi         class-hnd
;  V02 arg2         [V02,T07] (  5,  7   )     ref  ->  rdi         class-hnd
;  V03 arg3         [V03,T09] (  7,  5   )     int  ->   r9        
;  V04 arg4         [V04,T12] (  1,  1   )     int  ->  [rsp+0x80]  
;  V05 loc0         [V05,T04] ( 11, 20   )     int  ->  rbp        
;  V06 loc1         [V06,T01] ( 12, 37.50)     int  ->  rax        
;  V07 loc2         [V07,T00] ( 15, 43.50)     int  ->  r14        
;  V08 OutArgs      [V08    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V09 tmp1         [V09,T05] (  6, 20   )    long  ->  rcx         "impRuntimeLookup slot"
;  V10 tmp2         [V10,T06] (  4, 16   )   byref  ->  r15         "impAppendStmt"
;  V11 tmp3         [V11,T03] (  8, 24   )    long  ->  r11         "spilling Runtime Lookup tree"
;* V12 tmp4         [V12    ] (  0,  0   )    long  ->  zero-ref    "VirtualCall with runtime lookup"
;  V13 cse0         [V13,T10] (  3,  5   )    long  ->  r11         "CSE - aggressive"
;  V14 cse1         [V14,T11] (  3,  5   )    long  ->  r11         "CSE - aggressive"
;
; Lcl frame size = 40

G_M30624_IG01:              ;; offset=0000H
       4157                 push     r15
       4156                 push     r14
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC28             sub      rsp, 40
       48894C2420           mov      qword ptr [rsp+20H], rcx
       488BD9               mov      rbx, rcx
       488BF2               mov      rsi, rdx
       498BF8               mov      rdi, r8
						;; bbWeight=1    PerfScore 8.00
G_M30624_IG02:              ;; offset=001AH
       418BE9               mov      ebp, r9d
       03AC2480000000       add      ebp, dword ptr [rsp+80H]
       4885FF               test     rdi, rdi
       0F857F000000         jne      G_M30624_IG12
						;; bbWeight=1    PerfScore 2.50
G_M30624_IG03:              ;; offset=002DH
       418BC1               mov      eax, r9d
       443BCD               cmp      r9d, ebp
       0F8D5D010000         jge      G_M30624_IG27
       4885F6               test     rsi, rsi
       7442                 je       SHORT G_M30624_IG08
						;; bbWeight=0.50 PerfScore 1.38
G_M30624_IG04:              ;; offset=003EH
       448BC8               mov      r9d, eax
       41F7D1               not      r9d
       41C1E91F             shr      r9d, 31
       8BCD                 mov      ecx, ebp
       F7D1                 not      ecx
       C1E91F               shr      ecx, 31
       4123C9               and      ecx, r9d
       396E08               cmp      dword ptr [rsi+8], ebp
       0F9DC2               setge    dl
       0FB6D2               movzx    rdx, dl
       85CA                 test     ecx, edx
       7421                 je       SHORT G_M30624_IG08
       90                   align    
						;; bbWeight=0.50 PerfScore 5.00
G_M30624_IG05:              ;; offset=0060H
       4863C8               movsxd   rcx, eax
       48837CCE1000         cmp      gword ptr [rsi+8*rcx+16], 0
       7434                 je       SHORT G_M30624_IG11
						;; bbWeight=4    PerfScore 13.00
G_M30624_IG06:              ;; offset=006BH
       FFC0                 inc      eax
       3BC5                 cmp      eax, ebp
       7CEF                 jl       SHORT G_M30624_IG05
						;; bbWeight=4    PerfScore 6.00
G_M30624_IG07:              ;; offset=0071H
       E920010000           jmp      G_M30624_IG27
       66660F1F840000000000 align    
						;; bbWeight=0.50 PerfScore 1.12
G_M30624_IG08:              ;; offset=0080H
       3B4608               cmp      eax, dword ptr [rsi+8]
       0F832F010000         jae      G_M30624_IG31
       4863C8               movsxd   rcx, eax
       48837CCE1000         cmp      gword ptr [rsi+8*rcx+16], 0
       740B                 je       SHORT G_M30624_IG11
						;; bbWeight=4    PerfScore 25.00
G_M30624_IG09:              ;; offset=0094H
       FFC0                 inc      eax
       3BC5                 cmp      eax, ebp
       7CE6                 jl       SHORT G_M30624_IG08
						;; bbWeight=4    PerfScore 6.00
G_M30624_IG10:              ;; offset=009AH
       E9F7000000           jmp      G_M30624_IG27
						;; bbWeight=0.50 PerfScore 1.00
G_M30624_IG11:              ;; offset=009FH
       4883C428             add      rsp, 40
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       415E                 pop      r14
       415F                 pop      r15
       C3                   ret      
						;; bbWeight=0.50 PerfScore 2.12
G_M30624_IG12:              ;; offset=00ACH
       458BF1               mov      r14d, r9d
       443BCD               cmp      r9d, ebp
       0F8DDE000000         jge      G_M30624_IG27
       4885F6               test     rsi, rsi
       0F847F000000         je       G_M30624_IG21
						;; bbWeight=0.50 PerfScore 1.38
G_M30624_IG13:              ;; offset=00C1H
       418BCE               mov      ecx, r14d
       F7D1                 not      ecx
       C1E91F               shr      ecx, 31
       8BD5                 mov      edx, ebp
       F7D2                 not      edx
       C1EA1F               shr      edx, 31
       23CA                 and      ecx, edx
       396E08               cmp      dword ptr [rsi+8], ebp
       0F9DC2               setge    dl
       0FB6D2               movzx    rdx, dl
       85CA                 test     ecx, edx
       7461                 je       SHORT G_M30624_IG21
						;; bbWeight=0.50 PerfScore 4.88
G_M30624_IG14:              ;; offset=00DFH
       4963CE               movsxd   rcx, r14d
       48837CCE1000         cmp      gword ptr [rsi+8*rcx+16], 0
       744B                 je       SHORT G_M30624_IG19
						;; bbWeight=4    PerfScore 13.00
G_M30624_IG15:              ;; offset=00EAH
       443B7608             cmp      r14d, dword ptr [rsi+8]
       0F83C4000000         jae      G_M30624_IG31
       4963CE               movsxd   rcx, r14d
       4C8D7CCE10           lea      r15, bword ptr [rsi+8*rcx+16]
       488B0B               mov      rcx, qword ptr [rbx]
       488B5138             mov      rdx, qword ptr [rcx+56]
       488B5208             mov      rdx, qword ptr [rdx+8]
       4C8B5A18             mov      r11, qword ptr [rdx+24]
       4D85DB               test     r11, r11
       7402                 je       SHORT G_M30624_IG17
						;; bbWeight=2    PerfScore 27.00
G_M30624_IG16:              ;; offset=0110H
       EB12                 jmp      SHORT G_M30624_IG18
						;; bbWeight=1    PerfScore 2.00
G_M30624_IG17:              ;; offset=0112H
       48BAE8D2AD76FA7F0000 mov      rdx, 0x7FFA76ADD2E8
       E8BF6E065F           call     CORINFO_HELP_RUNTIMEHANDLE_CLASS
       4C8BD8               mov      r11, rax
						;; bbWeight=1    PerfScore 1.50
G_M30624_IG18:              ;; offset=0124H
       498B0F               mov      rcx, gword ptr [r15]
       488BD7               mov      rdx, rdi
       41FF13               call     qword ptr [r11]
       85C0                 test     eax, eax
       7577                 jne      SHORT G_M30624_IG29
		  ;; NOP compensation instructions of 4 bytes.
						;; bbWeight=2    PerfScore 13.00
G_M30624_IG19:              ;; offset=0135H
       41FFC6               inc      r14d
       443BF5               cmp      r14d, ebp
       7CA2                 jl       SHORT G_M30624_IG14
						;; bbWeight=4    PerfScore 6.00
G_M30624_IG20:              ;; offset=013DH
       EB57                 jmp      SHORT G_M30624_IG27
       90                   align    
						;; bbWeight=0.50 PerfScore 1.12
G_M30624_IG21:              ;; offset=0140H
       443B7608             cmp      r14d, dword ptr [rsi+8]
       7372                 jae      SHORT G_M30624_IG31
       4963CE               movsxd   rcx, r14d
       48837CCE1000         cmp      gword ptr [rsi+8*rcx+16], 0
       743D                 je       SHORT G_M30624_IG26
						;; bbWeight=4    PerfScore 25.00
G_M30624_IG22:              ;; offset=0151H
       4963CE               movsxd   rcx, r14d
       4C8D7CCE10           lea      r15, bword ptr [rsi+8*rcx+16]
       488B0B               mov      rcx, qword ptr [rbx]
       488B5138             mov      rdx, qword ptr [rcx+56]
       488B5208             mov      rdx, qword ptr [rdx+8]
       4C8B5A18             mov      r11, qword ptr [rdx+24]
       4D85DB               test     r11, r11
       7402                 je       SHORT G_M30624_IG24
						;; bbWeight=2    PerfScore 21.00
G_M30624_IG23:              ;; offset=016DH
       EB12                 jmp      SHORT G_M30624_IG25
						;; bbWeight=1    PerfScore 2.00
G_M30624_IG24:              ;; offset=016FH
       48BAE8D2AD76FA7F0000 mov      rdx, 0x7FFA76ADD2E8
       E8626E065F           call     CORINFO_HELP_RUNTIMEHANDLE_CLASS
       4C8BD8               mov      r11, rax
						;; bbWeight=1    PerfScore 1.50
G_M30624_IG25:              ;; offset=0181H
       498B0F               mov      rcx, gword ptr [r15]
       488BD7               mov      rdx, rdi
       41FF13               call     qword ptr [r11]
       85C0                 test     eax, eax
       751A                 jne      SHORT G_M30624_IG29
						;; bbWeight=2    PerfScore 13.00
G_M30624_IG26:              ;; offset=018EH
       41FFC6               inc      r14d
       443BF5               cmp      r14d, ebp
       7CAA                 jl       SHORT G_M30624_IG21
						;; bbWeight=4    PerfScore 6.00
G_M30624_IG27:              ;; offset=0196H
       B8FFFFFFFF           mov      eax, -1
						;; bbWeight=0.50 PerfScore 0.12
G_M30624_IG28:              ;; offset=019BH
       4883C428             add      rsp, 40
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       415E                 pop      r14
       415F                 pop      r15
       C3                   ret      
						;; bbWeight=0.50 PerfScore 2.12
G_M30624_IG29:              ;; offset=01A8H
       418BC6               mov      eax, r14d
						;; bbWeight=0.50 PerfScore 0.12
G_M30624_IG30:              ;; offset=01ABH
       4883C428             add      rsp, 40
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       415E                 pop      r14
       415F                 pop      r15
       C3                   ret      
						;; bbWeight=0.50 PerfScore 2.12
G_M30624_IG31:              ;; offset=01B8H
       E843D1065F           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     
						;; bbWeight=0    PerfScore 0.00

; Total bytes of code 446, prolog size 26, PerfScore 258.60, instruction count 147, allocated bytes for code 446 (MethodHash=c3fb885f) for method System.Collections.Generic.GenericEqualityComparer`1[__Canon][System.__Canon]:IndexOf(System.__Canon[],System.__Canon,int,int):int:this
; ============================================================

Impacted benchmarks:

  • System.Collections.ContainsTrue.List(Size: 512)
  • System.Collections.ContainsTrue.ICollection(Size: 512)
  • System.Collections.ContainsTrue.Queue(Size: 512)
  • System.Collections.ContainsTrue.ImmutableArray(Size: 512)
  • System.Collections.ContainsFalse.List(Size: 512)
  • System.Collections.ContainsFalse.ICollection(Size: 512)
  • System.Collections.ContainsFalse.Queue(Size: 512)
  • System.Collections.ContainsFalse.ImmutableArray(Size: 512)

image

image

image

Here are the list of methods that rely directly or indirectly on optLoopCount to perform tweaking to the loops recorded in optLoopTable. Loops that are not present in the table are left out from these updates:

  • optUnrollLoops
  • optHoistLoopCode
  • optComputeLoopSideEffects (that also marks which loop has a call)
  • fgInsertGCPolls
  • fgUpdateFlowGraph
  • fgOptimizeEmptyBlock
  • fgRemoveUnreachableBlocks
  • fgUpdateLoopsAfterCompacting

@kunalspathak
Copy link
Member Author

I just noticed this comment where we have a TODO to insert cloned loops in optLoopTable:

// TODO-Cleanup: The above clones the bbNatLoopNum, which is incorrect. Eventually, we should probably insert
// the cloned loop in the loop table. For now, however, we'll just make these blocks be part of the surrounding
// loop, if one exists -- the parent of the loop we're cloning.

@AndyAyersMS
Copy link
Member

The second loop ENTRY has two outside-loop predecessors, corresponding to the two failure cases of the preceding loop condition.

As part of loop normalization we should refactor flow so all loop headers have exactly one non-in-loop predecessor (for reducible loops, anyways). I expect that C# always gives us IL with reducible loops (except in methods that use goto).

@AndyAyersMS
Copy link
Member

Just for completeness' sake, the "standard" flow normalization around reducible loops are:

  • preheader: ensure there is a single non-critical non-in-loop edge reaching the loop header (implies: pred block of that edge has a unique successor, since we know the loop header has multiple predecessors. This allows hoisting.
  • landing pad: ensure the block targeted by loop exit edges has no non-loop predecessors. This allows sinking.
  • backedge (not as common): ensure there is just one loop backedge.
  • rotation: in multi-block loops, ensure that entry and exit blocks are distinct blocks (so that the exit block can appear lexically last, allows "bottom testing" and (with backedge normalization) a single backwards branch within the loop.

Today the jit selectively does rotation and preheader formation; as noted here, it would be interesting to try and broaden the set of loops we treat this way.

@BruceForstall
Copy link
Member

Here's another case of unrecognized loops: #58941. It appears that F# presents IL that doesn't match the form coming from C# that the JIT is optimized for.

@anthonycanino
Copy link
Contributor

From looking at the code a bit here, it looks like that in order to better generalize the loop normalization done in optInvertWhileLoop, it would help to know the general loops first. Right now the sequence looks like this (to my understanding)

optInvertWhileLoop ---> optFindLoops ---> optFindNaturalLoops ---> general loop logic

but if some of the logic were flipped, i.e.,

optFindLoops ---> general loop logic ---> optInvertWhileLoop ---> optFindNaturalLoops

we could more easily transform the loops in optInvertWhileLoop, which from what I understand right now really looks at a specific lexical match of the flow, which misses the above cases.

Am I following the high-level logic correctly?

@BruceForstall
Copy link
Member

The "general loop logic" only marks blocks, it doesn't rearrange block order, so I don't think hoisting that in the phase order will have any effect.

fyi, this code is a little easier to follow with #62560

@anthonycanino
Copy link
Contributor

Thanks @BruceForstall . When I wrote "general loop logic", I meant, "marking blocks that are loops", which I thought it might be easier to then apply more transforms once we know which blocks are loops. It seemed to me the loop identification logic in optInvertWhileLoop is limited compared with that in optFindLoops. But let me look at the code changes in 62560 and see if I better understand the code.

@jakobbotsch
Copy link
Member

I verified that is fixed with the new loop representation for the QuickSortSpan benchmark. This is the flowgraph.

@jakobbotsch
Copy link
Member

Fixed by PRs listed in #93144 (comment)

@github-actions github-actions bot locked and limited conversation to collaborators Feb 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Status: Done
Development

No branches or pull requests

7 participants