Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Straighten out flow during early jump threading #104603

Merged
merged 3 commits into from
Jul 15, 2024

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Jul 9, 2024

After early jump threading has kicked in we can frequently prove that a branch will go in one direction. This was previously left up to RBO; instead, try to fold it immediately and straighten out the flow before we build SSA, so that we can refine the phis.

Fix #101176

Diffs here are very mixed... Would have expected it to be generally a good thing to straighten out flow like this, but apparently not unequivocally so.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 9, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

After early jump threading had kicked in we can frequently prove that a
branch will go in one direction. This was previously left up to RBO;
instead, try to fold it immediately and straighten out the flow before
we build SSA, so that we can refine the phis.

Fix dotnet#101176
@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jul 9, 2024

Perfscore regressions seem to just be from us recognizing more loops and applying the questionable "loop-scaling" logic to them. For example, here's a case with no diffs yet massive perfscore regressions:

Example
@@ -8,58 +8,58 @@
 ; 0 inlinees with PGO data; 22 single block inlinees; 8 inlinees without PGO data
 ; Final local variable assignments
 ;
-;* V00 arg0         [V00    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op single-def <Microsoft.CodeAnalysis.SyntaxList`1[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeListSyntax]>
-;  V01 loc0         [V01,T13] (  4, 18   )   ubyte  ->  rbx        
-;* V02 loc1         [V02    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op <Microsoft.CodeAnalysis.SyntaxList`1+Enumerator[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeListSyntax]>
-;  V03 loc2         [V03    ] (  7, 60   )  struct (32) [rsp+0x78]  do-not-enreg[XSF] must-init addr-exposed ld-addr-op <Microsoft.CodeAnalysis.SeparatedSyntaxList`1+Enumerator[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeSyntax]>
-;  V04 loc3         [V04    ] (  2,  4   )  struct (24) [rsp+0x60]  do-not-enreg[XS] must-init addr-exposed ld-addr-op <Microsoft.CodeAnalysis.SeparatedSyntaxList`1[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeSyntax]>
-;  V05 loc4         [V05,T02] (  6, 48   )     ref  ->  rbp         class-hnd exact <Microsoft.CodeAnalysis.CSharp.Syntax.AttributeSyntax>
-;  V06 loc5         [V06    ] (  2, 16   )  struct (24) [rsp+0x48]  do-not-enreg[XS] must-init addr-exposed ld-addr-op <Microsoft.CodeAnalysis.SyntaxToken>
-;  V07 OutArgs      [V07    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-;* V08 tmp1         [V08    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "NewObj constructor temp" <Microsoft.CodeAnalysis.SyntaxList`1+Enumerator[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeListSyntax]>
-;* V09 tmp2         [V09    ] (  0,  0   )  struct ( 8) zero-ref    "Inlining Arg" <Microsoft.CodeAnalysis.SyntaxList`1[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeListSyntax]>
-;  V10 tmp3         [V10,T25] (  4,  6   )     ref  ->  rdx         class-hnd "Inline return value spill temp" <Microsoft.CodeAnalysis.SyntaxNode>
-;* V11 tmp4         [V11    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "dup spill" <Microsoft.CodeAnalysis.SyntaxNode>
-;* V12 tmp5         [V12    ] (  0,  0   )     ref  ->  zero-ref   
-;* V13 tmp6         [V13,T31] (  0,  0   )     int  ->  zero-ref   
-;* V14 tmp7         [V14    ] (  0,  0   )     int  ->  zero-ref    "Inlining Arg"
-;  V15 tmp8         [V15,T21] (  2,  8   )     ref  ->  rcx         class-hnd "Inlining Arg" <Microsoft.CodeAnalysis.GreenNode>
-;  V16 tmp9         [V16,T24] (  2,  8   )  struct (32) [rsp+0x28]  do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp" <Microsoft.CodeAnalysis.SeparatedSyntaxList`1+Enumerator[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeSyntax]>
-;* V17 tmp10        [V17    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "Inline return value spill temp" <Microsoft.CodeAnalysis.CSharp.Syntax.NameSyntax>
-;  V18 tmp11        [V18,T00] (  4, 64   )   byref  ->  r14         "Inlining Arg"
-;  V19 tmp12        [V19,T05] (  5, 40   )     ref  ->  r15         class-hnd "Inline stloc first use temp" <Microsoft.CodeAnalysis.CSharp.Syntax.NameSyntax>
-;  V20 tmp13        [V20,T07] (  4, 32   )     ref  ->  rax         class-hnd "Inline stloc first use temp" <Microsoft.CodeAnalysis.GreenNode>
-;* V21 tmp14        [V21    ] (  0,  0   )     ref  ->  zero-ref    ld-addr-op class-hnd "Inline ldloca(s) first use temp" <Microsoft.CodeAnalysis.CSharp.Syntax.NameSyntax>
-;  V22 tmp15        [V22,T09] (  2, 32   )     ref  ->  rdx         class-hnd "impAppendStmt" <Microsoft.CodeAnalysis.CSharp.Syntax.NameSyntax>
-;* V23 tmp16        [V23    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "Inlining Arg" <Microsoft.CodeAnalysis.CSharp.Syntax.NameSyntax>
-;  V24 tmp17        [V24,T01] (  4, 64   )     ref  ->  rcx         class-hnd "dup spill" <Microsoft.CodeAnalysis.GreenNode>
-;* V25 tmp18        [V25    ] (  0,  0   )     ref  ->  zero-ref   
-;  V26 tmp19        [V26,T11] (  3, 24   )     ref  ->  rcx        
-;  V27 tmp20        [V27,T08] (  4, 32   )     ref  ->  rcx        
-;* V28 tmp21        [V28,T19] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
-;  V29 tmp22        [V29,T06] (  3, 40   )     int  ->  rcx         "Inline stloc first use temp"
-;* V30 tmp23        [V30,T29] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
-;  V31 tmp24        [V31,T14] (  3, 18   )     int  ->  rdi         "Inline stloc first use temp"
-;  V32 tmp25        [V32,T17] (  4, 14   )     int  ->  rax         "Inline return value spill temp"
-;* V33 tmp26        [V33    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "Inlining Arg" <Microsoft.CodeAnalysis.SyntaxNode>
-;  V34 tmp27        [V34,T22] (  2,  8   )     ref  ->  rax         class-hnd "Inlining Arg" <Microsoft.CodeAnalysis.GreenNode>
-;* V35 tmp28        [V35    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "Inlining Arg" <Microsoft.CodeAnalysis.SyntaxNode>
-;  V36 tmp29        [V36,T15] (  4, 16   )     ref  ->  rcx         class-hnd "Inlining Arg" <Microsoft.CodeAnalysis.GreenNode>
-;  V37 tmp30        [V37,T20] (  4,  8   )     int  ->  rax         "Inline stloc first use temp"
-;  V38 tmp31        [V38,T28] (  2,  2   )     ref  ->  rcx         single-def "field V00._node (fldOffset=0x0)" P-INDEP
-;  V39 tmp32        [V39,T18] (  4, 13   )     int  ->  rdi         "field V02._index (fldOffset=0x0)" P-INDEP
-;  V40 tmp33        [V40,T12] (  8, 21   )     ref  ->  rsi         single-def "field V02._list (fldOffset=0x8)" P-INDEP
-;* V41 tmp34        [V41,T32] (  0,  0   )     int  ->  zero-ref    single-def "field V08._index (fldOffset=0x0)" P-INDEP
-;  V42 tmp35        [V42,T30] (  2,  2   )     ref  ->  rsi         single-def "field V08._list (fldOffset=0x8)" P-INDEP
-;* V43 tmp36        [V43    ] (  0,  0   )     ref  ->  zero-ref    "field V09._node (fldOffset=0x0)" P-INDEP
-;* V44 tmp37        [V44    ] (  0,  0   )     int  ->  zero-ref    "V16.[000..004)"
-;  V45 tmp38        [V45,T23] (  2,  8   )     ref  ->  rcx         "argument with side effect"
-;  V46 tmp39        [V46,T03] (  3, 48   )     ref  ->  rcx         "argument with side effect"
-;  V47 tmp40        [V47,T10] (  2, 32   )     ref  ->  rdx         "argument with side effect"
-;  V48 tmp41        [V48,T04] (  3, 48   )     ref  ->  rax         "argument with side effect"
-;  V49 cse0         [V49,T26] (  3,  6   )     ref  ->  rcx         "CSE #01: moderate"
-;  V50 rat0         [V50,T16] (  4, 14   )     ref  ->  rcx         "replacement local"
-;  V51 rat1         [V51,T27] (  3,  4   )    long  ->  rax         "CSE for expectedClsNode"
+;* V00 arg0         [V00    ] (  0,   0   )  struct ( 8) zero-ref    ld-addr-op single-def <Microsoft.CodeAnalysis.SyntaxList`1[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeListSyntax]>
+;  V01 loc0         [V01,T08] (  4, 258   )   ubyte  ->  rbx        
+;* V02 loc1         [V02    ] (  0,   0   )  struct (16) zero-ref    ld-addr-op <Microsoft.CodeAnalysis.SyntaxList`1+Enumerator[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeListSyntax]>
+;  V03 loc2         [V03    ] (  7,2576   )  struct (32) [rsp+0x78]  do-not-enreg[XSF] must-init addr-exposed ld-addr-op <Microsoft.CodeAnalysis.SeparatedSyntaxList`1+Enumerator[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeSyntax]>
+;  V04 loc3         [V04    ] (  2,  16   )  struct (24) [rsp+0x60]  do-not-enreg[XS] must-init addr-exposed ld-addr-op <Microsoft.CodeAnalysis.SeparatedSyntaxList`1[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeSyntax]>
+;  V05 loc4         [V05,T01] (  6,1632   )     ref  ->  rbp         class-hnd exact <Microsoft.CodeAnalysis.CSharp.Syntax.AttributeSyntax>
+;  V06 loc5         [V06    ] (  2, 256   )  struct (24) [rsp+0x48]  do-not-enreg[XS] must-init addr-exposed ld-addr-op <Microsoft.CodeAnalysis.SyntaxToken>
+;  V07 OutArgs      [V07    ] (  1,   1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
+;* V08 tmp1         [V08    ] (  0,   0   )  struct (16) zero-ref    ld-addr-op "NewObj constructor temp" <Microsoft.CodeAnalysis.SyntaxList`1+Enumerator[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeListSyntax]>
+;* V09 tmp2         [V09    ] (  0,   0   )  struct ( 8) zero-ref    "Inlining Arg" <Microsoft.CodeAnalysis.SyntaxList`1[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeListSyntax]>
+;  V10 tmp3         [V10,T24] (  4,  24   )     ref  ->  rdx         class-hnd "Inline return value spill temp" <Microsoft.CodeAnalysis.SyntaxNode>
+;* V11 tmp4         [V11    ] (  0,   0   )     ref  ->  zero-ref    class-hnd "dup spill" <Microsoft.CodeAnalysis.SyntaxNode>
+;* V12 tmp5         [V12    ] (  0,   0   )     ref  ->  zero-ref   
+;* V13 tmp6         [V13    ] (  0,   0   )     int  ->  zero-ref   
+;* V14 tmp7         [V14    ] (  0,   0   )     int  ->  zero-ref    "Inlining Arg"
+;  V15 tmp8         [V15,T20] (  2,  32   )     ref  ->  rcx         class-hnd "Inlining Arg" <Microsoft.CodeAnalysis.GreenNode>
+;  V16 tmp9         [V16,T23] (  2,  32   )  struct (32) [rsp+0x28]  do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp" <Microsoft.CodeAnalysis.SeparatedSyntaxList`1+Enumerator[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeSyntax]>
+;* V17 tmp10        [V17    ] (  0,   0   )     ref  ->  zero-ref    class-hnd "Inline return value spill temp" <Microsoft.CodeAnalysis.CSharp.Syntax.NameSyntax>
+;  V18 tmp11        [V18,T00] (  4,2176   )   byref  ->  r14         "Inlining Arg"
+;  V19 tmp12        [V19,T03] (  5,1312   )     ref  ->  r15         class-hnd "Inline stloc first use temp" <Microsoft.CodeAnalysis.CSharp.Syntax.NameSyntax>
+;  V20 tmp13        [V20,T12] (  4, 128   )     ref  ->  rax         class-hnd "Inline stloc first use temp" <Microsoft.CodeAnalysis.GreenNode>
+;* V21 tmp14        [V21    ] (  0,   0   )     ref  ->  zero-ref    ld-addr-op class-hnd "Inline ldloca(s) first use temp" <Microsoft.CodeAnalysis.CSharp.Syntax.NameSyntax>
+;  V22 tmp15        [V22,T13] (  2, 128   )     ref  ->  rdx         class-hnd "impAppendStmt" <Microsoft.CodeAnalysis.CSharp.Syntax.NameSyntax>
+;* V23 tmp16        [V23    ] (  0,   0   )     ref  ->  zero-ref    class-hnd "Inlining Arg" <Microsoft.CodeAnalysis.CSharp.Syntax.NameSyntax>
+;  V24 tmp17        [V24,T04] (  4,1024   )     ref  ->  rcx         class-hnd "dup spill" <Microsoft.CodeAnalysis.GreenNode>
+;* V25 tmp18        [V25    ] (  0,   0   )     ref  ->  zero-ref   
+;  V26 tmp19        [V26,T07] (  3, 384   )     ref  ->  rcx        
+;  V27 tmp20        [V27,T06] (  4, 512   )     ref  ->  rcx        
+;* V28 tmp21        [V28    ] (  0,   0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
+;  V29 tmp22        [V29,T02] (  3,1536   )     int  ->  rcx         "Inline stloc first use temp"
+;* V30 tmp23        [V30    ] (  0,   0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
+;  V31 tmp24        [V31,T10] (  3, 160   )     int  ->  rdi         "Inline stloc first use temp"
+;  V32 tmp25        [V32,T16] (  4,  88   )     int  ->  rax         "Inline return value spill temp"
+;* V33 tmp26        [V33    ] (  0,   0   )     ref  ->  zero-ref    class-hnd "Inlining Arg" <Microsoft.CodeAnalysis.SyntaxNode>
+;  V34 tmp27        [V34,T21] (  2,  32   )     ref  ->  rax         class-hnd "Inlining Arg" <Microsoft.CodeAnalysis.GreenNode>
+;* V35 tmp28        [V35    ] (  0,   0   )     ref  ->  zero-ref    class-hnd "Inlining Arg" <Microsoft.CodeAnalysis.SyntaxNode>
+;  V36 tmp29        [V36,T17] (  4,  64   )     ref  ->  rcx         class-hnd "Inlining Arg" <Microsoft.CodeAnalysis.GreenNode>
+;  V37 tmp30        [V37,T19] (  4,  32   )     int  ->  rax         "Inline stloc first use temp"
+;  V38 tmp31        [V38,T27] (  2,   2   )     ref  ->  rcx         single-def "field V00._node (fldOffset=0x0)" P-INDEP
+;  V39 tmp32        [V39,T15] (  4, 105   )     int  ->  rdi         "field V02._index (fldOffset=0x0)" P-INDEP
+;  V40 tmp33        [V40,T11] (  8, 137   )     ref  ->  rsi         single-def "field V02._list (fldOffset=0x8)" P-INDEP
+;* V41 tmp34        [V41,T29] (  0,   0   )     int  ->  zero-ref    single-def "field V08._index (fldOffset=0x0)" P-INDEP
+;  V42 tmp35        [V42,T28] (  2,   2   )     ref  ->  rsi         single-def "field V08._list (fldOffset=0x8)" P-INDEP
+;* V43 tmp36        [V43    ] (  0,   0   )     ref  ->  zero-ref    "field V09._node (fldOffset=0x0)" P-INDEP
+;* V44 tmp37        [V44    ] (  0,   0   )     int  ->  zero-ref    "V16.[000..004)"
+;  V45 tmp38        [V45,T22] (  2,  32   )     ref  ->  rcx         "argument with side effect"
+;  V46 tmp39        [V46,T09] (  3, 192   )     ref  ->  rcx         "argument with side effect"
+;  V47 tmp40        [V47,T14] (  2, 128   )     ref  ->  rdx         "argument with side effect"
+;  V48 tmp41        [V48,T05] (  3, 768   )     ref  ->  rax         "argument with side effect"
+;  V49 cse0         [V49,T25] (  3,  24   )     ref  ->  rcx         "CSE #01: moderate"
+;  V50 rat0         [V50,T18] (  4,  56   )     ref  ->  rcx         "replacement local"
+;  V51 rat1         [V51,T26] (  3,  16   )    long  ->  rax         "CSE for expectedClsNode"
 ;
 ; Lcl frame size = 152
 
@@ -86,13 +86,13 @@ G_M20888_IG02:        ; bbWeight=1, gcrefRegs=0002 {rcx}, byrefRegs=0000 {}, byr
        ; gcrRegs +[rsi]
        mov      edi, -1
 						;; size=10 bbWeight=1 PerfScore 0.75
-G_M20888_IG03:        ; bbWeight=8, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref, isz
+G_M20888_IG03:        ; bbWeight=64, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref, isz
        ; gcrRegs -[rcx]
        inc      edi
        test     rsi, rsi
        je       SHORT G_M20888_IG05
-						;; size=7 bbWeight=8 PerfScore 12.00
-G_M20888_IG04:        ; bbWeight=2, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref, isz
+						;; size=7 bbWeight=64 PerfScore 96.00
+G_M20888_IG04:        ; bbWeight=8, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref, isz
        mov      rcx, gword ptr [rsi+0x18]
        ; gcrRegs +[rcx]
        mov      rax, rcx
@@ -102,13 +102,13 @@ G_M20888_IG04:        ; bbWeight=2, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byr
        mov      eax, 1
        ; gcrRegs -[rax]
        jmp      SHORT G_M20888_IG07
-						;; size=21 bbWeight=2 PerfScore 17.00
-G_M20888_IG05:        ; bbWeight=2, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref, isz
+						;; size=21 bbWeight=8 PerfScore 68.00
+G_M20888_IG05:        ; bbWeight=8, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref, isz
        ; gcrRegs -[rcx]
        xor      eax, eax
        jmp      SHORT G_M20888_IG07
-						;; size=4 bbWeight=2 PerfScore 4.50
-G_M20888_IG06:        ; bbWeight=2, gcrefRegs=0042 {rcx rsi}, byrefRegs=0000 {}, byref, isz
+						;; size=4 bbWeight=8 PerfScore 18.00
+G_M20888_IG06:        ; bbWeight=8, gcrefRegs=0042 {rcx rsi}, byrefRegs=0000 {}, byref, isz
        ; gcrRegs +[rcx]
        movzx    rax, byte  ptr [rcx+0x0F]
        cmp      eax, 255
@@ -118,20 +118,20 @@ G_M20888_IG06:        ; bbWeight=2, gcrefRegs=0042 {rcx rsi}, byrefRegs=0000 {},
        call     [rax+0x28]<unknown method>
        ; gcrRegs -[rcx]
        ; gcr arg pop 0
-						;; size=21 bbWeight=2 PerfScore 20.50
-G_M20888_IG07:        ; bbWeight=8, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref, isz
+						;; size=21 bbWeight=8 PerfScore 82.00
+G_M20888_IG07:        ; bbWeight=64, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref, isz
        cmp      edi, eax
        jge      SHORT G_M20888_IG10
-						;; size=4 bbWeight=8 PerfScore 10.00
-G_M20888_IG08:        ; bbWeight=2, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref, isz
+						;; size=4 bbWeight=64 PerfScore 80.00
+G_M20888_IG08:        ; bbWeight=32, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref, isz
        test     rsi, rsi
        jne      SHORT G_M20888_IG12
-						;; size=5 bbWeight=2 PerfScore 2.50
-G_M20888_IG09:        ; bbWeight=2, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref, isz
+						;; size=5 bbWeight=32 PerfScore 40.00
+G_M20888_IG09:        ; bbWeight=8, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref, isz
        mov      rdx, rsi
        ; gcrRegs +[rdx]
        jmp      SHORT G_M20888_IG13
-						;; size=5 bbWeight=2 PerfScore 4.50
+						;; size=5 bbWeight=8 PerfScore 18.00
 G_M20888_IG10:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        ; gcrRegs -[rdx rsi]
        mov      eax, ebx
@@ -147,7 +147,7 @@ G_M20888_IG11:        ; bbWeight=1, epilog, nogc, extend
        pop      r15
        ret      
 						;; size=19 bbWeight=1 PerfScore 5.25
-G_M20888_IG12:        ; bbWeight=2, gcVars=0000000000000000 {}, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, gcvars, byref, isz
+G_M20888_IG12:        ; bbWeight=8, gcVars=0000000000000000 {}, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, gcvars, byref, isz
        ; gcrRegs +[rsi]
        mov      rcx, gword ptr [rsi+0x18]
        ; gcrRegs +[rcx]
@@ -162,42 +162,42 @@ G_M20888_IG12:        ; bbWeight=2, gcVars=0000000000000000 {}, gcrefRegs=0040 {
        ; gcr arg pop 0
        mov      rdx, rax
        ; gcrRegs +[rdx]
-						;; size=29 bbWeight=2 PerfScore 27.50
-G_M20888_IG13:        ; bbWeight=2, gcrefRegs=0044 {rdx rsi}, byrefRegs=0000 {}, byref, isz
+						;; size=29 bbWeight=8 PerfScore 110.00
+G_M20888_IG13:        ; bbWeight=8, gcrefRegs=0044 {rdx rsi}, byrefRegs=0000 {}, byref, isz
        ; gcrRegs -[rax]
        mov      rcx, rdx
        ; gcrRegs +[rcx]
        test     rcx, rcx
        je       SHORT G_M20888_IG15
-						;; size=8 bbWeight=2 PerfScore 3.00
-G_M20888_IG14:        ; bbWeight=1, gcrefRegs=0046 {rcx rdx rsi}, byrefRegs=0000 {}, byref
+						;; size=8 bbWeight=8 PerfScore 12.00
+G_M20888_IG14:        ; bbWeight=4, gcrefRegs=0046 {rcx rdx rsi}, byrefRegs=0000 {}, byref
        mov      rax, 0xD1FFAB1E      ; Microsoft.CodeAnalysis.CSharp.Syntax.AttributeListSyntax
        cmp      qword ptr [rcx], rax
        jne      G_M20888_IG25
-						;; size=19 bbWeight=1 PerfScore 4.25
-G_M20888_IG15:        ; bbWeight=2, gcrefRegs=0042 {rcx rsi}, byrefRegs=0000 {}, byref
+						;; size=19 bbWeight=4 PerfScore 17.00
+G_M20888_IG15:        ; bbWeight=8, gcrefRegs=0042 {rcx rsi}, byrefRegs=0000 {}, byref
        ; gcrRegs -[rdx]
        lea      rdx, [rsp+0x60]
        cmp      dword ptr [rcx], ecx
        call     [Microsoft.CodeAnalysis.CSharp.Syntax.AttributeListSyntax:get_Attributes():Microsoft.CodeAnalysis.SeparatedSyntaxList`1[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeSyntax]:this]
        ; gcrRegs -[rcx]
        ; gcr arg pop 0
-						;; size=13 bbWeight=2 PerfScore 13.00
-G_M20888_IG16:        ; bbWeight=2, nogc, extend
+						;; size=13 bbWeight=8 PerfScore 52.00
+G_M20888_IG16:        ; bbWeight=8, nogc, extend
        vmovdqu  xmm0, xmmword ptr [rsp+0x60]
        vmovdqu  xmmword ptr [rsp+0x30], xmm0
        mov      rcx, qword ptr [rsp+0x70]
        mov      qword ptr [rsp+0x40], rcx
-						;; size=22 bbWeight=2 PerfScore 12.00
-G_M20888_IG17:        ; bbWeight=2, nogc, extend
+						;; size=22 bbWeight=8 PerfScore 48.00
+G_M20888_IG17:        ; bbWeight=8, nogc, extend
        vmovdqu  ymm0, ymmword ptr [rsp+0x28]
        vmovdqu  ymmword ptr [rsp+0x78], ymm0
-						;; size=12 bbWeight=2 PerfScore 10.00
-G_M20888_IG18:        ; bbWeight=2, isz, extend
+						;; size=12 bbWeight=8 PerfScore 40.00
+G_M20888_IG18:        ; bbWeight=8, isz, extend
        mov      dword ptr [rsp+0x78], -1
        jmp      SHORT G_M20888_IG20
-						;; size=10 bbWeight=2 PerfScore 6.00
-G_M20888_IG19:        ; bbWeight=8, gcrefRegs=0042 {rcx rsi}, byrefRegs=0000 {}, byref
+						;; size=10 bbWeight=8 PerfScore 24.00
+G_M20888_IG19:        ; bbWeight=128, gcrefRegs=0042 {rcx rsi}, byrefRegs=0000 {}, byref
        ; gcrRegs +[rcx]
        mov      edx, 1
        call     [Microsoft.CodeAnalysis.CSharp.Symbols.QuickAttributeHelpers:GetQuickAttributes(System.String,ubyte):ubyte]
@@ -205,14 +205,12 @@ G_M20888_IG19:        ; bbWeight=8, gcrefRegs=0042 {rcx rsi}, byrefRegs=0000 {},
        ; gcr arg pop 0
        or       eax, ebx
        movzx    rbx, al
-						;; size=16 bbWeight=8 PerfScore 30.00
-G_M20888_IG20:        ; bbWeight=16, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref
+						;; size=16 bbWeight=128 PerfScore 480.00
+G_M20888_IG20:        ; bbWeight=512, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref, isz
        mov      ecx, dword ptr [rsp+0x78]
        inc      ecx
        cmp      ecx, dword ptr [rsp+0x80]
        jge      G_M20888_IG03
-						;; size=19 bbWeight=16 PerfScore 68.00
-G_M20888_IG21:        ; bbWeight=8, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byref, isz
        mov      dword ptr [rsp+0x78], ecx
        lea      rcx, [rsp+0x80]
        mov      r8d, dword ptr [rsp+0x78]
@@ -229,11 +227,13 @@ G_M20888_IG21:        ; bbWeight=8, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byr
        ; gcrRegs +[r15]
        test     r15, r15
        jne      SHORT G_M20888_IG22
+						;; size=71 bbWeight=512 PerfScore 8704.00
+G_M20888_IG21:        ; bbWeight=32, gcrefRegs=8060 {rbp rsi r15}, byrefRegs=4000 {r14}, byref, isz
+       ; gcrRegs -[rax]
        mov      rcx, gword ptr [rbp+0x18]
        ; gcrRegs +[rcx]
        xor      edx, edx
        mov      rax, qword ptr [rcx]
-       ; gcrRegs -[rax]
        mov      rax, qword ptr [rax+0x50]
        call     [rax+0x20]<unknown method>
        ; gcrRegs -[rcx] +[rax]
@@ -269,8 +269,8 @@ G_M20888_IG21:        ; bbWeight=8, gcrefRegs=0040 {rsi}, byrefRegs=0000 {}, byr
        ; gcr arg pop 0
        mov      r15, gword ptr [r14]
        ; gcrRegs +[r15]
-						;; size=128 bbWeight=8 PerfScore 304.00
-G_M20888_IG22:        ; bbWeight=8, gcrefRegs=8040 {rsi r15}, byrefRegs=0000 {}, byref, isz
+						;; size=76 bbWeight=32 PerfScore 808.00
+G_M20888_IG22:        ; bbWeight=128, gcrefRegs=8040 {rsi r15}, byrefRegs=0000 {}, byref, isz
        ; byrRegs -[r14]
        mov      rcx, r15
        ; gcrRegs +[rcx]
@@ -294,8 +294,8 @@ G_M20888_IG22:        ; bbWeight=8, gcrefRegs=8040 {rsi r15}, byrefRegs=0000 {},
        jne      SHORT G_M20888_IG23
        xor      rcx, rcx
        jmp      SHORT G_M20888_IG24
-						;; size=44 bbWeight=8 PerfScore 156.00
-G_M20888_IG23:        ; bbWeight=8, gcrefRegs=0042 {rcx rsi}, byrefRegs=0000 {}, byref
+						;; size=44 bbWeight=128 PerfScore 2496.00
+G_M20888_IG23:        ; bbWeight=128, gcrefRegs=0042 {rcx rsi}, byrefRegs=0000 {}, byref
        mov      rax, qword ptr [rcx]
        mov      rax, qword ptr [rax+0x60]
        call     [rax+0x30]<unknown method>
@@ -303,14 +303,14 @@ G_M20888_IG23:        ; bbWeight=8, gcrefRegs=0042 {rcx rsi}, byrefRegs=0000 {},
        ; gcr arg pop 0
        mov      rcx, rax
        ; gcrRegs +[rcx]
-						;; size=13 bbWeight=8 PerfScore 58.00
-G_M20888_IG24:        ; bbWeight=8, gcrefRegs=0042 {rcx rsi}, byrefRegs=0000 {}, byref
+						;; size=13 bbWeight=128 PerfScore 928.00
+G_M20888_IG24:        ; bbWeight=128, gcrefRegs=0042 {rcx rsi}, byrefRegs=0000 {}, byref
        ; gcrRegs -[rax]
        test     rcx, rcx
        jne      G_M20888_IG19
        mov      rcx, 0xD1FFAB1E
        jmp      G_M20888_IG19
-						;; size=24 bbWeight=8 PerfScore 28.00
+						;; size=24 bbWeight=128 PerfScore 448.00
 G_M20888_IG25:        ; bbWeight=0, gcrefRegs=0004 {rdx}, byrefRegs=0000 {}, byref
        ; gcrRegs -[rcx rsi] +[rdx]
        mov      rcx, rax
@@ -320,7 +320,7 @@ G_M20888_IG25:        ; bbWeight=0, gcrefRegs=0004 {rdx}, byrefRegs=0000 {}, byr
        int3     
 						;; size=9 bbWeight=0 PerfScore 0.00
 
-; Total bytes of code 516, prolog size 52, PerfScore 811.83, instruction count 141, allocated bytes for code 516 (MethodHash=453aae67) for method Microsoft.CodeAnalysis.CSharp.DeclarationTreeBuilder:GetQuickAttributes(Microsoft.CodeAnalysis.SyntaxList`1[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeListSyntax]):ubyte (FullOpts)
+; Total bytes of code 516, prolog size 52, PerfScore 14590.08, instruction count 141, allocated bytes for code 516 (MethodHash=453aae67) for method Microsoft.CodeAnalysis.CSharp.DeclarationTreeBuilder:GetQuickAttributes(Microsoft.CodeAnalysis.SyntaxList`1[Microsoft.CodeAnalysis.CSharp.Syntax.AttributeListSyntax]):ubyte (FullOpts)

Diff in the jitdump looks like

@@ -1,8 +1,8 @@
 *************** In optMarkLoopHeads()
-2 loop heads marked
+4 loop heads marked
 *************** In optFindAndScaleGeneralLoopBlocks()
 
-Marking a loop from BB02 to BB30
+Marking a loop from BB02 to BB26
     BB02(wt=400)
     BB03(wt=400)
     BB04(wt=400)
@@ -19,20 +19,21 @@ Marking a loop from BB02 to BB30
     BB15(wt=400)
     BB16(wt=400)
     BB17(wt=400)
-    BB18(wt=400)
+    BB18(wt=800)
     BB19(wt=400)
     BB20(wt=400)
-    BB21(wt=800)
+    BB21(wt=400)
     BB22(wt=400)
     BB23(wt=400)
     BB24(wt=400)
-    BB25(wt=400)
-    BB26(wt=400)
-    BB27(wt=400)
-    BB28(wt=800)
-    BB29(wt=800)
-    BB30(wt=800)
-Marking a loop from BB08 to BB19
+    BB25(wt=800)
+    BB26(wt=800)
+Marking a loop from BB03 to BB26
+    BB03(wt=1600)
+    BB04(wt=1600)
+    BB05(wt=1600)
+    BB06(wt=1600)
+    BB07(wt=1600)
     BB08(wt=1600)
     BB09(wt=1600)
     BB10(wt=1600)
@@ -42,7 +43,35 @@ Marking a loop from BB08 to BB19
     BB14(wt=1600)
     BB15(wt=1600)
     BB16(wt=1600)
-    BB17(wt=3200)
-    BB18(wt=3200)
-    BB19(wt=3200)
-Found a total of 2 general loops.
+    BB17(wt=1600)
+    BB18(wt=6400)
+    BB19(wt=1600)
+    BB20(wt=1600)
+    BB21(wt=1600)
+    BB22(wt=1600)
+    BB23(wt=1600)
+    BB24(wt=1600)
+    BB25(wt=6400)
+    BB26(wt=6400)
+Marking a loop from BB07 to BB16
+    BB07(wt=6400)
+    BB08(wt=6400)
+    BB09(wt=6400)
+    BB10(wt=6400)
+    BB11(wt=6400)
+    BB12(wt=6400)
+    BB13(wt=6400)
+    BB14(wt=6400)
+    BB15(wt=12800)
+    BB16(wt=12800)
+Marking a loop from BB09 to BB16
+    BB09(wt=25600)
+    BB10(wt=25600)
+    BB11(wt=25600)
+    BB12(wt=25600)
+    BB13(wt=25600)
+    BB14(wt=25600)
+    BB15(wt=102400)
+    BB16(wt=102400)
+Found a total of 4 general loops.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jul 9, 2024

Looking at some regressions..

libraries_tests.run.windows.x64.Release.mch System.Xml.Schema.Preprocessor:GetParentSchema(System.Xml.Schema.XmlSchemaObject):System.Xml.Schema.XmlSchema (Tier1)
@@ -9,58 +9,65 @@
 ; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
 ; Final local variable assignments
 ;
-;  V00 arg0         [V00,T00] (  7,  6   )     ref  ->  rbx         class-hnd <System.Xml.Schema.XmlSchemaObject>
-;  V01 loc0         [V01,T02] (  4,  3.50)     ref  ->  rax         class-hnd <System.Xml.Schema.XmlSchema>
+;  V00 arg0         [V00,T00] (  8,  7.37)     ref  ->  rbx         class-hnd <System.Xml.Schema.XmlSchemaObject>
+;  V01 loc0         [V01,T02] (  4,  4   )     ref  ->  rax         class-hnd <System.Xml.Schema.XmlSchema>
 ;  V02 OutArgs      [V02    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;  V03 rat0         [V03,T01] (  5,  7   )     ref  ->  rax         "replacement local"
 ;  V04 rat1         [V04,T03] (  3,  2   )    long  ->  rcx         "CSE for expectedClsNode"
 ;
 ; Lcl frame size = 32
 
-G_M65415_IG01:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
+G_M65415_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
        push     rbx
        sub      rsp, 32
        mov      rbx, rcx
        ; gcrRegs +[rbx]
-						;; size=8 bbWeight=0.50 PerfScore 0.75
-G_M65415_IG02:        ; bbWeight=0.50, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
+						;; size=8 bbWeight=1 PerfScore 1.50
+G_M65415_IG02:        ; bbWeight=1, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref, isz
        xor      rax, rax
        ; gcrRegs +[rax]
-						;; size=2 bbWeight=0.50 PerfScore 0.12
-G_M65415_IG03:        ; bbWeight=1, gcrefRegs=0009 {rax rbx}, byrefRegs=0000 {}, byref, isz
        test     rbx, rbx
-       je       SHORT G_M65415_IG06
-       mov      rbx, gword ptr [rbx+0x18]
-       mov      rax, rbx
-       test     rax, rax
-       je       SHORT G_M65415_IG05
-						;; size=17 bbWeight=1 PerfScore 4.75
-G_M65415_IG04:        ; bbWeight=0.50, gcrefRegs=0009 {rax rbx}, byrefRegs=0000 {}, byref, isz
-       mov      rcx, 0xD1FFAB1E      ; System.Xml.Schema.XmlSchema
-       cmp      qword ptr [rax], rcx
-       jne      SHORT G_M65415_IG07
-						;; size=15 bbWeight=0.50 PerfScore 2.12
-G_M65415_IG05:        ; bbWeight=1, gcrefRegs=0009 {rax rbx}, byrefRegs=0000 {}, byref, isz
-       test     rax, rax
-       je       SHORT G_M65415_IG03
-						;; size=5 bbWeight=1 PerfScore 1.25
-G_M65415_IG06:        ; bbWeight=1, gcrefRegs=0001 {rax}, byrefRegs=0000 {}, byref, epilog, nogc
+       jne      SHORT G_M65415_IG04
+						;; size=7 bbWeight=1 PerfScore 1.50
+G_M65415_IG03:        ; bbWeight=1, gcrefRegs=0001 {rax}, byrefRegs=0000 {}, byref, epilog, nogc
        ; gcrRegs -[rbx]
        add      rsp, 32
        pop      rbx
        ret      
 						;; size=6 bbWeight=1 PerfScore 1.75
-G_M65415_IG07:        ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, gcvars, byref, isz
+G_M65415_IG04:        ; bbWeight=1, gcVars=0000000000000000 {}, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, gcvars, byref, isz
        ; gcrRegs -[rax] +[rbx]
+       mov      rbx, gword ptr [rbx+0x18]
+       mov      rax, rbx
+       ; gcrRegs +[rax]
+       test     rax, rax
+       je       SHORT G_M65415_IG06
+						;; size=12 bbWeight=1 PerfScore 3.50
+G_M65415_IG05:        ; bbWeight=0.50, gcrefRegs=0009 {rax rbx}, byrefRegs=0000 {}, byref, isz
+       mov      rcx, 0xD1FFAB1E      ; System.Xml.Schema.XmlSchema
+       cmp      qword ptr [rax], rcx
+       jne      SHORT G_M65415_IG08
+						;; size=15 bbWeight=0.50 PerfScore 2.12
+G_M65415_IG06:        ; bbWeight=1, gcrefRegs=0009 {rax rbx}, byrefRegs=0000 {}, byref, isz
+       test     rax, rax
+       jne      SHORT G_M65415_IG03
+						;; size=5 bbWeight=1 PerfScore 1.25
+G_M65415_IG07:        ; bbWeight=1.37, gcrefRegs=0009 {rax rbx}, byrefRegs=0000 {}, byref, isz
+       test     rbx, rbx
+       jne      SHORT G_M65415_IG04
+       jmp      SHORT G_M65415_IG03
+						;; size=7 bbWeight=1.37 PerfScore 4.45
+G_M65415_IG08:        ; bbWeight=0, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref, isz
+       ; gcrRegs -[rax]
        mov      rdx, rbx
        ; gcrRegs +[rdx]
        call     CORINFO_HELP_ISINSTANCEOFCLASS
        ; gcrRegs -[rdx] +[rax]
        ; gcr arg pop 0
-       jmp      SHORT G_M65415_IG05
+       jmp      SHORT G_M65415_IG06
 						;; size=10 bbWeight=0 PerfScore 0.00
 
-; Total bytes of code 63, prolog size 8, PerfScore 10.75, instruction count 21, allocated bytes for code 63 (MethodHash=360f0078) for method System.Xml.Schema.Preprocessor:GetParentSchema(System.Xml.Schema.XmlSchemaObject):System.Xml.Schema.XmlSchema (Tier1)
+; Total bytes of code 70, prolog size 8, PerfScore 16.08, instruction count 24, allocated bytes for code 70 (MethodHash=360f0078) for method System.Xml.Schema.Preprocessor:GetParentSchema(System.Xml.Schema.XmlSchemaObject):System.Xml.Schema.XmlSchema (Tier1)

Looks like we straighten out the flow, and then immediately some other "duplicate tail condition" logic kicks in:

@@ -1,119 +1,178 @@
 *************** In fgUpdateFlowGraph()
 Before updating the flow graph:
 
 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 BB01 [0000]  1                             1     73 [000..004)-> BB03(1)                 (always)                     i IBC
 BB02 [0001]  1       BB04                  1     73 [004..013)-> BB03(1)                 (always)                     i IBC hascall bwd bwd-target
 BB03 [0002]  2       BB01,BB02             2    146 [013..016)-> BB05(0.5),BB04(0.5)     ( cond )                     i IBC bwd
 BB04 [0003]  1       BB03                  1     73 [016..019)-> BB02(1),BB05(0)         ( cond )                     i IBC bwd bwd-src
 BB05 [0004]  2       BB03,BB04             1     73 [019..01B)                           (return)                     i IBC
 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
 Considering uncond to cond BB01 -> BB03
 setting likelihood of BB01 -> BB05 from 1 to 0.5
 setting likelihood of BB01 -> BB04 to 0.5
 fgOptimizeUncondBranchToSimpleCond(from BB01 to cond BB03), modified BB01
    expecting opts to key off V01 in BB01
 Decreased BB03 profile weight from 146 to 73
-Considering uncond to cond BB02 -> BB03
+Forward substituting local after jump threading. Before:
+STMT00007 ( ??? ... ??? )
+               [000025] -----------                         ▌  JTRUE     void  
+               [000026] J------N---                         └──▌  NE        int   
+               [000027] -----------                            ├──▌  LCL_VAR   ref    V01 loc0         
+               [000028] -----------                            └──▌  CNS_INT   ref    null
+
+After:
+STMT00007 ( ??? ... ??? )
+               [000025] -----------                         ▌  JTRUE     void  
+               [000026] J------N---                         └──▌  NE        int   
+               [000029] -----------                            ├──▌  CNS_INT   ref    null
+               [000028] -----------                            └──▌  CNS_INT   ref    null
+
+Now trying to fold...
+
+Folding operator with constant nodes into a constant:
+               [000026] J------N---                         ▌  NE        int   
+               [000029] -----------                         ├──▌  CNS_INT   ref    null
+               [000028] -----------                         └──▌  CNS_INT   ref    null
+Bashed to int constant:
+               [000026] -----------                         ▌  CNS_INT   int    0
+STMT00007 ( ??? ... ??? )
+               [000025] -----------                         ▌  JTRUE     void  
+               [000026] -----------                         └──▌  CNS_INT   int    0
+
+removing useless STMT00007 ( ??? ... ??? )
+               [000025] -----------                         ▌  JTRUE     void  
+               [000026] -----------                         └──▌  CNS_INT   int    0
+ from BB01
+setting likelihood of BB01 -> BB04 from 0.5 to 1
+
+Conditional folded at BB01
+BB01 becomes a BBJ_ALWAYS to BB04
+Trying to compact last pred BB02 of BB03 that we now bypass
 
 Compacting BB03 into BB02:
 *************** In fgDebugCheckBBlist
+Considering uncond to cond BB01 -> BB04
 
 After updating the flow graph:
 
 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
-BB01 [0000]  1                             1     73 [000..004)-> BB05(0.5),BB04(0.5)     ( cond )                     i IBC
+BB01 [0000]  1                             1     73 [000..004)-> BB04(1)                 (always)                     i IBC
 BB02 [0001]  1       BB04                  1     73 [004..016)-> BB05(0.5),BB04(0.5)     ( cond )                     i IBC hascall bwd bwd-target
 BB04 [0003]  2       BB01,BB02             1     73 [016..019)-> BB02(1),BB05(0)         ( cond )                     i IBC bwd bwd-src
-BB05 [0004]  3       BB01,BB02,BB04        1     73 [019..01B)                           (return)                     i IBC
+BB05 [0004]  2       BB02,BB04             1     73 [019..01B)                           (return)                     i IBC
 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
 ***************  Exception Handling table is empty
 *************** In fgDebugCheckBBlist
 
 *************** In fgExpandRarelyRunBlocks()
 *************** In fgReorderBlocks()
 
 Initial BasicBlocks
 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
-BB01 [0000]  1                             1     73 [000..004)-> BB05(0.5),BB04(0.5)     ( cond )                     i IBC
+BB01 [0000]  1                             1     73 [000..004)-> BB04(1)                 (always)                     i IBC
 BB02 [0001]  1       BB04                  1     73 [004..016)-> BB05(0.5),BB04(0.5)     ( cond )                     i IBC hascall bwd bwd-target
 BB04 [0003]  2       BB01,BB02             1     73 [016..019)-> BB02(1),BB05(0)         ( cond )                     i IBC bwd bwd-src
+BB05 [0004]  2       BB02,BB04             1     73 [019..01B)                           (return)                     i IBC
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
+
+Duplication of the conditional block BB04 (always branch from BB01) performed, because the cost of duplication (6) is less or equal than 6, validProfileWeights = true
+setting likelihood of BB01 -> BB02 to 0
+setting likelihood of BB01 -> BB05 from 1 to 1
+
+fgOptimizeBranch added these statements(s) at the end of BB01:
+STMT00008 ( 0x016[E-] ... ??? )
+     (  7,  6) [000030] -----------                         ▌  JTRUE     void  
+     (  5,  4) [000031] J------N---                         └──▌  EQ        int   
+     (  3,  2) [000032] -----------                            ├──▌  LCL_VAR   ref    V00 arg0         
+     (  1,  1) [000033] -----------                            └──▌  CNS_INT   ref    null
+
+fgOptimizeBranch changed block BB01 from BBJ_ALWAYS to BBJ_COND.
+
+After this change in fgOptimizeBranch the BB graph is:
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
+BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
+BB01 [0000]  1                             1     73 [000..004)-> BB05(1),BB02(0)         ( cond )                     i IBC
+BB02 [0001]  2       BB01,BB04             1     73 [004..016)-> BB05(0.5),BB04(0.5)     ( cond )                     i IBC hascall bwd bwd-target
+BB04 [0003]  1       BB02                  1.37 100 [016..019)-> BB02(1),BB05(0)         ( cond )                     i IBC bwd bwd-src
 BB05 [0004]  3       BB01,BB02,BB04        1     73 [019..01B)                           (return)                     i IBC
 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
 
 *************** Finishing PHASE Optimize control flow
 Trees after Optimize control flow
 
 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
-BB01 [0000]  1                             1     73 [000..004)-> BB05(0.5),BB04(0.5)     ( cond )                     i IBC
-BB02 [0001]  1       BB04                  1     73 [004..016)-> BB05(0.5),BB04(0.5)     ( cond )                     i IBC hascall bwd bwd-target
-BB04 [0003]  2       BB01,BB02             1     73 [016..019)-> BB02(1),BB05(0)         ( cond )                     i IBC bwd bwd-src
+BB01 [0000]  1                             1     73 [000..004)-> BB05(1),BB02(0)         ( cond )                     i IBC
+BB02 [0001]  2       BB01,BB04             1     73 [004..016)-> BB05(0.5),BB04(0.5)     ( cond )                     i IBC hascall bwd bwd-target
+BB04 [0003]  1       BB02                  1.37 100 [016..019)-> BB02(1),BB05(0)         ( cond )                     i IBC bwd bwd-src
 BB05 [0004]  3       BB01,BB02,BB04        1     73 [019..01B)                           (return)                     i IBC
 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
------------- BB01 [0000] [000..004) -> BB05(0.5),BB04(0.5) (cond), preds={} succs={BB04,BB05}
+------------ BB01 [0000] [000..004) -> BB05(1),BB02(0) (cond), preds={} succs={BB02,BB05}
 
 ***** BB01 [0000]
 STMT00000 ( 0x000[E-] ... 0x001 )
                [000001] DA---+-----                         ▌  STORE_LCL_VAR ref    V01 loc0         
                [000000] -----+-----                         └──▌  CNS_INT   ref    null
 
 ***** BB01 [0000]
-STMT00007 ( ??? ... ??? )
-               [000025] -----------                         ▌  JTRUE     void  
-               [000026] J------N---                         └──▌  NE        int   
-               [000027] -----------                            ├──▌  LCL_VAR   ref    V01 loc0         
-               [000028] -----------                            └──▌  CNS_INT   ref    null
+STMT00008 ( 0x016[E-] ... ??? )
+     (  7,  6) [000030] -----------                         ▌  JTRUE     void  
+     (  5,  4) [000031] J------N---                         └──▌  EQ        int   
+     (  3,  2) [000032] -----------                            ├──▌  LCL_VAR   ref    V00 arg0         
+     (  1,  1) [000033] -----------                            └──▌  CNS_INT   ref    null
 
------------- BB02 [0001] [004..016) -> BB05(0.5),BB04(0.5) (cond), preds={BB04} succs={BB04,BB05}
+------------ BB02 [0001] [004..016) -> BB05(0.5),BB04(0.5) (cond), preds={BB01,BB04} succs={BB04,BB05}
 
 ***** BB02 [0001]
 STMT00005 ( 0x004[E-] ... ??? )
                [000015] DA-XG+-----                         ▌  STORE_LCL_VAR ref    V00 arg0         
                [000021] ---XG+-----                         └──▌  IND       ref   
                [000024] -----+-----                            └──▌  ADD       byref 
                [000012] -----+-----                               ├──▌  LCL_VAR   ref    V00 arg0         
                [000023] -----+-----                               └──▌  CNS_INT   long   24 Fseq[<unknown field>]
 
 ***** BB02 [0001]
 STMT00006 ( 0x00C[E-] ... 0x012 )
                [000019] DAC-G+-----                         ▌  STORE_LCL_VAR ref    V01 loc0         
                [000018] --C-G+-----                         └──▌  CALL help ref    CORINFO_HELP_ISINSTANCEOFCLASS
                [000016] -----+----- arg1 in rdx                ├──▌  LCL_VAR   ref    V00 arg0         
                [000017] H----+-N--- arg0 in rcx                └──▌  CNS_INT(h) long   0x7fff78421508 class System.Xml.Schema.XmlSchema
 
 ***** BB02 [0001]
 STMT00001 ( 0x013[E-] ... 0x014 )
                [000005] -----+-----                         ▌  JTRUE     void  
                [000004] J----+-N---                         └──▌  NE        int   
                [000002] -----+-----                            ├──▌  LCL_VAR   ref    V01 loc0         
                [000003] -----+-----                            └──▌  CNS_INT   ref    null
 
------------- BB04 [0003] [016..019) -> BB02(1),BB05(0) (cond), preds={BB01,BB02} succs={BB05,BB02}
+------------ BB04 [0003] [016..019) -> BB02(1),BB05(0) (cond), preds={BB02} succs={BB05,BB02}
 
 ***** BB04 [0003]
 STMT00003 ( 0x016[E-] ... 0x017 )
-               [000011] -----+-----                         ▌  JTRUE     void  
-               [000010] J----+-N---                         └──▌  NE        int   
-               [000008] -----+-----                            ├──▌  LCL_VAR   ref    V00 arg0         
-               [000009] -----+-----                            └──▌  CNS_INT   ref    null
+     (  7,  6) [000011] -----------                         ▌  JTRUE     void  
+     (  5,  4) [000010] J------N---                         └──▌  NE        int   
+     (  3,  2) [000008] -----------                            ├──▌  LCL_VAR   ref    V00 arg0         
+     (  1,  1) [000009] -----------                            └──▌  CNS_INT   ref    null
 
 ------------ BB05 [0004] [019..01B) (return), preds={BB01,BB02,BB04} succs={}
 
 ***** BB05 [0004]
 STMT00002 ( 0x019[E-] ... 0x01A )
                [000007] -----+-----                         ▌  RETURN    ref   
                [000006] -----+-----                         └──▌  LCL_VAR   ref    V01 loc0         
 
 -------------------------------------------------------------------------------------------------------------------

The "duplicate tail" transformation here results in the loop no longer being in loop-inverted shape, so the final block layout looks worse.

@jakobbotsch
Copy link
Member Author

System.Double:Equals(System.Object):ubyte:this (Instrumented Tier1)
@@ -12,71 +12,70 @@
 ;
 ;  V00 this         [V00,T01] (  4,  2.65)   byref  ->  rsi         this single-def
 ;  V01 arg1         [V01,T00] (  5,  4.02)     ref  ->  rbx         class-hnd single-def <System.Object>
-;  V02 loc0         [V02,T04] (  4,  1.56)  double  ->  mm0        
+;  V02 loc0         [V02,T03] (  4,  1.56)  double  ->  mm6        
 ;  V03 OutArgs      [V03    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-;* V04 tmp1         [V04,T02] (  0,  0   )     int  ->  zero-ref    "spilling qmarkNull"
-;  V05 tmp2         [V05,T03] (  4,  1.04)   ubyte  ->  rax         "Inline return value spill temp"
-;  V06 tmp3         [V06,T05] (  3,  0.78)  double  ->  mm0         "Inlining Arg"
+;* V04 tmp1         [V04    ] (  0,  0   )     int  ->  zero-ref    "spilling qmarkNull"
+;  V05 tmp2         [V05,T02] (  4,  1.04)   ubyte  ->  rax         "Inline return value spill temp"
+;  V06 tmp3         [V06,T04] (  3,  0.78)  double  ->  mm0         "Inlining Arg"
 ;
-; Lcl frame size = 40
+; Lcl frame size = 56
 
 G_M46727_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
        push     rsi
        push     rbx
-       sub      rsp, 40
+       sub      rsp, 56
+       vmovaps  xmmword ptr [rsp+0x20], xmm6
        mov      rsi, rcx
        ; byrRegs +[rsi]
        mov      rbx, rdx
        ; gcrRegs +[rbx]
-						;; size=12 bbWeight=1 PerfScore 2.75
+						;; size=18 bbWeight=1 PerfScore 4.75
 G_M46727_IG02:        ; bbWeight=1, gcrefRegs=0008 {rbx}, byrefRegs=0040 {rsi}, byref, isz
        test     rbx, rbx
-       jne      SHORT G_M46727_IG05
+       je       SHORT G_M46727_IG07
 						;; size=5 bbWeight=1 PerfScore 1.25
-G_M46727_IG03:        ; bbWeight=0.48, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+G_M46727_IG03:        ; bbWeight=0.50, gcrefRegs=0008 {rbx}, byrefRegs=0040 {rsi}, byref, isz
+       mov      rcx, 0xD1FFAB1E      ; System.Double
+       cmp      qword ptr [rbx], rcx
+       jne      SHORT G_M46727_IG07
+						;; size=15 bbWeight=0.50 PerfScore 2.12
+G_M46727_IG04:        ; bbWeight=0.52, gcrefRegs=0008 {rbx}, byrefRegs=0040 {rsi}, byref, isz
+       mov      rcx, 0xD1FFAB1E
+       call     CORINFO_HELP_COUNTPROFILE32
+       ; gcr arg pop 0
+       vmovsd   xmm6, qword ptr [rbx+0x08]
+       vucomisd xmm6, qword ptr [rsi]
+       jp       SHORT G_M46727_IG09
+       jne      SHORT G_M46727_IG09
+						;; size=28 bbWeight=0.52 PerfScore 6.37
+G_M46727_IG05:        ; bbWeight=0.26, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        ; gcrRegs -[rbx]
        ; byrRegs -[rsi]
+       mov      eax, 1
+						;; size=5 bbWeight=0.26 PerfScore 0.07
+G_M46727_IG06:        ; bbWeight=0.52, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, epilog, nogc
+       vmovaps  xmm6, xmmword ptr [rsp+0x20]
+       add      rsp, 56
+       pop      rbx
+       pop      rsi
+       ret      
+						;; size=13 bbWeight=0.52 PerfScore 3.25
+G_M46727_IG07:        ; bbWeight=0.48, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
        mov      rcx, 0xD1FFAB1E
        call     CORINFO_HELP_COUNTPROFILE32
        ; gcr arg pop 0
        xor      eax, eax
 						;; size=17 bbWeight=0.48 PerfScore 0.72
-G_M46727_IG04:        ; bbWeight=0.48, epilog, nogc, extend
-       add      rsp, 40
+G_M46727_IG08:        ; bbWeight=0.48, epilog, nogc, extend
+       vmovaps  xmm6, xmmword ptr [rsp+0x20]
+       add      rsp, 56
        pop      rbx
        pop      rsi
        ret      
-						;; size=7 bbWeight=0.48 PerfScore 1.08
-G_M46727_IG05:        ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0008 {rbx}, byrefRegs=0040 {rsi}, gcvars, byref, isz
-       ; gcrRegs +[rbx]
-       ; byrRegs +[rsi]
-       mov      rcx, 0xD1FFAB1E      ; System.Double
-       cmp      qword ptr [rbx], rcx
-       jne      SHORT G_M46727_IG03
-						;; size=15 bbWeight=0.50 PerfScore 2.12
-G_M46727_IG06:        ; bbWeight=0.52, gcrefRegs=0008 {rbx}, byrefRegs=0040 {rsi}, byref, isz
-       mov      rcx, 0xD1FFAB1E
-       call     CORINFO_HELP_COUNTPROFILE32
-       ; gcr arg pop 0
-       vmovsd   xmm0, qword ptr [rbx+0x08]
-       vucomisd xmm0, qword ptr [rsi]
-       jp       SHORT G_M46727_IG09
-       jne      SHORT G_M46727_IG09
-						;; size=28 bbWeight=0.52 PerfScore 6.37
-G_M46727_IG07:        ; bbWeight=0.26, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
-       ; gcrRegs -[rbx]
-       ; byrRegs -[rsi]
-       mov      eax, 1
-						;; size=5 bbWeight=0.26 PerfScore 0.07
-G_M46727_IG08:        ; bbWeight=0.52, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, epilog, nogc
-       add      rsp, 40
-       pop      rbx
-       pop      rsi
-       ret      
-						;; size=7 bbWeight=0.52 PerfScore 1.17
+						;; size=13 bbWeight=0.48 PerfScore 3.00
 G_M46727_IG09:        ; bbWeight=0.26, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0040 {rsi}, gcvars, byref, isz
        ; byrRegs +[rsi]
-       vucomisd xmm0, xmm0
+       vucomisd xmm6, xmm6
        jp       SHORT G_M46727_IG10
        je       SHORT G_M46727_IG11
 						;; size=8 bbWeight=0.26 PerfScore 1.04
@@ -85,15 +84,15 @@ G_M46727_IG10:        ; bbWeight=0.13, gcrefRegs=0000 {}, byrefRegs=0040 {rsi},
        vucomisd xmm0, xmm0
        setp     al
        movzx    rax, al
-       jmp      SHORT G_M46727_IG08
+       jmp      SHORT G_M46727_IG06
 						;; size=16 bbWeight=0.13 PerfScore 1.20
 G_M46727_IG11:        ; bbWeight=0.13, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
        ; byrRegs -[rsi]
        xor      eax, eax
-       jmp      SHORT G_M46727_IG08
+       jmp      SHORT G_M46727_IG06
 						;; size=4 bbWeight=0.13 PerfScore 0.29
 
-; Total bytes of code 124, prolog size 6, PerfScore 18.07, instruction count 38, allocated bytes for code 124 (MethodHash=203a4978) for method System.Double:Equals(System.Object):ubyte:this (Instrumented Tier1)
+; Total bytes of code 142, prolog size 12, PerfScore 24.07, instruction count 41, allocated bytes for code 142 (MethodHash=203a4978) for method System.Double:Equals(System.Object):ubyte:this (Instrumented Tier1)

Looks like after we folded some control flow we managed to compact some more blocks, and this results in worse block layout now.

@jakobbotsch
Copy link
Member Author

System.Data.Common.DataStorage:GetStorageType(System.Type):int (Tier1)
@@ -9,22 +9,22 @@
 ; 3 inlinees with PGO data; 0 single block inlinees; 0 inlinees without PGO data
 ; Final local variable assignments
 ;
-;  V00 arg0         [V00,T03] ( 12, 46.60)     ref  ->  rbx         class-hnd single-def <System.Type>
-;  V01 loc0         [V01,T09] (  3,  0   )     int  ->  rax        
-;  V02 loc1         [V02,T01] (  6, 65.17)     int  ->  rsi        
+;  V00 arg0         [V00,T03] ( 15, 20.18)     ref  ->  rbx         class-hnd single-def <System.Type>
+;  V01 loc0         [V01,T07] (  3,  0   )     int  ->  rax        
+;  V02 loc1         [V02,T00] ( 10, 65.17)     int  ->  rsi        
 ;  V03 OutArgs      [V03    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-;  V04 tmp1         [V04,T05] (  2,  0   )   ubyte  ->  rax         "Inline return value spill temp"
-;  V05 tmp2         [V05,T00] (  5, 92.71)     ref  ->  rbp         class-hnd "Inlining Arg" <System.Type>
-;* V06 tmp3         [V06,T06] (  0,  0   )     int  ->  zero-ref    "spilling qmarkNull"
-;  V07 tmp4         [V07,T08] (  3,  0.04)     int  ->  rax         "Inline return value spill temp"
-;  V08 tmp5         [V08,T07] (  3,  0.08)     int  ->  rax         "guarded devirt return temp"
+;  V04 tmp1         [V04,T08] (  2,  0   )   ubyte  ->  rax         "Inline return value spill temp"
+;  V05 tmp2         [V05,T01] (  7, 63.39)     ref  ->  rbp         class-hnd "Inlining Arg" <System.Type>
+;* V06 tmp3         [V06    ] (  0,  0   )     int  ->  zero-ref    "spilling qmarkNull"
+;  V07 tmp4         [V07,T06] (  3,  0.04)     int  ->  rax         "Inline return value spill temp"
+;  V08 tmp5         [V08,T05] (  3,  0.08)     int  ->  rax         "guarded devirt return temp"
 ;* V09 tmp6         [V09    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact "guarded devirt this exact temp" <System.RuntimeType>
 ;* V10 tmp7         [V10    ] (  0,  0   )     int  ->  zero-ref    "Inline return value spill temp"
 ;* V11 tmp8         [V11    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact "Inlining Arg" <System.RuntimeType>
 ;* V12 tmp9         [V12    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "Inline stloc first use temp" <System.RuntimeType>
-;  V13 tmp10        [V13,T02] (  2, 63.09)     ref  ->  rcx         "arr expr"
+;  V13 tmp10        [V13,T02] (  4, 63.09)     ref  ->  rax         "arr expr"
 ;* V14 tmp11        [V14    ] (  0,  0   )     ref  ->  zero-ref    "argument with side effect"
-;  V15 cse0         [V15,T04] (  2, 16.77)     ref  ->  rdi         hoist "CSE #01: aggressive"
+;  V15 cse0         [V15,T04] (  4, 16.78)     ref  ->  rdi         hoist multi-def "CSE #01: aggressive"
 ;
 ; Lcl frame size = 40
 
@@ -37,65 +37,93 @@ G_M6739_IG01:        ; bbWeight=1.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, byre
        mov      rbx, rcx
        ; gcrRegs +[rbx]
 						;; size=11 bbWeight=1.00 PerfScore 4.49
-G_M6739_IG02:        ; bbWeight=1.00, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
+G_M6739_IG02:        ; bbWeight=1.00, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref, isz
        xor      esi, esi
+       test     rbx, rbx
+       je       SHORT G_M6739_IG09
+       mov      rcx, 0xD1FFAB1E      ; System.RuntimeType
+       cmp      qword ptr [rbx], rcx
+       jne      SHORT G_M6739_IG09
        mov      rcx, 0xD1FFAB1E      ; const ptr
        mov      rdi, gword ptr [rcx]
        ; gcrRegs +[rdi]
-						;; size=15 bbWeight=1.00 PerfScore 2.49
-G_M6739_IG03:        ; bbWeight=15.77, gcrefRegs=0088 {rbx rdi}, byrefRegs=0000 {}, byref, isz
-       mov      rcx, rdi
-       ; gcrRegs +[rcx]
-       mov      eax, esi
-       mov      rbp, gword ptr [rcx+8*rax+0x10]
+						;; size=35 bbWeight=1.00 PerfScore 7.98
+G_M6739_IG03:        ; bbWeight=15.62, gcrefRegs=0088 {rbx rdi}, byrefRegs=0000 {}, byref, isz
+       mov      rax, rdi
+       ; gcrRegs +[rax]
+       mov      ecx, esi
+       mov      rbp, gword ptr [rax+8*rcx+0x10]
        ; gcrRegs +[rbp]
        cmp      rbx, rbp
-       je       SHORT G_M6739_IG12
-						;; size=15 bbWeight=15.77 PerfScore 59.15
-G_M6739_IG04:        ; bbWeight=14.83, gcrefRegs=00A8 {rbx rbp rdi}, byrefRegs=0000 {}, byref, isz
-       ; gcrRegs -[rcx]
-       test     rbx, rbx
-       je       SHORT G_M6739_IG07
-						;; size=5 bbWeight=14.83 PerfScore 18.54
-G_M6739_IG05:        ; bbWeight=14.81, gcrefRegs=00A8 {rbx rbp rdi}, byrefRegs=0000 {}, byref, isz
-       test     rbp, rbp
-       je       SHORT G_M6739_IG07
-						;; size=5 bbWeight=14.81 PerfScore 18.51
-G_M6739_IG06:        ; bbWeight=13.88, gcrefRegs=00A8 {rbx rbp rdi}, byrefRegs=0000 {}, byref, isz
-       mov      rcx, 0xD1FFAB1E      ; System.RuntimeType
-       cmp      qword ptr [rbx], rcx
-       jne      SHORT G_M6739_IG14
-						;; size=15 bbWeight=13.88 PerfScore 58.98
-G_M6739_IG07:        ; bbWeight=15.81, gcrefRegs=0088 {rbx rdi}, byrefRegs=0000 {}, byref, isz
-       ; gcrRegs -[rbp]
+       je       SHORT G_M6739_IG15
+						;; size=15 bbWeight=15.62 PerfScore 58.56
+G_M6739_IG04:        ; bbWeight=15.66, gcrefRegs=0088 {rbx rdi}, byrefRegs=0000 {}, byref, isz
+       ; gcrRegs -[rax rbp]
        inc      esi
        cmp      esi, 41
        jl       SHORT G_M6739_IG03
-						;; size=7 bbWeight=15.81 PerfScore 23.72
-G_M6739_IG08:        ; bbWeight=0.04, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref, isz
+						;; size=7 bbWeight=15.66 PerfScore 23.48
+G_M6739_IG05:        ; bbWeight=0.04, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
        ; gcrRegs -[rdi]
        test     rbx, rbx
-       je       SHORT G_M6739_IG16
-						;; size=5 bbWeight=0.04 PerfScore 0.05
-G_M6739_IG09:        ; bbWeight=0.04, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref, isz
+       je       G_M6739_IG19
+						;; size=9 bbWeight=0.04 PerfScore 0.05
+G_M6739_IG06:        ; bbWeight=0.04, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
        mov      rcx, 0xD1FFAB1E      ; System.RuntimeType
        cmp      qword ptr [rbx], rcx
-       jne      SHORT G_M6739_IG15
-						;; size=15 bbWeight=0.04 PerfScore 0.18
-G_M6739_IG10:        ; bbWeight=0.04, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
+       jne      G_M6739_IG18
+						;; size=19 bbWeight=0.04 PerfScore 0.18
+G_M6739_IG07:        ; bbWeight=0.04, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
        mov      rcx, rbx
        ; gcrRegs +[rcx]
        call     [System.RuntimeType:GetTypeCodeImpl():int:this]
        ; gcrRegs -[rcx rbx]
        ; gcr arg pop 0
 						;; size=9 bbWeight=0.04 PerfScore 0.13
-G_M6739_IG11:        ; bbWeight=0.04, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
-       jmp      SHORT G_M6739_IG17
-						;; size=2 bbWeight=0.04 PerfScore 0.08
-G_M6739_IG12:        ; bbWeight=0.96, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+G_M6739_IG08:        ; bbWeight=0.04, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+       jmp      G_M6739_IG20
+						;; size=5 bbWeight=0.04 PerfScore 0.08
+G_M6739_IG09:        ; bbWeight=0.01, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
+       ; gcrRegs +[rbx]
+       mov      rax, 0xD1FFAB1E      ; const ptr
+       mov      rdi, gword ptr [rax]
+       ; gcrRegs +[rdi]
+						;; size=13 bbWeight=0.01 PerfScore 0.02
+G_M6739_IG10:        ; bbWeight=0.16, gcrefRegs=0088 {rbx rdi}, byrefRegs=0000 {}, byref, isz
+       mov      rax, rdi
+       ; gcrRegs +[rax]
+       mov      edx, esi
+       mov      rbp, gword ptr [rax+8*rdx+0x10]
+       ; gcrRegs +[rbp]
+       cmp      rbx, rbp
+       je       SHORT G_M6739_IG15
+						;; size=15 bbWeight=0.16 PerfScore 0.59
+G_M6739_IG11:        ; bbWeight=0.15, gcrefRegs=00A8 {rbx rbp rdi}, byrefRegs=0000 {}, byref, isz
+       ; gcrRegs -[rax]
+       test     rbx, rbx
+       je       SHORT G_M6739_IG14
+						;; size=5 bbWeight=0.15 PerfScore 0.19
+G_M6739_IG12:        ; bbWeight=0.15, gcrefRegs=00A8 {rbx rbp rdi}, byrefRegs=0000 {}, byref, isz
+       test     rbp, rbp
+       je       SHORT G_M6739_IG14
+						;; size=5 bbWeight=0.15 PerfScore 0.19
+G_M6739_IG13:        ; bbWeight=0.14, gcrefRegs=00A8 {rbx rbp rdi}, byrefRegs=0000 {}, byref, isz
+       mov      rax, 0xD1FFAB1E      ; System.RuntimeType
+       cmp      qword ptr [rbx], rax
+       jne      SHORT G_M6739_IG17
+						;; size=15 bbWeight=0.14 PerfScore 0.59
+G_M6739_IG14:        ; bbWeight=0.16, gcrefRegs=0088 {rbx rdi}, byrefRegs=0000 {}, byref, isz
+       ; gcrRegs -[rbp]
+       inc      esi
+       cmp      esi, 41
+       jl       SHORT G_M6739_IG10
+       jmp      SHORT G_M6739_IG05
+						;; size=9 bbWeight=0.16 PerfScore 0.55
+G_M6739_IG15:        ; bbWeight=0.96, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+       ; gcrRegs -[rbx rdi]
        mov      eax, esi
 						;; size=2 bbWeight=0.96 PerfScore 0.24
-G_M6739_IG13:        ; bbWeight=0.96, epilog, nogc, extend
+G_M6739_IG16:        ; bbWeight=0.96, epilog, nogc, extend
        add      rsp, 40
        pop      rbx
        pop      rbp
@@ -103,7 +131,7 @@ G_M6739_IG13:        ; bbWeight=0.96, epilog, nogc, extend
        pop      rdi
        ret      
 						;; size=9 bbWeight=0.96 PerfScore 3.12
-G_M6739_IG14:        ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=00A8 {rbx rbp rdi}, byrefRegs=0000 {}, gcvars, byref, isz
+G_M6739_IG17:        ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=00A8 {rbx rbp rdi}, byrefRegs=0000 {}, gcvars, byref, isz
        ; gcrRegs +[rbx rbp rdi]
        mov      rdx, rbp
        ; gcrRegs +[rdx]
@@ -112,7 +140,7 @@ G_M6739_IG14:        ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=00A8 {r
        ; gcrRegs -[rdx] +[rax]
        ; gcr arg pop 0
        test     rax, rax
-       jne      SHORT G_M6739_IG07
+       jne      SHORT G_M6739_IG14
        mov      rcx, rbx
        ; gcrRegs +[rcx]
        mov      rdx, rbp
@@ -124,10 +152,10 @@ G_M6739_IG14:        ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=00A8 {r
        ; gcrRegs -[rcx rdx rbp]
        ; gcr arg pop 0
        test     eax, eax
-       je       SHORT G_M6739_IG07
-       jmp      SHORT G_M6739_IG12
+       je       SHORT G_M6739_IG14
+       jmp      SHORT G_M6739_IG15
 						;; size=48 bbWeight=0 PerfScore 0.00
-G_M6739_IG15:        ; bbWeight=0, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref, isz
+G_M6739_IG18:        ; bbWeight=0, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
        ; gcrRegs -[rdi]
        mov      rcx, rbx
        ; gcrRegs +[rcx]
@@ -136,17 +164,17 @@ G_M6739_IG15:        ; bbWeight=0, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byre
        call     [rax+0x10]<unknown method>
        ; gcrRegs -[rcx rbx]
        ; gcr arg pop 0
-       jmp      SHORT G_M6739_IG11
-						;; size=18 bbWeight=0 PerfScore 0.00
-G_M6739_IG16:        ; bbWeight=0.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+       jmp      G_M6739_IG08
+						;; size=21 bbWeight=0 PerfScore 0.00
+G_M6739_IG19:        ; bbWeight=0.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        xor      eax, eax
 						;; size=2 bbWeight=0.00 PerfScore 0.00
-G_M6739_IG17:        ; bbWeight=0, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+G_M6739_IG20:        ; bbWeight=0, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        xor      ecx, ecx
        cmp      eax, 1
        cmove    eax, ecx
 						;; size=8 bbWeight=0 PerfScore 0.00
-G_M6739_IG18:        ; bbWeight=0, epilog, nogc, extend
+G_M6739_IG21:        ; bbWeight=0, epilog, nogc, extend
        add      rsp, 40
        pop      rbx
        pop      rbp
@@ -155,7 +183,7 @@ G_M6739_IG18:        ; bbWeight=0, epilog, nogc, extend
        ret      
 						;; size=9 bbWeight=0 PerfScore 0.00
 
-; Total bytes of code 200, prolog size 11, PerfScore 189.68, instruction count 67, allocated bytes for code 200 (MethodHash=2fe8e5ac) for method System.Data.Common.DataStorage:GetStorageType(System.Type):int (Tier1)
+; Total bytes of code 271, prolog size 11, PerfScore 100.44, instruction count 83, allocated bytes for code 271 (MethodHash=2fe8e5ac) for method System.Data.Common.DataStorage:GetStorageType(System.Type):int (Tier1)

This one just looks like loop cloning now managed to kick in.

@jakobbotsch
Copy link
Member Author

/azp run FUzzlyn

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Recognizing a series of handle compares as a switch is not legal since
those handles may change values.
@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @AndyAyersMS

Diffs. They are quite mixed, but overall a decent improvement. TP also improves quite a bit. I analyzed some of the regressions above -- a lot of them seems to be that this change results in various different block layout decisions.
Any thoughts on whether we should take this now or wait to the beginning of .NET 10?

@jakobbotsch jakobbotsch requested a review from AndyAyersMS July 10, 2024 12:31
@@ -86,7 +86,7 @@ bool IsConstantTestCondBlock(const BasicBlock* block,
}

// We're looking for "X EQ/NE CNS" or "CNS EQ/NE X" pattern
if (op1->IsCnsIntOrI() ^ op2->IsCnsIntOrI())
if ((op1->IsCnsIntOrI() && !op1->IsIconHandle()) ^ (op2->IsCnsIntOrI() && !op2->IsIconHandle()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already merged as a separate change in #104634.

Comment on lines 1969 to 1973
// No point duplicating this block if it would not remove (part of) the joint.
if ((target->GetTrueTarget() == target) || (target->GetFalseTarget() == target))
{
return false;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hitting some divergence in fgUpdateFlowGraph without this check. It only has a handful of diffs on its own.

@@ -1966,6 +1966,12 @@ bool Compiler::fgBlockIsGoodTailDuplicationCandidate(BasicBlock* target, unsigne
return false;
}

// No point duplicating this block if it would not remove (part of) the joint.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// No point duplicating this block if it would not remove (part of) the joint.
// No point duplicating this block if it would not remove (part of) the join.

@@ -1966,6 +1966,12 @@ bool Compiler::fgBlockIsGoodTailDuplicationCandidate(BasicBlock* target, unsigne
return false;
}

// No point duplicating this block if it would not remove (part of) the joint.
if ((target->GetTrueTarget() == target) || (target->GetFalseTarget() == target))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ((target->GetTrueTarget() == target) || (target->GetFalseTarget() == target))
if (target->TrueTargetIs(target) || target->FalseTargetIs(target))

@amanasifkhalid
Copy link
Member

Thanks for looking into the regressions!

Looks like after we folded some control flow we managed to compact some more blocks, and this results in worse block layout now.

Interesting -- I would hope more compaction wouldn't regress block layout (I'm guessing it's instead pessimizing some other flow opts). I haven't looked closely at the regressions from #103785 though...

@AndyAyersMS
Copy link
Member

I think we can certainly consider taking this. Can you look at this diff in benchmarks_pgo?

     292 (4.31 % of base) : 107474.dasm - System.Collections.Generic.ArraySortHelper`1[int]:IntroSort(System.Span`1[int],int,System.Comparison`1[int]) (Tier1)

Curious if it's more cloning or whatnot...

@jakobbotsch
Copy link
Member Author

I think we can certainly consider taking this. Can you look at this diff in benchmarks_pgo?

     292 (4.31 % of base) : 107474.dasm - System.Collections.Generic.ArraySortHelper`1[int]:IntroSort(System.Span`1[int],int,System.Comparison`1[int]) (Tier1)

Curious if it's more cloning or whatnot...

It doesn't look to be more loop cloning. From the codegen diff it looks like different block layout that leads to different register allocation: https://www.diffchecker.com/O3MjiTIP/
It's hard to pinpoint exactly what causes the differences. There's a lot of different decisions being made (e.g. new loop canonicalization) after we straighten out the flow early.
Loop cloning also seems to come into play, the codegen is half the size with loop cloning disabled and in that case this change becomes both a size and perfscore improvement (it looks like even with loop cloning it is a perfscore improvement. Not sure how comparable the perfscores are here, though it does look like we did not recognize any more general loops, which is what caused the perfscore differences in previous cases I looked at.)

@hez2010
Copy link
Contributor

hez2010 commented Jul 11, 2024

This diff is interesting (coming from benchmarks.run_pgo.windows.x64.checked.mch), where it stores eax into [rcx+0x18] and then loads it back to eax immediately for and, and then it stores eax back into [rcx+0x18] again:

-G_M58686_IG01:        ; bbWeight=0, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
-						;; size=0 bbWeight=0 PerfScore 0.00
-G_M58686_IG02:        ; bbWeight=0, gcrefRegs=0002 {rcx}, byrefRegs=0000 {}, byref
+G_M58686_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
+						;; size=0 bbWeight=1 PerfScore 0.00
+G_M58686_IG02:        ; bbWeight=1, gcrefRegs=0002 {rcx}, byrefRegs=0000 {}, byref
        ; gcrRegs +[rcx]
        xor      eax, eax
        mov      dword ptr [rcx+0x18], eax
-       and      dword ptr [rcx+0x18], 0xD1FFAB1E
-						;; size=12 bbWeight=0 PerfScore 0.00
-G_M58686_IG03:        ; bbWeight=0, epilog, nogc, extend
+       mov      eax, dword ptr [rcx+0x18]
+       and      eax, 0xD1FFAB1E
+       mov      dword ptr [rcx+0x18], eax
+						;; size=16 bbWeight=1 PerfScore 4.50
+G_M58686_IG03:        ; bbWeight=1, epilog, nogc, extend
        ret      
-						;; size=1 bbWeight=0 PerfScore 0.00
+						;; size=1 bbWeight=1 PerfScore 1.00
 
-; Total bytes of code 13, prolog size 0, PerfScore 0.00, instruction count 4, allocated bytes for code 13 (MethodHash=f3a21ac1) for method System.Threading.Tasks.Task+SetOnInvokeMres:.ctor():this (Tier1)
+; Total bytes of code 17, prolog size 0, PerfScore 5.50, instruction count 6, allocated bytes for code 17 (MethodHash=f3a21ac1) for method System.Threading.Tasks.Task+SetOnInvokeMres:.ctor():this (Tier1)

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM overall.

@jakobbotsch
Copy link
Member Author

This diff is interesting (coming from benchmarks.run_pgo.windows.x64.checked.mch), where it stores eax into [rcx+0x18] and then loads it back to eax immediately for and, and then it stores eax back into [rcx+0x18] again:

-G_M58686_IG01:        ; bbWeight=0, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
-						;; size=0 bbWeight=0 PerfScore 0.00
-G_M58686_IG02:        ; bbWeight=0, gcrefRegs=0002 {rcx}, byrefRegs=0000 {}, byref
+G_M58686_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
+						;; size=0 bbWeight=1 PerfScore 0.00
+G_M58686_IG02:        ; bbWeight=1, gcrefRegs=0002 {rcx}, byrefRegs=0000 {}, byref
        ; gcrRegs +[rcx]
        xor      eax, eax
        mov      dword ptr [rcx+0x18], eax
-       and      dword ptr [rcx+0x18], 0xD1FFAB1E
-						;; size=12 bbWeight=0 PerfScore 0.00
-G_M58686_IG03:        ; bbWeight=0, epilog, nogc, extend
+       mov      eax, dword ptr [rcx+0x18]
+       and      eax, 0xD1FFAB1E
+       mov      dword ptr [rcx+0x18], eax
+						;; size=16 bbWeight=1 PerfScore 4.50
+G_M58686_IG03:        ; bbWeight=1, epilog, nogc, extend
        ret      
-						;; size=1 bbWeight=0 PerfScore 0.00
+						;; size=1 bbWeight=1 PerfScore 1.00
 
-; Total bytes of code 13, prolog size 0, PerfScore 0.00, instruction count 4, allocated bytes for code 13 (MethodHash=f3a21ac1) for method System.Threading.Tasks.Task+SetOnInvokeMres:.ctor():this (Tier1)
+; Total bytes of code 17, prolog size 0, PerfScore 5.50, instruction count 6, allocated bytes for code 17 (MethodHash=f3a21ac1) for method System.Threading.Tasks.Task+SetOnInvokeMres:.ctor():this (Tier1)

Looks like in the diff we did a CSE that stops us from doing the containment. The current collections do the same CSE even with the baseline it seems, so I can't repro it.

@jakobbotsch
Copy link
Member Author

Ok, managed to find the right context for the above (turns out my collection was the outdated one).

Before flowgraph opts we have these blocks:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..001)-> BB02(1)                 (always)                     i IBC
BB02 [0013]  1       BB01                  0      0 [000..001)-> BB06(1)                 (always)                     i IBC rare
BB03 [0014]  0                             1    100 [000..001)-> BB05(1),BB04(0.000315)  ( cond )                     i IBC
BB04 [0015]  1       BB03                  0.00   0 [000..001)-> BB06(1)                 (always)                     i IBC
BB05 [0016]  1       BB03                  1.00 100 [000..001)-> BB06(1)                 (always)                     i IBC
BB06 [0017]  3       BB02,BB04,BB05        1    100 [000..001)-> BB08(1),BB07(0)         ( cond )                     i IBC
BB07 [0009]  1       BB06                  0      0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB08 [0010]  1       BB06                  1    100 [000..008)-> BB10(1)                 (always)                     i IBC
BB09 [0041]  0                             0      0 [000..001)-> BB10(1)                 (always)                     i IBC rare hascall gcsafe
BB10 [0042]  2       BB08,BB09             1    100 [000..001)-> BB12(1),BB11(0)         ( cond )                     i IBC
BB11 [0046]  1       BB10                  0      0 [000..001)-> BB12(1)                 (always)                     i IBC rare hascall gcsafe
BB12 [0047]  2       BB10,BB11             1    100 [000..009)                           (return)                     i IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Looks like in the base the flowgraph optimizations we do mean we end up with all 0 weight basic blocks going into CSE:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             0      0 [000..001)-> BB03(1),BB02(0)         ( cond )                     i IBC rare
BB02 [0009]  1       BB01                  0      0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB03 [0010]  1       BB01                  0      0 [000..008)-> BB05(1),BB04(0)         ( cond )                     i IBC rare
BB04 [0046]  1       BB03                  0      0 [000..001)-> BB05(1)                 (always)                     i IBC rare hascall gcsafe
BB05 [0047]  2       BB03,BB04             0      0 [000..009)                           (return)                     i IBC rare
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

CSE decides to do no CSE because of the 0 weights. OTOH, with the early folding done by this PR there is just one block going into CSE, which appeared after a bunch of compaction. It seems the result is a different weight that means CSE does end up happening:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..009)                           (return)                     i IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

The "inconsistent weights" problem is probably (similar to) #96614. In the end we also ought to recognize the address mode regardless of the intervening local created by CSE (it ends up single-def single-used after a bunch of transformations). That goes under the #104538 umbrella.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: IV widening does not kick in for simple foreach over List<int>
4 participants