-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: poor CQ for list enumeration constructs #9304
Comments
For ; Assembly listing for method P:SumFor(ref):int
G_M43955_IG01:
4883EC28 sub rsp, 40
G_M43955_IG02:
33C0 xor eax, eax // result
33D2 xor edx, edx // i
448B4118 mov r8d, dword ptr [rcx+24] // _size
4585C0 test r8d, r8d
7E1E jle SHORT G_M43955_IG05
G_M43955_IG03:
413BD0 cmp edx, r8d
731E jae SHORT G_M43955_IG06
G_M43955_IG04:
4C8B4908 mov r9, gword ptr [rcx+8] // _items
413B5108 cmp edx, dword ptr [r9+8] // array len
731A jae SHORT G_M43955_IG07
4C63D2 movsxd r10, edx
4303449110 add eax, dword ptr [r9+4*r10+16]
FFC2 inc edx
413BD0 cmp edx, r8d
7CE2 jl SHORT G_M43955_IG03
G_M43955_IG05:
4883C428 add rsp, 40
C3 ret
G_M43955_IG06:
E85758A35E call System.ThrowHelper:ThrowArgumentOutOfRange_IndexException()
CC int3
G_M43955_IG07:
E831D8365F call CORINFO_HELP_RNGCHKFAIL
CC IG03 is a critical block (fork/join). The join limits the jit's ability to reason about what happens in this block. When reaching IG03 from IG02: r8d > 0, edx = 0. So edx < r8d. So the branch to IG06 is not taken. When reaching IG03 from IG04: edx < r8d. So the branch to IG06 is not taken. It might be simpler for the jit to reason about IG03 if it simply cloned it to push the join down to IG04. IG03's branch makes IG04 "conditionally executed" in the loop. So the jit won't consider hoisting invariants like "mov r9, gword ptr [rcx+8]". |
There would be more value in inlining However I think the range check elimination currently only works evaluating to |
For ; method P:SumFor(ref):int
; MoveNext* with AggressiveInline
G_M41030_IG01:
4883EC28 sub rsp, 40
G_M41030_IG02:
33C0 xor eax, eax // i
8B11 mov edx, dword ptr [rcx]
488BD1 mov rdx, rcx
8B4A1C mov ecx, dword ptr [rdx+28] // _version
448BC1 mov r8d, ecx
448BC8 mov r9d, eax
G_M41030_IG03:
EB03 jmp SHORT G_M41030_IG05
G_M41030_IG04:
4103C2 add eax, r10d
G_M41030_IG05:
443BC1 cmp r8d, ecx
7523 jne SHORT G_M41030_IG06
443B4A18 cmp r9d, dword ptr [rdx+24] // _index
731D jae SHORT G_M41030_IG06
4C8B5208 mov r10, gword ptr [rdx+8] // _list
453B4A08 cmp r9d, dword ptr [r10+8]
7335 jae SHORT G_M41030_IG11
4D63D9 movsxd r11, r9d
478B549A10 mov r10d, dword ptr [r10+4*r11+16]
41FFC1 inc r9d
41BB01000000 mov r11d, 1
EB12 jmp SHORT G_M41030_IG08
G_M41030_IG06:
443BC1 cmp r8d, ecx
7517 jne SHORT G_M41030_IG10
G_M41030_IG07:
448B4A18 mov r9d, dword ptr [rdx+24]
41FFC1 inc r9d
4533D2 xor r10d, r10d
4533DB xor r11d, r11d
G_M41030_IG08:
4585DB test r11d, r11d
75BE jne SHORT G_M41030_IG04
G_M41030_IG09:
4883C428 add rsp, 40
C3 ret
G_M41030_IG10:
E87E5FA35E call System.ThrowHelper:ThrowInvalidOperationException_InvalidOperation_EnumFailedVersion()
CC int3
G_M41030_IG11:
E8A8D7365F call CORINFO_HELP_RNGCHKFAIL
CC int3 IG08 is a also a critical block (testing the result of MoveNext): When reached from IG07 By cloning this block we can streamline both branches. The jit already realizes the compare and branch at the top of IG05 and IG06 has known outcome but doesn't act on it for some reason (perhaps because we're in SSA and don't want to mess with control flow?):
note the two operands have the same value numbers. Fixing all that would leave us with an inner loop something like: G_M41030_IG04:
4103C2 add eax, r10d
G_M41030_IG05:
443B4A18 cmp r9d, dword ptr [rdx+24] // _index
731D jae SHORT G_M41030_IG10
4C8B5208 mov r10, gword ptr [rdx+8] // _list
453B4A08 cmp r9d, dword ptr [r10+8]
7335 jae SHORT G_M41030_IG11
4D63D9 movsxd r11, r9d
478B549A10 mov r10d, dword ptr [r10+4*r11+16]
41FFC1 inc r9d
EB12 jmp SHORT G_M41030_IG04 where we then might be able to host some of the loads off of |
Codegen today: ; Assembly listing for method Program:SumFor(System.Collections.Generic.List`1[int]):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 1 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T04] ( 4, 7 ) ref -> rcx class-hnd single-def <System.Collections.Generic.List`1[int]>
; V01 loc0 [V01,T02] ( 4, 10 ) int -> rax
;* V02 loc1 [V02,T05] ( 0, 0 ) int -> zero-ref
; V03 OutArgs [V03 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
; V04 tmp1 [V04,T01] ( 3, 24 ) ref -> r10 "arr expr"
; V05 cse0 [V05,T03] ( 4, 10 ) int -> r8 "CSE #01: aggressive"
; V06 rat0 [V06,T00] ( 7, 25 ) long -> rdx "Widened IV V02"
;
; Lcl frame size = 40
G_M46421_IG01: ;; offset=0x0000
sub rsp, 40
;; size=4 bbWeight=1 PerfScore 0.25
G_M46421_IG02: ;; offset=0x0004
xor eax, eax
xor edx, edx
mov r8d, dword ptr [rcx+0x10]
test r8d, r8d
jle SHORT G_M46421_IG04
align [15 bytes for IG03]
;; size=28 bbWeight=1 PerfScore 4.00
G_M46421_IG03: ;; offset=0x0020
cmp edx, r8d
jae SHORT G_M46421_IG05
mov r10, gword ptr [rcx+0x08]
cmp edx, dword ptr [r10+0x08]
jae SHORT G_M46421_IG06
add eax, dword ptr [r10+4*rdx+0x10]
inc edx
cmp edx, r8d
jl SHORT G_M46421_IG03
;; size=27 bbWeight=4 PerfScore 47.00
G_M46421_IG04: ;; offset=0x003B
add rsp, 40
ret
;; size=5 bbWeight=1 PerfScore 1.25
G_M46421_IG05: ;; offset=0x0040
call [System.ThrowHelper:ThrowArgumentOutOfRange_IndexMustBeLessException()]
int3
;; size=7 bbWeight=0 PerfScore 0.00
G_M46421_IG06: ;; offset=0x0047
call CORINFO_HELP_RNGCHKFAIL
int3
;; size=6 bbWeight=0 PerfScore 0.00
; Total bytes of code 77, prolog size 4, PerfScore 52.50, instruction count 22, allocated bytes for code 77 (MethodHash=44f14aaa) for method Program:SumFor(System.Collections.Generic.List`1[int]):int (FullOpts)
; ============================================================ ; Assembly listing for method Program:SumForeach(System.Collections.Generic.List`1[int]):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 2 inlinees with PGO data; 4 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T07] ( 3, 3 ) ref -> rcx class-hnd single-def <System.Collections.Generic.List`1[int]>
; V01 loc0 [V01,T04] ( 4, 12.27) int -> rax
;* V02 loc1 [V02 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op <System.Collections.Generic.List`1+Enumerator[int]>
;* V03 loc2 [V03 ] ( 0, 0 ) int -> zero-ref
; V04 OutArgs [V04 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1 [V05 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "NewObj constructor temp" <System.Collections.Generic.List`1+Enumerator[int]>
;* V06 tmp2 [V06,T06] ( 0, 0 ) ubyte -> zero-ref "Inline return value spill temp"
;* V07 tmp3 [V07 ] ( 0, 0 ) ref -> zero-ref class-hnd "Inline stloc first use temp" <System.Collections.Generic.List`1[int]>
; V08 tmp4 [V08,T03] ( 3, 12.63) ref -> rcx single-def "field V02._list (fldOffset=0x0)" P-INDEP
; V09 tmp5 [V09,T00] ( 6, 23.54) int -> rdx "field V02._index (fldOffset=0x8)" P-INDEP
;* V10 tmp6 [V10,T11] ( 0, 0 ) int -> zero-ref single-def "field V02._version (fldOffset=0xc)" P-INDEP
; V11 tmp7 [V11,T05] ( 2, 7.27) int -> r8 "field V02._current (fldOffset=0x10)" P-INDEP
; V12 tmp8 [V12,T08] ( 3, 3 ) ref -> rcx single-def "field V05._list (fldOffset=0x0)" P-INDEP
;* V13 tmp9 [V13,T12] ( 0, 0 ) int -> zero-ref single-def "field V05._index (fldOffset=0x8)" P-INDEP
;* V14 tmp10 [V14,T09] ( 0, 0 ) int -> zero-ref single-def "field V05._version (fldOffset=0xc)" P-INDEP
;* V15 tmp11 [V15 ] ( 0, 0 ) int -> zero-ref single-def "field V05._current (fldOffset=0x10)" P-INDEP
; V16 tmp12 [V16,T01] ( 3, 21.81) ref -> r8 "arr expr"
;* V17 cse0 [V17,T10] ( 0, 0 ) int -> zero-ref "CSE #01: aggressive"
; V18 cse1 [V18,T02] ( 2, 16 ) int -> r8 "CSE #02: aggressive"
;
; Lcl frame size = 40
G_M40154_IG01: ;; offset=0x0000
sub rsp, 40
;; size=4 bbWeight=1 PerfScore 0.25
G_M40154_IG02: ;; offset=0x0004
xor eax, eax
mov edx, dword ptr [rcx+0x14]
xor edx, edx
jmp SHORT G_M40154_IG04
align [0 bytes for IG03]
;; size=9 bbWeight=1 PerfScore 4.50
G_M40154_IG03: ;; offset=0x000D
add eax, r8d
;; size=3 bbWeight=3.63 PerfScore 0.91
G_M40154_IG04: ;; offset=0x0010
mov r8d, dword ptr [rcx+0x10]
cmp edx, r8d
jae SHORT G_M40154_IG06
;; size=9 bbWeight=8 PerfScore 26.00
G_M40154_IG05: ;; offset=0x0019
mov r8, gword ptr [rcx+0x08]
cmp edx, dword ptr [r8+0x08]
jae SHORT G_M40154_IG07
mov r10d, edx
mov r8d, dword ptr [r8+4*r10+0x10]
inc edx
jmp SHORT G_M40154_IG03
;; size=22 bbWeight=3.63 PerfScore 38.16
G_M40154_IG06: ;; offset=0x002F
add rsp, 40
ret
;; size=5 bbWeight=4 PerfScore 5.00
G_M40154_IG07: ;; offset=0x0034
call CORINFO_HELP_RNGCHKFAIL
int3
;; size=6 bbWeight=0 PerfScore 0.00
; Total bytes of code 58, prolog size 4, PerfScore 74.82, instruction count 21, allocated bytes for code 58 (MethodHash=ae626325) for method Program:SumForeach(System.Collections.Generic.List`1[int]):int (FullOpts)
; ============================================================ Looks much better today, but there still seems to be a redundant compare in We also seemingly don't manage to widen the IV for the foreach version. I opened #101176 for this. |
With
Looking at the JIT dump, this looks like another case of the loop shape issue mentioned in #102343:
|
The loop rotation issue I noted above is fixed by #102343. Here's the block layout of
With the new implementation, here's the initial RPO layout from above:
Then after we fix up the backward jumps, we end up with this layout:
So we're back to square one. We can't use
And final codegen:
I think this warrants a follow-up to #102461. While unrelated to this issue, we don't handle the |
Follow-up to dotnet#102461, and part of dotnet#9304. Compacting blocks after establishing an RPO-based layout, but before moving backward jumps to fall into their successors, can enable more opportunities for branch removal; see dotnet#9304 for one such example.
Follow-up to dotnet#102461, and part of dotnet#9304. Compacting blocks after establishing an RPO-based layout, but before moving backward jumps to fall into their successors, can enable more opportunities for branch removal; see dotnet#9304 for one such example.
From some discussion over in dotnet/csharplang#1085. Both the
for
andforeach
cases have suboptimal CQ.In the
for
case there is a redundant branch in the inner loop.In the
foreach
case the enumerator'sMoveNext
is not inlined. If we force this andMoveNextRare
inline then the jit is not able to optimize away the version check overhead in the inner loop.Without the inlines the
foreach
is rougly 2.9x slower than thefor
; with the inlines this drops to 1.3x or so.category:cq
theme:basic-cq
skill-level:expert
cost:medium
impact:large
The text was updated successfully, but these errors were encountered: