-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow contained indirections in tailcalls on x64 #58686
Conversation
This adds support for contained indirections in tailcalls on x64. The significant diff of this change is refactoring to be able to reuse the code for generating call instructions when generating tailcalls as well. Other than that, the main change is to allow contained indirs in lowering and to ensure LSRA uses volatile registers for the addressing mode so that the registers are not overridden by the epilog sequence. To be sure we insert rex. prefix correctly I also refactored the emitter to base the decision on a new instruction (INS_i_jmp) and emit it when necessary. The rex. prefix is necessary because the unwinder uses it to determine that a tail jmp is non-local and thus part of the epilog. Finally, unlike the OS unwinder our unwinder needs to compute the size of the epilog. This computation was wrong for jmp [foo] for addressing modes/rip-relative addressing, so fix this as well. Presumably that has not been a problem before since we did not generate these instructions in managed code (although native code may have had these instructions -- not sure if we use that unwinder for native code). This PR helps support dotnet#56669.
INST1(rex_jmp, "rex.jmp", IUM_RD, 0x0020FE, INS_FLAGS_None) | ||
#endif | ||
|
||
INST1(i_jmp, "jmp", IUM_RD, 0x0020FE, INS_FLAGS_None ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that we have the wrong opcode for this and compensate for it in the emitter for some reason, so I have fixed this as well.
SPMI diffs: aspnet.run.windows.x64.checked.mch:
Detail diffs
benchmarks.run.windows.x64.checked.mch:
Detail diffs
coreclr_tests.pmi.windows.x64.checked.mch:
Detail diffs
libraries.pmi.windows.x64.checked.mch:
Detail diffs
libraries_tests.pmi.windows.x64.checked.mch:
Detail diffs
Typical improvements: diff --git "a/D:\\dev\\dotnet\\spmi\\asm.aspnet.run.windows.x64.checked\\base\\10456.dasm" "b/D:\\dev\\dotnet\\spmi\\asm.aspnet.run.windows.x64.checked\\diff\\10456.dasm"
index 0afccab..7541ecd 100644
--- "a/D:\\dev\\dotnet\\spmi\\asm.aspnet.run.windows.x64.checked\\base\\10456.dasm"
+++ "b/D:\\dev\\dotnet\\spmi\\asm.aspnet.run.windows.x64.checked\\diff\\10456.dasm"
@@ -25,20 +25,20 @@ G_M1504_IG02: ; gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, byref, i
mov rax, 0xD1FFAB1E ; hackishClassName
cmp qword ptr [rcx], rax
jne SHORT G_M1504_IG04
- mov rax, qword ptr [(reloc)]
- ;; bbWeight=1 PerfScore 7.25
+ ;; bbWeight=1 PerfScore 5.25
G_M1504_IG03: ; , epilog, nogc, extend
- rex.jmp rax
+ tail.jmp [Logger:IsEnabled(int):bool:this]
+ ; gcr arg pop 0
;; bbWeight=1 PerfScore 2.00
G_M1504_IG04: ; gcVars=0000000000000000 {}, gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, gcvars, byref
mov r11, 0xD1FFAB1E
- mov rax, qword ptr [(reloc)]
;; bbWeight=0 PerfScore 0.00
G_M1504_IG05: ; , epilog, nogc, extend
- rex.jmp rax
+ tail.jmp [hackishModuleName:hackishMethodName()]
+ ; gcr arg pop 0
;; bbWeight=0 PerfScore 0.00
-; Total bytes of code 49, prolog size 0, PerfScore 14.15, instruction count 9, allocated bytes for code 49 (MethodHash=e611fa1f) for method Logger`1:Microsoft.Extensions.Logging.ILogger.IsEnabled(int):bool:this
+; Total bytes of code 41, prolog size 0, PerfScore 11.35, instruction count 7, allocated bytes for code 41 (MethodHash=e611fa1f) for method Logger`1:Microsoft.Extensions.Logging.ILogger.IsEnabled(int):bool:this
; ============================================================
Unwind Info: diff --git "a/D:\\dev\\dotnet\\spmi\\asm.aspnet.run.windows.x64.checked\\base\\11392.dasm" "b/D:\\dev\\dotnet\\spmi\\asm.aspnet.run.windows.x64.checked\\diff\\11392.dasm"
index 41c6f74..8de9722 100644
--- "a/D:\\dev\\dotnet\\spmi\\asm.aspnet.run.windows.x64.checked\\base\\11392.dasm"
+++ "b/D:\\dev\\dotnet\\spmi\\asm.aspnet.run.windows.x64.checked\\diff\\11392.dasm"
@@ -33,13 +33,12 @@ G_M28340_IG04: ; gcVars=0000000000000000 {}, gcrefRegs=00000002 {rcx}, by
; gcrRegs -[rax]
mov rax, qword ptr [rcx]
mov rax, qword ptr [rax+80]
- mov rax, qword ptr [rax+32]
;; bbWeight=0 PerfScore 0.00
G_M28340_IG05: ; , epilog, nogc, extend
- rex.jmp rax
+ tail.jmp qword ptr [rax+32]hackishModuleName:hackishMethodName()
;; bbWeight=0 PerfScore 0.00
-; Total bytes of code 37, prolog size 0, PerfScore 9.95, instruction count 9, allocated bytes for code 37 (MethodHash=f0b5914b) for method TransportConnection:Microsoft.AspNetCore.Connections.Features.IMemoryPoolFeature.get_MemoryPool():MemoryPool`1:this
+; Total bytes of code 34, prolog size 0, PerfScore 9.65, instruction count 8, allocated bytes for code 34 (MethodHash=f0b5914b) for method TransportConnection:Microsoft.AspNetCore.Connections.Features.IMemoryPoolFeature.get_MemoryPool():MemoryPool`1:this
; ============================================================
Unwind Info: The regressions are due to LSRA making suboptimal choices with the extra freedom it is given for registers. Typical regression: diff --git "a/D:\\dev\\dotnet\\spmi\\asm.aspnet.run.windows.x64.checked\\base\\19767.dasm" "b/D:\\dev\\dotnet\\spmi\\asm.aspnet.run.windows.x64.checked\\diff\\19767.dasm"
index ba49752..480d44b 100644
--- "a/D:\\dev\\dotnet\\spmi\\asm.aspnet.run.windows.x64.checked\\base\\19767.dasm"
+++ "b/D:\\dev\\dotnet\\spmi\\asm.aspnet.run.windows.x64.checked\\diff\\19767.dasm"
@@ -8,39 +8,40 @@
;
; V00 this [V00,T00] ( 5, 5 ) ref -> rcx this class-hnd single-def
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
-; V02 tmp1 [V02,T02] ( 2, 4 ) ref -> rsi class-hnd single-def "non-inline candidate call"
+; V02 tmp1 [V02,T02] ( 2, 4 ) ref -> [rsp+28H] class-hnd single-def "non-inline candidate call"
; V03 tmp2 [V03,T03] ( 2, 4 ) ref -> rdx single-def "argument with side effect"
-; V04 rat0 [V04,T01] ( 3, 6 ) ref -> rsi "delegate invoke call"
+; V04 rat0 [V04,T01] ( 3, 6 ) ref -> rax "delegate invoke call"
;
-; Lcl frame size = 32
+; Lcl frame size = 48
G_M501_IG01: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
push rsi
- sub rsp, 32
+ sub rsp, 48
;; bbWeight=1 PerfScore 1.25
G_M501_IG02: ; gcrefRegs=00000002 {rcx}, byrefRegs=00000000 {}, byref
; gcrRegs +[rcx]
mov rsi, gword ptr [rcx+8]
; gcrRegs +[rsi]
- mov rax, qword ptr [rcx]
- mov rax, qword ptr [rax+64]
- call gword ptr [rax+32]hackishModuleName:hackishMethodName()
- ; gcrRegs -[rcx] +[rax]
+ mov gword ptr [rsp+28H], rsi
+ ; GC ptr vars +{V02}
+ mov rdx, qword ptr [rcx]
+ mov rdx, qword ptr [rdx+64]
+ call gword ptr [rdx+32]hackishModuleName:hackishMethodName()
+ ; gcrRegs -[rcx rsi] +[rax]
; gcr arg pop 0
mov rdx, rax
; gcrRegs +[rdx]
- mov rcx, gword ptr [rsi+8]
+ mov rax, gword ptr [rsp+28H]
+ mov rcx, gword ptr [rax+8]
; gcrRegs +[rcx]
- mov rax, qword ptr [rsi+24]
- ; gcrRegs -[rax]
;; bbWeight=1 PerfScore 13.25
G_M501_IG03: ; , epilog, nogc, extend
- add rsp, 32
+ add rsp, 48
pop rsi
- rex.jmp rax
+ tail.jmp qword ptr [rax+24]hackishModuleName:hackishMethodName()
;; bbWeight=1 PerfScore 2.75
-; Total bytes of code 38, prolog size 5, PerfScore 21.05, instruction count 12, allocated bytes for code 38 (MethodHash=d3a3fe0a) for method RelationalExecutionStrategyFactory:Create():IExecutionStrategy:this
+; Total bytes of code 45, prolog size 5, PerfScore 21.75, instruction count 13, allocated bytes for code 45 (MethodHash=d3a3fe0a) for method RelationalExecutionStrategyFactory:Create():IExecutionStrategy:this
; ============================================================
Unwind Info:
@@ -53,5 +54,5 @@ Unwind Info:
FrameRegister : none (0)
FrameOffset : N/A (no FrameRegister) (Value=0)
UnwindCodes :
- CodeOffset: 0x05 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 3 * 8 + 8 = 32 = 0x20
+ CodeOffset: 0x05 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 5 * 8 + 8 = 48 = 0x30
CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6) |
Also add some assertions that we are using non-volatile registers when generating the tailcall. Fixes the arm64 tests as a side effect. It seems there is a bug in the allocator when specifying only a single register as a candidate where it may use a local variable's home instead of the candidate.
cc @dotnet/jit-contrib, @janvorli (for unwinder parts) |
ping @dotnet/jit-contrib, @janvorli |
ping @dotnet/jit-contrib, would appreciate a review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I've played with it locally, nice improvements!
} | ||
else | ||
#endif // TARGET_ARM | ||
if (varTypeUsesFloatArgReg(returnType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity - is this strange indentation intentional or is it a result of the clang format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janvorli yeah it's clang-format (I remember seeing exact the same pattern after it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in the excepamd64.cpp look good. I have not reviewed the JIT ones.
@@ -367,7 +367,7 @@ RtlVirtualUnwind_Worker ( | |||
// This is a jmp outside of the function, probably a tail call | |||
// to an import function. | |||
InEpilogue = TRUE; | |||
NextByte += 2; | |||
NextByte += 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious - how did you discover this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only noticed that the code was wrong when I was trying to make sure that the new jmp
instructions would be properly handled during unwinding. I haven't actually seen any issue (we have also never generated this particular form of jmp
before this PR).
This adds support for contained indirections in tailcalls on x64. The
significant diff of this change is refactoring to be able to reuse the
code for generating call instructions when generating tailcalls as well.
Other than that, the main change is to allow contained indirs in
lowering and to ensure LSRA uses volatile registers for the addressing
mode so that the registers are not overridden by the epilog sequence.
To be sure we insert rex. prefix correctly I also refactored the emitter
to base the decision on a new instruction (INS_i_jmp) and emit it when
necessary. The rex. prefix is necessary because the unwinder uses it to
determine that a tail jmp is non-local and thus part of the epilog.
Finally, unlike the OS unwinder our unwinder needs to compute the size
of the epilog. This computation was wrong for jmp [foo] for addressing
modes/rip-relative addressing, so fix this as well. Presumably that has
not been a problem before since we did not generate these instructions
in managed code (although native code may have had these instructions --
not sure if we use that unwinder for native code).
This PR helps support #56669.