-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support batching GC transitions for P/Invokes #54107
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Likewise, it would be beneficial if For example, given that a This would allow a user to write something like the following and expect that, since its only calling other unmanaged code (P/Invokes) that no transitions exist and this could be used as a way to efficiently batch native calls together: [UnmanagedCallersOnly]
private static void TransitionMethod()
{
PInvoke();
PInvoke();
PInvoke();
} |
#54106 is related and requests a way to explicitly mark transition regions |
@tannergooding It would be helpful to see the general pattern for this in practice and how we might compute the fall over point where we insert the poll. Batching was discussed during the I'd like to see more than "Helps a niche scenario" (i.e., Vulkan) and a concrete proposal as to how the JIT or the user determines how many to batch is too much, too little, or just right. It isn't obvious to me how this feature can be used safely and correctly and adhere to our GC SLA goals.
When would we make the transition then? How would it be determined? Does the JIT simply make a transitions when a call to something not with a |
I've given this some more thought. I'm struggling to see any other use case here than "writing a native macro function in pure C#". The goal here is for Vulkan like scenarios where one wants to pass a low overhead callback to native and not pay the cost of owning native assets – very interesting. It seems to me that if this is the use case then what is desired is being able to define a I believe this would enable the specific use case and not add any potentially complicated batching logic – which I find dubious in value. If there is another scenario here I'd like to understand them. |
Note that we do batch the most expensive part of the GC transition (saving of callee saved registers, etc.) already. The part that is not batched is just few instructions that flip the GC mode that is very cheap, especially relative to call that it wraps. When it comes to micro-optimizations of PInvoke transitions like this, there are more generally applicable ones we can do: Make InlinedPInvokeFrame cheaper or reduce number of indirections required to get to the preemptive flag. |
I do not think we would want to compilicate the public surface for these sort of optimizations. This would make sense to do as a JIT optimization - make the JIT recognize certain patterns where the transitions can be batched or omitted. |
@AaronRobinsonMSFT, I think I didn't explain well. The general premise is that there are many scenarios, such as
Because of this, I think it would be good if the JIT could be smarter here (either implicitly or via some explicit opt-in) such that the general cost here can be reduced.
The simplest comparison is likely in a method that just does: ulong start;
QueryPerformanceCounter(&start);
ulong end;
QueryPerformanceCounter(&end);
return end - start; Such a method (assuming its static) doesn't ever interact with the GC in a meaningful way (that is, it never touches a GC tracked type, it doesn't even use byrefs, and it never allocates anything). However, we get 245 bytes of codegen, spilling many registers, and performing multiple memory reads/writes, all of which add up: L0000: push rbp
L0001: push r15
L0003: push r14
L0005: push r13
L0007: push r12
L0009: push rdi
L000a: push rsi
L000b: push rbx
L000c: sub rsp, 0x78
L0010: lea rbp, [rsp+0xb0]
L0018: xor ecx, ecx
L001a: mov [rbp-0x40], rcx
L001e: mov [rbp-0x48], rcx
L0022: lea rcx, [rbp-0x88]
L0029: mov rdx, r10
L002c: call 0x00007ffaa5ca1660
L0031: mov rsi, rax
L0034: mov rcx, rsp
L0037: mov [rbp-0x68], rcx
L003b: mov rcx, rbp
L003e: mov [rbp-0x58], rcx
L0042: lea rcx, [rbp-0x40]
L0046: mov rax, 0x7ffa4ce4ccb0
L0050: mov [rbp-0x78], rax
L0054: lea rax, [L0074]
L005b: mov [rbp-0x60], rax
L005f: lea rax, [rbp-0x88]
L0066: mov [rsi+0x10], rax
L006a: mov byte ptr [rsi+0xc], 0
L006e: call qword ptr [0x7ffa4ce4cd08]
L0074: mov byte ptr [rsi+0xc], 1
L0078: cmp dword ptr [0x7ffaa606c07c], 0
L007f: je short L0087
L0081: call qword ptr [0x7ffaa605e368]
L0087: mov rcx, [rbp-0x80]
L008b: mov [rsi+0x10], rcx
L008f: lea rcx, [rbp-0x48]
L0093: mov rax, 0x7ffa4ce4ccb0
L009d: mov [rbp-0x78], rax
L00a1: lea rax, [L00c1]
L00a8: mov [rbp-0x60], rax
L00ac: lea rax, [rbp-0x88]
L00b3: mov [rsi+0x10], rax
L00b7: mov byte ptr [rsi+0xc], 0
L00bb: call qword ptr [0x7ffa4ce4cd08]
L00c1: mov byte ptr [rsi+0xc], 1
L00c5: cmp dword ptr [0x7ffaa606c07c], 0
L00cc: je short L00d4
L00ce: call qword ptr [0x7ffaa605e368]
L00d4: mov rax, [rbp-0x80]
L00d8: mov [rsi+0x10], rax
L00dc: mov rax, [rbp-0x48]
L00e0: sub rax, [rbp-0x40]
L00e4: lea rsp, [rbp-0x38]
L00e8: pop rbx
L00e9: pop rsi
L00ea: pop rdi
L00eb: pop r12
L00ed: pop r13
L00ef: pop r14
L00f1: pop r15
L00f3: pop rbp
L00f4: ret Compare this with the codegen for MSVC of the same method, which is roughly 1/5th the size at ~42 bytes: sub rsp, 40
lea rcx, QWORD PTR start$[rsp]
call QWORD PTR __imp_QueryPerformanceCounter
lea rcx, QWORD PTR end$[rsp]
call QWORD PTR __imp_QueryPerformanceCounter
mov rax, QWORD PTR end$[rsp]
sub rax, QWORD PTR start$[rsp]
add rsp, 40
ret 0 -- I definitely understand this can't ever be a 1-to-1 comparison and there are limitations with what the JIT can do here given we have a GC and need to transition and that being in pre-emptive mode for too long can have negative side effects as well. However, I also think this is something that we could be better in and that we could improve perf and throughput for these types of scenarios in a meaningful and measurable way. Likewise its worth calling out that while
|
Don't think we'll get to this for 7.0 (if ever) so moving to future. |
Was talking with @AaronRobinsonMSFT and @jkoritzinsky about this a bit more in relation to some questions I had on P/Invokes in relation to #82132. Ended up iterating here that the exact mechanic through which this "batching" is supported isn't of particular interest to me, but rather that the general scenario be supported in some fashion. To that end, allowing the user to write something like the following would also satisfy my interest here: GC.TransitionToNative();
{
SgctPinvoke1();
SgctPinvoke2();
SgctPinvoke3();
}
GC.TransitionToManaged(); This would of course come with special rules about what you could/could not do within the confines of the transitioned area, but would achieve the same general effect with less JIT/VM work required. |
@jkotas, any thoughts on an API such as the following to achieve this: namespace System;
public static partial class GC
{
public static void TransitionToNative(); // Switches the thread to preemptive mode
public static void TransitionToManaged(); // Switches the thread to coooperative mode
} There are of course a few variations on the general shape that could be done, such as being closer to It would ultimately be similar to what we have internally in the JIT (via the This then allows the original premise of avoiding repeated transitions in interop heavy code, such as might appear for certain types of UI, games, GPU compute, or other scenarios. I imagine the biggest concern is around whether or not power users can use such APIs safely/correctly given the special rules/considerations required and that using these is similar to writing a QCALL/FCALL helper internally. |
I am not convinced about this. The JIT IR would need to know about these transitions so that optimizations do not accidentally move operations on GC tracked references inside the block.
Right. FCalls and manually managed code are a bug farm and these bugs are very hard to diagnose. I would strongly prefer that the JIT does the analysis on whether it is safe to perform this optimization. |
Thanks for the insight! I think for the JIT to do it implicitly, it would effectively require us modeling the two transitions in IR (likely as "side-effecting intrinsics") and then looking for neighboring This could potentially be made more "pay for play" by only modeling the transitions in IR for methods which have multiple intrinsic calls. This is also notably similar to how the JIT could handle such public APIs (mark as side effecting to prevent operation movement, and when present do a pass to validate the user didn't use any gcrefs), but I understand the concerns around the potential bug farm it could end up with. |
I'm not hugely familiar with RyuJIT IR, but my first thought upon reading Tanner's idea is that, conceptually, it just sounds like a variation on memory barriers. Would it be much more complicated to model than those? |
Memory barriers block reordering of reads/writes but don't have any special rules beyond that. This is more like modeling state transitions where what you are allowed to do in Doing the "wrong thing" after transitioning into "preemptive" mode can be very bad for the JIT/GC and can lead to all kinds of unexpected crashes/failures (much like using |
Also, memory barriers block reordering of reads/writes in globally visible data only. They do not necessarily block reordering of read/writes of non-escaping data. (https://github.com/dotnet/runtime/blob/main/docs/design/specs/Memory-model.md#thread-local-memory-accesses) |
@jkotas That is a fair criticism that I'd not considered. It would be an interesting exercise to see the cost of having the JIT do all the work and make the decision relative to the user providing an intrinsic that acted as an indicator the JIT can use to limit optimizations. I'm not convinced the JIT can make these decisions without some user gesture, that is my primary reason for advocating a similar model to what @tannergooding is proposing. |
I think that @tannergooding description of this optimization in #54107 (comment) is fairly accurate and it looks pretty doable to me based on that description. It does not look like a very expensive optimization to me. To re-iterate the estimated benefit of this optimization: The motivating code example from #54107 (comment) would go from 63 instructions to 58 instructions (wall clock time improvement would be less). It would be still a far cry from 9 instructions that the equivalent C++ code compiles to. |
Wouldn't it be a bit better than That is, assuming we don't need to create a secondary P/Invoke frame since we wouldn't be transitioning to cooperative and then back into pre-emptive, we currently get and should be able to remove something like this: ; Final local variable assignments
;
; V00 loc0 [V00 ] ( 2, 2 ) long -> [rbp-40H] do-not-enreg[X] addr-exposed ld-addr-op
; V01 loc1 [V01 ] ( 1, 1 ) long -> [rbp-48H] do-not-enreg[X] addr-exposed ld-addr-op
; V02 OutArgs [V02 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V03 tmp1 [V03,T01] ( 2, 4 ) long -> rax "Single return block return value"
; V04 FramesRoot [V04,T00] ( 10, 10 ) long -> rsi "Pinvoke FrameListRoot"
; V05 PInvokeFrame [V05 ] ( 12, 12 ) blk (72) [rbp-90H] do-not-enreg[X] addr-exposed "Pinvoke FrameVar"
;
; Lcl frame size = 120
; Method Program:Delta():ulong
G_M000_IG01: ;; offset=0000H
55 push rbp
4157 push r15
4156 push r14
4155 push r13
4154 push r12
57 push rdi
56 push rsi
53 push rbx
4883EC78 sub rsp, 120
488DAC24B0000000 lea rbp, [rsp+B0H]
G_M000_IG02: ;; offset=0018H
488D8D78FFFFFF lea rcx, [rbp-88H]
498BD2 mov rdx, r10
E8A939A65F call CORINFO_HELP_INIT_PINVOKE_FRAME
488BF0 mov rsi, rax
488BCC mov rcx, rsp
48894D98 mov qword ptr [rbp-68H], rcx
488BCD mov rcx, rbp
48894DA8 mov qword ptr [rbp-58H], rcx
488D4DC0 lea rcx, [rbp-40H]
48B8009DE65FFA7F0000 mov rax, 0x7FFA5FE69D00
48894588 mov qword ptr [rbp-78H], rax
488D0519000000 lea rax, G_M000_IG04
488945A0 mov qword ptr [rbp-60H], rax
488D8578FFFFFF lea rax, bword ptr [rbp-88H]
48894610 mov qword ptr [rsi+10H], rax
C6460C00 mov byte ptr [rsi+0CH], 0
G_M000_IG03: ;; offset=0064H
FF1536632300 call [Program:QueryPerformanceCounter(ulong):int]
-G_M000_IG04: ;; offset=006AH
- C6460C01 mov byte ptr [rsi+0CH], 1
- 833D8F9EE25F00 cmp dword ptr [(reloc 0x7ffabfa5d8a4)], 0
- 7406 je SHORT G_M000_IG05
- FF151BA9E15F call [CORINFO_HELP_STOP_FOR_GC]
G_M000_IG05: ;; offset=007DH
- 488B4D80 mov rcx, bword ptr [rbp-80H]
- 48894E10 mov qword ptr [rsi+10H], rcx
488D4DB8 lea rcx, [rbp-48H]
- 48B8009DE65FFA7F0000 mov rax, 0x7FFA5FE69D00
- 48894588 mov qword ptr [rbp-78H], rax
- 488D0519000000 lea rax, G_M000_IG07
- 488945A0 mov qword ptr [rbp-60H], rax
- 488D8578FFFFFF lea rax, bword ptr [rbp-88H]
- 48894610 mov qword ptr [rsi+10H], rax
- C6460C00 mov byte ptr [rsi+0CH], 0
G_M000_IG06: ;; offset=00B1H
FF15E9622300 call [Program:QueryPerformanceCounter(ulong):int]
G_M000_IG07: ;; offset=00B7H
C6460C01 mov byte ptr [rsi+0CH], 1
833D429EE25F00 cmp dword ptr [(reloc 0x7ffabfa5d8a4)], 0
7406 je SHORT G_M000_IG08
FF15CEA8E15F call [CORINFO_HELP_STOP_FOR_GC]
G_M000_IG08: ;; offset=00CAH
488B4580 mov rax, bword ptr [rbp-80H]
48894610 mov qword ptr [rsi+10H], rax
488B45B8 mov rax, qword ptr [rbp-48H]
482B45C0 sub rax, qword ptr [rbp-40H]
G_M000_IG09: ;; offset=00DAH
4883C478 add rsp, 120
5B pop rbx
5E pop rsi
5F pop rdi
415C pop r12
415D pop r13
415E pop r14
415F pop r15
5D pop rbp
C3 ret
; Total bytes of code: 235 That would end up remove some branches and multiple dependent memory accesses which are taking at least 4-12 cycles each. Basically allowing us to remote the additional epilogue after the first P/Invoke |
You are right that it would be more with the current scheme. I have been thinking about the ideal P/Invoke transition sequence that would be smaller (maybe not as small as 5 instructions, but close).
This is keeps track of the PInvoke that we are in. It is necessary to produce accurate stacktraces. We would end up producing inaccurate stacktraces if we removed this. |
Gotcha. Would definitely be nice to get it smaller in general. The overhead isn't terrible or anything, but it does add up and can impact the general frame to frame latency for things like games from what I've seen.
Is that an issue for release code? We notably don't seem to emit them for In general removing them for |
SuppressGCTransition methods have issues with diagnostics in general. They cannot be debugged, stacktraces are less unreliable, etc.
It is take-back. It would have to be announced as diagnostic breaking change. It is fairly common for code to spend a non-trivial amount of time in unmanaged code. The diagnostic tools that work with stacktraces would stop seeing the name of the method that is P/Invoked for these cases and the diagnostic of these cases would be harder. |
As someone who might benefit from this in my code, I think it could be useful to have a language construct for this:
Inside the This would help me get the most out of this, as otherwise an area of code that is highly optimized for this (assuming the JIT infers it automagically) could end up becoming de-optimized by some later maintenance work and there would be no indication that performance had regressed or why. |
Summary
Today, when performing a normal P/Invoke the JIT inserts a GC-transition before and after the call.
This means that if you have logic such as:
You will end up actually getting something like:
This can be undesirable for interop heavy code as it means many transitions are occuring. Ideally we could instead recognize this pattern and simplify it to something more like:
This would increase the performance of interop heavy code, but would have the code in question in in pre-emptive mode for longer which may also have negative side effects and so may need additional consideration.
category:cq
theme:pinvoke
skill-level:expert
cost:medium
impact:medium
The text was updated successfully, but these errors were encountered: