vzeroupper fails to be emitted in some P/Invoke scenarios #82132

tannergooding · 2023-02-14T21:58:13Z

Consider the following code:

public Vector256<int> M(Vector256<int> x, long* lpFrequency)
{
    x += x;
    QueryPerformanceFrequency(lpFrequency);
    return x;
}

public Vector256<int> M2(Vector256<int> x, long* lpFrequency)
{
    x += x;
    N(lpFrequency);
    return x;
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static int N(long* lpFrequency)
{
    return QueryPerformanceFrequency(lpFrequency);
}

[DllImport("kernel32", ExactSpelling = true)]
[SuppressGCTransition]
public static extern int QueryPerformanceFrequency(long* lpFrequency);

For M, because the QPC is direct and because Vector256<int> is directly used in the function we get a vzeroupper directly before the P/Invoke is called:

; Method C:M(System.Runtime.Intrinsics.Vector256`1[int],ulong):System.Runtime.Intrinsics.Vector256`1[int]:this
G_M000_IG01:                ;; offset=0000H
       55                   push     rbp
       57                   push     rdi
       56                   push     rsi
       4883EC20             sub      rsp, 32
       C5F877               vzeroupper 
       488D6C2430           lea      rbp, [rsp+30H]
       48895518             mov      bword ptr [rbp+18H], rdx
       498BF0               mov      rsi, r8

G_M000_IG02:                ;; offset=0016H
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       C5FDFEC0             vpaddd   ymm0, ymm0, ymm0
       48897520             mov      bword ptr [rbp+20H], rsi
       C5FC1106             vmovups  ymmword ptr[rsi], ymm0
       498BC9               mov      rcx, r9
       48B880474922FE7F0000 mov      rax, 0x7FFE22494780

G_M000_IG03:                ;; offset=0033H
       C5F877               vzeroupper 
       FFD0                 call     rax ; C:QueryPerformanceFrequency(ulong):int
       488B7520             mov      rsi, bword ptr [rbp+20H]
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       488B7D18             mov      rdi, bword ptr [rbp+18H]
       C5FC1107             vmovups  ymmword ptr[rdi], ymm0
       833DB59EE15F00       cmp      dword ptr [(reloc 0x7ffcd12fd8a4)], 0
       750E                 jne      SHORT G_M000_IG06

G_M000_IG04:                ;; offset=0051H
       488BC7               mov      rax, rdi

G_M000_IG05:                ;; offset=0054H
       C5F877               vzeroupper 
       4883C420             add      rsp, 32
       5E                   pop      rsi
       5F                   pop      rdi
       5D                   pop      rbp
       C3                   ret      

G_M000_IG06:                ;; offset=005FH
       E85C79AA5F           call     CORINFO_HELP_POLL_GC
       EBEB                 jmp      SHORT G_M000_IG04
; Total bytes of code: 102

However, for M2 because the call to the P/Invoke is hidden behind the non-inlined method N and because N does not itself utilize any instructions requiring YMM, we miss out on this and do not emit an additional vzeroupper:

; Method C:M2(System.Runtime.Intrinsics.Vector256`1[int],ulong):System.Runtime.Intrinsics.Vector256`1[int]:this
G_M000_IG01:                ;; offset=0000H
       57                   push     rdi
       56                   push     rsi
       4883EC28             sub      rsp, 40
       C5F877               vzeroupper 
       488BFA               mov      rdi, rdx
       498BF0               mov      rsi, r8

G_M000_IG02:                ;; offset=000FH
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       C5FDFEC0             vpaddd   ymm0, ymm0, ymm0
       C5FC1106             vmovups  ymmword ptr[rsi], ymm0
       498BC9               mov      rcx, r9
       FF15FCF92300         call     [C:N(ulong):int]
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       C5FC1107             vmovups  ymmword ptr[rdi], ymm0
       488BC7               mov      rax, rdi

G_M000_IG03:                ;; offset=002FH
       C5F877               vzeroupper 
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      
; Total bytes of code: 57

; Method C:N(ulong):int
G_M000_IG01:                ;; offset=0000H
       55                   push     rbp
       56                   push     rsi
       4883EC28             sub      rsp, 40
       488D6C2430           lea      rbp, [rsp+30H]
       488BF1               mov      rsi, rcx

G_M000_IG02:                ;; offset=000EH
       833DEF9EE45F00       cmp      dword ptr [(reloc 0x7ffcd12fd8a4)], 0
       7517                 jne      SHORT G_M000_IG06

G_M000_IG03:                ;; offset=0017H
       488BCE               mov      rcx, rsi
       48B880474922FE7F0000 mov      rax, 0x7FFE22494780

G_M000_IG04:                ;; offset=0024H
       FFD0                 call     rax ; C:QueryPerformanceFrequency(ulong):int
       90                   nop      

G_M000_IG05:                ;; offset=0027H
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5D                   pop      rbp
       C3                   ret      

G_M000_IG06:                ;; offset=002EH
       E88D79AD5F           call     CORINFO_HELP_POLL_GC
       EBE2                 jmp      SHORT G_M000_IG03
; Total bytes of code: 53

If the P/Invoke being called uses any legacy encoded SIMD instructions, then there is a significant (up to 10x perf regression per SIMD instruction) penalty that is incurred. Native code that utilizes SIMD instructions typically ends up with the "legacy encoding" since AVX is not part of the baseline instruction set.

The text was updated successfully, but these errors were encountered:

ghost · 2023-02-14T21:58:21Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Consider the following code:

public Vector256<int> M(Vector256<int> x, long* lpFrequency)
{
    x += x;
    QueryPerformanceFrequency(lpFrequency);
    return x;
}

public Vector256<int> M2(Vector256<int> x, long* lpFrequency)
{
    x += x;
    N(lpFrequency);
    return x;
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static int N(long* lpFrequency)
{
    return QueryPerformanceFrequency(lpFrequency);
}

[DllImport("kernel32", ExactSpelling = true)]
[SuppressGCTransition]
public static extern int QueryPerformanceFrequency(long* lpFrequency);

For M, because the QPC is direct and because Vector256<int> is directly used in the function we get a vzeroupper directly before the P/Invoke is called:

; Method C:M(System.Runtime.Intrinsics.Vector256`1[int],ulong):System.Runtime.Intrinsics.Vector256`1[int]:this
G_M000_IG01:                ;; offset=0000H
       55                   push     rbp
       57                   push     rdi
       56                   push     rsi
       4883EC20             sub      rsp, 32
       C5F877               vzeroupper 
       488D6C2430           lea      rbp, [rsp+30H]
       48895518             mov      bword ptr [rbp+18H], rdx
       498BF0               mov      rsi, r8

G_M000_IG02:                ;; offset=0016H
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       C5FDFEC0             vpaddd   ymm0, ymm0, ymm0
       48897520             mov      bword ptr [rbp+20H], rsi
       C5FC1106             vmovups  ymmword ptr[rsi], ymm0
       498BC9               mov      rcx, r9
       48B880474922FE7F0000 mov      rax, 0x7FFE22494780

G_M000_IG03:                ;; offset=0033H
       C5F877               vzeroupper 
       FFD0                 call     rax ; C:QueryPerformanceFrequency(ulong):int
       488B7520             mov      rsi, bword ptr [rbp+20H]
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       488B7D18             mov      rdi, bword ptr [rbp+18H]
       C5FC1107             vmovups  ymmword ptr[rdi], ymm0
       833DB59EE15F00       cmp      dword ptr [(reloc 0x7ffcd12fd8a4)], 0
       750E                 jne      SHORT G_M000_IG06

G_M000_IG04:                ;; offset=0051H
       488BC7               mov      rax, rdi

G_M000_IG05:                ;; offset=0054H
       C5F877               vzeroupper 
       4883C420             add      rsp, 32
       5E                   pop      rsi
       5F                   pop      rdi
       5D                   pop      rbp
       C3                   ret      

G_M000_IG06:                ;; offset=005FH
       E85C79AA5F           call     CORINFO_HELP_POLL_GC
       EBEB                 jmp      SHORT G_M000_IG04
; Total bytes of code: 102

However, for M2 because the call to the P/Invoke is hidden behind the non-inlined method N and because N does not itself utilize any instructions requiring YMM, we miss out on this and do not emit an additional vzeroupper:

; Method C:M2(System.Runtime.Intrinsics.Vector256`1[int],ulong):System.Runtime.Intrinsics.Vector256`1[int]:this
G_M000_IG01:                ;; offset=0000H
       57                   push     rdi
       56                   push     rsi
       4883EC28             sub      rsp, 40
       C5F877               vzeroupper 
       488BFA               mov      rdi, rdx
       498BF0               mov      rsi, r8

G_M000_IG02:                ;; offset=000FH
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       C5FDFEC0             vpaddd   ymm0, ymm0, ymm0
       C5FC1106             vmovups  ymmword ptr[rsi], ymm0
       498BC9               mov      rcx, r9
       FF15FCF92300         call     [C:N(ulong):int]
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       C5FC1107             vmovups  ymmword ptr[rdi], ymm0
       488BC7               mov      rax, rdi

G_M000_IG03:                ;; offset=002FH
       C5F877               vzeroupper 
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      
; Total bytes of code: 57

; Method C:N(ulong):int
G_M000_IG01:                ;; offset=0000H
       55                   push     rbp
       56                   push     rsi
       4883EC28             sub      rsp, 40
       488D6C2430           lea      rbp, [rsp+30H]
       488BF1               mov      rsi, rcx

G_M000_IG02:                ;; offset=000EH
       833DEF9EE45F00       cmp      dword ptr [(reloc 0x7ffcd12fd8a4)], 0
       7517                 jne      SHORT G_M000_IG06

G_M000_IG03:                ;; offset=0017H
       488BCE               mov      rcx, rsi
       48B880474922FE7F0000 mov      rax, 0x7FFE22494780

G_M000_IG04:                ;; offset=0024H
       FFD0                 call     rax ; C:QueryPerformanceFrequency(ulong):int
       90                   nop      

G_M000_IG05:                ;; offset=0027H
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5D                   pop      rbp
       C3                   ret      

G_M000_IG06:                ;; offset=002EH
       E88D79AD5F           call     CORINFO_HELP_POLL_GC
       EBE2                 jmp      SHORT G_M000_IG03
; Total bytes of code: 53

If the P/Invoke being called uses any legacy encoded SIMD instructions, then there is a significant (up to 10x perf regression per SIMD instruction) penalty that is incurred. Native code that utilizes SIMD instructions typically ends up with the "legacy encoding" since AVX is not part of the baseline instruction set.

Author:	tannergooding
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `untriaged`
Milestone:	-

tannergooding · 2023-02-14T22:03:04Z

CC. @AaronRobinsonMSFT, @jkoritzinsky, @anthonycanino

The simplest fix here is likely to emit vzeroupper before any P/Invoke as that is in general a "best practice".

There exists an inverse issue in that we emit vzeroupper too frequently in some other scenarios today: #11496

In general we don't need vzeroupper for managed to managed calls. We only need it before calling into some method which might be AVX unaware which includes P/Invokes and some crossgen/ready-to-run (pre-jit) methods.

For the case where SuppressGCTransition is not involved, then the call won't be directly to the P/Invoke but will instead be to a pre-code stub that resolves the native function address and dispatches to it. We should also ensure that the vzeroupper is included as part of that and not unnecessarily duplicated outside of it.

JulieLeeMSFT · 2023-03-06T18:55:26Z

@tannergooding, is this correctness issue or code quality issue? If it is code quality issue, I will set it to Future.

JulieLeeMSFT · 2023-03-20T16:57:47Z

@tannergooding Ping.

@tannergooding, is this correctness issue or code quality issue? If it is code quality issue, I will set it to Future.

tannergooding · 2023-03-20T17:35:31Z

@JulieLeeMSFT This is a code quality issue but one that has a significant impact on perf. The impact on some hardware can be that the code in question runs 10x slower or more.

JulieLeeMSFT · 2023-03-20T19:40:29Z

@tannergooding, thanks for clarifying it. Setting it to .net 8 for now and assigning it to you. You can push it out if you don’t have time to get to work on it

tannergooding · 2023-08-08T16:37:50Z

This is a longstanding perf issue, but not a regression nor a correctness issue. Moving to .NET 9

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 14, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Feb 14, 2023

tannergooding mentioned this issue Feb 14, 2023

Support batching GC transitions for P/Invokes #54107

Open

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Mar 20, 2023

JulieLeeMSFT added this to the 8.0.0 milestone Mar 20, 2023

JulieLeeMSFT assigned tannergooding Mar 20, 2023

tannergooding modified the milestones: 8.0.0, 9.0.0 Aug 8, 2023

kunalspathak mentioned this issue Dec 15, 2023

Performance of Trigonometric math function have unbelievable loss at .NET8 #95954

Closed

This was referenced Feb 8, 2024

SqlServer.Types is 20% slower on .NET 8 #96211

Closed

Update where and when vzeroupper is emitted #98261

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 10, 2024

tannergooding closed this as completed in #98261 Feb 13, 2024

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 13, 2024

github-actions bot locked and limited conversation to collaborators Mar 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vzeroupper fails to be emitted in some P/Invoke scenarios #82132

vzeroupper fails to be emitted in some P/Invoke scenarios #82132

tannergooding commented Feb 14, 2023

ghost commented Feb 14, 2023

tannergooding commented Feb 14, 2023

JulieLeeMSFT commented Mar 6, 2023

JulieLeeMSFT commented Mar 20, 2023

tannergooding commented Mar 20, 2023

JulieLeeMSFT commented Mar 20, 2023

tannergooding commented Aug 8, 2023

vzeroupper fails to be emitted in some P/Invoke scenarios #82132

vzeroupper fails to be emitted in some P/Invoke scenarios #82132

Comments

tannergooding commented Feb 14, 2023

ghost commented Feb 14, 2023

tannergooding commented Feb 14, 2023

JulieLeeMSFT commented Mar 6, 2023

JulieLeeMSFT commented Mar 20, 2023

tannergooding commented Mar 20, 2023

JulieLeeMSFT commented Mar 20, 2023

tannergooding commented Aug 8, 2023