Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Faster Math.Max/Min for x64 #65625

Closed
EgorBo opened this issue Feb 19, 2022 · 11 comments · Fixed by #69434
Closed

JIT: Faster Math.Max/Min for x64 #65625

EgorBo opened this issue Feb 19, 2022 · 11 comments · Fixed by #69434
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Feb 19, 2022

Optimize Math.Max/Min to a single instruction on x64 when one of the arguments is a constant (not NaN and whoever will be implementing it has to be careful around -/+0.0)

float Test(float a) => Math.Max(a, 10);

Currently emits:

; Method Tests4:Test(float):float:this
G_M55200_IG01:
       vzeroupper 
G_M55200_IG02:
       vmovss   xmm0, dword ptr [reloc @RWD00]
       vucomiss xmm1, xmm0
       jp       SHORT G_M55200_IG03
       je       SHORT G_M55200_IG06
G_M55200_IG03:
       vucomiss xmm1, xmm1
       jp       SHORT G_M55200_IG05
       vucomiss xmm1, xmm0
       ja       SHORT G_M55200_IG04
       jmp      SHORT G_M55200_IG08
G_M55200_IG04:
       vmovaps  xmm0, xmm1
       jmp      SHORT G_M55200_IG08
G_M55200_IG05:
       vmovaps  xmm0, xmm1
       jmp      SHORT G_M55200_IG08
G_M55200_IG06:
       vmovaps  xmm1, xmm0
       vmovd    eax, xmm1
       test     eax, eax
       jl       SHORT G_M55200_IG07
       jmp      SHORT G_M55200_IG08
G_M55200_IG07:
       vmovss   xmm0, dword ptr [reloc @RWD00]
G_M55200_IG08:
       ret      
RWD00  	dd	41200000h		;        10
; Total bytes of code: 68

Expected codegen:

; Method Tests4:Test(float):float:this
       vzeroupper 
       vmaxss   xmm0, xmm1, dword ptr [reloc @RWD00]
       ret      
RWD00  	dd	41200000h ; 10.0
; Total bytes of code: 12

#65584 did it for ARM where we could do it even for both non-constants

@EgorBo EgorBo added the tenet-performance Performance related issue label Feb 19, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Feb 19, 2022
@ghost
Copy link

ghost commented Feb 19, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Optimize Math.Max/Min to a single instruction on x64 when one of the arguments is a constant (not NaN and has to be careful around -/+0.0)

float Test(float a) => Math.Max(a, 10);

Currently emits:

; Method Tests4:Test(float):float:this
G_M55200_IG01:
       vzeroupper 
G_M55200_IG02:
       vmovss   xmm0, dword ptr [reloc @RWD00]
       vucomiss xmm1, xmm0
       jp       SHORT G_M55200_IG03
       je       SHORT G_M55200_IG06
G_M55200_IG03:
       vucomiss xmm1, xmm1
       jp       SHORT G_M55200_IG05
       vucomiss xmm1, xmm0
       ja       SHORT G_M55200_IG04
       jmp      SHORT G_M55200_IG08
G_M55200_IG04:
       vmovaps  xmm0, xmm1
       jmp      SHORT G_M55200_IG08
G_M55200_IG05:
       vmovaps  xmm0, xmm1
       jmp      SHORT G_M55200_IG08
G_M55200_IG06:
       vmovaps  xmm1, xmm0
       vmovd    eax, xmm1
       test     eax, eax
       jl       SHORT G_M55200_IG07
       jmp      SHORT G_M55200_IG08
G_M55200_IG07:
       vmovss   xmm0, dword ptr [reloc @RWD00]
G_M55200_IG08:
       ret      
RWD00  	dd	41200000h		;        10
; Total bytes of code: 68

Expected codegen:

; Method Tests4:Test(float):float:this
       vzeroupper 
       vmaxss   xmm0, xmm1, dword ptr [reloc @RWD00]
       ret      
RWD00  	dd	41200000h ; 10.0
; Total bytes of code: 12

#65584 did it for ARM where we could do it even for both non-constants

Author: EgorBo
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@EgorBo EgorBo added good first issue Issue should be easy to implement, good for first-time contributors help wanted [up-for-grabs] Good issue for external contributors labels Feb 19, 2022
@EgorBo EgorBo added this to the Future milestone Feb 19, 2022
@EgorBo EgorBo removed the untriaged New issue has not been triaged by the area owner label Feb 19, 2022
@tannergooding
Copy link
Member

tannergooding commented Feb 19, 2022

In particular, Math.Max and Math.Min for floating-point have a requirement, from the IEEE 754 spec, that -0.0 be the minimum of it and +0.0. Likewise, they have a requirement that NaN is propagated (if both inputs are NaN, then the exact NaN returned doesn't matter).

x86/x64 provide maxss and minss. These instructions will return the second operand if both operands are 0 (same or differing signs). Likewise, if either input is NaN, the second operand is returned. Otherwise, the correct max/min is returned.

What this functionally means is that if both inputs are unknown, we can't really "optimize" and instead have to use maxss/minss and then some follow up computations. However, if either input is constant then we actually can optimize it down to a single instruction in all cases:

  • In the case both are constant, we simply constant fold and it doesn't matter.
  • If either input is known to be NaN, it is used as the second argument (ensuring NaN is propagated since either input being NaN means the second argument is returned)
  • For maxss, if either input is +0.0 it takes the second argument (ensuring +0.0 is returned if both inputs are zero since both inputs being zero, of either sign, means the second argument is returned)
  • For minss, if either input is -0.0 it takes the second argument (ensuring -0.0 is returned if both inputs are zero since both inputs being zero, of either sign, means the second argument is returned)
  • Otherwise, for any other known constant input it is the first argument
    • If the unknown is NaN, this ensures it is returned since either input being NaN means the second argument is returned
    • If the unknown is +0.0 for maxss and the known was -0.0, it would ensure +0.0 is returned since both inputs being zero, of either sign, means the second argument is returned
    • If the unknown is -0.0 for minss and the known was +0.0, it would ensure -0.0 is returned since both inputs being zero, of either sign, means the second argument is returned
    • For any other inputs known or unknown; the default handling is already correct

This ensures that Max returns NaN and +0.0 and likewise that Min returns NaN and -0.0 for their respective special cases.

@SkiFoD
Copy link
Contributor

SkiFoD commented Feb 21, 2022

Hey, everyone. I would like to take a look at the issue. @EgorBo Could you please tell me how you made "Math.Max" call to get inlined?

@EgorBo
Copy link
Member Author

EgorBo commented Feb 21, 2022

Hey, everyone. I would like to take a look at the issue. @EgorBo Could you please tell me how you made "Math.Max" call to get inlined?

Hey, sure! What do you mean inlined? Intrinsic?

@SkiFoD
Copy link
Contributor

SkiFoD commented Feb 21, 2022

I'm trying to get the same emit, but I always get something like this:
image
So I can't look inside the Math.Max call to invistigate it.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 21, 2022

@SkiFoD see #65584, it ifdefed special import for Max for x86/64, you need to enable it, most likely you will also have to modify lsraxarch and the xarch emitter

@SkiFoD
Copy link
Contributor

SkiFoD commented Feb 21, 2022

@EgorBo Thank you, so basically you are changing the call to instrinsic use.

@SkiFoD
Copy link
Contributor

SkiFoD commented Feb 21, 2022

@EgorBo Seems like I figured out how to make the changes, but I got a few questions.
Is it all right that the call transforms to the intrinsic use only under a pressure (like if the Math.Max/Min is called in a huge cycle) or if there are envs "COMPlus_TieredCompilation=0" and "COMPlus_JITMinOpts=0" set up?
p.s. For some reason if I used the envs along with "COMPlus_JitDump/JitDisasm" then there are not output in the console :( Have you ever come across with such a behaviour?

@EgorBo
Copy link
Member Author

EgorBo commented Feb 21, 2022

@EgorBo Seems like I figured out how to make the changes, but I got a few questions. Is it all right that the call transforms to the intrinsic use only under a pressure (like if the Math.Max/Min is called in a huge cycle) or if there are envs "COMPlus_TieredCompilation=0" and "COMPlus_JITMinOpts=0" set up? p.s. For some reason if I used the envs along with "COMPlus_JitDump/JitDisasm" then there are not output in the console :( Have you ever come across with such a behaviour?

Yes, we don't expand intrinsics in tier0 (unoptimized code) - only a few "must expand" ones. So I'd suggest to disable tiered compilation during development

SkiFoD added a commit to SkiFoD/runtime that referenced this issue Feb 22, 2022
SkiFoD added a commit to SkiFoD/runtime that referenced this issue Mar 1, 2022
@SkiFoD
Copy link
Contributor

SkiFoD commented Mar 15, 2022

@EgorBo please take a look at my PR, I would like to know what you think of this.

@danmoseley
Copy link
Member

linking #65700

@danmoseley danmoseley removed good first issue Issue should be easy to implement, good for first-time contributors help wanted [up-for-grabs] Good issue for external contributors labels Apr 10, 2022
SkiFoD added a commit to SkiFoD/runtime that referenced this issue May 19, 2022
SkiFoD added a commit to SkiFoD/runtime that referenced this issue May 19, 2022
SkiFoD added a commit to SkiFoD/runtime that referenced this issue Jun 2, 2022
SkiFoD added a commit to SkiFoD/runtime that referenced this issue Jun 3, 2022
SkiFoD added a commit to SkiFoD/runtime that referenced this issue Jun 3, 2022
SkiFoD added a commit to SkiFoD/runtime that referenced this issue Jun 3, 2022
SkiFoD added a commit to SkiFoD/runtime that referenced this issue Jun 20, 2022
SkiFoD added a commit to SkiFoD/runtime that referenced this issue Jun 20, 2022
SkiFoD added a commit to SkiFoD/runtime that referenced this issue Jun 22, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 22, 2022
tannergooding pushed a commit that referenced this issue Jun 23, 2022
…#69434)

* Add xarch optimization for min/max (#65625)

* Changes according to the requirements (#65625)

* Draft for Math.Max/Math.Min optimization (#65625)

* Draft for optimizing Math.Max/Math.Min with a const (#65625)

* Fix tests (#65625)

* Refactoring of the conditions (#65625)

* Fix of the summary (#65625)

* Refactoring due to the PR comments (#65625)

* Add spilling side effect + Fix of formats (#65625)

* Update src/coreclr/jit/importer.cpp

Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>

* Update src/coreclr/jit/importer.cpp

Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>

Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 23, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants