Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsafe.BitCast/As does not remove stack spilling for custom SIMD type #86448

Closed
xoofx opened this issue May 18, 2023 · 11 comments
Closed

Unsafe.BitCast/As does not remove stack spilling for custom SIMD type #86448

xoofx opened this issue May 18, 2023 · 11 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue

Comments

@xoofx
Copy link
Member

xoofx commented May 18, 2023

Hello JIT compiler friends, 😊

Description

Sometimes, there are some legacy Vector Math APIs that are providing their own Vector3/Vector4 and would like to transition to SIMD optimized version through System.Numerics.Vector3/4. I have tried to use Unsafe.As but it generates stack spill, so I was hoping that the new .NET 8 Unsafe.BitCast #82917 would help here, but it seems not. Might require deeper tweaking in the JIT to realize that a type is almost an alias to a SIMD type.

With the latest .NET 8 preview (8.0.100-preview.4.23260.5), the following code:

using System.Numerics;
using System.Runtime.CompilerServices;

namespace BenchCSharpApp;

public class TestCustomVector4
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static float Calculator(MyVector4 v1, MyVector4 v2)
    {
        var a = Add(v1, v2);
        var b = Multiply(v1, v2);
        var c = Subtract(v1, v2);
        var d = Subtract(Add(a, b), c);

        return Length(d);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static float Length(MyVector4 v1)
    {
        return Unsafe.BitCast<MyVector4, Vector4>(v1).Length();
    }
    
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static MyVector4 Multiply(MyVector4 v1, MyVector4 v2)
    {
        return Unsafe.BitCast<Vector4, MyVector4>(Unsafe.BitCast<MyVector4, Vector4>(v1) * Unsafe.BitCast<MyVector4, Vector4>(v2));
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static MyVector4 Add(MyVector4 v1, MyVector4 v2)
    {
        return Unsafe.BitCast<Vector4, MyVector4>(Unsafe.BitCast<MyVector4, Vector4>(v1) + Unsafe.BitCast<MyVector4, Vector4>(v2));
    }


    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static MyVector4 Subtract(MyVector4 v1, MyVector4 v2)
    {
        return Unsafe.BitCast<Vector4, MyVector4>(Unsafe.BitCast<MyVector4, Vector4>(v1) - Unsafe.BitCast<MyVector4, Vector4>(v2));
    }

    public struct MyVector4
    {
        public MyVector4(float x, float y, float z, float w)
        {
            X = x;
            Y = y;
            Z = z;
            W = w;
        }

        public float X;
        public float Y;
        public float Z;
        public float W;
    }
}

generates the following ASM:

; Assembly listing for method BenchCSharpApp.TestCustomVector4Bis:Calculator(BenchCSharpApp.TestCustomVector4Bis+MyVector4,BenchCSharpApp.TestCustomVector4Bis+MyVector4):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 6 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       sub      rsp, 104
       vzeroupper 
 
G_M000_IG02:                ;; offset=0007H
       vmovups  xmm0, xmmword ptr [rcx]
       vmovups  xmmword ptr [rsp+58H], xmm0
       vmovups  xmm0, xmmword ptr [rdx]
       vmovups  xmmword ptr [rsp+48H], xmm0
       vmovups  xmm0, xmmword ptr [rsp+58H]
       vaddps   xmm0, xmm0, xmmword ptr [rsp+48H]
       vmovups  xmm1, xmmword ptr [rcx]
       vmovups  xmmword ptr [rsp+38H], xmm1
       vmovups  xmm1, xmmword ptr [rdx]
       vmovups  xmmword ptr [rsp+28H], xmm1
       vmovups  xmm1, xmmword ptr [rsp+38H]
       vmulps   xmm1, xmm1, xmmword ptr [rsp+28H]
       vmovups  xmm2, xmmword ptr [rcx]
       vmovups  xmmword ptr [rsp+18H], xmm2
       vmovups  xmm2, xmmword ptr [rdx]
       vmovups  xmmword ptr [rsp+08H], xmm2
       vmovups  xmm2, xmmword ptr [rsp+18H]
       vsubps   xmm2, xmm2, xmmword ptr [rsp+08H]
       vaddps   xmm0, xmm0, xmm1
       vsubps   xmm0, xmm0, xmm2
       vdpps    xmm0, xmm0, xmm0, -1
       vsqrtss  xmm0, xmm0
 
G_M000_IG03:                ;; offset=0079H
       add      rsp, 104
       ret      
 
; Total bytes of code 126

while a version that would not perform any stack spill could achieve the following:


G_M000_IG01:                ;; offset=0000H
       vzeroupper 
 
G_M000_IG02:                ;; offset=0003H
       vmovups  xmm0, xmmword ptr [rcx]
       vmovups  xmm1, xmmword ptr [rdx]
       vaddps   xmm2, xmm0, xmm1
       vmulps   xmm3, xmm0, xmm1
       vaddps   xmm2, xmm2, xmm3
       vsubps   xmm0, xmm0, xmm1
       vsubps   xmm0, xmm2, xmm0
       vdpps    xmm0, xmm0, xmm0, -1
       vsqrtss  xmm0, xmm0
 
G_M000_IG03:                ;; offset=0029H
       ret      

I would not mind putting an attribute on such custom types e.g [AliasVector(typeof(Vector4))] if it could help the SIMD only usage detection.

@xoofx xoofx added the tenet-performance Performance related issue label May 18, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 18, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 18, 2023
@ghost
Copy link

ghost commented May 18, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Hello JIT compiler friends, 😊

Description

Sometimes, there are some legacy Vector Math APIs that are providing their own Vector3/Vector4 and would like to transition to SIMD optimized version through System.Numerics.Vector3/4. I have tried to use Unsafe.As but it generates stack spill, so I was hoping that the new .NET 8 Unsafe.BitCast #82917 would help here, but it seems not. Might require deeper tweaking in the JIT to realize that a type is almost an alias to a SIMD type.

With the latest .NET 8 preview (8.0.100-preview.4.23260.5), the following code:

using System.Numerics;
using System.Runtime.CompilerServices;

namespace BenchCSharpApp;

public class TestCustomVector4
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static float Calculator(MyVector4 v1, MyVector4 v2)
    {
        var a = Add(v1, v2);
        var b = Multiply(v1, v2);
        var c = Subtract(v1, v2);
        var d = Subtract(Add(a, b), c);

        return Length(d);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static float Length(MyVector4 v1)
    {
        return Unsafe.BitCast<MyVector4, Vector4>(v1).Length();
    }
    
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static MyVector4 Multiply(MyVector4 v1, MyVector4 v2)
    {
        return Unsafe.BitCast<Vector4, MyVector4>(Unsafe.BitCast<MyVector4, Vector4>(v1) * Unsafe.BitCast<MyVector4, Vector4>(v2));
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static MyVector4 Add(MyVector4 v1, MyVector4 v2)
    {
        return Unsafe.BitCast<Vector4, MyVector4>(Unsafe.BitCast<MyVector4, Vector4>(v1) + Unsafe.BitCast<MyVector4, Vector4>(v2));
    }


    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static MyVector4 Subtract(MyVector4 v1, MyVector4 v2)
    {
        return Unsafe.BitCast<Vector4, MyVector4>(Unsafe.BitCast<MyVector4, Vector4>(v1) - Unsafe.BitCast<MyVector4, Vector4>(v2));
    }

    public struct MyVector4
    {
        public MyVector4(float x, float y, float z, float w)
        {
            X = x;
            Y = y;
            Z = z;
            W = w;
        }

        public float X;
        public float Y;
        public float Z;
        public float W;
    }
}

generates the following ASM:

; Assembly listing for method BenchCSharpApp.TestCustomVector4Bis:Calculator(BenchCSharpApp.TestCustomVector4Bis+MyVector4,BenchCSharpApp.TestCustomVector4Bis+MyVector4):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 6 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       sub      rsp, 104
       vzeroupper 
 
G_M000_IG02:                ;; offset=0007H
       vmovups  xmm0, xmmword ptr [rcx]
       vmovups  xmmword ptr [rsp+58H], xmm0
       vmovups  xmm0, xmmword ptr [rdx]
       vmovups  xmmword ptr [rsp+48H], xmm0
       vmovups  xmm0, xmmword ptr [rsp+58H]
       vaddps   xmm0, xmm0, xmmword ptr [rsp+48H]
       vmovups  xmm1, xmmword ptr [rcx]
       vmovups  xmmword ptr [rsp+38H], xmm1
       vmovups  xmm1, xmmword ptr [rdx]
       vmovups  xmmword ptr [rsp+28H], xmm1
       vmovups  xmm1, xmmword ptr [rsp+38H]
       vmulps   xmm1, xmm1, xmmword ptr [rsp+28H]
       vmovups  xmm2, xmmword ptr [rcx]
       vmovups  xmmword ptr [rsp+18H], xmm2
       vmovups  xmm2, xmmword ptr [rdx]
       vmovups  xmmword ptr [rsp+08H], xmm2
       vmovups  xmm2, xmmword ptr [rsp+18H]
       vsubps   xmm2, xmm2, xmmword ptr [rsp+08H]
       vaddps   xmm0, xmm0, xmm1
       vsubps   xmm0, xmm0, xmm2
       vdpps    xmm0, xmm0, xmm0, -1
       vsqrtss  xmm0, xmm0
 
G_M000_IG03:                ;; offset=0079H
       add      rsp, 104
       ret      
 
; Total bytes of code 126

while a version that would not perform any stack spill could achieve the following:


G_M000_IG01:                ;; offset=0000H
       vzeroupper 
 
G_M000_IG02:                ;; offset=0003H
       vmovups  xmm0, xmmword ptr [rcx]
       vmovups  xmm1, xmmword ptr [rdx]
       vaddps   xmm2, xmm0, xmm1
       vmulps   xmm3, xmm0, xmm1
       vaddps   xmm2, xmm2, xmm3
       vsubps   xmm0, xmm0, xmm1
       vsubps   xmm0, xmm2, xmm0
       vdpps    xmm0, xmm0, xmm0, -1
       vsqrtss  xmm0, xmm0
 
G_M000_IG03:                ;; offset=0029H
       ret      

I would not mind putting an attribute on such custom types e.g [AliasVector(typeof(Vector4))] if it could help the SIMD only usage detection.

Author: xoofx
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@MihaZupan
Copy link
Member

cc: @MichalPetryka would #85562 help here?

@hypeartist
Copy link

Using Unsafe.As() alongside with in modifier for method parameters seems to fix the issue:

using System.Numerics;
using System.Runtime.CompilerServices;

namespace BenchCSharpApp;

public class TestCustomVector4
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static float Calculator(MyVector4 v1, MyVector4 v2)
    {
        var a = Add(v1, v2);
        var b = Multiply(v1, v2);
        var c = Subtract(v1, v2);
        var d = Subtract(Add(a, b), c);

        return Length(d);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static float Length(in MyVector4 v1)
    {
        return Unsafe.As<MyVector4, Vector4>(ref Unsafe.AsRef(v1)).Length();
    }
    
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static MyVector4 Multiply(in MyVector4 v1, in MyVector4 v2)
    {
        var v = Unsafe.As<MyVector4, Vector4>(ref Unsafe.AsRef(v1)) * Unsafe.As<MyVector4, Vector4>(ref Unsafe.AsRef(v2));
        return Unsafe.As<Vector4, MyVector4>(ref v);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static MyVector4 Add(in MyVector4 v1, in MyVector4 v2)
    {
        var v = Unsafe.As<MyVector4, Vector4>(ref Unsafe.AsRef(v1)) + Unsafe.As<MyVector4, Vector4>(ref Unsafe.AsRef(v2));
        return Unsafe.As<Vector4, MyVector4>(ref v);
    }


    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static MyVector4 Subtract(in MyVector4 v1, in MyVector4 v2)
    {
        var v = Unsafe.As<MyVector4, Vector4>(ref Unsafe.AsRef(v1)) - Unsafe.As<MyVector4, Vector4>(ref Unsafe.AsRef(v2));
        return Unsafe.As<Vector4, MyVector4>(ref v);
    }

    public struct MyVector4
    {
        public MyVector4(float x, float y, float z, float w)
        {
            X = x;
            Y = y;
            Z = z;
            W = w;
        }

        public float X;
        public float Y;
        public float Z;
        public float W;
    }
}
    L0000: vzeroupper
    L0003: vmovupd xmm0, [rcx]
    L0007: vmovupd xmm1, [rdx]
    L000b: vaddps xmm2, xmm0, xmm1
    L000f: vmulps xmm3, xmm0, xmm1
    L0013: vsubps xmm0, xmm0, xmm1
    L0017: vaddps xmm1, xmm2, xmm3
    L001b: vsubps xmm0, xmm1, xmm0
    L001f: vdpps xmm0, xmm0, xmm0, 0xf1
    L0025: vsqrtss xmm0, xmm0, xmm0
    L0029: ret

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABABgAJiBGAOgDkBXfGKASzFwG4BYAKDMq0ASgwB2GVsxoBhCPgAOrADYsAyiwBu7GFz59R2ZrnnYwMcgCEYosAAtpq29ijyAgvPk9efYgGZKAEzkACo6GNIMuBhyAGowYNFQKHwA3nzkGeQA2gCyMBi2EAAmAJIKSgAUeQXFZfJKAPLyEhCiuPQQJaJKrKK9AOYAlAC66Zl+gkjkAGZKENgY5NLYSmAMSgvQVQCecQnQKOQaVGjkObvxiYcaAYNjGWm8mc9HzuTY5AC85K5FRRXHU43QZeF6ZDRvYBfM7rCT1bYAk5HW6gsGvKDkMDQ1QMYAYKCmDCIoEo+4vCEYorY3H4wkVX7/bCnYCDU5gEF6J5o4gAdnIABlrP0ChUihyuRkAL6c565fKFUrlKry2rlJotNo0Vz9fqwXC4VgaGBdHp9URDUYS8jyNgQjDmahTWbzRaC80i3pnC77JJHKh3K2PNGUPkAVTa2GmMC1uAAPOc9ldTomDgA+CqwabkcO4SPR1y4IQwaaIwaDGhu4W2Cri57Sq1kuU1RX1ZXNuqNZqsVrtbW6nQGo0m3oDEZkm2GhYOqhTBOXA4wpRwpQIz1zn3XJFr71XZEB55BtEUo7QnN5mPxncHZPzpLpzPZiNRmNFkvHMvkABUj9zz4Ll5TJIbw3e9ix/c8C1fAFblrbkwyffM40A9AvWQ0Csw0WCpRlTImwVDs23wtUux7LUdT1QdjW6EdzTHK0JztadZyvX0GQqbdkL9U4ONva5bjJQ8wWPDRTwQi91yTcg0IzMCzz/Qti1LQZyAAanA+SAN44CrnQ9TEKg4EsJeXk9IvZDTgktMZIwoz6xwjI8NVVtqiI+p1W7TU+wow0qNNUdLWeBip0mVDePIHE8QJBJ2NEUKNy48gePi4EBLJZ5hNE39EM0jdtKsh85P0xT32UuBTP/SygKk3jdMKl9ipg1EwRMur/3MuKdOso5bPsyh/CiKAGASDqDlSq1ngmSqUAqZ0FnIBBTlmxZtkWuY5oAL1Wl1yAAd33NFBODcgAA1oQQJqjoATWhbYLuDAAtaF1rutEAHVoR2l7sO8cbxn8JaTq+vqZjWxZLqBiYAfuiH/tB8hXou+tJSAA=

@xoofx
Copy link
Member Author

xoofx commented May 19, 2023

Yeah, I noticed that, but it comes with the constraint that we can't necessarily modify all user code and existing API. 😉

@jakobbotsch
Copy link
Member

jakobbotsch commented May 19, 2023

Regular promotion does not promote any of these because no fields are accessed:

Not promoting promotable struct local V00: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V01: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V02: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V03: #fields = 4, fieldAccessed = 0.
  struct promotion of V04 is disabled because it has already been marked address exposed
Not promoting promotable struct local V05: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V06: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V07: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V10: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V11: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V15: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V16: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V20: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V21: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V25: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V26: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V30: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V31: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V34: #fields = 4, fieldAccessed = 0.

(edit: this makes sense since these are SIMDs. But we end up DNER'ing them due to LCL_FLD accesses, causing all the stack spilling.)

However, with physical promotion after latest heuristics adjustments:

; Assembly listing for method BenchCSharpApp.TestCustomVector4:Calculator(BenchCSharpApp.TestCustomVector4+MyVector4,BenchCSharpApp.TestCustomVector4+MyVector4):float
; Emitting BLENDED_CODE for X64 with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 6 single block inlinees; 16 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  6   )   byref  ->  rcx         single-def
;  V01 arg1         [V01,T01] (  3,  6   )   byref  ->  rdx         single-def
;* V02 loc0         [V02,T05] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF]
;* V03 loc1         [V03,T06] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF]
;# V04 OutArgs      [V04    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1         [V05,T02] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] "impAppendStmt"
;* V06 tmp2         [V06,T03] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] "spilled call-like call argument"
;* V07 tmp3         [V07,T04] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] "spilled call-like call argument"
;* V08 tmp4         [V08    ] (  0,  0   )  simd16  ->  zero-ref    "impAppendStmt"
;* V09 tmp5         [V09    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
;* V10 tmp6         [V10    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V11 tmp7         [V11    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V12 tmp8         [V12,T09] (  0,  0   )  simd16  ->  zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V13 tmp9         [V13    ] (  0,  0   )  simd16  ->  zero-ref    "impAppendStmt"
;* V14 tmp10        [V14    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
;* V15 tmp11        [V15    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V16 tmp12        [V16    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V17 tmp13        [V17,T10] (  0,  0   )  simd16  ->  zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V18 tmp14        [V18    ] (  0,  0   )  simd16  ->  zero-ref    "impAppendStmt"
;* V19 tmp15        [V19    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
;* V20 tmp16        [V20    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V21 tmp17        [V21    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V22 tmp18        [V22,T11] (  0,  0   )  simd16  ->  zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V23 tmp19        [V23    ] (  0,  0   )  simd16  ->  zero-ref    "impAppendStmt"
;* V24 tmp20        [V24    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
;* V25 tmp21        [V25    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V26 tmp22        [V26    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V27 tmp23        [V27,T12] (  0,  0   )  simd16  ->  zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V28 tmp24        [V28    ] (  0,  0   )  simd16  ->  zero-ref    "impAppendStmt"
;* V29 tmp25        [V29    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
;* V30 tmp26        [V30    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V31 tmp27        [V31    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V32 tmp28        [V32,T13] (  0,  0   )  simd16  ->  zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;  V33 tmp29        [V33,T14] (  3,  3   )  simd16  ->  mm0         ld-addr-op single-def "Inline stloc first use temp"
;* V34 tmp30        [V34    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V35 tmp31        [V35    ] (  0,  0   )  simd16  ->  zero-ref    "V10.[000..016)"
;* V36 tmp32        [V36    ] (  0,  0   )  simd16  ->  zero-ref    "V11.[000..016)"
;* V37 tmp33        [V37    ] (  0,  0   )  simd16  ->  zero-ref    "V15.[000..016)"
;* V38 tmp34        [V38    ] (  0,  0   )  simd16  ->  zero-ref    "V16.[000..016)"
;* V39 tmp35        [V39    ] (  0,  0   )  simd16  ->  zero-ref    "V20.[000..016)"
;* V40 tmp36        [V40    ] (  0,  0   )  simd16  ->  zero-ref    "V21.[000..016)"
;* V41 tmp37        [V41    ] (  0,  0   )  simd16  ->  zero-ref    "V25.[000..016)"
;* V42 tmp38        [V42    ] (  0,  0   )  simd16  ->  zero-ref    "V26.[000..016)"
;* V43 tmp39        [V43    ] (  0,  0   )  simd16  ->  zero-ref    "V30.[000..016)"
;* V44 tmp40        [V44    ] (  0,  0   )  simd16  ->  zero-ref    "V31.[000..016)"
;* V45 tmp41        [V45    ] (  0,  0   )  simd16  ->  zero-ref    "V34.[000..016)"
;  V46 cse0         [V46,T15] (  2,  2   )  simd16  ->  mm2         "CSE - aggressive"
;  V47 cse1         [V47,T16] (  2,  2   )  simd16  ->  mm3         "CSE - aggressive"
;  V48 cse2         [V48,T17] (  2,  2   )  simd16  ->  mm1         "CSE - aggressive"
;  V49 cse3         [V49,T18] (  2,  2   )  simd16  ->  mm0         "CSE - aggressive"
;  V50 cse4         [V50,T19] (  2,  2   )  simd16  ->  mm0         "CSE - aggressive"
;  V51 cse5         [V51,T07] (  4,  4   )  simd16  ->  mm0         "CSE - aggressive"
;  V52 cse6         [V52,T08] (  4,  4   )  simd16  ->  mm1         "CSE - aggressive"
;
; Lcl frame size = 0

G_M42757_IG01:  ;; offset=0000H
       vzeroupper
                                                ;; size=3 bbWeight=1 PerfScore 1.00
G_M42757_IG02:  ;; offset=0003H
       vmovups  xmm0, xmmword ptr [rcx]
       vmovups  xmm1, xmmword ptr [rdx]
       vaddps   xmm2, xmm0, xmm1
       vmulps   xmm3, xmm0, xmm1
       vsubps   xmm0, xmm0, xmm1
       vaddps   xmm1, xmm2, xmm3
       vsubps   xmm0, xmm1, xmm0
       vdpps    xmm0, xmm0, xmm0, -1
       vsqrtss  xmm0, xmm0, xmm0
                                                ;; size=38 bbWeight=1 PerfScore 46.00
G_M42757_IG03:  ;; offset=0029H
       ret
                                                ;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 42, prolog size 3, PerfScore 52.20, instruction count 11, allocated bytes for code 42 (MethodHash=8ac158fa) for method BenchCSharpApp.TestCustomVector4:Calculator(BenchCSharpApp.TestCustomVector4+MyVector4,BenchCSharpApp.TestCustomVector4+MyVector4):float

Physical promotion is not enabled by default, and even when enabled DOTNET_JitEnablePhysicalPromotion=1 does not decide to promote in preview 4, but should in a future preview.

(edit: edited the assembly code with the result after #86043 -- which gets the ideal codegen)

@jakobbotsch
Copy link
Member

It would be possible to get this in regular promotion by considering the custom struct type to have a simd16 field. You can sort of work around that manually by defining your struct in the following way:

public struct MyVector4
{
    private Vector4 V;
    public MyVector4(float x, float y, float z, float w)
    {
        X = x;
        Y = y;
        Z = z;
        W = w;
    }

    [UnscopedRef]
    public ref float X => ref V.X;
    [UnscopedRef]
    public ref float Y => ref V.Y;
    [UnscopedRef]
    public ref float Z => ref V.Z;
    [UnscopedRef]
    public ref float W => ref V.W;
}

It gets the ideal code gen on current main and should work in preview 4 and probably earlier versions of .NET too:

G_M33107_IG01:  ;; offset=0000H
       vzeroupper
                                                ;; size=3 bbWeight=1 PerfScore 1.00
G_M33107_IG02:  ;; offset=0003H
       vmovups  xmm0, xmmword ptr [rdx]
       vmovups  xmm1, xmmword ptr [rcx]
       vaddps   xmm2, xmm1, xmm0
       vmulps   xmm3, xmm1, xmm0
       vsubps   xmm0, xmm1, xmm0
       vaddps   xmm1, xmm2, xmm3
       vsubps   xmm0, xmm1, xmm0
       vdpps    xmm0, xmm0, xmm0, -1
       vsqrtss  xmm0, xmm0, xmm0
                                                ;; size=38 bbWeight=1 PerfScore 46.00
G_M33107_IG03:  ;; offset=0029H
       ret
                                                ;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 42, prolog size 3, PerfScore 52.20, instruction count 11, allocated bytes for code 42 (MethodHash=f2247eac) for method TestCustomVector4:Calculator(TestCustomVector4+MyVector4,TestCustomVector4+MyVector4):float

@xoofx
Copy link
Member Author

xoofx commented May 19, 2023

It would be possible to get this in regular promotion by considering the custom struct type to have a simd16 field. You can sort of work around that manually by defining your struct in the following way:

Indeed, that could be an acceptable workaround, that might be enough for my case, thanks!

@xoofx xoofx closed this as completed May 19, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label May 19, 2023
@tannergooding
Copy link
Member

tannergooding commented May 19, 2023

It's worth noting that you'll get better codegen for argument passing by using a Vector128<float> field instead.

Vector4 is a "user-defined struct" that happens to get special JIT handling, it is at best passed in 4 registers and is treated as an HFA (homogenous floating-point aggregate). Vector128<T>, however, is an ABI primitive equivalent to __m128 in native, so it gets passed by register and treated as an HVA (homogenous vector aggregate) on some platforms (such as Unix or __vectorcall on Windows -- we don't support the latter today, unfortunately)

Likewise, you really don't want to expose references to the underlying fields if you can help it. That is, you probably don't want:

    [UnscopedRef]
    public ref float X => ref V.X;

The public fields are an anti-pattern in the System.Numerics.Vector2/3/4, Quaternion, Plane, Matrix3x2, and Matrix4x4 types and we have to do a lot to workaround them being exposed.

They also encourage devs to not treat these as "single values" which often leads to worse codegen. Vector types exist in a single register and should be treated as a single unit, much like an int32 is. Thus, you want to "replace" the entire struct at once (e.g. you want vector = vector.WithElement(1, 5) like Vector128<T> requires) and not treat it like a standard struct where you only mutate the bits that are changing (e.g. you don't typically want vector.Y = 5 or for the user to be able to do &vector.Y, etc)

@xoofx
Copy link
Member Author

xoofx commented May 19, 2023

Thanks, agreed and aware about the inefficiency of member access. It's more about providing a transition with a relatively efficient implementation for an existing API that cannot be changed. 🙂

@jakobbotsch
Copy link
Member

FWIW, I wouldn't expect false positive address exposure in most cases with these properties. If you immediately load/store them the JIT should be able to understand that no address exposure is occurring after inlining.

@tannergooding
Copy link
Member

The issue here isn't necessarily with false address exposure, but with the type of mutation it makes going against how the vectors actually operate at the hardware level and it leading towards people accidentally writing code that performs poorly as SIMD.

@ghost ghost locked as resolved and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants