-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unsafe.BitCast/As does not remove stack spilling for custom SIMD type #86448
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsHello JIT compiler friends, 😊 DescriptionSometimes, there are some legacy Vector Math APIs that are providing their own With the latest .NET 8 preview ( using System.Numerics;
using System.Runtime.CompilerServices;
namespace BenchCSharpApp;
public class TestCustomVector4
{
[MethodImpl(MethodImplOptions.NoInlining)]
public static float Calculator(MyVector4 v1, MyVector4 v2)
{
var a = Add(v1, v2);
var b = Multiply(v1, v2);
var c = Subtract(v1, v2);
var d = Subtract(Add(a, b), c);
return Length(d);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static float Length(MyVector4 v1)
{
return Unsafe.BitCast<MyVector4, Vector4>(v1).Length();
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static MyVector4 Multiply(MyVector4 v1, MyVector4 v2)
{
return Unsafe.BitCast<Vector4, MyVector4>(Unsafe.BitCast<MyVector4, Vector4>(v1) * Unsafe.BitCast<MyVector4, Vector4>(v2));
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static MyVector4 Add(MyVector4 v1, MyVector4 v2)
{
return Unsafe.BitCast<Vector4, MyVector4>(Unsafe.BitCast<MyVector4, Vector4>(v1) + Unsafe.BitCast<MyVector4, Vector4>(v2));
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static MyVector4 Subtract(MyVector4 v1, MyVector4 v2)
{
return Unsafe.BitCast<Vector4, MyVector4>(Unsafe.BitCast<MyVector4, Vector4>(v1) - Unsafe.BitCast<MyVector4, Vector4>(v2));
}
public struct MyVector4
{
public MyVector4(float x, float y, float z, float w)
{
X = x;
Y = y;
Z = z;
W = w;
}
public float X;
public float Y;
public float Z;
public float W;
}
} generates the following ASM:
while a version that would not perform any stack spill could achieve the following:
I would not mind putting an attribute on such custom types e.g
|
cc: @MichalPetryka would #85562 help here? |
Using using System.Numerics;
using System.Runtime.CompilerServices;
namespace BenchCSharpApp;
public class TestCustomVector4
{
[MethodImpl(MethodImplOptions.NoInlining)]
public static float Calculator(MyVector4 v1, MyVector4 v2)
{
var a = Add(v1, v2);
var b = Multiply(v1, v2);
var c = Subtract(v1, v2);
var d = Subtract(Add(a, b), c);
return Length(d);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static float Length(in MyVector4 v1)
{
return Unsafe.As<MyVector4, Vector4>(ref Unsafe.AsRef(v1)).Length();
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static MyVector4 Multiply(in MyVector4 v1, in MyVector4 v2)
{
var v = Unsafe.As<MyVector4, Vector4>(ref Unsafe.AsRef(v1)) * Unsafe.As<MyVector4, Vector4>(ref Unsafe.AsRef(v2));
return Unsafe.As<Vector4, MyVector4>(ref v);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static MyVector4 Add(in MyVector4 v1, in MyVector4 v2)
{
var v = Unsafe.As<MyVector4, Vector4>(ref Unsafe.AsRef(v1)) + Unsafe.As<MyVector4, Vector4>(ref Unsafe.AsRef(v2));
return Unsafe.As<Vector4, MyVector4>(ref v);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static MyVector4 Subtract(in MyVector4 v1, in MyVector4 v2)
{
var v = Unsafe.As<MyVector4, Vector4>(ref Unsafe.AsRef(v1)) - Unsafe.As<MyVector4, Vector4>(ref Unsafe.AsRef(v2));
return Unsafe.As<Vector4, MyVector4>(ref v);
}
public struct MyVector4
{
public MyVector4(float x, float y, float z, float w)
{
X = x;
Y = y;
Z = z;
W = w;
}
public float X;
public float Y;
public float Z;
public float W;
}
} L0000: vzeroupper
L0003: vmovupd xmm0, [rcx]
L0007: vmovupd xmm1, [rdx]
L000b: vaddps xmm2, xmm0, xmm1
L000f: vmulps xmm3, xmm0, xmm1
L0013: vsubps xmm0, xmm0, xmm1
L0017: vaddps xmm1, xmm2, xmm3
L001b: vsubps xmm0, xmm1, xmm0
L001f: vdpps xmm0, xmm0, xmm0, 0xf1
L0025: vsqrtss xmm0, xmm0, xmm0
L0029: ret |
Yeah, I noticed that, but it comes with the constraint that we can't necessarily modify all user code and existing API. 😉 |
Regular promotion does not promote any of these because no fields are accessed: Not promoting promotable struct local V00: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V01: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V02: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V03: #fields = 4, fieldAccessed = 0.
struct promotion of V04 is disabled because it has already been marked address exposed
Not promoting promotable struct local V05: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V06: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V07: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V10: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V11: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V15: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V16: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V20: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V21: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V25: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V26: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V30: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V31: #fields = 4, fieldAccessed = 0.
Not promoting promotable struct local V34: #fields = 4, fieldAccessed = 0. (edit: this makes sense since these are SIMDs. But we end up DNER'ing them due to LCL_FLD accesses, causing all the stack spilling.) However, with physical promotion after latest heuristics adjustments: ; Assembly listing for method BenchCSharpApp.TestCustomVector4:Calculator(BenchCSharpApp.TestCustomVector4+MyVector4,BenchCSharpApp.TestCustomVector4+MyVector4):float
; Emitting BLENDED_CODE for X64 with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 6 single block inlinees; 16 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 6 ) byref -> rcx single-def
; V01 arg1 [V01,T01] ( 3, 6 ) byref -> rdx single-def
;* V02 loc0 [V02,T05] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF]
;* V03 loc1 [V03,T06] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF]
;# V04 OutArgs [V04 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V05 tmp1 [V05,T02] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] "impAppendStmt"
;* V06 tmp2 [V06,T03] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] "spilled call-like call argument"
;* V07 tmp3 [V07,T04] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] "spilled call-like call argument"
;* V08 tmp4 [V08 ] ( 0, 0 ) simd16 -> zero-ref "impAppendStmt"
;* V09 tmp5 [V09 ] ( 0, 0 ) simd16 -> zero-ref "spilled call-like call argument"
;* V10 tmp6 [V10 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V11 tmp7 [V11 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V12 tmp8 [V12,T09] ( 0, 0 ) simd16 -> zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V13 tmp9 [V13 ] ( 0, 0 ) simd16 -> zero-ref "impAppendStmt"
;* V14 tmp10 [V14 ] ( 0, 0 ) simd16 -> zero-ref "spilled call-like call argument"
;* V15 tmp11 [V15 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V16 tmp12 [V16 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V17 tmp13 [V17,T10] ( 0, 0 ) simd16 -> zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V18 tmp14 [V18 ] ( 0, 0 ) simd16 -> zero-ref "impAppendStmt"
;* V19 tmp15 [V19 ] ( 0, 0 ) simd16 -> zero-ref "spilled call-like call argument"
;* V20 tmp16 [V20 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V21 tmp17 [V21 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V22 tmp18 [V22,T11] ( 0, 0 ) simd16 -> zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V23 tmp19 [V23 ] ( 0, 0 ) simd16 -> zero-ref "impAppendStmt"
;* V24 tmp20 [V24 ] ( 0, 0 ) simd16 -> zero-ref "spilled call-like call argument"
;* V25 tmp21 [V25 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V26 tmp22 [V26 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V27 tmp23 [V27,T12] ( 0, 0 ) simd16 -> zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V28 tmp24 [V28 ] ( 0, 0 ) simd16 -> zero-ref "impAppendStmt"
;* V29 tmp25 [V29 ] ( 0, 0 ) simd16 -> zero-ref "spilled call-like call argument"
;* V30 tmp26 [V30 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V31 tmp27 [V31 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V32 tmp28 [V32,T13] ( 0, 0 ) simd16 -> zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
; V33 tmp29 [V33,T14] ( 3, 3 ) simd16 -> mm0 ld-addr-op single-def "Inline stloc first use temp"
;* V34 tmp30 [V34 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg"
;* V35 tmp31 [V35 ] ( 0, 0 ) simd16 -> zero-ref "V10.[000..016)"
;* V36 tmp32 [V36 ] ( 0, 0 ) simd16 -> zero-ref "V11.[000..016)"
;* V37 tmp33 [V37 ] ( 0, 0 ) simd16 -> zero-ref "V15.[000..016)"
;* V38 tmp34 [V38 ] ( 0, 0 ) simd16 -> zero-ref "V16.[000..016)"
;* V39 tmp35 [V39 ] ( 0, 0 ) simd16 -> zero-ref "V20.[000..016)"
;* V40 tmp36 [V40 ] ( 0, 0 ) simd16 -> zero-ref "V21.[000..016)"
;* V41 tmp37 [V41 ] ( 0, 0 ) simd16 -> zero-ref "V25.[000..016)"
;* V42 tmp38 [V42 ] ( 0, 0 ) simd16 -> zero-ref "V26.[000..016)"
;* V43 tmp39 [V43 ] ( 0, 0 ) simd16 -> zero-ref "V30.[000..016)"
;* V44 tmp40 [V44 ] ( 0, 0 ) simd16 -> zero-ref "V31.[000..016)"
;* V45 tmp41 [V45 ] ( 0, 0 ) simd16 -> zero-ref "V34.[000..016)"
; V46 cse0 [V46,T15] ( 2, 2 ) simd16 -> mm2 "CSE - aggressive"
; V47 cse1 [V47,T16] ( 2, 2 ) simd16 -> mm3 "CSE - aggressive"
; V48 cse2 [V48,T17] ( 2, 2 ) simd16 -> mm1 "CSE - aggressive"
; V49 cse3 [V49,T18] ( 2, 2 ) simd16 -> mm0 "CSE - aggressive"
; V50 cse4 [V50,T19] ( 2, 2 ) simd16 -> mm0 "CSE - aggressive"
; V51 cse5 [V51,T07] ( 4, 4 ) simd16 -> mm0 "CSE - aggressive"
; V52 cse6 [V52,T08] ( 4, 4 ) simd16 -> mm1 "CSE - aggressive"
;
; Lcl frame size = 0
G_M42757_IG01: ;; offset=0000H
vzeroupper
;; size=3 bbWeight=1 PerfScore 1.00
G_M42757_IG02: ;; offset=0003H
vmovups xmm0, xmmword ptr [rcx]
vmovups xmm1, xmmword ptr [rdx]
vaddps xmm2, xmm0, xmm1
vmulps xmm3, xmm0, xmm1
vsubps xmm0, xmm0, xmm1
vaddps xmm1, xmm2, xmm3
vsubps xmm0, xmm1, xmm0
vdpps xmm0, xmm0, xmm0, -1
vsqrtss xmm0, xmm0, xmm0
;; size=38 bbWeight=1 PerfScore 46.00
G_M42757_IG03: ;; offset=0029H
ret
;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code 42, prolog size 3, PerfScore 52.20, instruction count 11, allocated bytes for code 42 (MethodHash=8ac158fa) for method BenchCSharpApp.TestCustomVector4:Calculator(BenchCSharpApp.TestCustomVector4+MyVector4,BenchCSharpApp.TestCustomVector4+MyVector4):float Physical promotion is not enabled by default, and even when enabled (edit: edited the assembly code with the result after #86043 -- which gets the ideal codegen) |
It would be possible to get this in regular promotion by considering the custom struct type to have a simd16 field. You can sort of work around that manually by defining your struct in the following way: public struct MyVector4
{
private Vector4 V;
public MyVector4(float x, float y, float z, float w)
{
X = x;
Y = y;
Z = z;
W = w;
}
[UnscopedRef]
public ref float X => ref V.X;
[UnscopedRef]
public ref float Y => ref V.Y;
[UnscopedRef]
public ref float Z => ref V.Z;
[UnscopedRef]
public ref float W => ref V.W;
} It gets the ideal code gen on current main and should work in preview 4 and probably earlier versions of .NET too: G_M33107_IG01: ;; offset=0000H
vzeroupper
;; size=3 bbWeight=1 PerfScore 1.00
G_M33107_IG02: ;; offset=0003H
vmovups xmm0, xmmword ptr [rdx]
vmovups xmm1, xmmword ptr [rcx]
vaddps xmm2, xmm1, xmm0
vmulps xmm3, xmm1, xmm0
vsubps xmm0, xmm1, xmm0
vaddps xmm1, xmm2, xmm3
vsubps xmm0, xmm1, xmm0
vdpps xmm0, xmm0, xmm0, -1
vsqrtss xmm0, xmm0, xmm0
;; size=38 bbWeight=1 PerfScore 46.00
G_M33107_IG03: ;; offset=0029H
ret
;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code 42, prolog size 3, PerfScore 52.20, instruction count 11, allocated bytes for code 42 (MethodHash=f2247eac) for method TestCustomVector4:Calculator(TestCustomVector4+MyVector4,TestCustomVector4+MyVector4):float |
Indeed, that could be an acceptable workaround, that might be enough for my case, thanks! |
It's worth noting that you'll get better codegen for argument passing by using a
Likewise, you really don't want to expose references to the underlying fields if you can help it. That is, you probably don't want: [UnscopedRef]
public ref float X => ref V.X; The public fields are an anti-pattern in the They also encourage devs to not treat these as "single values" which often leads to worse codegen. Vector types exist in a single register and should be treated as a single unit, much like an |
Thanks, agreed and aware about the inefficiency of member access. It's more about providing a transition with a relatively efficient implementation for an existing API that cannot be changed. 🙂 |
FWIW, I wouldn't expect false positive address exposure in most cases with these properties. If you immediately load/store them the JIT should be able to understand that no address exposure is occurring after inlining. |
The issue here isn't necessarily with false address exposure, but with the type of mutation it makes going against how the vectors actually operate at the hardware level and it leading towards people accidentally writing code that performs poorly as SIMD. |
Hello JIT compiler friends, 😊
Description
Sometimes, there are some legacy Vector Math APIs that are providing their own
Vector3
/Vector4
and would like to transition to SIMD optimized version throughSystem.Numerics.Vector3/4
. I have tried to useUnsafe.As
but it generates stack spill, so I was hoping that the new .NET 8Unsafe.BitCast
#82917 would help here, but it seems not. Might require deeper tweaking in the JIT to realize that a type is almost an alias to a SIMD type.With the latest .NET 8 preview (
8.0.100-preview.4.23260.5
), the following code:generates the following ASM:
while a version that would not perform any stack spill could achieve the following:
I would not mind putting an attribute on such custom types e.g
[AliasVector(typeof(Vector4))]
if it could help the SIMD only usage detection.The text was updated successfully, but these errors were encountered: