Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volatile. prefix doesn't work on ldobj or stobj #91530

Open
hamarb123 opened this issue Sep 4, 2023 · 9 comments
Open

volatile. prefix doesn't work on ldobj or stobj #91530

hamarb123 opened this issue Sep 4, 2023 · 9 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@hamarb123
Copy link
Contributor

hamarb123 commented Sep 4, 2023

Description

I'm probably the first person insane enough to try this :)

In IL code you can use the volatile. prefix for any of ldind, stind, ldfld, stfld, ldobj, stobj, initblk, cpblk, ldsfld, and stsfld (III.2.6). Notably, nowhere does it place any form of restrictions on what the type can be for APIs like ldobj and stobj; in fact it even specifies behaviour for atomicity, which would not make sense if it wasn't supposed to work on non-builtin types They do not provide atomicity, other than that guaranteed by the specification of §I.12.6.6. (I.12.6.7).

I.12.6.6 Atomic reads and writes specifies A conforming CLI shall guarantee that read and write access to properly aligned memory locations no larger than the native word size (the size of type native int) is atomic (see §I.12.6.2) when all the write accesses to a location are the same size. which would mean that for custom types which are structs with multiple fields, it should just do the applicable volatile operation for each field (in an unspecified order).

The point of the above is that something like:

struct A
{
    int a, b;
}

should work with:

volatile. ldobj !!0
//and:
volatile. stobj !!0

when A or int or any other type is the generic parameter !!0; and for A it should generate something similar to what it would as if I'd done it directly with each field instead.

To be clear, I ran into this as a result of determining that this operation would actually be useful to something I was trying to implement lock-free.

Reproduction Steps

struct A
{
    public int a, b;
}

static T VolatileRead<T>(ref T reference)
{
    //ldarg.0
    //volatile.
    //ldobj !!0
    //ret
}

static void VolatileWrite<T>(ref T reference, T value)
{
    //ldarg.0
    //ldarg.1
    //volatile.
    //stobj !!0
    //ret
}

static A ReadVolatile1(ref A value) => VolatileRead(ref value);
static int ReadVolatile1(ref int value) => VolatileRead(ref value);
static A ReadVolatile2(ref A value)
{
    A result;
    result.a = Volatile.Read(ref value.a);
    result.b = Volatile.Read(ref value.b);
    return result;
}
static int ReadVolatile2(ref int value) => Volatile.Read(ref value);

static void WriteVolatile1(ref A value, A value2) => VolatileWrite(ref value, value2);
static void WriteVolatile1(ref int value, int value2) => VolatileWrite(ref value, value2);
static void WriteVolatile2(ref A value, A value2)
{
    Volatile.Write(ref value2.a, value.a);
    Volatile.Write(ref value2.b, value.b);
}
static void WriteVolatile2(ref int value, int value2) => Volatile.Write(ref value, value2);

Expected behavior

ReadVolatile1(int&) and ReadVolatile2(int&) have the same (or very similar) codegen, and similarly with A&. And same with Write.

Actual behavior

I get the following codegen from crossgen2
; Assembly listing for method Program:ReadVolatile1(byref):A
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->   x0         single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M15271_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M15271_IG02:              ;; offset=0008H
        9000000B          adrp    x11, [HIGH RELOC #0x40000000004292b8]      // function address
        9100016B          add     x11, x11, [LOW RELOC #0x40000000004292b8]
        F9400161          ldr     x1, [x11]
						;; size=12 bbWeight=1    PerfScore 4.00
G_M15271_IG03:              ;; offset=0014H
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D61F0020          br      x1
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=d3c5c458) for method Program:ReadVolatile1(byref):A
; ============================================================

; Assembly listing for method Program:ReadVolatile2(byref):A
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )   byref  ->   x0         single-def
;  V01 loc0         [V01    ] (  3,  3   )  struct ( 8) [fp+18H]   do-not-enreg[S] ld-addr-op
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V04 tmp2         [V04    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;  V05 tmp3         [V05,T01] (  2,  2   )     int  ->  [fp+18H]   do-not-enreg[] V01.a(offs=0x00) P-DEP "field V01.a (fldOffset=0x0)"
;  V06 tmp4         [V06,T02] (  2,  2   )     int  ->  [fp+1CH]   do-not-enreg[] V01.b(offs=0x04) P-DEP "field V01.b (fldOffset=0x4)"
;
; Lcl frame size = 16

G_M36164_IG01:              ;; offset=0000H
        A9BE7BFD          stp     fp, lr, [sp,#-0x20]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M36164_IG02:              ;; offset=0008H
        88DFFC01          ldar    w1, [x0]
        B9001BA1          str     w1, [fp,#0x18]	// [V05 tmp3]
        B9400400          ldr     w0, [x0,#0x04]
        D50339BF          dmb     ishld
        B9001FA0          str     w0, [fp,#0x1C]	// [V06 tmp4]
        F9400FA0          ldr     x0, [fp,#0x18]	// [V01 loc0]
						;; size=24 bbWeight=1    PerfScore 20.00
G_M36164_IG03:              ;; offset=0020H
        A8C27BFD          ldp     fp, lr, [sp],#0x20
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 40, prolog size 8, PerfScore 27.50, instruction count 10, allocated bytes for code 40 (MethodHash=85ab72bb) for method Program:ReadVolatile2(byref):A
; ============================================================

; Assembly listing for method Program:ReadVolatile1(byref):int
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->   x0         single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M27221_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M27221_IG02:              ;; offset=0008H
        9000000B          adrp    x11, [HIGH RELOC #0x40000000004292e8]      // function address
        9100016B          add     x11, x11, [LOW RELOC #0x40000000004292e8]
        F9400161          ldr     x1, [x11]
						;; size=12 bbWeight=1    PerfScore 4.00
G_M27221_IG03:              ;; offset=0014H
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D61F0020          br      x1
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=dc4895aa) for method Program:ReadVolatile1(byref):int
; ============================================================

; Assembly listing for method Program:ReadVolatile2(byref):int
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->   x0         single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M42934_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M42934_IG02:              ;; offset=0008H
        88DFFC00          ldar    w0, [x0]
						;; size=4 bbWeight=1    PerfScore 3.00
G_M42934_IG03:              ;; offset=000CH
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 20, prolog size 8, PerfScore 8.50, instruction count 5, allocated bytes for code 20 (MethodHash=9e525849) for method Program:ReadVolatile2(byref):int
; ============================================================

; Assembly listing for method Program:WriteVolatile1(byref,A)
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->   x0         single-def
;  V01 arg1         [V01,T01] (  3,  3   )  struct ( 8) [fp+18H]   do-not-enreg[S] single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 16

G_M5886_IG01:              ;; offset=0000H
        A9BE7BFD          stp     fp, lr, [sp,#-0x20]!
        910003FD          mov     fp, sp
        F9000FA1          str     x1, [fp,#0x18]
						;; size=12 bbWeight=1    PerfScore 2.50
G_M5886_IG02:              ;; offset=000CH
        F9400FA1          ldr     x1, [fp,#0x18]
        9000000B          adrp    x11, [HIGH RELOC #0x4000000000429358]      // function address
        9100016B          add     x11, x11, [LOW RELOC #0x4000000000429358]
        F9400162          ldr     x2, [x11]
						;; size=16 bbWeight=1    PerfScore 6.00
G_M5886_IG03:              ;; offset=001CH
        A8C27BFD          ldp     fp, lr, [sp],#0x20
        D61F0040          br      x2
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 36, prolog size 12, PerfScore 14.10, instruction count 9, allocated bytes for code 36 (MethodHash=7866e901) for method Program:WriteVolatile1(byref,A)
; ============================================================

; Assembly listing for method Program:WriteVolatile2(byref,A)
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )   byref  ->   x0         single-def
;  V01 arg1         [V01    ] (  4,  4   )  struct ( 8) [fp+18H]   do-not-enreg[XS] addr-exposed ld-addr-op single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T01] (  2,  4   )     int  ->   x1         "Inlining Arg"
;  V04 tmp2         [V04,T02] (  2,  4   )     int  ->   x0         "Inlining Arg"
;
; Lcl frame size = 16

G_M11869_IG01:              ;; offset=0000H
        A9BE7BFD          stp     fp, lr, [sp,#-0x20]!
        910003FD          mov     fp, sp
        F9000FA1          str     x1, [fp,#0x18]
						;; size=12 bbWeight=1    PerfScore 2.50
G_M11869_IG02:              ;; offset=000CH
        B9400001          ldr     w1, [x0]
        D5033BBF          dmb     ish
        B9001BA1          str     w1, [fp,#0x18]
        B9400400          ldr     w0, [x0,#0x04]
        D5033BBF          dmb     ish
        B9001FA0          str     w0, [fp,#0x1C]
						;; size=24 bbWeight=1    PerfScore 28.00
G_M11869_IG03:              ;; offset=0024H
        A8C27BFD          ldp     fp, lr, [sp],#0x20
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 44, prolog size 8, PerfScore 36.90, instruction count 11, allocated bytes for code 44 (MethodHash=597cd1a2) for method Program:WriteVolatile2(byref,A)
; ============================================================

; Assembly listing for method Program:WriteVolatile1(byref,int)
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->   x0         single-def
;  V01 arg1         [V01,T01] (  3,  3   )     int  ->   x1         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M16972_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M16972_IG02:              ;; offset=0008H
        9000000B          adrp    x11, [HIGH RELOC #0x4000000000429388]      // function address
        9100016B          add     x11, x11, [LOW RELOC #0x4000000000429388]
        F9400162          ldr     x2, [x11]
						;; size=12 bbWeight=1    PerfScore 4.00
G_M16972_IG03:              ;; offset=0014H
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D61F0040          br      x2
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=2db1bdb3) for method Program:WriteVolatile1(byref,int)
; ============================================================

; Assembly listing for method Program:WriteVolatile2(byref,int)
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->   x0         single-def
;  V01 arg1         [V01,T01] (  3,  3   )     int  ->   x1         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M63279_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M63279_IG02:              ;; offset=0008H
        889FFC01          stlr    w1, [x0]
						;; size=4 bbWeight=1    PerfScore 1.00
G_M63279_IG03:              ;; offset=000CH
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 20, prolog size 8, PerfScore 6.50, instruction count 5, allocated bytes for code 20 (MethodHash=abe008d0) for method Program:WriteVolatile2(byref,int)
; ============================================================

as can be seen above, the 1 variants don't appear to be volatile at all. But they should all have extremely similar codegen to the matching 2 variants.

Regression?

Probably not :)

Known Workarounds

For specific cases, it's obviously easy enough to work around if you have access to all the fields, or if it's simply a class type or unmanaged, but it's basically unsolvable in the general case since gc-refs could be anywhere in it, so you can't just loop over the memory.

Configuration

I got the codegen using disasmo. It's for .NET 7 on arm64 using crossgen2. I haven't got a locally built .NET 8, so I haven't checked it, but I doubt it's much better :) (it would be good if someone else was able to test this in .NET 8).
I was planning to use this on a .NET 6 project I have, so much for that idea lol. It probably won't be fixed until at least 9 at this point would be my guess (it would be great if we could fix it on 6, 7, and 8 too though 😄).

Other information

I haven't tested all of the other instructions, but it could easily be an issue specific to ldobj and stobj and the rest could be fine, this should be checked.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 4, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 4, 2023
@ghost
Copy link

ghost commented Sep 4, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

I'm probably the first person insane enough to try this :)

In IL code you can use the volatile. prefix for any of ldind, stind, ldfld, stfld, ldobj, stobj, initblk, cpblk, ldsfld, and stsfld (III.2.6). Notably, nowhere does it place any form of restrictions on what the type can be for APIs like ldobj and stobj; in fact it even specifies behaviour for atomicity, which would not make sense if it wasn't supposed to work on non-builtin types They do not provide atomicity, other than that guaranteed by the specification of §I.12.6.6. (I.12.6.7).

I.12.6.6 Atomic reads and writes specifies A conforming CLI shall guarantee that read and write access to properly aligned memory locations no larger than the native word size (the size of type native int) is atomic (see §I.12.6.2) when all the write accesses to a location are the same size. which would mean that for custom types which are structs with multiple fields, it should just do the applicable volatile operation for each field (in an unspecified order).

The point of the above is that something like:

struct A
{
    int a, b;
}

should work with:

volatile. ldobj !!0
//and:
volatile. stobj !!0

when A is the generic parameter !!0; and it should generate something similar to what it would as if I'd done it directly with each field instead.

Reproduction Steps

struct A
{
    public int a, b;
}

static T VolatileRead<T>(ref T reference)
{
    //ldarg.0
    //volatile.
    //ldobj !!0
    //ret
}

static void VolatileWrite<T>(ref T reference, T value)
{
    //ldarg.0
    //ldarg.1
    //volatile.
    //stobj !!0
    //ret
}

static A ReadVolatile1(ref A value) => VolatileRead(ref value);
static int ReadVolatile1(ref int value) => VolatileRead(ref value);
static A ReadVolatile2(ref A value)
{
    A result;
    result.a = Volatile.Read(ref value.a);
    result.b = Volatile.Read(ref value.b);
    return result;
}
static int ReadVolatile2(ref int value) => Volatile.Read(ref value);

static void WriteVolatile1(ref A value, A value2) => VolatileWrite(ref value, value2);
static void WriteVolatile1(ref int value, int value2) => VolatileWrite(ref value, value2);
static void WriteVolatile2(ref A value, A value2)
{
    Volatile.Write(ref value2.a, value.a);
    Volatile.Write(ref value2.b, value.b);
}
static void WriteVolatile2(ref int value, int value2) => Volatile.Write(ref value, value2);

Expected behavior

ReadVolatile1(int&) and ReadVolatile2(int&) have the same (or very similar) codegen, and similarly with A&. And same with Write.

Actual behavior

I get the following codegen from crossgen2:

; Assembly listing for method Program:ReadVolatile1(byref):A
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->   x0         single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M15271_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M15271_IG02:              ;; offset=0008H
        9000000B          adrp    x11, [HIGH RELOC #0x40000000004292b8]      // function address
        9100016B          add     x11, x11, [LOW RELOC #0x40000000004292b8]
        F9400161          ldr     x1, [x11]
						;; size=12 bbWeight=1    PerfScore 4.00
G_M15271_IG03:              ;; offset=0014H
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D61F0020          br      x1
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=d3c5c458) for method Program:ReadVolatile1(byref):A
; ============================================================

; Assembly listing for method Program:ReadVolatile2(byref):A
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )   byref  ->   x0         single-def
;  V01 loc0         [V01    ] (  3,  3   )  struct ( 8) [fp+18H]   do-not-enreg[S] ld-addr-op
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V04 tmp2         [V04    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;  V05 tmp3         [V05,T01] (  2,  2   )     int  ->  [fp+18H]   do-not-enreg[] V01.a(offs=0x00) P-DEP "field V01.a (fldOffset=0x0)"
;  V06 tmp4         [V06,T02] (  2,  2   )     int  ->  [fp+1CH]   do-not-enreg[] V01.b(offs=0x04) P-DEP "field V01.b (fldOffset=0x4)"
;
; Lcl frame size = 16

G_M36164_IG01:              ;; offset=0000H
        A9BE7BFD          stp     fp, lr, [sp,#-0x20]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M36164_IG02:              ;; offset=0008H
        88DFFC01          ldar    w1, [x0]
        B9001BA1          str     w1, [fp,#0x18]	// [V05 tmp3]
        B9400400          ldr     w0, [x0,#0x04]
        D50339BF          dmb     ishld
        B9001FA0          str     w0, [fp,#0x1C]	// [V06 tmp4]
        F9400FA0          ldr     x0, [fp,#0x18]	// [V01 loc0]
						;; size=24 bbWeight=1    PerfScore 20.00
G_M36164_IG03:              ;; offset=0020H
        A8C27BFD          ldp     fp, lr, [sp],#0x20
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 40, prolog size 8, PerfScore 27.50, instruction count 10, allocated bytes for code 40 (MethodHash=85ab72bb) for method Program:ReadVolatile2(byref):A
; ============================================================

; Assembly listing for method Program:ReadVolatile1(byref):int
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->   x0         single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M27221_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M27221_IG02:              ;; offset=0008H
        9000000B          adrp    x11, [HIGH RELOC #0x40000000004292e8]      // function address
        9100016B          add     x11, x11, [LOW RELOC #0x40000000004292e8]
        F9400161          ldr     x1, [x11]
						;; size=12 bbWeight=1    PerfScore 4.00
G_M27221_IG03:              ;; offset=0014H
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D61F0020          br      x1
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=dc4895aa) for method Program:ReadVolatile1(byref):int
; ============================================================

; Assembly listing for method Program:ReadVolatile2(byref):int
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->   x0         single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M42934_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M42934_IG02:              ;; offset=0008H
        88DFFC00          ldar    w0, [x0]
						;; size=4 bbWeight=1    PerfScore 3.00
G_M42934_IG03:              ;; offset=000CH
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 20, prolog size 8, PerfScore 8.50, instruction count 5, allocated bytes for code 20 (MethodHash=9e525849) for method Program:ReadVolatile2(byref):int
; ============================================================

; Assembly listing for method Program:WriteVolatile1(byref,A)
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->   x0         single-def
;  V01 arg1         [V01,T01] (  3,  3   )  struct ( 8) [fp+18H]   do-not-enreg[S] single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 16

G_M5886_IG01:              ;; offset=0000H
        A9BE7BFD          stp     fp, lr, [sp,#-0x20]!
        910003FD          mov     fp, sp
        F9000FA1          str     x1, [fp,#0x18]
						;; size=12 bbWeight=1    PerfScore 2.50
G_M5886_IG02:              ;; offset=000CH
        F9400FA1          ldr     x1, [fp,#0x18]
        9000000B          adrp    x11, [HIGH RELOC #0x4000000000429358]      // function address
        9100016B          add     x11, x11, [LOW RELOC #0x4000000000429358]
        F9400162          ldr     x2, [x11]
						;; size=16 bbWeight=1    PerfScore 6.00
G_M5886_IG03:              ;; offset=001CH
        A8C27BFD          ldp     fp, lr, [sp],#0x20
        D61F0040          br      x2
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 36, prolog size 12, PerfScore 14.10, instruction count 9, allocated bytes for code 36 (MethodHash=7866e901) for method Program:WriteVolatile1(byref,A)
; ============================================================

; Assembly listing for method Program:WriteVolatile2(byref,A)
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )   byref  ->   x0         single-def
;  V01 arg1         [V01    ] (  4,  4   )  struct ( 8) [fp+18H]   do-not-enreg[XS] addr-exposed ld-addr-op single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T01] (  2,  4   )     int  ->   x1         "Inlining Arg"
;  V04 tmp2         [V04,T02] (  2,  4   )     int  ->   x0         "Inlining Arg"
;
; Lcl frame size = 16

G_M11869_IG01:              ;; offset=0000H
        A9BE7BFD          stp     fp, lr, [sp,#-0x20]!
        910003FD          mov     fp, sp
        F9000FA1          str     x1, [fp,#0x18]
						;; size=12 bbWeight=1    PerfScore 2.50
G_M11869_IG02:              ;; offset=000CH
        B9400001          ldr     w1, [x0]
        D5033BBF          dmb     ish
        B9001BA1          str     w1, [fp,#0x18]
        B9400400          ldr     w0, [x0,#0x04]
        D5033BBF          dmb     ish
        B9001FA0          str     w0, [fp,#0x1C]
						;; size=24 bbWeight=1    PerfScore 28.00
G_M11869_IG03:              ;; offset=0024H
        A8C27BFD          ldp     fp, lr, [sp],#0x20
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 44, prolog size 8, PerfScore 36.90, instruction count 11, allocated bytes for code 44 (MethodHash=597cd1a2) for method Program:WriteVolatile2(byref,A)
; ============================================================

; Assembly listing for method Program:WriteVolatile1(byref,int)
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->   x0         single-def
;  V01 arg1         [V01,T01] (  3,  3   )     int  ->   x1         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M16972_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M16972_IG02:              ;; offset=0008H
        9000000B          adrp    x11, [HIGH RELOC #0x4000000000429388]      // function address
        9100016B          add     x11, x11, [LOW RELOC #0x4000000000429388]
        F9400162          ldr     x2, [x11]
						;; size=12 bbWeight=1    PerfScore 4.00
G_M16972_IG03:              ;; offset=0014H
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D61F0040          br      x2
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 10.30, instruction count 7, allocated bytes for code 28 (MethodHash=2db1bdb3) for method Program:WriteVolatile1(byref,int)
; ============================================================

; Assembly listing for method Program:WriteVolatile2(byref,int)
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->   x0         single-def
;  V01 arg1         [V01,T01] (  3,  3   )     int  ->   x1         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M63279_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M63279_IG02:              ;; offset=0008H
        889FFC01          stlr    w1, [x0]
						;; size=4 bbWeight=1    PerfScore 1.00
G_M63279_IG03:              ;; offset=000CH
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00

; Total bytes of code 20, prolog size 8, PerfScore 6.50, instruction count 5, allocated bytes for code 20 (MethodHash=abe008d0) for method Program:WriteVolatile2(byref,int)
; ============================================================

as can be seen above, the 1 variants don't appear to be volatile at all. But they should all have extremely similar codegen to the matching 2 variants.

Regression?

Probably not :)

Known Workarounds

For specific cases, it's obviously easy enough to work around if you have access to all the fields, or if it's simply a class type or unmanaged, but it's basically unsolvable in the general case since gc-refs could be anywhere in it, so you can't just loop over the memory.

Configuration

I got the codegen using disasmo. It's for .NET 7 on arm64 using crossgen2. I haven't got a locally built .NET 8, so I haven't checked it, but I doubt it's much better :) (it would be good if someone else was able to test this in .NET 8).
I was planning to use this on a .NET 6 project I have, so much for that idea lol. It probably won't be fixed until at least 9 at this point would be my guess.

Other information

I haven't tested all of the other instructions, but it could easily be an issue specific to ldobj and stobj and the rest could be fine, this should be checked.

Author: hamarb123
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo
Copy link
Member

EgorBo commented Sep 4, 2023

It's probably doable with some efforts accross all runtimes, but since it's not exposed in C# (or any other .NET language) it's probably better to leave it as is and refer to this ECMA's update: https://github.com/dotnet/runtime/blob/main/docs/design/specs/Ecma-335-Augments.md#atomic-reads-and-writes

@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 4, 2023

probably better to leave it as is

Not sure what you're referring to here - I don't expect atomicity guarantees (except for the builtin primitive types & pointers obviously), the issue is that the volatile guarantees are not met.

@EgorBo
Copy link
Member

EgorBo commented Sep 4, 2023

probably better to leave it as is

Not sure what you're referring to here - I don't expect atomicity guarantees (except for the builtin primitive types & pointers obviously), the issue is that the volatile guarantees are not met.

Does it make any sense without atomicity guarantees?

Also, I don't see from your codegen what's wrong - your *1 variants have calls which may or may not provide volatile guarantees - how did you know?

@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 4, 2023

Does it make any sense without atomicity guarantees?

Yes, I had the example on the discord earlier (am happy to post it here if you would like).

Also, I don't see from your codegen what's wrong - your *1 variants have calls which may or may not provide volatile guarantees - how did you know?

I must have missed that, do you know what function it's calling (all mine here are aggressiveinlining)? Is there a way I could test whether it's working?

public static int Method1(void* address, int a) => VolatileRead(ref *(int*)address);
public static int Method1(void* address, short a) => *(int*)address;
public static int Method1(void* address, byte a) => System.Threading.Volatile.Read(ref *(int*)address);

gives

; Assembly listing for method Program:Method1(ulong,int):int
G_M25106_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M25106_IG02:              ;; offset=0008H
        9000000B          adrp    x11, [HIGH RELOC #0x40000000004292a8]      // function address
        9100016B          add     x11, x11, [LOW RELOC #0x40000000004292a8]
        F9400161          ldr     x1, [x11]
						;; size=12 bbWeight=1    PerfScore 4.00
G_M25106_IG03:              ;; offset=0014H
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D61F0020          br      x1
						;; size=8 bbWeight=1    PerfScore 2.00



; Assembly listing for method Program:Method1(ulong,short):int
G_M46963_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M46963_IG02:              ;; offset=0008H
        B9400000          ldr     w0, [x0]
						;; size=4 bbWeight=1    PerfScore 3.00
G_M46963_IG03:              ;; offset=000CH
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00



; Assembly listing for method Program:Method1(ulong,ubyte):int
G_M27038_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M27038_IG02:              ;; offset=0008H
        88DFFC00          ldar    w0, [x0]
						;; size=4 bbWeight=1    PerfScore 3.00
G_M27038_IG03:              ;; offset=000CH
        A8C17BFD          ldp     fp, lr, [sp],#0x10
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00

why is the one which calls VolatileRead seemingly so much worse? Shouldn't it just use ldar too?

@EgorBo
Copy link
Member

EgorBo commented Sep 4, 2023

why is the one which calls VolatileRead seemingly so much worse?

Presumably, it's a cross-assembly inlining in crossgen limitation (I assume your VolatileRead is written in pure IL in a separate module).

@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 4, 2023

Presumably, it's a cross-assembly inlining in crossgen limitation

I think you're right, it seems to generate the correct code when I use clrjit_universal_arm64_x64 instead for int.

1 other thing to check, the expected thing for VolatileCopyBlock would be dmb ish before the call to bl CORINFO_HELP_MEMCPY & the rest is the same as without volatile. & similar for InitBlock?

It does seem to be not right for this random type I came up with though:

; Method Program:Method1(ulong,int):System.ValueTuple`3[int,ubyte,long]
G_M62410_IG01:              ;; offset=0000H
        A9BE7BFD          stp     fp, lr, [sp,#-0x20]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50

G_M62410_IG02:              ;; offset=0008H
        3DC00010          ldr     q16, [x0]
        3D8007B0          str     q16, [fp,#0x10]
        F9400BA0          ldr     x0, [fp,#0x10]
        F9400FA1          ldr     x1, [fp,#0x18]
						;; size=16 bbWeight=1    PerfScore 8.00

G_M62410_IG03:              ;; offset=0018H
        A8C27BFD          ldp     fp, lr, [sp],#0x20
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00
; Total bytes of code: 32

; Method Program:Method1(ulong,short):System.ValueTuple`3[int,ubyte,long]
G_M11179_IG01:              ;; offset=0000H
        A9BE7BFD          stp     fp, lr, [sp,#-0x20]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50

G_M11179_IG02:              ;; offset=0008H
        3DC00010          ldr     q16, [x0]
        3D8007B0          str     q16, [fp,#0x10]
        F9400BA0          ldr     x0, [fp,#0x10]
        F9400FA1          ldr     x1, [fp,#0x18]
						;; size=16 bbWeight=1    PerfScore 8.00

G_M11179_IG03:              ;; offset=0018H
        A8C27BFD          ldp     fp, lr, [sp],#0x20
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00
; Total bytes of code: 32

for

public static (int, byte, long) Method1(void* address, int a) => VolatileRead(ref *((int, byte, long)*)address);
public static (int, byte, long) Method1(void* address, short a) => *((int, byte, long)*)address;

(feel free to inform me that I'm wrong if I am)

@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 4, 2023

Just so it's clear to anyone reading, specifically what I would expect to happen with the volatile operations on non-primitives/pointers, or when it's unaligned (since volatile can also be combined with unaligned.), is:

  • The ordering guarantees (these are the most important)
  • The read/write operation can be considered as one whole operation, therefore the fields can be read/written in any order whichever is most practical
  • It provides the normal volatile guarantees: e.g. lack of elimination, etc.
  • It doesn't need to be atomic since that wouldn't be guaranteed for a normal read/write either for these types

Implementing the above (for cases which aren't already handled like primitive types & pointers) isn't too difficult, as you need only put the applicable half barrier before/after the standard read/write code - or use the applicable store/write instruction if there's only 1 of them (or the first for write, last for read) and there's a volatile version available of that operation.

@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 4, 2023

For volatile. cpblk/initblk, on arm we are currently using dmb ish when we could be using dmb ishst.

Additional notes from when I investigated it last: also volatile. initblk fails when it gets inlined, and volatile. cpblk seems to provide only a store barrier, which doesn't seem right.

@JulieLeeMSFT JulieLeeMSFT added this to the Future milestone Sep 6, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Sep 6, 2023
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

3 participants