Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of Span.Clear #51534

Closed

Conversation

GrabYourPitchforks
Copy link
Member

Followup to #51365, this time focusing on Span<T>.Clear. The code comments explain why it's legal to use SIMD operations here and to have a mix of aligned + unaligned writes, even though we're potentially stomping on references.

Also clarified a comment in Span<T>.Fill.

/cc @carlossanlop @jozkee

@ghost
Copy link

ghost commented Apr 20, 2021

Tagging subscribers to this area: @GrabYourPitchforks, @carlossanlop
See info in area-owners.md if you want to be subscribed.

Issue Details

Followup to #51365, this time focusing on Span<T>.Clear. The code comments explain why it's legal to use SIMD operations here and to have a mix of aligned + unaligned writes, even though we're potentially stomping on references.

Also clarified a comment in Span<T>.Fill.

/cc @carlossanlop @jozkee

Author: GrabYourPitchforks
Assignees: -
Labels:

area-System.Memory, optimization, tenet-performance

Milestone: 6.0.0

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Apr 20, 2021

Perf results below for clearing a Span<object> of a given length. There is a very slight regression (fractions of a ns) for clearing small spans, but nothing I think we'd miss.

Method Job Toolchain Length Mean Error StdDev Ratio RatioSD
Clear Job-TAFADB main 0 1.633 ns 0.0144 ns 0.0135 ns 1.00 0.00
Clear Job-YXPQWT spanclear 0 2.028 ns 0.0117 ns 0.0104 ns 1.24 0.01
Clear Job-TAFADB main 1 1.833 ns 0.0086 ns 0.0081 ns 1.00 0.00
Clear Job-YXPQWT spanclear 1 2.037 ns 0.0171 ns 0.0133 ns 1.11 0.01
Clear Job-TAFADB main 3 2.273 ns 0.0143 ns 0.0134 ns 1.00 0.00
Clear Job-YXPQWT spanclear 3 2.093 ns 0.0392 ns 0.0367 ns 0.92 0.02
Clear Job-TAFADB main 7 1.612 ns 0.0154 ns 0.0136 ns 1.00 0.00
Clear Job-YXPQWT spanclear 7 1.844 ns 0.0135 ns 0.0119 ns 1.14 0.01
Clear Job-TAFADB main 16 2.766 ns 0.0152 ns 0.0135 ns 1.00 0.00
Clear Job-YXPQWT spanclear 16 1.844 ns 0.0116 ns 0.0090 ns 0.67 0.00
Clear Job-TAFADB main 32 6.345 ns 0.0197 ns 0.0184 ns 1.00 0.00
Clear Job-YXPQWT spanclear 32 2.357 ns 0.0161 ns 0.0151 ns 0.37 0.00
Clear Job-TAFADB main 128 28.214 ns 0.1197 ns 0.1061 ns 1.00 0.00
Clear Job-YXPQWT spanclear 128 7.042 ns 0.0274 ns 0.0257 ns 0.25 0.00

Codegen (x64) and some brief inspection:

00007ffa`4b76cec0 c5f877                 vzeroupper 
00007ffa`4b76cec3 4883fa04               cmp     rdx, 4
00007ffa`4b76cec7 726a                   jb      System_Private_CoreLib!System.SpanHelpers.ClearWithReferences(IntPtr ByRef,  UIntPtr)+0x73 (00007ffa`4b76cf33)
00007ffa`4b76cec9 c4e17c57c0             vxorps  ymm0, ymm0, ymm0
00007ffa`4b76cece c4e17d1101             vmovupd ymmword ptr [rcx], ymm0
00007ffa`4b76ced3 488bc1                 mov     rax, rcx
00007ffa`4b76ced6 4883e01f               and     rax, 1Fh
00007ffa`4b76ceda 488d54d0e0             lea     rdx, [rax+rdx*8-20h]
00007ffa`4b76cedf 4883c120               add     rcx, 20h
00007ffa`4b76cee3 482bc8                 sub     rcx, rax
00007ffa`4b76cee6 33c0                   xor     eax, eax
00007ffa`4b76cee8 4883fa40               cmp     rdx, 40h
00007ffa`4b76ceec 722a                   jb      System_Private_CoreLib!System.SpanHelpers.ClearWithReferences(IntPtr ByRef,  UIntPtr)+0x58 (00007ffa`4b76cf18)
00007ffa`4b76ceee 4c8bc2                 mov     r8, rdx
00007ffa`4b76cef1 4983e0c0               and     r8, 0FFFFFFFFFFFFFFC0h
00007ffa`4b76cef5 6666660f1f840000000000 nop     word ptr [rax+rax]

; body of "write 2" loop
00007ffa`4b76cf00 c5fd110401             vmovupd ymmword ptr [rcx+rax], ymm0
00007ffa`4b76cf05 4c8d4820               lea     r9, [rax+20h]  ; this really should be folded into the instruction below, Span<T>.Fill exhibits same behavior
00007ffa`4b76cf09 c4a17d110409           vmovupd ymmword ptr [rcx+r9], ymm0
00007ffa`4b76cf0f 4883c040               add     rax, 40h
00007ffa`4b76cf13 493bc0                 cmp     rax, r8
00007ffa`4b76cf16 72e8                   jb      System_Private_CoreLib!System.SpanHelpers.ClearWithReferences(IntPtr ByRef,  UIntPtr)+0x40 (00007ffa`4b76cf00)

00007ffa`4b76cf18 448bc2                 mov     r8d, edx  ; unnecessary mov, but cheap, so whatever
00007ffa`4b76cf1b 41f6c020               test    r8b, 20h
00007ffa`4b76cf1f 7405                   je      System_Private_CoreLib!System.SpanHelpers.ClearWithReferences(IntPtr ByRef,  UIntPtr)+0x66 (00007ffa`4b76cf26)
00007ffa`4b76cf21 c5fd110401             vmovupd ymmword ptr [rcx+rax], ymm0
00007ffa`4b76cf26 4883c2e0               add     rdx, 0FFFFFFFFFFFFFFE0h  ; this really should be folded into the instruction below, Span<T>.Fill exhibits same behavior
00007ffa`4b76cf2a c5fd110411             vmovupd ymmword ptr [rcx+rdx], ymm0
00007ffa`4b76cf2f c5f877                 vzeroupper 
00007ffa`4b76cf32 c3                     ret     

;; start of "can't vectorize" logic
00007ffa`4b76cf33 33c0                   xor     eax, eax
00007ffa`4b76cf35 448bc2                 mov     r8d, edx
00007ffa`4b76cf38 41f6c002               test    r8b, 2
00007ffa`4b76cf3c 740e                   je      System_Private_CoreLib!System.SpanHelpers.ClearWithReferences(IntPtr ByRef,  UIntPtr)+0x8c (00007ffa`4b76cf4c)
00007ffa`4b76cf3e 33c0                   xor     eax, eax  ; this is redundant but cheap
00007ffa`4b76cf40 488901                 mov     qword ptr [rcx], rax
00007ffa`4b76cf43 48894108               mov     qword ptr [rcx+8], rax
00007ffa`4b76cf47 b802000000             mov     eax, 2
00007ffa`4b76cf4c f6c201                 test    dl, 1
00007ffa`4b76cf4f 7406                   je      System_Private_CoreLib!System.SpanHelpers.ClearWithReferences(IntPtr ByRef,  UIntPtr)+0x97 (00007ffa`4b76cf57)
00007ffa`4b76cf51 33d2                   xor     edx, edx
00007ffa`4b76cf53 488914c1               mov     qword ptr [rcx+rax*8], rdx
00007ffa`4b76cf57 c5f877                 vzeroupper 
00007ffa`4b76cf5a c3                     ret   

Unsafe.Add(ref ip, 3) = default;
Unsafe.Add(ref Unsafe.Add(ref ip, (nint)pointerSizeLength), -3) = default;
Unsafe.Add(ref Unsafe.Add(ref ip, (nint)pointerSizeLength), -2) = default;
if (!Vector.IsHardwareAccelerated || Vector<byte>.Count / sizeof(IntPtr) > 8) // JIT turns this into constant true or false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing it... why is this check necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JIT won't properly elide the following if without this check. Here's the codegen if this and the similar check a few lines down are removed. The fallback logic becomes much larger. But note that blocks d07b - d0ba and blocks d0c1 - d0e2 will never be executed. The explicit if in the code above is hinting to JIT that it shouldn't worry about this scenario. (That's also the crux behind the "skip this check if..." comments above these lines, but I can wordsmith if necessary.)

00007ffa`30eed000 c5f877                 vzeroupper 
00007ffa`30eed003 4883fa04               cmp     rdx, 4
00007ffa`30eed007 726a                   jb      System_Private_CoreLib!System.SpanHelpers.ClearWithReferences(IntPtr ByRef,  UIntPtr)+0x73 (00007ffa`30eed073)

;; BEGIN MAIN VECTOR LOGIC
00007ffa`30eed009 c4e17c57c0             vxorps  ymm0, ymm0, ymm0
00007ffa`30eed00e c4e17d1101             vmovupd ymmword ptr [rcx], ymm0
; <snip>
00007ffa`30eed06f c5f877                 vzeroupper 
00007ffa`30eed072 c3                     ret     
;; END MAIN VECTOR LOGIC

;; BEGIN FALLBACK LOGIC
00007ffa`30eed073 33c0                   xor     eax, eax
00007ffa`30eed075 4883fa08               cmp     rdx, 8  ; rdx can never be < 8 (see lines d003 - d007 earlier)
00007ffa`30eed079 7241                   jb      System_Private_CoreLib!System.SpanHelpers.ClearWithReferences(IntPtr ByRef,  UIntPtr)+0xbc (00007ffa`30eed0bc)
00007ffa`30eed07b 4c8bc2                 mov     r8, rdx
00007ffa`30eed07e 4983e0f8               and     r8, 0FFFFFFFFFFFFFFF8h
00007ffa`30eed082 4c8bc8                 mov     r9, rax
00007ffa`30eed085 49c1e103               shl     r9, 3
00007ffa`30eed089 4533d2                 xor     r10d, r10d
00007ffa`30eed08c 4e891409               mov     qword ptr [rcx+r9], r10
00007ffa`30eed090 4e89540908             mov     qword ptr [rcx+r9+8], r10
00007ffa`30eed095 4e89540910             mov     qword ptr [rcx+r9+10h], r10
00007ffa`30eed09a 4e89540918             mov     qword ptr [rcx+r9+18h], r10
00007ffa`30eed09f 4e89540920             mov     qword ptr [rcx+r9+20h], r10
00007ffa`30eed0a4 4e89540928             mov     qword ptr [rcx+r9+28h], r10
00007ffa`30eed0a9 4e89540930             mov     qword ptr [rcx+r9+30h], r10
00007ffa`30eed0ae 4e89540938             mov     qword ptr [rcx+r9+38h], r10
00007ffa`30eed0b3 4883c008               add     rax, 8
00007ffa`30eed0b7 493bc0                 cmp     rax, r8
00007ffa`30eed0ba 72c6                   jb      System_Private_CoreLib!System.SpanHelpers.ClearWithReferences(IntPtr ByRef,  UIntPtr)+0x82 (00007ffa`30eed082)
00007ffa`30eed0bc f6c204                 test    dl, 4  ; rdx & 4 will always result in 0 (see lines d003 - d007 earlier)
00007ffa`30eed0bf 7421                   je      System_Private_CoreLib!System.SpanHelpers.ClearWithReferences(IntPtr ByRef,  UIntPtr)+0xe2 (00007ffa`30eed0e2)
00007ffa`30eed0c1 4c8bc0                 mov     r8, rax
00007ffa`30eed0c4 49c1e003               shl     r8, 3
00007ffa`30eed0c8 4533c9                 xor     r9d, r9d
00007ffa`30eed0cb 4e890c01               mov     qword ptr [rcx+r8], r9
00007ffa`30eed0cf 4e894c0108             mov     qword ptr [rcx+r8+8], r9
00007ffa`30eed0d4 4e894c0110             mov     qword ptr [rcx+r8+10h], r9
00007ffa`30eed0d9 4e894c0118             mov     qword ptr [rcx+r8+18h], r9
00007ffa`30eed0de 4883c004               add     rax, 4
00007ffa`30eed0e2 f6c202                 test    dl, 2
00007ffa`30eed0e5 7417                   je      System_Private_CoreLib!System.SpanHelpers.ClearWithReferences(IntPtr ByRef,  UIntPtr)+0xfe (00007ffa`30eed0fe)
00007ffa`30eed0e7 4c8bc0                 mov     r8, rax
00007ffa`30eed0ea 49c1e003               shl     r8, 3
00007ffa`30eed0ee 4533c9                 xor     r9d, r9d
00007ffa`30eed0f1 4e890c01               mov     qword ptr [rcx+r8], r9
00007ffa`30eed0f5 4e894c0108             mov     qword ptr [rcx+r8+8], r9
00007ffa`30eed0fa 4883c002               add     rax, 2
00007ffa`30eed0fe f6c201                 test    dl, 1
00007ffa`30eed101 7406                   je      System_Private_CoreLib!System.SpanHelpers.ClearWithReferences(IntPtr ByRef,  UIntPtr)+0x109 (00007ffa`30eed109)
00007ffa`30eed103 33d2                   xor     edx, edx
00007ffa`30eed105 488914c1               mov     qword ptr [rcx+rax*8], rdx
00007ffa`30eed109 c5f877                 vzeroupper 
00007ffa`30eed10c c3                     ret

@danmoseley
Copy link
Member

Is it necessary to check generated code for ARM64 too? That's arguably as important as x64 these days and I'm guessing codegen is still a bit less mature (?)

@GrabYourPitchforks
Copy link
Member Author

Is it necessary to check generated code for ARM64 too?

I'm not yet skilled in reading arm64 code, so I wouldn't be able to pass judgment on it. Figured that the benchmarks repo would pick up any regressions.

goto Write1;
Vector<byte> zero = default;
ref byte refDataAsBytes = ref Unsafe.As<IntPtr, byte>(ref ip);
Unsafe.WriteUnaligned(ref refDataAsBytes, zero);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that Unsafe.WriteUnaligned is atomic for hardware accelerated vectors. It is not a safe assumption to make in general. It may the reason for the failing Mono tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to be atomic across the entire vector. It only needs to be atomic at a native word granularity. @tannergooding might know more, but I believe every retail or microprocessor in the market provides this guarantee.

Copy link
Member

@jkotas jkotas Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are making assumption about the instruction that this is going to compile into.

The compiler (LLVM in this case) is free to compile this into byte-at-a-time copy, especially with optimizations off.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, IL interpretter is unlikely to execute this atomically either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you would like to keep this, I think you may need #define ATOMIC_UNALIGNED_VECTOR_OPERATIONS that won't be set for MONO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the ARM spec actually guarantee this?

Yes for ARM64. image

I'm not 100% on how it handles ARM32 given that would be 4x 32-bit reads/writes. Probably need to dig further into the spec and the ARM32 specific sections.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular:

  • Reads to SIMD and floating-point registers of a 128-bit value that is 64-bit aligned in memory are treated as a pair of single-copy atomic 64-bit reads.
  • Writes from SIMD and floating-point registers of a 128-bit value that is 64-bit aligned in memory are treated as a pair of single-copy atomic 64-bit writes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also noting, I don't think this guarantees that something like LLVM won't rewrite the copy loop to be "more efficient". I'd imagine Mono is explicitly marking certain loops to prevent LLVM from doing such optimizations.

Copy link
Member Author

@GrabYourPitchforks GrabYourPitchforks Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we scrapped the Vector.IsHardwareAccelerated check and dropped down to raw SSE2 / ARM64 intrinsics instead? Presumably that wouldn't provide the opportunity for anybody to rewrite these methods in a non-compliant manner? Though I wonder if that would hurt perf since it'd require pinning (and the associated stack spill) to use the intrinsics.

Could also use Jan's suggestion of wrapping the vectorized code in #if CORECLR, bypassing the whole issue.

Copy link
Member

@tannergooding tannergooding Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine that depends on which instructions are being used and what optimizations LLVM is doing. It is conceivably "correct" in C/C++ code for it to recognize __mm_load_si128 and __mm_store_si128 as a "trivial copy loop" and to replace it with ERMSB.

(and https://godbolt.org/z/hsfE4nKnr shows that it does recognize and replace with memcpy "improved" code 😄)

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Apr 20, 2021

@SamMonoRT @imhameed Can you comment on the discussion going on above? A brief summary:

The original code, when calling Span<T>.Clear() or Array.Clear(T[], ...) with T reference type or contains references, would essentially reinterpret_cast the buffer as a Span<IntPtr> and call theSpan[i] = IntPtr.Zero; over and over in a loop.

The new code, given this same scenario, tries to perform vectorized writes if possible. If vectorization is not hardware accelerated, it falls back to a Span<IntPtr> loop, same as the original code.

In all cases, the Span<T> element addresses are natural word-aligned.

The new code does not rely on vector writes being atomic. However, it does rely on that if a vector is written to a natural word-aligned address, each individual natural word-sized write will be atomic, even if the operation as a whole is not atomic. In other words, if Thread A is writing a vector to a natural-word aligned address, then some other Thread B which is reading natural word-sized values from natural word-aligned addresses will never observe that any individual natural word-sized value is itself torn.

Is this assumption valid given the way that Mono and LLVM operate? Is there ever a scenario where, given the example above, Thread B might read a "torn" natural word-sized value from a natural word-aligned address?

Furthermore, is there potentially a problem with the way the code was originally written? Is it possible for the compiler to say "oh, you're just doing a bunch of writes to a Span<IntPtr>, let me turn that into a memset for you," and for such a substitution to void our atomicity guarantees?

@vargaz
Copy link
Contributor

vargaz commented Apr 20, 2021

mono only supports Vector on platforms where there is hw support for it, so I believe Vector code is going to be compiled to hw SIMD instructions, there is no emulation on our side, llvm might do some emulation but only if the optimization/arch flags we pass to it are messed up, so that would be a bug.

@imhameed
Copy link
Contributor

imhameed commented Apr 21, 2021

@SamMonoRT @imhameed Can you comment on the discussion going on above? A brief summary:

The original code, when calling Span<T>.Clear() or Array.Clear(T[], ...) with T reference type or contains references, would essentially reinterpret_cast the buffer as a Span<IntPtr> and call theSpan[i] = IntPtr.Zero; over and over in a loop.

The new code, given this same scenario, tries to perform vectorized writes if possible. If vectorization is not hardware accelerated, it falls back to a Span<IntPtr> loop, same as the original code.

In all cases, the Span<T> element addresses are natural word-aligned.

The new code does not rely on vector writes being atomic. However, it does rely on that if a vector is written to a natural word-aligned address, each individual natural word-sized write will be atomic, even if the operation as a whole is not atomic. In other words, if Thread A is writing a vector to a natural-word aligned address, then some other Thread B which is reading natural word-sized values from natural word-aligned addresses will never observe that any individual natural word-sized value is itself torn.

Is this assumption valid given the way that Mono and LLVM operate? Is there ever a scenario where, given the example above, Thread B might read a "torn" natural word-sized value from a natural word-aligned address?

The non-LLVM Mono lane is also failing. Right now, Mono's non-LLVM code generator emits unrolled bytewise stores for unaligned stores:
intrinsics.c
memory-access.c
This would be straightforward to improve.

The Mono LLVM IR generator just emits store instructions with alignment 1 for Unsafe.WriteUnaligned. LLVM can recognize memory-filling loops and replace them with calls to an intrinsic memset function. Nothing special is done to avoid this. I don't know offhand what, if any, guarantees are provided about the code that is generated by this LLVM intrinsic. This probably violates I.12.6.6 in ECMA-335, so we ought not to do this. The last time I tried using load atomic and store atomic with unordered ordering, the performance and codegen quality of the resulting code was pretty bad. But maybe it's worth looking into again.

We could also use volatile LLVM loads/stores unconditionally for Unsafe.ReadUnaligned and Unsafe.WriteUnaligned (and the unaligned IL prefix) and remove the pass that replaces store loops with memset from our pass list. But there's still a hole: I don't know what optimization is done on apps submitted to Apple as bitcode rather than machine code.

The Intel SDM doesn't guarantee that larger-than-64-bit stores that are simd-width-aligned or natural-word-aligned are atomic or are decomposed into individual atomic natural-word-sized stores, although I'd be surprised if any relevant hardware does something weird here.

Furthermore, is there potentially a problem with the way the code was originally written? Is it possible for the compiler to say "oh, you're just doing a bunch of writes to a Span<IntPtr>, let me turn that into a memset for you," and for such a substitution to void our atomicity guarantees?

Yes, I think that is possible.

(An old, related issue: mono/mono#16716)

I think for now I'd suggest making this new code CoreCLR-only, so that Mono-specific issues can be figured out later.

@GrabYourPitchforks
Copy link
Member Author

Some quick experimentation shows that the C++ compiler does recognize a difference when optimizing looped writes to uint64_t* addresses vs. std::atomic_uint64_t* addresses.

#include <stdint.h>
#include <cstddef>
#include <atomic>

void WriteLongs(uint64_t* a, size_t len)
{
    // compiler replaces this loop with a call to memset
    for (size_t i = 0; i < len; i++)
    {
        a[i] = 0;
    }
}

void WriteAtomicLongs(std::atomic_uint64_t* a, size_t len)
{
    // compiler unrolls loop as a series of QWORD writes
    for (size_t i = 0; i < len; i++)
    {
        a[i].store(0, std::memory_order_relaxed);
    }
}

Comment on lines +454 to +457
if ((pointerSizeLength & 1) != 0)
{
Unsafe.Add(ref ip, (nint)i) = default;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ((pointerSizeLength & 1) != 0)
{
Unsafe.Add(ref ip, (nint)i) = default;
}
Unsafe.Add(ref ip, (nint)(pointerSizeLength - 1) = default;

Can be written at the end directly. Same as #51365 (comment) applies here.

-test      dl,1
-je        short M01_L04
-mov       [rcx+rax],0x0
+mov       [rcx+rdx+0FFFF],0x0

Unfortunately in the other PR I forgot to look at the scalar path 😢

if ((numElements & 1) != 0)
{
Unsafe.Add(ref refData, (nint)i) = value;
}

Can you please update this? (I don't want to submit a PR for this mini change when you're on it.)

@GrabYourPitchforks
Copy link
Member Author

Here's another interesting discovery: JIT will emit vectorized zeroing code when clearing multiple reference fields at once. So it looks like JIT is implicitly relying on this "the processor will never tear native words" behavior?

using System;
using System.Runtime.CompilerServices;

namespace MyApp
{
    class Program
    {
        public static void Main() { }

        [SkipLocalsInit]
        static void WriteRefStruct(ref MyStruct s)
        {
            s = default;
        }
    }

    internal struct MyStruct
    {
        object o0;
        object o1;
        object o2;
        object o3;
        object o4;
        object o5;
        object o6;
        object o7;
    }
}
; Method MyApp.Program:WriteRefStruct(byref)
G_M3391_IG01:
       vzeroupper 
						;; bbWeight=1    PerfScore 1.00

G_M3391_IG02:
       vxorps   xmm0, xmm0
       vmovdqu  xmmword ptr [rcx], xmm0
       vmovdqu  xmmword ptr [rcx+16], xmm0
       vmovdqu  xmmword ptr [rcx+32], xmm0
       vmovdqu  xmmword ptr [rcx+48], xmm0
						;; bbWeight=1    PerfScore 4.33

G_M3391_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 27

{
if (pointerSizeLength >= 8)
{
nuint stopLoopAtOffset = pointerSizeLength & ~(nuint)7;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I like this pattern I'm not sure if

nuint stopLoopAt = pointerSizeLength - 8;
do
{
    // ...
} while ((i += 8) <= stopLoopAt);

is a "bit better".

-mov       r9,rdx
-and       r9,0FFFFFFFFFFFFFFF8
+lea       r9,[rdx+0FFF8]

I doubt it will make a measurable difference in either side.

My thinking is:
The mov r9,rdx likely won't reach the cpu's backend, but the and needs to be executed.
For the lea this can be handled in the AGU.
And it's 3 bytes less code (hooray).
It's a bit more readable, though I assume anyone reading code like this is aware of these tricks.

Out of interest: did you have a reason to choose the bit-hack?

The same applies to other places too (for the record 😉).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit twiddling hack is muscle memory, since it allows this pattern to work:

nuint i = 0;
nuint lastOffset = (whatever) & ~(nuint)7;
for (; i != lastOffset; i += 8)
{
    /* do work */
}

JIT will use the and from the preceding line as the input into the jcc at the start of the loop. Allows repurposing the bit-twiddling as the "are you even large enough?" comparison in the first place.

Though to your point, I agree that if we already do the comparison upfront, this twiddling isn't needed. Or we could skip the earlier comparison, eagerly perform the bit-twiddling, and use the result of the twiddling as the comparison.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@GrabYourPitchforks
Copy link
Member Author

@jkotas @tannergooding What do you make of the following pattern?

            if (Vector.IsHardwareAccelerated && pointerSizeLength >= 8)
            {
                // We have enough data for at least one bulk (vectorized) write.

                nuint currentByteOffset = 0;
                nuint byteOffsetWhereCanPerformFinalLoopIteration = (pointerSizeLength - 8) * (nuint)IntPtr.Size;

                // Write 8 refs, which is 256 - 512 bits, which is 1 - 4 SIMD vectors.

                do
                {
                    Unsafe.AddByteOffset(ref Unsafe.As<IntPtr, EightRefs>(ref ip), currentByteOffset) = default;
                } while ((currentByteOffset += (nuint)Unsafe.SizeOf<EightRefs>()) <= byteOffsetWhereCanPerformFinalLoopIteration);

                // Write 4 refs, which is 128 - 256 bits, which is 1 - 2 SIMD vectors.

                if ((pointerSizeLength & 4) != 0)
                {
                    Unsafe.AddByteOffset(ref Unsafe.As<IntPtr, FourRefs>(ref ip), currentByteOffset) = default;
                    currentByteOffset += (nuint)Unsafe.SizeOf<FourRefs>();
                }

// etc.

        private readonly struct EightRefs
        {
            private readonly object? _data0;
            private readonly object? _data1;
            // snip
            private readonly object? _data7;
        }

        private readonly struct FourRefs
        {
            private readonly object? _data0;
            private readonly object? _data1;
            private readonly object? _data2;
            private readonly object? _data3;
        }

// etc.

One positive attribute it offloads the concern from us to the JIT. The JIT becomes wholly responsible for ensuring that writes to object fields are atomic. If it wants to continue to use the pattern identified at #51534 (comment) it's free to do so, but at that point it becomes wholly a JIT concern and this method has no say in the matter.

However, I'm nervous about saying "hey, ip can be treated as an EightRefs&!" - since if somebody tries looking at the span and interpreting its values as object references, I don't know what will happen. AFAIK nobody does this today, but I don't know if any aspect of the runtime reserves the right to do that. In theory I could work around this by using ByReference<byte> instead of object for the fields, but this would require EightRefs and FourRefs to become a ref struct, and then Unsafe.As wouldn't compile.

@jkotas
Copy link
Member

jkotas commented Apr 21, 2021

I don't know what will happen. AFAIK nobody does this today, but I don't know if any aspect of the runtime reserves the right to do that

This method makes assumptions about how the GC and JIT work. There is no way to write it in a fully future proof way. For example, if we were to switch to a different GC contract (e.g. require write barriers for any object reference changes, not just non-null object reference changes), it may need to be rewritten.

I think it is ok to have this method conditionalized or ifdefed as necessary to make it work well for different configurations we care about.

@GrabYourPitchforks
Copy link
Member Author

I've split off the "is the current JIT behavior even legal?" question into #51638.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Apr 21, 2021

I'm closing this PR for now because there are too many open questions with respect to what behavior is legal and what isn't. I've opened another thread so that JIT & processor experts can chime in. But TBH I suspect that investigation will be low priority. Once those questions are answered we can loop back here and see if there's anything worth salvaging.

Clarification: I think this work is still worth doing if we can get consensus on what's safe vs. what's not safe. But I don't expect us to be able to make progress on this in the immediate future. Hence closing for now so it doesn't go stale.

@ghost ghost locked as resolved and limited conversation to collaborators May 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants