Unroll Buffer.Memmove for constant lengths #83638

EgorBo · 2023-03-18T15:43:39Z

Unroll Buffer.Memmove for constant lengths, various patterns improve, e.g.:

void StringLiteral(Span<char> span) => "False".CopyTo(span);

void Utf8Literal(Span<byte> span) => "Hello World"u8.CopyTo(span);

void SliceCopy(ReadOnlySpan<byte> src, Span<byte> dest) => src.Slice(0, 128).CopyTo(dest);

bool AsSpanCopy(byte[] a, byte[] b) => a.AsSpan(0, 100).TryCopyTo(b);

byte[] RvaToArray() => "hello world"u8.ToArray();

char[] StringToCharArray() => "Test string".ToCharArray();

Codegen example:

void CopySpan(Span<int> dst, ReadOnlySpan<int> src) => 
    src.Slice(0, 30).TryCopyTo(dst); // copy 30*sizeof(int)=120 bytes

Was:

; Method P:CopySpan(System.Span`1[int],System.ReadOnlySpan`1[int]):this
       push     rbp
       mov      rbp, rsp
       cmp      r8d, 30
       jb       SHORT G_M33314_IG06
       mov      rdi, rsi
       cmp      edx, 30
       jb       SHORT G_M33314_IG04
       mov      edx, 120
       mov      rsi, rcx
       call     [System.Buffer:Memmove(byref,byref,ulong)]  ;; memmove call
G_M33314_IG04:
       nop      
       pop      rbp
       ret      
G_M33314_IG06:
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3

Now

; Method P:CopySpan(System.Span`1[int],System.ReadOnlySpan`1[int]):this
       push     rbp
       vzeroupper 
       mov      rbp, rsp
       cmp      r8d, 30
       jb       SHORT G_M33314_IG05
       cmp      edx, 30
       jb       SHORT G_M33314_IG04
       vmovdqu  ymm0, ymmword ptr[rcx]
       vmovdqu  ymm1, ymmword ptr[rcx+20H]
       vmovdqu  ymm2, ymmword ptr[rcx+40H]
       vmovdqu  ymm3, ymmword ptr[rcx+58H]
       vmovdqu  ymmword ptr[rsi], ymm0
       vmovdqu  ymmword ptr[rsi+20H], ymm1
       vmovdqu  ymmword ptr[rsi+40H], ymm2
       vmovdqu  ymmword ptr[rsi+58H], ymm3
G_M33314_IG04:
       pop      rbp
       ret      
G_M33314_IG05:
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3

The whole src is saved into temp SIMD regs so we can ignore the fact that src and dst might overlap. Currently, we support unrolling from 1 to 128 bytes (or 256 bytes with AVX-512 once it's enabled)

Benchmarks:

public static IEnumerable<object[]> TestArgs()
{
    yield return new object[] { new byte[128], new byte[128] };
}

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice4(byte[] dst, byte[] src) => src.AsSpan(0, 4).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice10(byte[] dst, byte[] src) => src.AsSpan(0, 10).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice26(byte[] dst, byte[] src) => src.AsSpan(0, 26).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice64(byte[] dst, byte[] src) => src.AsSpan(0, 64).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice120(byte[] dst, byte[] src) => src.AsSpan(0, 120).CopyTo(dst);

// Overlapping

IntPtr nativeAlloc;

[GlobalSetup]
public void GlobalSetup() => nativeAlloc = (IntPtr)NativeMemory.AlignedAlloc(1024, 32); // aligned to reduce noise

[GlobalCleanup]
public void GlobalCleanup() => NativeMemory.AlignedFree((void*)nativeAlloc);

[Benchmark]
public void CopyConstSlice8_overlap() => 
    new Span<byte>((void*)nativeAlloc, 8).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 8));

[Benchmark]
public void CopyConstSlice32_overlap() => 
    new Span<byte>((void*)nativeAlloc, 32).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 32));

[Benchmark]
public void CopyConstSlice120_overlap() => 
    new Span<byte>((void*)nativeAlloc, 120).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 120));

Codegen diff for them: https://www.diffchecker.com/YLQJhBTE/

Method	Toolchain	Mean	Ratio
CopyConstSlice4	\runtime-base\corerun.exe	1.1995 ns	3.52
CopyConstSlice4	\runtime-PR\corerun.exe	0.3410 ns	1.00

CopyConstSlice10	\runtime-base\corerun.exe	1.2086 ns	3.51
CopyConstSlice10	\runtime-PR\corerun.exe	0.3445 ns	1.00

CopyConstSlice26	\runtime-base\corerun.exe	1.3104 ns	2.50
CopyConstSlice26	\runtime-PR\corerun.exe	0.5236 ns	1.00

CopyConstSlice64	\runtime-base\corerun.exe	1.4350 ns	2.73
CopyConstSlice64	\runtime-PR\corerun.exe	0.5253 ns	1.00

CopyConstSlice120	\runtime-base\corerun.exe	1.8053 ns	3.39
CopyConstSlice120	\runtime-PR\corerun.exe	0.5325 ns	1.00

CopyConstSlice8_overlap	\runtime-base\corerun.exe	4.6385 ns	1.95
CopyConstSlice8_overlap	\runtime-PR\corerun.exe	2.3922 ns	1.00

CopyConstSlice32_overlap	\runtime-base\corerun.exe	4.0476 ns	1.48
CopyConstSlice32_overlap	\runtime-PR\corerun.exe	2.7372 ns	1.00

CopyConstSlice120_overlap	\runtime-base\corerun.exe	5.3873 ns	1.63
CopyConstSlice120_overlap	\runtime-PR\corerun.exe	3.3125 ns	1.00

Note that I was bencmarking it on a very fast CPU Ryzen 7950X (Windows 11 x64).

There are a few nit TODO-CQ in the code for future improvements (e.g. don't use two ymm to unroll 33 bytes and do ymm+GPR) and only x64 for now to simplify code review, I'll do arm64 right after this lands. 32bit can also be implemented, but it adds extra complexity like on x86 we can only do byte-wide loads on specific register and we need up to 4 GPR regs to handle 15 bytes, etc.

Motivation

To simplify hacks like these:

runtime/src/libraries/System.Private.CoreLib/src/System/Boolean.cs

Lines 118 to 121 in 87b73f0

    
           ulong fals_val = BitConverter.IsLittleEndian ? 0x73006C00610046ul : 0x460061006C0073ul; // "Fals" 
        
           MemoryMarshal.Write<ulong>(MemoryMarshal.AsBytes(destination), ref fals_val); 
        
           destination[4] = 'e'; 
        
           charsWritten = 5;

cc @jkotas @stephentoub thoughts?

From my understanding the "are overlapped" check is GC-safe and can be futher improved (but not with Math.Abs). Also, I had to slightly improve IsKnowConstant because when spans are involved we can only properly detect constants in late phases (after SSA/VN)

ghost · 2023-03-18T15:43:52Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

If pointers don't overlap and size is jit-time known - try to unroll it. Various shapes are supported, e.g.:

void WriteFalse(Span<char> span)
{
    "False".CopyTo(span);
}

void WriteHelloWorldUtf8(Span<byte> span)
{
    "Hello World"u8.CopyTo(span);
}

void CopyFirst128Bytes(ReadOnlySpan<byte> src, Span<byte> dest)
{
    src.Slice(0, 128).CopyTo(dest);
}

E.g. codegen for CopyFirst128Bytes: https://www.diffchecker.com/1JFxGxoF/

Benchmarks

static readonly char[] charBuffer = new char[1024];
static readonly byte[] byteBuffer1 = new byte[1024];
static readonly byte[] byteBuffer2 = new byte[1024];

[Benchmark] 
public void CopyConstSlice1() => byteBuffer1.AsSpan(0, 1).CopyTo(byteBuffer2);

[Benchmark] 
public void CopyConstSlice8() => byteBuffer1.AsSpan(0, 8).CopyTo(byteBuffer2);

[Benchmark] 
public void CopyConstSlice16() => byteBuffer1.AsSpan(0, 16).CopyTo(byteBuffer2);

[Benchmark] 
public void CopyConstSlice64() => byteBuffer1.AsSpan(0, 64).CopyTo(byteBuffer2);

[Benchmark] 
public void CopyConstSlice128() => byteBuffer1.AsSpan(0, 128).CopyTo(byteBuffer2);

|               Method |                 Toolchain |      Mean |
|--------------------- |-------------------------- |----------:|
|      CopyConstSlice1 | \runtime-Main\corerun.exe | 1.2544 ns |
|      CopyConstSlice1 |      \runtime\corerun.exe | 0.2108 ns |
|                      |                           |           |
|      CopyConstSlice8 | \runtime-Main\corerun.exe | 1.1305 ns |
|      CopyConstSlice8 |      \runtime\corerun.exe | 0.3892 ns |
|                      |                           |           |
|     CopyConstSlice16 | \runtime-Main\corerun.exe | 1.1323 ns |
|     CopyConstSlice16 |      \runtime\corerun.exe | 0.3480 ns |
|                      |                           |           |
|     CopyConstSlice64 | \runtime-Main\corerun.exe | 1.3809 ns |
|     CopyConstSlice64 |      \runtime\corerun.exe | 0.3970 ns |
|                      |                           |           |
|    CopyConstSlice128 | \runtime-Main\corerun.exe | 1.9375 ns |
|    CopyConstSlice128 |      \runtime\corerun.exe | 0.5410 ns |

Motivation

To simplify hacks like these:

runtime/src/libraries/System.Private.CoreLib/src/System/Boolean.cs

Lines 118 to 121 in 87b73f0

    
           ulong fals_val = BitConverter.IsLittleEndian ? 0x73006C00610046ul : 0x460061006C0073ul; // "Fals" 
        
           MemoryMarshal.Write<ulong>(MemoryMarshal.AsBytes(destination), ref fals_val); 
        
           destination[4] = 'e'; 
        
           charsWritten = 5;

cc @jkotas @stephentoub thoughts?

From my understanding the "are overlapped" check is GC-safe and can be futher improved (but not with Math.Abs). Also, I had to slightly improve IsKnowConstant because when spans are involved we can only properly detect constants in late phases (after SSA/VN)

Author:	EgorBo
Assignees:	EgorBo
Labels:	`area-CodeGen-coreclr`
Milestone:	-

src/libraries/System.Private.CoreLib/src/System/Buffer.cs

stephentoub · 2023-03-18T15:52:31Z

Can we remove things like this?

runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/DefaultInterpolatedStringHandler.cs

Lines 139 to 197 in 4a1dd91

    
               // AppendLiteral is expected to be called by compiler-generated code with a literal string. 
        
               // By inlining it, the method body is exposed to the constant length of that literal, allowing the JIT to 
        
               // prune away the irrelevant cases.  This effectively enables multiple implementations of AppendLiteral, 
        
               // special-cased on and optimized for the literal's length.  We special-case lengths 1 and 2 because 
        
               // they're very common, e.g. 
        
               //     1: ' ', '.', '-', '\t', etc. 
        
               //     2: ", ", "0x", "=>", ": ", etc. 
        
               // and length 4 because it can similarly be done with a single read/write. 
        
               if (value.Length == 1) 
        
               { 
        
                   int pos = _pos; 
        
                   Span<char> chars = _chars; 
        
                   if ((uint)pos < (uint)chars.Length) 
        
                   { 
        
                       chars[pos] = value[0]; 
        
                       _pos = pos + 1; 
        
                   } 
        
                   else 
        
                   { 
        
                       GrowThenCopyString(value); 
        
                   } 
        
                   return; 
        
               } 
        
               if (value.Length == 2) 
        
               { 
        
                   int pos = _pos; 
        
                   if ((uint)pos < _chars.Length - 1) 
        
                   { 
        
                       Unsafe.WriteUnaligned( 
        
                           ref Unsafe.As<char, byte>(ref Unsafe.Add(ref MemoryMarshal.GetReference(_chars), pos)), 
        
                           Unsafe.ReadUnaligned<int>(ref Unsafe.As<char, byte>(ref value.GetRawStringData()))); 
        
                       _pos = pos + 2; 
        
                   } 
        
                   else 
        
                   { 
        
                       GrowThenCopyString(value); 
        
                   } 
        
                   return; 
        
               } 
        
               if (value.Length == 4) 
        
               { 
        
                   int pos = _pos; 
        
                   if ((uint)pos < _chars.Length - 3) 
        
                   { 
        
                       Unsafe.WriteUnaligned( 
        
                           ref Unsafe.As<char, byte>(ref Unsafe.Add(ref MemoryMarshal.GetReference(_chars), pos)), 
        
                           Unsafe.ReadUnaligned<long>(ref Unsafe.As<char, byte>(ref value.GetRawStringData()))); 
        
                       _pos = pos + 4; 
        
                   } 
        
                   else 
        
                   { 
        
                       GrowThenCopyString(value); 
        
                   } 
        
                   return; 
        
               } 
        
           }

src/coreclr/jit/morph.cpp

src/coreclr/jit/valuenum.cpp

src/coreclr/jit/lowerxarch.cpp

EgorBo · 2023-03-18T16:20:07Z

Can we remove things like this?

Oh, good example, let me check.
Btw, this PR fixes #68103

jkotas · 2023-03-18T18:03:04Z

codegen for CopyFirst128Bytes: https://www.diffchecker.com/1JFxGxoF/

It would be nice to turn this into just this:

       cmp      dword ptr [rdx+08H], 128
       jb       SHORT G_M29327_IG07
       mov      rdx, bword ptr [rdx]
       mov      rcx, bword ptr [r8]
       mov      r8d, dword ptr [r8+08H]
       cmp      r8d, 128
       jb       SHORT G_M29327_IG08

       vmovdqu  ymm0, ymmword ptr[rdx]
       vmovdqu  ymm1, ymmword ptr[rdx+20H]
       vmovdqu  ymm2, ymmword ptr[rdx+40H]
       vmovdqu  ymm3, ymmword ptr[rdx+60H]
       vmovdqu  ymmword ptr[rcx], ymm0
       vmovdqu  ymmword ptr[rcx+20H], ymm1
       vmovdqu  ymmword ptr[rcx+40H], ymm2
       vmovdqu  ymmword ptr[rcx+60H], ymm3

       ret

EgorBo · 2023-03-18T18:10:12Z

codegen for CopyFirst128Bytes: https://www.diffchecker.com/1JFxGxoF/

It would be nice to turn this into just this:

       cmp      dword ptr [rdx+08H], 128
       jb       SHORT G_M29327_IG07
       mov      rdx, bword ptr [rdx]
       mov      rcx, bword ptr [r8]
       mov      r8d, dword ptr [r8+08H]
       cmp      r8d, 128
       jb       SHORT G_M29327_IG08

       vmovdqu  ymm0, ymmword ptr[rdx]
       vmovdqu  ymm1, ymmword ptr[rdx+20H]
       vmovdqu  ymm2, ymmword ptr[rdx+40H]
       vmovdqu  ymm3, ymmword ptr[rdx+60H]
       vmovdqu  ymmword ptr[rcx], ymm0
       vmovdqu  ymmword ptr[rcx+20H], ymm1
       vmovdqu  ymmword ptr[rcx+40H], ymm2
       vmovdqu  ymmword ptr[rcx+60H], ymm3

       ret

Right, but, presumably, it's not a simple change since we also need to modify LSRA to tell it we're going to need up to 4 spare simd registers (+ scalars). Also, we need some separate phase after VN (but before lower since VN's will be invalidated) where we'll be doing these transformations. (or inside EarlyProp with some modifications).

I plan to prototype such a phase eventually because it's only way I can unroll SequenceEqual for utf8 literals

MichalPetryka · 2023-03-18T18:17:00Z

codegen for CopyFirst128Bytes: https://www.diffchecker.com/1JFxGxoF/

It would be nice to turn this into just this:
       cmp      dword ptr [rdx+08H], 128
       jb       SHORT G_M29327_IG07
       mov      rdx, bword ptr [rdx]
       mov      rcx, bword ptr [r8]
       mov      r8d, dword ptr [r8+08H]
       cmp      r8d, 128
       jb       SHORT G_M29327_IG08

       vmovdqu  ymm0, ymmword ptr[rdx]
       vmovdqu  ymm1, ymmword ptr[rdx+20H]
       vmovdqu  ymm2, ymmword ptr[rdx+40H]
       vmovdqu  ymm3, ymmword ptr[rdx+60H]
       vmovdqu  ymmword ptr[rcx], ymm0
       vmovdqu  ymmword ptr[rcx+20H], ymm1
       vmovdqu  ymmword ptr[rcx+40H], ymm2
       vmovdqu  ymmword ptr[rcx+60H], ymm3

       ret
Right, but, presumably, it's not a simple change since we also need to modify LSRA to tell it we're going to need up to 4 spare simd registers (+ scalars). Also, we need some separate phase after VN (but before lower since VN's will be invalidated) where we'll be doing these transformations. (or inside EarlyProp with some modifications).

I plan to prototype such a phase eventually because it's only way I can unroll SequenceEqual for utf8 literals

Could we for now assume that using up to 4 regs is fine and change it to only unroll if there are free ones later?

EgorBo · 2023-03-18T18:52:00Z

Ok, I'll try to do the whole thing in JIT. It probably makes sense to still merge improvements for IsKnownConstant alone since it's now more powerful and is able to detect constant Length for spans, etc.

MichalPetryka · 2023-03-18T18:53:07Z

Ok, I'll try to do the whole thing in JIT. It probably makes sense to still merge improvements for IsKnownConstant alone since it's now more powerful and is able to detect constant Length for spans, etc.

Does it make sense to split them off into a separate PR?

EgorBo · 2023-03-19T01:58:11Z

Implemented in JIT, e.g.:

void Test(ReadOnlySpan<int> dst, Span<int> src) => 
    dst.Slice(0, 30).CopyTo(src);

it now emits (linux-x64):

; Method P:Test
G_M7025_IG01:
       push     rbp
       vzeroupper 
       mov      rbp, rsp
G_M7025_IG02:
       cmp      edx, 30
       jb       SHORT G_M7025_IG04
       cmp      r8d, 30
       jb       SHORT G_M7025_IG05
       vmovdqu  ymm0, ymmword ptr[rsi]
       vmovdqu  ymm1, ymmword ptr[rsi+20H]
       vmovdqu  ymm2, ymmword ptr[rsi+40H]
       vmovdqu  ymm3, ymmword ptr[rsi+58H]
       vmovdqu  ymmword ptr[rcx], ymm0
       vmovdqu  ymmword ptr[rcx+20H], ymm1
       vmovdqu  ymmword ptr[rcx+40H], ymm2
       vmovdqu  ymmword ptr[rcx+58H], ymm3
G_M7025_IG03:
       pop      rbp
       ret      
G_M7025_IG04:
       mov      rax, 0xD1FFAB1E      ; code for System.ThrowHelper:ThrowArgumentOutOfRangeException()
       call     [rax]System.ThrowHelper:ThrowArgumentOutOfRangeException()
       int3     
G_M7025_IG05:
       mov      rax, 0xD1FFAB1E      ; code for System.ThrowHelper:ThrowArgumentException_DestinationTooShort()
       call     [rax]System.ThrowHelper:ThrowArgumentException_DestinationTooShort()
       int3     
; Total bytes of code: 84

30*sizeof(int) = 120 bytes = 4 ymm moves
Did it in lower and without optional optimizations like detecting that src and dst actually never overlap for now. And for x64 only.

Does it make sense to split them off into a separate PR?

Yeah, will push the improvements for IsKnownConstant separately once/if this lands

EgorBo · 2023-03-19T02:09:30Z

Or in case of Boolean.cs hack:

bool TryWriteFalseUtf8(Span<byte> span)
{
    ReadOnlySpan<byte> str = "False"u8; // utf8 literal
    return str.TryCopyTo(span);
}

void WriteFalse(Span<char> span)
{
    string str = "False"; // utf16
    str.AsSpan().CopyTo(span);
}

; Method P:TryWriteFalseUtf8
G_M40086_IG01:
       push     rax
G_M40086_IG02:
       mov      rax, 0xD1FFAB1E          ;; RVA data for utf8 literal
       xor      edi, edi
       cmp      edx, 5
       jb       SHORT G_M40086_IG04
G_M40086_IG03:
       mov      edx, dword ptr [rax]     ;; TODO: could be folded to a constant (RVA[cns])
       mov      edi, dword ptr [rax+01H] ;; TODO: could be folded to a constant (RVA[cns])
       mov      dword ptr [rsi], edx
       mov      dword ptr [rsi+01H], edi
       mov      edi, 1
G_M40086_IG04:
       mov      eax, edi
G_M40086_IG05:
       add      rsp, 8
       ret      
; Total bytes of code: 40


; Method P:WriteFalse
G_M10586_IG01:
       push     rbp
       mov      rbp, rsp
G_M10586_IG02:
       mov      rax, 0xD1FFAB1E           ;; "False" frozen string literal
       add      rax, 12
       cmp      edx, 5
       jb       SHORT G_M10586_IG04
       mov      rdx, qword ptr [rax]      ;; TODO: could be folded to a constant (string[cns])
       mov      rdi, qword ptr [rax+02H]  ;; TODO: could be folded to a constant (string[cns])
       mov      qword ptr [rsi], rdx
       mov      qword ptr [rsi+02H], rdi
G_M10586_IG03:
       pop      rbp
       ret      
G_M10586_IG04:
       mov      rax, 0xD1FFAB1E
       call     [rax]System.ThrowHelper:ThrowArgumentException_DestinationTooShort()
       int3     
; Total bytes of code: 52

(same for char, just CopyTo and many other patterns)

EgorBo · 2023-03-19T15:08:18Z

Should be ready for review once CI passes, checked GC info, tests. Will run jitstresssregs and other outerloops including gcstress.

new benchmarks:

public static IEnumerable<object[]> TestArgs()
{
    yield return new object[] { new byte[128], new byte[128] };
}

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice4(byte[] dst, byte[] src) => src.AsSpan(0, 4).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice10(byte[] dst, byte[] src) => src.AsSpan(0, 10).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice26(byte[] dst, byte[] src) => src.AsSpan(0, 26).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice64(byte[] dst, byte[] src) => src.AsSpan(0, 64).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice120(byte[] dst, byte[] src) => src.AsSpan(0, 120).CopyTo(dst);

// Overlapping

IntPtr nativeAlloc;

[GlobalSetup]
public void GlobalSetup() => nativeAlloc = (IntPtr)NativeMemory.AlignedAlloc(1024, 32); // aligned to reduce noise

[GlobalCleanup]
public void GlobalCleanup() => NativeMemory.AlignedFree((void*)nativeAlloc);

[Benchmark]
public void CopyConstSlice8_overlap() => 
    new Span<byte>((void*)nativeAlloc, 8).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 8));

[Benchmark]
public void CopyConstSlice32_overlap() => 
    new Span<byte>((void*)nativeAlloc, 32).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 32));

[Benchmark]
public void CopyConstSlice120_overlap() => 
    new Span<byte>((void*)nativeAlloc, 120).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 120));

Codegen diff for them: https://www.diffchecker.com/YLQJhBTE/

Method	Toolchain	Mean	Ratio
CopyConstSlice4	\runtime-base\corerun.exe	1.1995 ns	3.52
CopyConstSlice4	\runtime-PR\corerun.exe	0.3410 ns	1.00

CopyConstSlice10	\runtime-base\corerun.exe	1.2086 ns	3.51
CopyConstSlice10	\runtime-PR\corerun.exe	0.3445 ns	1.00

CopyConstSlice26	\runtime-base\corerun.exe	1.3104 ns	2.50
CopyConstSlice26	\runtime-PR\corerun.exe	0.5236 ns	1.00

CopyConstSlice64	\runtime-base\corerun.exe	1.4350 ns	2.73
CopyConstSlice64	\runtime-PR\corerun.exe	0.5253 ns	1.00

CopyConstSlice120	\runtime-base\corerun.exe	1.8053 ns	3.39
CopyConstSlice120	\runtime-PR\corerun.exe	0.5325 ns	1.00

CopyConstSlice8_overlap	\runtime-base\corerun.exe	4.6385 ns	1.95
CopyConstSlice8_overlap	\runtime-PR\corerun.exe	2.3922 ns	1.00

CopyConstSlice32_overlap	\runtime-base\corerun.exe	4.0476 ns	1.48
CopyConstSlice32_overlap	\runtime-PR\corerun.exe	2.7372 ns	1.00

CopyConstSlice120_overlap	\runtime-base\corerun.exe	5.3873 ns	1.63
CopyConstSlice120_overlap	\runtime-PR\corerun.exe	3.3125 ns	1.00

Note that I was bencmarking it on a very fast CPU Ryzen 7950X (Windows 11 x64).

There are a few nit TODO-CQ in the code for future improvements (e.g. don't use two ymm to unroll 33 bytes and do ymm+GPR) and only x64 for now to simplify code review, I'll do arm64 right after this lands. 32bit can also be implemented, but it adds extra complexity like on x86 we can only do byte-wide loads on specific register and we need up to 4 GPR regs to handle 15 bytes, etc.

MichalPetryka · 2023-03-21T14:14:55Z

src/tests/JIT/opt/Vectorization/BufferMemmove.cs

+        TestMemmove((dst, src) => src.AsSpan(0, 65).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(65)).CopyTo(dst));
+        TestMemmove((dst, src) => src.AsSpan(0, 127).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(127)).CopyTo(dst));


Suggested change

TestMemmove((dst, src) => src.AsSpan(0, 65).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(65)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 127).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(127)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 65).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(65)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 95).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(95)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 96).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(96)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 97).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(97)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 127).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(127)).CopyTo(dst));

Will apply in the arm64 follow up PR if no further feedback is provided. To avoid spinning CI again

MichalPetryka · 2023-03-21T14:45:42Z

src/tests/JIT/opt/Vectorization/BufferMemmove.cs

+        TestMemmove((dst, src) => src.AsSpan(0, 33).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(33)).CopyTo(dst));
+        TestMemmove((dst, src) => src.AsSpan(0, 63).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(63)).CopyTo(dst));


Suggested change

TestMemmove((dst, src) => src.AsSpan(0, 33).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(33)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 63).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(63)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 33).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(33)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 47).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(47)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 48).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(48)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 49).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(49)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 63).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(63)).CopyTo(dst));

MichalPetryka · 2023-03-21T14:46:52Z

src/tests/JIT/opt/Vectorization/BufferMemmove.cs

+        TestMemmove((dst, src) => src.AsSpan(0, 161).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(161)).CopyTo(dst));
+        TestMemmove((dst, src) => src.AsSpan(0, 255).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(255)).CopyTo(dst));


Suggested change

TestMemmove((dst, src) => src.AsSpan(0, 161).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(161)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 255).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(255)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 161).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(161)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 191).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(191)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 192).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(192)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 193).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(193)).CopyTo(dst));

TestMemmove((dst, src) => src.AsSpan(0, 255).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(255)).CopyTo(dst));

Thanks, will apply these in the follow up

EgorBo · 2023-03-21T16:52:02Z

Failures seem to be #83655 (Mono-wasm)

src/coreclr/jit/codegenxarch.cpp

src/coreclr/jit/compiler.h

BruceForstall · 2023-03-23T22:07:51Z

src/coreclr/jit/lower.cpp

@@ -1911,6 +1981,7 @@ void Lowering::LowerCall(GenTree* node)
    JITDUMP("lowering call (after):\n");
    DISPTREERANGE(BlockRange(), call);
    JITDUMP("\n");
+    return nullptr;


Couldn't this return call->gtNext and then we could avoid all the special-case checking for nullptr?

I can't return call->gtNext here becuase gtNext can be nullptr itself so then it won't be clear whether this method made any changes or not. In fact, I did it initally and hit an assert somewhere

src/coreclr/jit/lower.cpp

src/coreclr/jit/lsraxarch.cpp

EgorBo · 2023-03-24T12:09:39Z

@BruceForstall thanks for the feedback, addressed in #83740

ghost assigned EgorBo Mar 18, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 18, 2023

EgorBo commented Mar 18, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Buffer.cs Outdated Show resolved Hide resolved

SingleAccretion reviewed Mar 18, 2023

View reviewed changes

src/coreclr/jit/morph.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/morph.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/valuenum.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

EgorBo added 2 commits March 19, 2023 02:31

Unroll Buffer.Memmove

9f45dbf

Merge branch 'main' of github.com:dotnet/runtime into unroll-memmove

cc66929

EgorBo force-pushed the unroll-span-copyto branch from 26ff188 to cc66929 Compare March 19, 2023 01:33

Fix build

60cf0ea

EgorBo marked this pull request as draft March 19, 2023 02:01

EgorBo added 3 commits March 19, 2023 12:50

Oops

2ec2e4c

Add comments and tests

33d9c4e

Update BufferMemmove.csproj

42e325f

Update codegenxarch.cpp

0fbc82c

build-analysis bot mentioned this pull request Mar 21, 2023

IOException running NuGet-Migrations during tests in dotnet CLI first run #80619

Closed

MichalPetryka reviewed Mar 21, 2023

View reviewed changes

SingleAccretion approved these changes Mar 21, 2023

View reviewed changes

EgorBo merged commit 59a17b1 into dotnet:main Mar 21, 2023

EgorBo deleted the unroll-span-copyto branch March 21, 2023 16:52

EgorBo mentioned this pull request Mar 21, 2023

Unroll Buffer.Memmove for arm64 #83740

Merged

am11 mentioned this pull request Mar 22, 2023

Investigate adjusting herustics for unrolled block copies/initialization #82529

Closed

BruceForstall reviewed Mar 23, 2023

View reviewed changes

EgorBo mentioned this pull request Apr 9, 2023

Enable AVX-512 in Memmove unrolling #84348

Merged

ghost locked as resolved and limited conversation to collaborators Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unroll Buffer.Memmove for constant lengths #83638

Unroll Buffer.Memmove for constant lengths #83638

EgorBo commented Mar 18, 2023 •

edited

Loading

ghost commented Mar 18, 2023

Benchmarks

Motivation

stephentoub commented Mar 18, 2023

EgorBo commented Mar 18, 2023

jkotas commented Mar 18, 2023

EgorBo commented Mar 18, 2023

MichalPetryka commented Mar 18, 2023

EgorBo commented Mar 18, 2023

MichalPetryka commented Mar 18, 2023

EgorBo commented Mar 19, 2023 •

edited

Loading

EgorBo commented Mar 19, 2023 •

edited

Loading

EgorBo commented Mar 19, 2023 •

edited

Loading

MichalPetryka Mar 21, 2023

EgorBo Mar 21, 2023 •

edited

Loading

MichalPetryka Mar 21, 2023

MichalPetryka Mar 21, 2023

EgorBo Mar 21, 2023

EgorBo commented Mar 21, 2023

BruceForstall Mar 23, 2023

EgorBo Mar 24, 2023

EgorBo commented Mar 24, 2023

	ulong fals_val = BitConverter.IsLittleEndian ? 0x73006C00610046ul : 0x460061006C0073ul; // "Fals"
	MemoryMarshal.Write<ulong>(MemoryMarshal.AsBytes(destination), ref fals_val);
	destination[4] = 'e';
	charsWritten = 5;

		TestMemmove((dst, src) => src.AsSpan(0, 65).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(65)).CopyTo(dst));
		TestMemmove((dst, src) => src.AsSpan(0, 127).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(127)).CopyTo(dst));

Unroll Buffer.Memmove for constant lengths #83638

Unroll Buffer.Memmove for constant lengths #83638

Conversation

EgorBo commented Mar 18, 2023 • edited Loading

Codegen example:

Was:

Now

Benchmarks:

Motivation

ghost commented Mar 18, 2023

Benchmarks

Motivation

stephentoub commented Mar 18, 2023

EgorBo commented Mar 18, 2023

jkotas commented Mar 18, 2023

EgorBo commented Mar 18, 2023

MichalPetryka commented Mar 18, 2023

EgorBo commented Mar 18, 2023

MichalPetryka commented Mar 18, 2023

EgorBo commented Mar 19, 2023 • edited Loading

EgorBo commented Mar 19, 2023 • edited Loading

EgorBo commented Mar 19, 2023 • edited Loading

MichalPetryka Mar 21, 2023

Choose a reason for hiding this comment

EgorBo Mar 21, 2023 • edited Loading

Choose a reason for hiding this comment

MichalPetryka Mar 21, 2023

Choose a reason for hiding this comment

MichalPetryka Mar 21, 2023

Choose a reason for hiding this comment

EgorBo Mar 21, 2023

Choose a reason for hiding this comment

EgorBo commented Mar 21, 2023

BruceForstall Mar 23, 2023

Choose a reason for hiding this comment

EgorBo Mar 24, 2023

Choose a reason for hiding this comment

EgorBo commented Mar 24, 2023

EgorBo commented Mar 18, 2023 •

edited

Loading

EgorBo commented Mar 19, 2023 •

edited

Loading

EgorBo commented Mar 19, 2023 •

edited

Loading

EgorBo commented Mar 19, 2023 •

edited

Loading

EgorBo Mar 21, 2023 •

edited

Loading