Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unroll Buffer.Memmove for constant lengths #83638

Merged
merged 18 commits into from
Mar 21, 2023
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Mar 18, 2023

Unroll Buffer.Memmove for constant lengths, various patterns improve, e.g.:

void StringLiteral(Span<char> span) => "False".CopyTo(span);

void Utf8Literal(Span<byte> span) => "Hello World"u8.CopyTo(span);

void SliceCopy(ReadOnlySpan<byte> src, Span<byte> dest) => src.Slice(0, 128).CopyTo(dest);

bool AsSpanCopy(byte[] a, byte[] b) => a.AsSpan(0, 100).TryCopyTo(b);

byte[] RvaToArray() => "hello world"u8.ToArray();

char[] StringToCharArray() => "Test string".ToCharArray();

Codegen example:

void CopySpan(Span<int> dst, ReadOnlySpan<int> src) => 
    src.Slice(0, 30).TryCopyTo(dst); // copy 30*sizeof(int)=120 bytes

Was:

; Method P:CopySpan(System.Span`1[int],System.ReadOnlySpan`1[int]):this
       push     rbp
       mov      rbp, rsp
       cmp      r8d, 30
       jb       SHORT G_M33314_IG06
       mov      rdi, rsi
       cmp      edx, 30
       jb       SHORT G_M33314_IG04
       mov      edx, 120
       mov      rsi, rcx
       call     [System.Buffer:Memmove(byref,byref,ulong)]  ;; memmove call
G_M33314_IG04:
       nop      
       pop      rbp
       ret      
G_M33314_IG06:
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     

Now

; Method P:CopySpan(System.Span`1[int],System.ReadOnlySpan`1[int]):this
       push     rbp
       vzeroupper 
       mov      rbp, rsp
       cmp      r8d, 30
       jb       SHORT G_M33314_IG05
       cmp      edx, 30
       jb       SHORT G_M33314_IG04
       vmovdqu  ymm0, ymmword ptr[rcx]
       vmovdqu  ymm1, ymmword ptr[rcx+20H]
       vmovdqu  ymm2, ymmword ptr[rcx+40H]
       vmovdqu  ymm3, ymmword ptr[rcx+58H]
       vmovdqu  ymmword ptr[rsi], ymm0
       vmovdqu  ymmword ptr[rsi+20H], ymm1
       vmovdqu  ymmword ptr[rsi+40H], ymm2
       vmovdqu  ymmword ptr[rsi+58H], ymm3
G_M33314_IG04:
       pop      rbp
       ret      
G_M33314_IG05:
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     

The whole src is saved into temp SIMD regs so we can ignore the fact that src and dst might overlap. Currently, we support unrolling from 1 to 128 bytes (or 256 bytes with AVX-512 once it's enabled)

Benchmarks:

public static IEnumerable<object[]> TestArgs()
{
    yield return new object[] { new byte[128], new byte[128] };
}

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice4(byte[] dst, byte[] src) => src.AsSpan(0, 4).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice10(byte[] dst, byte[] src) => src.AsSpan(0, 10).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice26(byte[] dst, byte[] src) => src.AsSpan(0, 26).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice64(byte[] dst, byte[] src) => src.AsSpan(0, 64).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice120(byte[] dst, byte[] src) => src.AsSpan(0, 120).CopyTo(dst);

// Overlapping

IntPtr nativeAlloc;

[GlobalSetup]
public void GlobalSetup() => nativeAlloc = (IntPtr)NativeMemory.AlignedAlloc(1024, 32); // aligned to reduce noise

[GlobalCleanup]
public void GlobalCleanup() => NativeMemory.AlignedFree((void*)nativeAlloc);

[Benchmark]
public void CopyConstSlice8_overlap() => 
    new Span<byte>((void*)nativeAlloc, 8).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 8));

[Benchmark]
public void CopyConstSlice32_overlap() => 
    new Span<byte>((void*)nativeAlloc, 32).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 32));

[Benchmark]
public void CopyConstSlice120_overlap() => 
    new Span<byte>((void*)nativeAlloc, 120).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 120));

Codegen diff for them: https://www.diffchecker.com/YLQJhBTE/

Method Toolchain Mean Ratio
CopyConstSlice4 \runtime-base\corerun.exe 1.1995 ns 3.52
CopyConstSlice4 \runtime-PR\corerun.exe 0.3410 ns 1.00
CopyConstSlice10 \runtime-base\corerun.exe 1.2086 ns 3.51
CopyConstSlice10 \runtime-PR\corerun.exe 0.3445 ns 1.00
CopyConstSlice26 \runtime-base\corerun.exe 1.3104 ns 2.50
CopyConstSlice26 \runtime-PR\corerun.exe 0.5236 ns 1.00
CopyConstSlice64 \runtime-base\corerun.exe 1.4350 ns 2.73
CopyConstSlice64 \runtime-PR\corerun.exe 0.5253 ns 1.00
CopyConstSlice120 \runtime-base\corerun.exe 1.8053 ns 3.39
CopyConstSlice120 \runtime-PR\corerun.exe 0.5325 ns 1.00
CopyConstSlice8_overlap \runtime-base\corerun.exe 4.6385 ns 1.95
CopyConstSlice8_overlap \runtime-PR\corerun.exe 2.3922 ns 1.00
CopyConstSlice32_overlap \runtime-base\corerun.exe 4.0476 ns 1.48
CopyConstSlice32_overlap \runtime-PR\corerun.exe 2.7372 ns 1.00
CopyConstSlice120_overlap \runtime-base\corerun.exe 5.3873 ns 1.63
CopyConstSlice120_overlap \runtime-PR\corerun.exe 3.3125 ns 1.00

Note that I was bencmarking it on a very fast CPU Ryzen 7950X (Windows 11 x64).

There are a few nit TODO-CQ in the code for future improvements (e.g. don't use two ymm to unroll 33 bytes and do ymm+GPR) and only x64 for now to simplify code review, I'll do arm64 right after this lands. 32bit can also be implemented, but it adds extra complexity like on x86 we can only do byte-wide loads on specific register and we need up to 4 GPR regs to handle 15 bytes, etc.

Motivation

To simplify hacks like these:

ulong fals_val = BitConverter.IsLittleEndian ? 0x73006C00610046ul : 0x460061006C0073ul; // "Fals"
MemoryMarshal.Write<ulong>(MemoryMarshal.AsBytes(destination), ref fals_val);
destination[4] = 'e';
charsWritten = 5;

cc @jkotas @stephentoub thoughts?

From my understanding the "are overlapped" check is GC-safe and can be futher improved (but not with Math.Abs). Also, I had to slightly improve IsKnowConstant because when spans are involved we can only properly detect constants in late phases (after SSA/VN)

@ghost ghost assigned EgorBo Mar 18, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 18, 2023
@ghost
Copy link

ghost commented Mar 18, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

If pointers don't overlap and size is jit-time known - try to unroll it. Various shapes are supported, e.g.:

void WriteFalse(Span<char> span)
{
    "False".CopyTo(span);
}

void WriteHelloWorldUtf8(Span<byte> span)
{
    "Hello World"u8.CopyTo(span);
}

void CopyFirst128Bytes(ReadOnlySpan<byte> src, Span<byte> dest)
{
    src.Slice(0, 128).CopyTo(dest);
}

E.g. codegen for CopyFirst128Bytes: https://www.diffchecker.com/1JFxGxoF/

Benchmarks

static readonly char[] charBuffer = new char[1024];
static readonly byte[] byteBuffer1 = new byte[1024];
static readonly byte[] byteBuffer2 = new byte[1024];

[Benchmark] 
public void CopyConstSlice1() => byteBuffer1.AsSpan(0, 1).CopyTo(byteBuffer2);

[Benchmark] 
public void CopyConstSlice8() => byteBuffer1.AsSpan(0, 8).CopyTo(byteBuffer2);

[Benchmark] 
public void CopyConstSlice16() => byteBuffer1.AsSpan(0, 16).CopyTo(byteBuffer2);

[Benchmark] 
public void CopyConstSlice64() => byteBuffer1.AsSpan(0, 64).CopyTo(byteBuffer2);

[Benchmark] 
public void CopyConstSlice128() => byteBuffer1.AsSpan(0, 128).CopyTo(byteBuffer2);
|               Method |                 Toolchain |      Mean |
|--------------------- |-------------------------- |----------:|
|      CopyConstSlice1 | \runtime-Main\corerun.exe | 1.2544 ns |
|      CopyConstSlice1 |      \runtime\corerun.exe | 0.2108 ns |
|                      |                           |           |
|      CopyConstSlice8 | \runtime-Main\corerun.exe | 1.1305 ns |
|      CopyConstSlice8 |      \runtime\corerun.exe | 0.3892 ns |
|                      |                           |           |
|     CopyConstSlice16 | \runtime-Main\corerun.exe | 1.1323 ns |
|     CopyConstSlice16 |      \runtime\corerun.exe | 0.3480 ns |
|                      |                           |           |
|     CopyConstSlice64 | \runtime-Main\corerun.exe | 1.3809 ns |
|     CopyConstSlice64 |      \runtime\corerun.exe | 0.3970 ns |
|                      |                           |           |
|    CopyConstSlice128 | \runtime-Main\corerun.exe | 1.9375 ns |
|    CopyConstSlice128 |      \runtime\corerun.exe | 0.5410 ns |

Motivation

To simplify hacks like these:

ulong fals_val = BitConverter.IsLittleEndian ? 0x73006C00610046ul : 0x460061006C0073ul; // "Fals"
MemoryMarshal.Write<ulong>(MemoryMarshal.AsBytes(destination), ref fals_val);
destination[4] = 'e';
charsWritten = 5;

cc @jkotas @stephentoub thoughts?

From my understanding the "are overlapped" check is GC-safe and can be futher improved (but not with Math.Abs). Also, I had to slightly improve IsKnowConstant because when spans are involved we can only properly detect constants in late phases (after SSA/VN)

Author: EgorBo
Assignees: EgorBo
Labels:

area-CodeGen-coreclr

Milestone: -

@stephentoub
Copy link
Member

Can we remove things like this?

// AppendLiteral is expected to be called by compiler-generated code with a literal string.
// By inlining it, the method body is exposed to the constant length of that literal, allowing the JIT to
// prune away the irrelevant cases. This effectively enables multiple implementations of AppendLiteral,
// special-cased on and optimized for the literal's length. We special-case lengths 1 and 2 because
// they're very common, e.g.
// 1: ' ', '.', '-', '\t', etc.
// 2: ", ", "0x", "=>", ": ", etc.
// and length 4 because it can similarly be done with a single read/write.
if (value.Length == 1)
{
int pos = _pos;
Span<char> chars = _chars;
if ((uint)pos < (uint)chars.Length)
{
chars[pos] = value[0];
_pos = pos + 1;
}
else
{
GrowThenCopyString(value);
}
return;
}
if (value.Length == 2)
{
int pos = _pos;
if ((uint)pos < _chars.Length - 1)
{
Unsafe.WriteUnaligned(
ref Unsafe.As<char, byte>(ref Unsafe.Add(ref MemoryMarshal.GetReference(_chars), pos)),
Unsafe.ReadUnaligned<int>(ref Unsafe.As<char, byte>(ref value.GetRawStringData())));
_pos = pos + 2;
}
else
{
GrowThenCopyString(value);
}
return;
}
if (value.Length == 4)
{
int pos = _pos;
if ((uint)pos < _chars.Length - 3)
{
Unsafe.WriteUnaligned(
ref Unsafe.As<char, byte>(ref Unsafe.Add(ref MemoryMarshal.GetReference(_chars), pos)),
Unsafe.ReadUnaligned<long>(ref Unsafe.As<char, byte>(ref value.GetRawStringData())));
_pos = pos + 4;
}
else
{
GrowThenCopyString(value);
}
return;
}
}

src/coreclr/jit/morph.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/morph.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/valuenum.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
@EgorBo
Copy link
Member Author

EgorBo commented Mar 18, 2023

Can we remove things like this?

Oh, good example, let me check.
Btw, this PR fixes #68103

@jkotas
Copy link
Member

jkotas commented Mar 18, 2023

codegen for CopyFirst128Bytes: https://www.diffchecker.com/1JFxGxoF/

It would be nice to turn this into just this:

       cmp      dword ptr [rdx+08H], 128
       jb       SHORT G_M29327_IG07
       mov      rdx, bword ptr [rdx]
       mov      rcx, bword ptr [r8]
       mov      r8d, dword ptr [r8+08H]
       cmp      r8d, 128
       jb       SHORT G_M29327_IG08

       vmovdqu  ymm0, ymmword ptr[rdx]
       vmovdqu  ymm1, ymmword ptr[rdx+20H]
       vmovdqu  ymm2, ymmword ptr[rdx+40H]
       vmovdqu  ymm3, ymmword ptr[rdx+60H]
       vmovdqu  ymmword ptr[rcx], ymm0
       vmovdqu  ymmword ptr[rcx+20H], ymm1
       vmovdqu  ymmword ptr[rcx+40H], ymm2
       vmovdqu  ymmword ptr[rcx+60H], ymm3

       ret

@EgorBo
Copy link
Member Author

EgorBo commented Mar 18, 2023

codegen for CopyFirst128Bytes: https://www.diffchecker.com/1JFxGxoF/

It would be nice to turn this into just this:

       cmp      dword ptr [rdx+08H], 128
       jb       SHORT G_M29327_IG07
       mov      rdx, bword ptr [rdx]
       mov      rcx, bword ptr [r8]
       mov      r8d, dword ptr [r8+08H]
       cmp      r8d, 128
       jb       SHORT G_M29327_IG08

       vmovdqu  ymm0, ymmword ptr[rdx]
       vmovdqu  ymm1, ymmword ptr[rdx+20H]
       vmovdqu  ymm2, ymmword ptr[rdx+40H]
       vmovdqu  ymm3, ymmword ptr[rdx+60H]
       vmovdqu  ymmword ptr[rcx], ymm0
       vmovdqu  ymmword ptr[rcx+20H], ymm1
       vmovdqu  ymmword ptr[rcx+40H], ymm2
       vmovdqu  ymmword ptr[rcx+60H], ymm3

       ret

Right, but, presumably, it's not a simple change since we also need to modify LSRA to tell it we're going to need up to 4 spare simd registers (+ scalars). Also, we need some separate phase after VN (but before lower since VN's will be invalidated) where we'll be doing these transformations. (or inside EarlyProp with some modifications).

I plan to prototype such a phase eventually because it's only way I can unroll SequenceEqual for utf8 literals

@MichalPetryka
Copy link
Contributor

codegen for CopyFirst128Bytes: https://www.diffchecker.com/1JFxGxoF/

It would be nice to turn this into just this:

       cmp      dword ptr [rdx+08H], 128
       jb       SHORT G_M29327_IG07
       mov      rdx, bword ptr [rdx]
       mov      rcx, bword ptr [r8]
       mov      r8d, dword ptr [r8+08H]
       cmp      r8d, 128
       jb       SHORT G_M29327_IG08

       vmovdqu  ymm0, ymmword ptr[rdx]
       vmovdqu  ymm1, ymmword ptr[rdx+20H]
       vmovdqu  ymm2, ymmword ptr[rdx+40H]
       vmovdqu  ymm3, ymmword ptr[rdx+60H]
       vmovdqu  ymmword ptr[rcx], ymm0
       vmovdqu  ymmword ptr[rcx+20H], ymm1
       vmovdqu  ymmword ptr[rcx+40H], ymm2
       vmovdqu  ymmword ptr[rcx+60H], ymm3

       ret

Right, but, presumably, it's not a simple change since we also need to modify LSRA to tell it we're going to need up to 4 spare simd registers (+ scalars). Also, we need some separate phase after VN (but before lower since VN's will be invalidated) where we'll be doing these transformations. (or inside EarlyProp with some modifications).

I plan to prototype such a phase eventually because it's only way I can unroll SequenceEqual for utf8 literals

Could we for now assume that using up to 4 regs is fine and change it to only unroll if there are free ones later?

@EgorBo
Copy link
Member Author

EgorBo commented Mar 18, 2023

Ok, I'll try to do the whole thing in JIT. It probably makes sense to still merge improvements for IsKnownConstant alone since it's now more powerful and is able to detect constant Length for spans, etc.

@MichalPetryka
Copy link
Contributor

Ok, I'll try to do the whole thing in JIT. It probably makes sense to still merge improvements for IsKnownConstant alone since it's now more powerful and is able to detect constant Length for spans, etc.

Does it make sense to split them off into a separate PR?

@EgorBo
Copy link
Member Author

EgorBo commented Mar 19, 2023

Implemented in JIT, e.g.:

void Test(ReadOnlySpan<int> dst, Span<int> src) => 
    dst.Slice(0, 30).CopyTo(src);

it now emits (linux-x64):

; Method P:Test
G_M7025_IG01:
       push     rbp
       vzeroupper 
       mov      rbp, rsp
G_M7025_IG02:
       cmp      edx, 30
       jb       SHORT G_M7025_IG04
       cmp      r8d, 30
       jb       SHORT G_M7025_IG05
       vmovdqu  ymm0, ymmword ptr[rsi]
       vmovdqu  ymm1, ymmword ptr[rsi+20H]
       vmovdqu  ymm2, ymmword ptr[rsi+40H]
       vmovdqu  ymm3, ymmword ptr[rsi+58H]
       vmovdqu  ymmword ptr[rcx], ymm0
       vmovdqu  ymmword ptr[rcx+20H], ymm1
       vmovdqu  ymmword ptr[rcx+40H], ymm2
       vmovdqu  ymmword ptr[rcx+58H], ymm3
G_M7025_IG03:
       pop      rbp
       ret      
G_M7025_IG04:
       mov      rax, 0xD1FFAB1E      ; code for System.ThrowHelper:ThrowArgumentOutOfRangeException()
       call     [rax]System.ThrowHelper:ThrowArgumentOutOfRangeException()
       int3     
G_M7025_IG05:
       mov      rax, 0xD1FFAB1E      ; code for System.ThrowHelper:ThrowArgumentException_DestinationTooShort()
       call     [rax]System.ThrowHelper:ThrowArgumentException_DestinationTooShort()
       int3     
; Total bytes of code: 84

30*sizeof(int) = 120 bytes = 4 ymm moves
Did it in lower and without optional optimizations like detecting that src and dst actually never overlap for now. And for x64 only.

Does it make sense to split them off into a separate PR?

Yeah, will push the improvements for IsKnownConstant separately once/if this lands

@EgorBo EgorBo marked this pull request as draft March 19, 2023 02:01
@EgorBo
Copy link
Member Author

EgorBo commented Mar 19, 2023

Or in case of Boolean.cs hack:

bool TryWriteFalseUtf8(Span<byte> span)
{
    ReadOnlySpan<byte> str = "False"u8; // utf8 literal
    return str.TryCopyTo(span);
}

void WriteFalse(Span<char> span)
{
    string str = "False"; // utf16
    str.AsSpan().CopyTo(span);
}
; Method P:TryWriteFalseUtf8
G_M40086_IG01:
       push     rax
G_M40086_IG02:
       mov      rax, 0xD1FFAB1E          ;; RVA data for utf8 literal
       xor      edi, edi
       cmp      edx, 5
       jb       SHORT G_M40086_IG04
G_M40086_IG03:
       mov      edx, dword ptr [rax]     ;; TODO: could be folded to a constant (RVA[cns])
       mov      edi, dword ptr [rax+01H] ;; TODO: could be folded to a constant (RVA[cns])
       mov      dword ptr [rsi], edx
       mov      dword ptr [rsi+01H], edi
       mov      edi, 1
G_M40086_IG04:
       mov      eax, edi
G_M40086_IG05:
       add      rsp, 8
       ret      
; Total bytes of code: 40


; Method P:WriteFalse
G_M10586_IG01:
       push     rbp
       mov      rbp, rsp
G_M10586_IG02:
       mov      rax, 0xD1FFAB1E           ;; "False" frozen string literal
       add      rax, 12
       cmp      edx, 5
       jb       SHORT G_M10586_IG04
       mov      rdx, qword ptr [rax]      ;; TODO: could be folded to a constant (string[cns])
       mov      rdi, qword ptr [rax+02H]  ;; TODO: could be folded to a constant (string[cns])
       mov      qword ptr [rsi], rdx
       mov      qword ptr [rsi+02H], rdi
G_M10586_IG03:
       pop      rbp
       ret      
G_M10586_IG04:
       mov      rax, 0xD1FFAB1E
       call     [rax]System.ThrowHelper:ThrowArgumentException_DestinationTooShort()
       int3     
; Total bytes of code: 52

(same for char, just CopyTo and many other patterns)

@EgorBo
Copy link
Member Author

EgorBo commented Mar 19, 2023

Should be ready for review once CI passes, checked GC info, tests. Will run jitstresssregs and other outerloops including gcstress.

new benchmarks:

public static IEnumerable<object[]> TestArgs()
{
    yield return new object[] { new byte[128], new byte[128] };
}

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice4(byte[] dst, byte[] src) => src.AsSpan(0, 4).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice10(byte[] dst, byte[] src) => src.AsSpan(0, 10).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice26(byte[] dst, byte[] src) => src.AsSpan(0, 26).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice64(byte[] dst, byte[] src) => src.AsSpan(0, 64).CopyTo(dst);

[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice120(byte[] dst, byte[] src) => src.AsSpan(0, 120).CopyTo(dst);

// Overlapping

IntPtr nativeAlloc;

[GlobalSetup]
public void GlobalSetup() => nativeAlloc = (IntPtr)NativeMemory.AlignedAlloc(1024, 32); // aligned to reduce noise

[GlobalCleanup]
public void GlobalCleanup() => NativeMemory.AlignedFree((void*)nativeAlloc);

[Benchmark]
public void CopyConstSlice8_overlap() => 
    new Span<byte>((void*)nativeAlloc, 8).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 8));

[Benchmark]
public void CopyConstSlice32_overlap() => 
    new Span<byte>((void*)nativeAlloc, 32).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 32));

[Benchmark]
public void CopyConstSlice120_overlap() => 
    new Span<byte>((void*)nativeAlloc, 120).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 120));

Codegen diff for them: https://www.diffchecker.com/YLQJhBTE/

Method Toolchain Mean Ratio
CopyConstSlice4 \runtime-base\corerun.exe 1.1995 ns 3.52
CopyConstSlice4 \runtime-PR\corerun.exe 0.3410 ns 1.00
CopyConstSlice10 \runtime-base\corerun.exe 1.2086 ns 3.51
CopyConstSlice10 \runtime-PR\corerun.exe 0.3445 ns 1.00
CopyConstSlice26 \runtime-base\corerun.exe 1.3104 ns 2.50
CopyConstSlice26 \runtime-PR\corerun.exe 0.5236 ns 1.00
CopyConstSlice64 \runtime-base\corerun.exe 1.4350 ns 2.73
CopyConstSlice64 \runtime-PR\corerun.exe 0.5253 ns 1.00
CopyConstSlice120 \runtime-base\corerun.exe 1.8053 ns 3.39
CopyConstSlice120 \runtime-PR\corerun.exe 0.5325 ns 1.00
CopyConstSlice8_overlap \runtime-base\corerun.exe 4.6385 ns 1.95
CopyConstSlice8_overlap \runtime-PR\corerun.exe 2.3922 ns 1.00
CopyConstSlice32_overlap \runtime-base\corerun.exe 4.0476 ns 1.48
CopyConstSlice32_overlap \runtime-PR\corerun.exe 2.7372 ns 1.00
CopyConstSlice120_overlap \runtime-base\corerun.exe 5.3873 ns 1.63
CopyConstSlice120_overlap \runtime-PR\corerun.exe 3.3125 ns 1.00

Note that I was bencmarking it on a very fast CPU Ryzen 7950X (Windows 11 x64).

There are a few nit TODO-CQ in the code for future improvements (e.g. don't use two ymm to unroll 33 bytes and do ymm+GPR) and only x64 for now to simplify code review, I'll do arm64 right after this lands. 32bit can also be implemented, but it adds extra complexity like on x86 we can only do byte-wide loads on specific register and we need up to 4 GPR regs to handle 15 bytes, etc.

Comment on lines +51 to +52
TestMemmove((dst, src) => src.AsSpan(0, 65).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(65)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 127).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(127)).CopyTo(dst));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TestMemmove((dst, src) => src.AsSpan(0, 65).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(65)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 127).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(127)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 65).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(65)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 95).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(95)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 96).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(96)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 97).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(97)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 127).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(127)).CopyTo(dst));

Copy link
Member Author

@EgorBo EgorBo Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will apply in the arm64 follow up PR if no further feedback is provided. To avoid spinning CI again

Comment on lines +48 to +49
TestMemmove((dst, src) => src.AsSpan(0, 33).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(33)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 63).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(63)).CopyTo(dst));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TestMemmove((dst, src) => src.AsSpan(0, 33).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(33)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 63).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(63)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 33).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(33)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 47).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(47)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 48).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(48)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 49).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(49)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 63).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(63)).CopyTo(dst));

Comment on lines +57 to +58
TestMemmove((dst, src) => src.AsSpan(0, 161).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(161)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 255).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(255)).CopyTo(dst));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TestMemmove((dst, src) => src.AsSpan(0, 161).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(161)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 255).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(255)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 161).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(161)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 191).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(191)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 192).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(192)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 193).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(193)).CopyTo(dst));
TestMemmove((dst, src) => src.AsSpan(0, 255).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(255)).CopyTo(dst));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will apply these in the follow up

@EgorBo
Copy link
Member Author

EgorBo commented Mar 21, 2023

Failures seem to be #83655 (Mono-wasm)

src/coreclr/jit/codegenxarch.cpp Show resolved Hide resolved
src/coreclr/jit/codegenxarch.cpp Show resolved Hide resolved
src/coreclr/jit/codegenxarch.cpp Show resolved Hide resolved
src/coreclr/jit/codegenxarch.cpp Show resolved Hide resolved
src/coreclr/jit/compiler.h Show resolved Hide resolved
@@ -1911,6 +1981,7 @@ void Lowering::LowerCall(GenTree* node)
JITDUMP("lowering call (after):\n");
DISPTREERANGE(BlockRange(), call);
JITDUMP("\n");
return nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this return call->gtNext and then we could avoid all the special-case checking for nullptr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't return call->gtNext here becuase gtNext can be nullptr itself so then it won't be clear whether this method made any changes or not. In fact, I did it initally and hit an assert somewhere

src/coreclr/jit/lower.cpp Show resolved Hide resolved
src/coreclr/jit/lsraxarch.cpp Show resolved Hide resolved
@EgorBo
Copy link
Member Author

EgorBo commented Mar 24, 2023

@BruceForstall thanks for the feedback, addressed in #83740

@ghost ghost locked as resolved and limited conversation to collaborators Apr 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants