Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Memmove: Use Vector types rather than Custom blocks #25172

Closed
wants to merge 4 commits into from

Conversation

benaadams
Copy link
Member

With "Allow pregenerating most HW intrinsics in CoreLib" #24917, we can use the Vector128 and Vector256 types and IsSupported check and still have the methods R2R AoT compiled for fast start; using 128-bit copies.

When the methods are re-compiled at Tier-1 they then upgrade to 256-bit copies.

This allows the best of both worlds.

/cc @jkotas @GrabYourPitchforks @tannergooding @MichalStrehovsky

Notes:

Doubled the length for managed copy after the upgrade when the register size doubles.

The R2R version is almost identical; however some leas creap in (I think @mikedn has a fix for this)

G_M6842_IG11:
       movdqu   xmm0, qword ptr [rdx]
       movdqu   qword ptr [rcx], xmm0
       movdqu   xmm0, qword ptr [rdx+16]
       movdqu   qword ptr [rcx+16], xmm0
       movdqu   xmm0, qword ptr [rdx+32]
       movdqu   qword ptr [rcx+32], xmm0
       movdqu   xmm0, qword ptr [rdx+48]
       movdqu   qword ptr [rcx+48], xmm0
       add      rcx, 64
       add      rdx, 64

becomes

G_M6856_IG11:
       mov      r11, rcx
       movdqu   xmm0, qword ptr [rdx]
       movdqu   qword ptr [r11], xmm0
       movdqu   xmm0, qword ptr [rdx+16]
       movdqu   qword ptr [r11+16], xmm0
       lea      r11, [rcx+32]
       lea      rsi, [rdx+32]
       movdqu   xmm0, qword ptr [rsi]
       movdqu   qword ptr [r11], xmm0
       movdqu   xmm0, qword ptr [rsi+16]
       movdqu   qword ptr [r11+16], xmm0
       add      rcx, 64
       add      rdx, 64

The Tier-1 likewise

       vmovupd  ymm0, ymmword ptr[rdx]
       vmovupd  ymmword ptr[rcx], ymm0
       lea      r11, bword ptr [rcx+32]
       lea      rsi, bword ptr [rdx+32]
       vmovupd  ymm0, ymmword ptr[rsi]
       vmovupd  ymmword ptr[r11], ymm0
       add      rcx, 64
       add      rdx, 64

@tannergooding
Copy link
Member

When the methods are re-compiled at Tier-1 they then upgrade to 256-bit copies.

What is the measured difference between 128-bit only and also supporting 256-bit here?

@@ -158,7 +160,7 @@ internal static unsafe void Memmove(byte* dest, byte* src, nuint len)
// Copy bytes which are multiples of 16 and leave the remainder for MCPY01 to handle.
Debug.Assert(len > 16 && len <= 64);
#if HAS_CUSTOM_BLOCKS
*(Block16*)dest = *(Block16*)src; // [0,16]
Unsafe.AsRef<Vector128<byte>>(dest) = Unsafe.AsRef<Vector128<byte>>(src); // [0,16]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a pointer, why not explicitly Load. That would more easily allow use of the non-temporal loads, if desirable (and make it easier to test them).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know if non-temporal is desirable.

Also would mean every copy would need to be doubled and guarded with an .IsSupported rather than just working?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would mean every copy would need to be doubled and guarded with an .IsSupported rather than just working?

I think providing a Load helper method could be used to handle that, and abstract away the x86 vs Arm vs fallback cases.

@@ -372,7 +391,7 @@ private static void Memmove(ref byte dest, ref byte src, nuint len)
if (len <= 32)
goto MCPY01;
#if HAS_CUSTOM_BLOCKS
Unsafe.As<byte, Block16>(ref Unsafe.Add(ref dest, 16)) = Unsafe.As<byte, Block16>(ref Unsafe.Add(ref src, 16)); // [0,32]
Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref dest, 16)) = Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref src, 16)); // [0,32]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we pin for long arrays, and have handling for short arrays, is there any reason we can't just pin for the medium array length case as well?

Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref src, 16)); is hard to read and easy to get wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two different methods one that takes pointers; and this one that takes ref and doesn't pin anything

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It pins if it hits the native fallback case (long arrays), where it calls _Memmove.

@benaadams
Copy link
Member Author

When the methods are re-compiled at Tier-1 they then upgrade to 256-bit copies.

What is the measured difference between 128-bit only and also supporting 256-bit here?

For >= Haswell the xmm and ymm have the same clock cycle cost (though one of the ymm ones is +1/3 latency)

image

As its operating on double the data size should be a net win?

I also wanted a good excuse to double the pinvoke switch over length 😉

image

@jkotas
Copy link
Member

jkotas commented Jun 14, 2019

Any numbers from micro-benchmarks demonstrating the improvement?

@tannergooding
Copy link
Member

As its operating on double the data size should be a net win?

Some processors, especially newer ones, can dispatch two 128-bit instructions or one 256-bit instruction per cycle. There are some potential drawbacks to using the 256-bit instructions, especially when the alignment is not guaranteed (and potential minor downclocking that can occur). So, I was wondering if we had any numbers showing that the 256-bit path was worth the additional maintenance cost (or if just having a 128-bit path is sufficient for the 95% of cases).

@benaadams
Copy link
Member Author

So, I was wondering if we had any numbers showing that the 256-bit path was worth the additional maintenance cost

Can minimise the changes instead so it deletes more than it adds as per last commit [+8 -12]; so less maintenance cost than currently?

Will follow up with perf numbers

@@ -4,6 +4,8 @@

#if AMD64 || ARM64 || (BIT32 && !ARM)
#define HAS_CUSTOM_BLOCKS
using Block16 = System.Runtime.Intrinsics.Vector128<byte>;
using Block32 = System.Runtime.Intrinsics.Vector256<byte>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this going to behave on ARM64, where Vector256 is not a TYP_SIMD (is it two 16-byte copies or is it copying 4 ulongs, since that is what Vector256<T> has as its fields)?

What about on x86 when AVX is not supported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYP_SIMD doesn't matter; as there is no field access it will behave the same as all 32 byte structs (like the Block64 previously) and use the largest R2Rable size e.g. the R2R version for x64 is

; Emitting BLENDED_CODE for X64 CPU with SSE2 - Windows
; ...
G_M6895_IG11:
       movdqu   xmm0, qword ptr [rdx]
       movdqu   qword ptr [rcx], xmm0
       movdqu   xmm0, qword ptr [rdx+16]
       movdqu   qword ptr [rcx+16], xmm0

However, because it is a TYP_SIMD the Tier-1 Jit x64 becomes

; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; ...
G_M6895_IG11:
       vmovupd  ymm0, ymmword ptr[rdx]
       vmovupd  ymmword ptr[rcx], ymm0

Whereas it stays as the 16-byte chunks with a regular struct.

I'm not sure why ARM32 is not included, maybe it faults if not aligned?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYP_SIMD doesn't matter

I might be mistaken, but I believe that optimization depends on TYP_SIMD and therefore FEATURE_SIMD.

FEATURE_SIMD is not enabled for ARM32, which would explain why it is missing out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking how to get arm disasm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to do ARM; x86 is fine though

; Emitting BLENDED_CODE for generic X86 CPU - Windows
; ...

G_M6903_IG11:
       movdqu   xmm0, qword ptr [edx]
       movdqu   qword ptr [ecx], xmm0
       movdqu   xmm0, qword ptr [edx+16]
       movdqu   qword ptr [ecx+16], xmm0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and

; Emitting BLENDED_CODE for generic X86 CPU - Windows
; Tier-1 compilation

G_M6903_IG11:
       vmovupd  ymm0, ymmword ptr[edx]
       vmovupd  ymmword ptr[ecx], ymm0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have built x86 windows, you should have an arm32 altjit. This will show you what code the jit would produce on arm32.

For jit-diffs you can use this via --altjit armelnonjit.dll.

With corerun you can set COMPlus_AltJitName=armelnonjit.dll and COMPlus_AltJit=*.

For crossgen, COMPlus_AltJitName=armelnonjit.dll and COMPlus_AltJitNgen=*.

Copy link
Member Author

@benaadams benaadams Jun 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @AndyAyersMS

ARM32 using the blocks and without seems to generate the same asm

G_M6903_IG11:
            ldr     r1, [r5]
            str     r1, [r4]
            ldr     r1, [r5+4]
            str     r1, [r4+4]
            ldr     r1, [r5+8]
            str     r1, [r4+8]
            ldr     r1, [r5+12]
            str     r1, [r4+12]
            ; ...   etc

However; I don't know enough about ARM to take out the non-block path, so will leave it in

Without NEON a multi-register instruction seems fastest at +11% (though also uses more registers)

      LDMIA r1!, {r3 - r10}
      STMIA r0!, {r3 - r10}

followed by the asm that is currently generated

      LDR r3, [r1], #4
      STR r3, [r0], #4

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

Copy link
Member Author

@benaadams benaadams Jun 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually if its the same might as well remove the non-block path as its a lot of code?

@benaadams
Copy link
Member Author

As its operating on double the data size should be a net win?

Some processors, especially newer ones, can dispatch two 128-bit instructions or one 256-bit instruction per cycle.

Looks like 2x 256bit loads and 1x 256bit store; so need to tweak a little.

Skylake: L1 Cache Peak Bandwidth (bytes/cyc) => 96 (2x32B Load + 1*32B Store) (a)

Haswell: The L1 data cache can handle two 256-bit load and one 256-bit store operations each cycle. (b)

a. From Table 2-4. Cache Parameters of the Skylake Microarchitecture

b. From 2.2.4.1 Load and Store Operation Enhancements

@benaadams
Copy link
Member Author

Hmm... can get 25+% for 512-8192+bytes, however the algo is outlandishly sensitive to changes; so trying to improve the lower end regresses the larger sizes. Not sure is worth spending more time on it as this time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants