-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Memmove: Use Vector types rather than Custom blocks #25172
Conversation
What is the measured difference between 128-bit only and also supporting 256-bit here? |
@@ -158,7 +160,7 @@ internal static unsafe void Memmove(byte* dest, byte* src, nuint len) | |||
// Copy bytes which are multiples of 16 and leave the remainder for MCPY01 to handle. | |||
Debug.Assert(len > 16 && len <= 64); | |||
#if HAS_CUSTOM_BLOCKS | |||
*(Block16*)dest = *(Block16*)src; // [0,16] | |||
Unsafe.AsRef<Vector128<byte>>(dest) = Unsafe.AsRef<Vector128<byte>>(src); // [0,16] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a pointer, why not explicitly Load
. That would more easily allow use of the non-temporal loads, if desirable (and make it easier to test them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know if non-temporal is desirable.
Also would mean every copy would need to be doubled and guarded with an .IsSupported
rather than just working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also would mean every copy would need to be doubled and guarded with an .IsSupported rather than just working?
I think providing a Load
helper method could be used to handle that, and abstract away the x86
vs Arm
vs fallback cases.
@@ -372,7 +391,7 @@ private static void Memmove(ref byte dest, ref byte src, nuint len) | |||
if (len <= 32) | |||
goto MCPY01; | |||
#if HAS_CUSTOM_BLOCKS | |||
Unsafe.As<byte, Block16>(ref Unsafe.Add(ref dest, 16)) = Unsafe.As<byte, Block16>(ref Unsafe.Add(ref src, 16)); // [0,32] | |||
Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref dest, 16)) = Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref src, 16)); // [0,32] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we pin for long arrays, and have handling for short arrays, is there any reason we can't just pin for the medium array length case as well?
Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref src, 16));
is hard to read and easy to get wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two different methods one that takes pointers; and this one that takes ref
and doesn't pin anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It pins if it hits the native fallback case (long arrays), where it calls _Memmove
.
For >= Haswell the xmm and ymm have the same clock cycle cost (though one of the ymm ones is +1/3 latency) As its operating on double the data size should be a net win? I also wanted a good excuse to double the pinvoke switch over length 😉 |
Any numbers from micro-benchmarks demonstrating the improvement? |
Some processors, especially newer ones, can dispatch two 128-bit instructions or one 256-bit instruction per cycle. There are some potential drawbacks to using the 256-bit instructions, especially when the alignment is not guaranteed (and potential minor downclocking that can occur). So, I was wondering if we had any numbers showing that the 256-bit path was worth the additional maintenance cost (or if just having a 128-bit path is sufficient for the 95% of cases). |
Can minimise the changes instead so it deletes more than it adds as per last commit [+8 -12]; so less maintenance cost than currently? Will follow up with perf numbers |
@@ -4,6 +4,8 @@ | |||
|
|||
#if AMD64 || ARM64 || (BIT32 && !ARM) | |||
#define HAS_CUSTOM_BLOCKS | |||
using Block16 = System.Runtime.Intrinsics.Vector128<byte>; | |||
using Block32 = System.Runtime.Intrinsics.Vector256<byte>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this going to behave on ARM64, where Vector256
is not a TYP_SIMD (is it two 16-byte copies or is it copying 4 ulongs, since that is what Vector256<T>
has as its fields)?
What about on x86 when AVX is not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TYP_SIMD doesn't matter; as there is no field access it will behave the same as all 32 byte structs (like the Block64 previously) and use the largest R2Rable size e.g. the R2R version for x64 is
; Emitting BLENDED_CODE for X64 CPU with SSE2 - Windows
; ...
G_M6895_IG11:
movdqu xmm0, qword ptr [rdx]
movdqu qword ptr [rcx], xmm0
movdqu xmm0, qword ptr [rdx+16]
movdqu qword ptr [rcx+16], xmm0
However, because it is a TYP_SIMD the Tier-1 Jit x64 becomes
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; ...
G_M6895_IG11:
vmovupd ymm0, ymmword ptr[rdx]
vmovupd ymmword ptr[rcx], ymm0
Whereas it stays as the 16-byte chunks with a regular struct.
I'm not sure why ARM32 is not included, maybe it faults if not aligned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TYP_SIMD doesn't matter
I might be mistaken, but I believe that optimization depends on TYP_SIMD
and therefore FEATURE_SIMD
.
FEATURE_SIMD
is not enabled for ARM32, which would explain why it is missing out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking how to get arm disasm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how to do ARM; x86 is fine though
; Emitting BLENDED_CODE for generic X86 CPU - Windows
; ...
G_M6903_IG11:
movdqu xmm0, qword ptr [edx]
movdqu qword ptr [ecx], xmm0
movdqu xmm0, qword ptr [edx+16]
movdqu qword ptr [ecx+16], xmm0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and
; Emitting BLENDED_CODE for generic X86 CPU - Windows
; Tier-1 compilation
G_M6903_IG11:
vmovupd ymm0, ymmword ptr[edx]
vmovupd ymmword ptr[ecx], ymm0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have built x86 windows, you should have an arm32 altjit. This will show you what code the jit would produce on arm32.
For jit-diffs you can use this via --altjit armelnonjit.dll
.
With corerun you can set COMPlus_AltJitName=armelnonjit.dll
and COMPlus_AltJit=*
.
For crossgen, COMPlus_AltJitName=armelnonjit.dll
and COMPlus_AltJitNgen=*
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @AndyAyersMS
ARM32 using the blocks and without seems to generate the same asm
G_M6903_IG11:
ldr r1, [r5]
str r1, [r4]
ldr r1, [r5+4]
str r1, [r4+4]
ldr r1, [r5+8]
str r1, [r4+8]
ldr r1, [r5+12]
str r1, [r4+12]
; ... etc
However; I don't know enough about ARM to take out the non-block path, so will leave it in
Without NEON a multi-register instruction seems fastest at +11% (though also uses more registers)
LDMIA r1!, {r3 - r10}
STMIA r0!, {r3 - r10}
followed by the asm that is currently generated
LDR r3, [r1], #4
STR r3, [r0], #4
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually if its the same might as well remove the non-block path as its a lot of code?
Looks like 2x 256bit loads and 1x 256bit store; so need to tweak a little. Skylake: L1 Cache Peak Bandwidth (bytes/cyc) => 96 (2x32B Load + 1*32B Store) (a) Haswell: The L1 data cache can handle two 256-bit load and one 256-bit store operations each cycle. (b) a. From Table 2-4. Cache Parameters of the Skylake Microarchitecture |
Hmm... can get 25+% for 512-8192+bytes, however the algo is outlandishly sensitive to changes; so trying to improve the lower end regresses the larger sizes. Not sure is worth spending more time on it as this time. |
With "Allow pregenerating most HW intrinsics in CoreLib" #24917, we can use the
Vector128
andVector256
types andIsSupported
check and still have the methods R2R AoT compiled for fast start; using 128-bit copies.When the methods are re-compiled at Tier-1 they then upgrade to 256-bit copies.
This allows the best of both worlds.
/cc @jkotas @GrabYourPitchforks @tannergooding @MichalStrehovsky
Notes:
Doubled the length for managed copy after the upgrade when the register size doubles.
The R2R version is almost identical; however some
lea
s creap in (I think @mikedn has a fix for this)becomes
The Tier-1 likewise