Rewrite Buffer.BlockCopy in C# #27216

jkotas · 2019-10-16T01:14:20Z

Fixes #27106

src/vm/jithelpers.cpp

jkotas · 2019-10-16T01:22:57Z

Performance results:

Before:

Method	numElements	Mean
CallBlockCopy	10	8.049 ns
CallBlockCopy	100	10.297 ns
CallBlockCopy	1000	30.870 ns
CallBlockCopy	10000	220.712 ns

After:

Method	numElements	Mean
CallBlockCopy	10	8.862 ns
CallBlockCopy	100	11.137 ns
CallBlockCopy	1000	41.078 ns
CallBlockCopy	10000	227.914 ns

The managed implementation is slightly slower on average (<5%), but it is GC pause friendly. Achieving the GC pause friendliness in the unmanaged implementation would add more significant overhead.

The larger delta for 1000 elements has to do with tuning of the managed Memcpy implementation that apparently performs worse for this specific block size on my machine.

src/vm/typehandle.inl

src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs

src/System.Private.CoreLib/shared/System/Buffer.cs

src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs

src/vm/typehandle.inl

VSadov

NICE!

benaadams · 2019-10-16T03:19:48Z

The managed implementation is slightly slower on average (<5%)

Memmove has scope to be improved; its setup for ymm sizes however it will only ever compile to xmm as it was implemented pre-intrinsics.

Maybe seeing if switching Block16 for Vector128 and Block64 for 2x Vector256 gives an improvement?

benaadams · 2019-10-16T03:38:16Z

Maybe seeing if switching Block16 for Vector128 and Block64 for 2x Vector256 gives an improvement?

Even have a commit for it 😄 benaadams@5b90be6

jkotas · 2019-10-16T06:28:21Z

benaadams/coreclr@5b90be6

That was #25172 . Trying to hand-tune memmove in software is a losing battle. I hope that hardware will add memmove instruction one day that will be superior to hand tuned implementations in every dimension.

benaadams · 2019-10-16T10:22:35Z

benaadams/coreclr@5b90be6

That was #25172 .

😄 so it was; thought it strange I'd not done anything with it!

I hope that hardware will add memmove instruction one day that will be superior to hand tuned implementations in every dimension.

Agreed. its such a common operation; I still don't understand why isn't a simple instruction that then can dispatch to the memory or DMA controllers to do the work if the sizes makes sense, rather than have to write a CPU loop in software; and the the cpu can get on other other work or sleep...

tannergooding · 2019-10-16T13:11:12Z

I hope that hardware will add memmove instruction one day that will be superior to hand tuned implementations in every dimension.

I'm pretty sure this is meant to be the ERMSB support and it does basically what you ask. Two problems are that it isn't supported everywhere yet and it has some overhead that makes it undesirable for small loops.

AMD doesn't support ERMSB directly, but it does support block copies for the same instructions. It has a lot of notes on when it is/isn't beneficial to use.
We could expose hardware intrinsics for IsGenuineIntel, IsAuthenticAmd, and ERMSB to allow this to be correctly handled.

Trying to hand-tune memmove in software is a losing battle.

I would agree that hand-tuning to have the best perf is likely a losing battle, but many of the rules around copying blocks of memory are well-defined and documented at this point (namely in the respective architecture manuals). It basically comes down to handling sizes less than 128 bytes and then everything else. The split at 128-bytes is defined because that is how much data a prefetch will grab.

If we exposed intrinsics for the above, we could just have some code like:

if (size < 128)
{
    // small copy
}
else if (Cpuid.IsGenuineIntel && Ermsb.IsSupported)
{
    Ermsb.MoveBytes(src, dst, count);
}
else if (size < threshold)
{
   // large copy in 64-byte chunks using non-temporal loads/stores
}
else
{
    // invoke native memcpy
}

This should provide overall decent performance and fairly closely match what is recommended by the architecture manuals and done by other memcpy implementations.

benaadams · 2019-10-16T13:35:24Z

Compared to AVX; ERMSB isn't great below 128bytes and its +2% win starts at >= 2048 bytes

Also its sounds more sensitive to misalignment than SIMD (+20%)

benaadams · 2019-10-16T13:43:30Z

Saying that... I would be interested in a Cpuid.IsGenuineIntel test as some of the Ryzen latencies aren't great (e.g. _pdep_u32 and _pext_u32)

benaadams · 2019-10-17T16:15:26Z

Would this be a 3.1 (LTS) candidate?

jkotas · 2019-10-17T16:33:28Z

Would this be a 3.1 (LTS) candidate?

Unlikely

Fixes #27106

AntonLapounov · 2019-10-18T13:11:44Z

The managed implementation is slightly slower on average (<5%), but it is GC pause friendly.

(Corrected the question.) For 10K block size we fall back to a PInvoke, and for 10 to 1K block sizes the performance hit of the managed implementation seems more than 5%. Is the 1K size indeed somehow exceptional or is that actually evidence that we'd better lower MemmoveNativeThreshold to fall back to the PInvoke earlier (at least for your particular machine and the current managed implementation that does not take advantage of intrinsics)?

jkotas · 2019-10-18T16:20:35Z

10 to 1K block sizes the performance hit of the managed implementation seems more than 5%

Yeah, you are right. The difference can be up to 20% around the MemmoveNativeThreshold. This memory copy implementation is used in many other places in managed code. If there is a good reason to look at tuning it again, I would rather do it in a separate change.

we'd better lower MemmoveNativeThreshold to fall back to the PInvoke earlier

The current MemmoveNativeThreshold is 2k. The PInvoke overhead is still significant for this buffer size. On my machine, there is actually a performance drop as we cross this threshold and fallback to PInvoke. This would call for making this threshold higher.

tests/CoreFX/CoreFX.issues.rsp

* Rewrite Buffer.BlockCopy in C# Fixes #27106 * Workaround to enable type check optimizations for BlockCopy only Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

Robird · 2020-11-12T07:24:33Z

benaadams/coreclr@5b90be6

That was #25172 . Trying to hand-tune memmove in software is a losing battle. I hope that hardware will add memmove instruction one day that will be superior to hand tuned implementations in every dimension.

I think, acutly there is hardware DMA on broad platforms, rather than CPU register and load/store instructions.

jkotas commented Oct 16, 2019

View reviewed changes

src/vm/jithelpers.cpp Show resolved Hide resolved

jkotas commented Oct 16, 2019

View reviewed changes

src/vm/typehandle.inl Show resolved Hide resolved

jkotas commented Oct 16, 2019

View reviewed changes

src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs Show resolved Hide resolved

jkotas requested a review from VSadov October 16, 2019 01:29

EgorBo reviewed Oct 16, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Buffer.cs Show resolved Hide resolved

stephentoub reviewed Oct 16, 2019

View reviewed changes

src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs Show resolved Hide resolved

stephentoub reviewed Oct 16, 2019

View reviewed changes

src/vm/typehandle.inl Show resolved Hide resolved

stephentoub approved these changes Oct 16, 2019

View reviewed changes

VSadov reviewed Oct 16, 2019

View reviewed changes

src/vm/typehandle.inl Outdated Show resolved Hide resolved

VSadov approved these changes Oct 16, 2019

View reviewed changes

jkotas added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Oct 16, 2019

jkotas force-pushed the managed-blockcopy branch from f5aee30 to 485a405 Compare October 16, 2019 20:26

jkotas added 2 commits October 17, 2019 22:21

Rewrite Buffer.BlockCopy in C#

a1805ba

Fixes #27106

Workaround to enable type check optimizations for BlockCopy only

d14319c

jkotas force-pushed the managed-blockcopy branch from 206fab2 to d14319c Compare October 18, 2019 08:47

jkotas commented Oct 18, 2019

View reviewed changes

tests/CoreFX/CoreFX.issues.rsp Outdated Show resolved Hide resolved

Update tests/CoreFX/CoreFX.issues.rsp

fd21a86

jkotas merged commit 495a6b5 into dotnet:master Oct 18, 2019

VSadov removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Oct 18, 2019

EgorBo mentioned this pull request Oct 19, 2019

Convert Array.IsPrimitiveTypeArray to C# #27302

Closed

AustinWise mentioned this pull request Oct 25, 2019

Cleanup some code and comments leftover from BlockCopy refactor #27432

Merged

jkotas deleted the managed-blockcopy branch October 27, 2019 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite Buffer.BlockCopy in C# #27216

Rewrite Buffer.BlockCopy in C# #27216

jkotas commented Oct 16, 2019

jkotas commented Oct 16, 2019

VSadov left a comment

benaadams commented Oct 16, 2019 •

edited

Loading

benaadams commented Oct 16, 2019

jkotas commented Oct 16, 2019

benaadams commented Oct 16, 2019

tannergooding commented Oct 16, 2019 •

edited

Loading

benaadams commented Oct 16, 2019

benaadams commented Oct 16, 2019

benaadams commented Oct 17, 2019

jkotas commented Oct 17, 2019

AntonLapounov commented Oct 18, 2019

jkotas commented Oct 18, 2019 •

edited

Loading

Robird commented Nov 12, 2020

Rewrite Buffer.BlockCopy in C# #27216

Rewrite Buffer.BlockCopy in C# #27216

Conversation

jkotas commented Oct 16, 2019

jkotas commented Oct 16, 2019

VSadov left a comment

Choose a reason for hiding this comment

benaadams commented Oct 16, 2019 • edited Loading

benaadams commented Oct 16, 2019

jkotas commented Oct 16, 2019

benaadams commented Oct 16, 2019

tannergooding commented Oct 16, 2019 • edited Loading

benaadams commented Oct 16, 2019

benaadams commented Oct 16, 2019

benaadams commented Oct 17, 2019

jkotas commented Oct 17, 2019

AntonLapounov commented Oct 18, 2019

jkotas commented Oct 18, 2019 • edited Loading

Robird commented Nov 12, 2020

benaadams commented Oct 16, 2019 •

edited

Loading

tannergooding commented Oct 16, 2019 •

edited

Loading

jkotas commented Oct 18, 2019 •

edited

Loading