Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Rewrite Buffer.BlockCopy in C# #27216

Merged
merged 3 commits into from
Oct 18, 2019
Merged

Rewrite Buffer.BlockCopy in C# #27216

merged 3 commits into from
Oct 18, 2019

Conversation

jkotas
Copy link
Member

@jkotas jkotas commented Oct 16, 2019

Fixes #27106

@jkotas
Copy link
Member Author

jkotas commented Oct 16, 2019

Performance results:

Before:

Method numElements Mean
CallBlockCopy 10 8.049 ns
CallBlockCopy 100 10.297 ns
CallBlockCopy 1000 30.870 ns
CallBlockCopy 10000 220.712 ns

After:

Method numElements Mean
CallBlockCopy 10 8.862 ns
CallBlockCopy 100 11.137 ns
CallBlockCopy 1000 41.078 ns
CallBlockCopy 10000 227.914 ns

The managed implementation is slightly slower on average (<5%), but it is GC pause friendly. Achieving the GC pause friendliness in the unmanaged implementation would add more significant overhead.

The larger delta for 1000 elements has to do with tuning of the managed Memcpy implementation that apparently performs worse for this specific block size on my machine.

@jkotas jkotas requested a review from VSadov October 16, 2019 01:29
src/vm/typehandle.inl Outdated Show resolved Hide resolved
Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICE!

@benaadams
Copy link
Member

benaadams commented Oct 16, 2019

The managed implementation is slightly slower on average (<5%)

Memmove has scope to be improved; its setup for ymm sizes however it will only ever compile to xmm as it was implemented pre-intrinsics.

Maybe seeing if switching Block16 for Vector128 and Block64 for 2x Vector256 gives an improvement?

@benaadams
Copy link
Member

Maybe seeing if switching Block16 for Vector128 and Block64 for 2x Vector256 gives an improvement?

Even have a commit for it 😄 benaadams@5b90be6

@jkotas
Copy link
Member Author

jkotas commented Oct 16, 2019

benaadams/coreclr@5b90be6

That was #25172 . Trying to hand-tune memmove in software is a losing battle. I hope that hardware will add memmove instruction one day that will be superior to hand tuned implementations in every dimension.

@benaadams
Copy link
Member

benaadams/coreclr@5b90be6

That was #25172 .

😄 so it was; thought it strange I'd not done anything with it!

I hope that hardware will add memmove instruction one day that will be superior to hand tuned implementations in every dimension.

Agreed. its such a common operation; I still don't understand why isn't a simple instruction that then can dispatch to the memory or DMA controllers to do the work if the sizes makes sense, rather than have to write a CPU loop in software; and the the cpu can get on other other work or sleep...

@tannergooding
Copy link
Member

tannergooding commented Oct 16, 2019

I hope that hardware will add memmove instruction one day that will be superior to hand tuned implementations in every dimension.

I'm pretty sure this is meant to be the ERMSB support and it does basically what you ask. Two problems are that it isn't supported everywhere yet and it has some overhead that makes it undesirable for small loops.

  • AMD doesn't support ERMSB directly, but it does support block copies for the same instructions. It has a lot of notes on when it is/isn't beneficial to use.
  • We could expose hardware intrinsics for IsGenuineIntel, IsAuthenticAmd, and ERMSB to allow this to be correctly handled.

Trying to hand-tune memmove in software is a losing battle.

I would agree that hand-tuning to have the best perf is likely a losing battle, but many of the rules around copying blocks of memory are well-defined and documented at this point (namely in the respective architecture manuals). It basically comes down to handling sizes less than 128 bytes and then everything else. The split at 128-bytes is defined because that is how much data a prefetch will grab.

If we exposed intrinsics for the above, we could just have some code like:

if (size < 128)
{
    // small copy
}
else if (Cpuid.IsGenuineIntel && Ermsb.IsSupported)
{
    Ermsb.MoveBytes(src, dst, count);
}
else if (size < threshold)
{
   // large copy in 64-byte chunks using non-temporal loads/stores
}
else
{
    // invoke native memcpy
}

This should provide overall decent performance and fairly closely match what is recommended by the architecture manuals and done by other memcpy implementations.

@benaadams
Copy link
Member

Compared to AVX; ERMSB isn't great below 128bytes and its +2% win starts at >= 2048 bytes

image

Also its sounds more sensitive to misalignment than SIMD (+20%)

image

@benaadams
Copy link
Member

Saying that... I would be interested in a Cpuid.IsGenuineIntel test as some of the Ryzen latencies aren't great (e.g. _pdep_u32 and _pext_u32)

@jkotas jkotas added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Oct 16, 2019
@benaadams
Copy link
Member

Would this be a 3.1 (LTS) candidate?

@jkotas
Copy link
Member Author

jkotas commented Oct 17, 2019

Would this be a 3.1 (LTS) candidate?

Unlikely

@AntonLapounov
Copy link
Member

The managed implementation is slightly slower on average (<5%), but it is GC pause friendly.

(Corrected the question.) For 10K block size we fall back to a PInvoke, and for 10 to 1K block sizes the performance hit of the managed implementation seems more than 5%. Is the 1K size indeed somehow exceptional or is that actually evidence that we'd better lower MemmoveNativeThreshold to fall back to the PInvoke earlier (at least for your particular machine and the current managed implementation that does not take advantage of intrinsics)?

@jkotas
Copy link
Member Author

jkotas commented Oct 18, 2019

10 to 1K block sizes the performance hit of the managed implementation seems more than 5%

Yeah, you are right. The difference can be up to 20% around the MemmoveNativeThreshold. This memory copy implementation is used in many other places in managed code. If there is a good reason to look at tuning it again, I would rather do it in a separate change.

we'd better lower MemmoveNativeThreshold to fall back to the PInvoke earlier

The current MemmoveNativeThreshold is 2k. The PInvoke overhead is still significant for this buffer size. On my machine, there is actually a performance drop as we cross this threshold and fallback to PInvoke. This would call for making this threshold higher.

@jkotas jkotas merged commit 495a6b5 into dotnet:master Oct 18, 2019
@VSadov VSadov removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Oct 18, 2019
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corefx that referenced this pull request Oct 21, 2019
* Rewrite Buffer.BlockCopy in C#

Fixes #27106

* Workaround to enable type check optimizations for BlockCopy only

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corert that referenced this pull request Oct 21, 2019
* Rewrite Buffer.BlockCopy in C#

Fixes #27106

* Workaround to enable type check optimizations for BlockCopy only

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/mono that referenced this pull request Oct 21, 2019
* Rewrite Buffer.BlockCopy in C#

Fixes #27106

* Workaround to enable type check optimizations for BlockCopy only

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
safern pushed a commit to dotnet/corefx that referenced this pull request Oct 21, 2019
* Rewrite Buffer.BlockCopy in C#

Fixes #27106

* Workaround to enable type check optimizations for BlockCopy only

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
jkotas added a commit to dotnet/corert that referenced this pull request Oct 22, 2019
* Rewrite Buffer.BlockCopy in C#

Fixes #27106

* Workaround to enable type check optimizations for BlockCopy only

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
marek-safar pushed a commit to mono/mono that referenced this pull request Oct 22, 2019
* Rewrite Buffer.BlockCopy in C#

Fixes #27106

* Workaround to enable type check optimizations for BlockCopy only

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
@jkotas jkotas deleted the managed-blockcopy branch October 27, 2019 21:06
@Robird
Copy link

Robird commented Nov 12, 2020

benaadams/coreclr@5b90be6

That was #25172 . Trying to hand-tune memmove in software is a losing battle. I hope that hardware will add memmove instruction one day that will be superior to hand tuned implementations in every dimension.

I think, acutly there is hardware DMA on broad platforms, rather than CPU register and load/store instructions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Buffer::BlockCopy may spend too long without GC polling
8 participants