-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
Performance results: Before:
After:
The managed implementation is slightly slower on average (<5%), but it is GC pause friendly. Achieving the GC pause friendliness in the unmanaged implementation would add more significant overhead. The larger delta for 1000 elements has to do with tuning of the managed Memcpy implementation that apparently performs worse for this specific block size on my machine. |
src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs
Show resolved
Hide resolved
src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NICE!
Memmove has scope to be improved; its setup for ymm sizes however it will only ever compile to xmm as it was implemented pre-intrinsics. Maybe seeing if switching |
Even have a commit for it 😄 benaadams@5b90be6 |
That was #25172 . Trying to hand-tune memmove in software is a losing battle. I hope that hardware will add memmove instruction one day that will be superior to hand tuned implementations in every dimension. |
😄 so it was; thought it strange I'd not done anything with it!
Agreed. its such a common operation; I still don't understand why isn't a simple instruction that then can dispatch to the memory or DMA controllers to do the work if the sizes makes sense, rather than have to write a CPU loop in software; and the the cpu can get on other other work or sleep... |
I'm pretty sure this is meant to be the
I would agree that hand-tuning to have the best perf is likely a losing battle, but many of the rules around copying blocks of memory are well-defined and documented at this point (namely in the respective architecture manuals). It basically comes down to handling sizes less than 128 bytes and then everything else. The split at 128-bytes is defined because that is how much data a If we exposed intrinsics for the above, we could just have some code like: if (size < 128)
{
// small copy
}
else if (Cpuid.IsGenuineIntel && Ermsb.IsSupported)
{
Ermsb.MoveBytes(src, dst, count);
}
else if (size < threshold)
{
// large copy in 64-byte chunks using non-temporal loads/stores
}
else
{
// invoke native memcpy
} This should provide overall decent performance and fairly closely match what is recommended by the architecture manuals and done by other memcpy implementations. |
Saying that... I would be interested in a |
f5aee30
to
485a405
Compare
Would this be a 3.1 (LTS) candidate? |
Unlikely |
206fab2
to
d14319c
Compare
(Corrected the question.) For 10K block size we fall back to a PInvoke, and for 10 to 1K block sizes the performance hit of the managed implementation seems more than 5%. Is the 1K size indeed somehow exceptional or is that actually evidence that we'd better lower |
Yeah, you are right. The difference can be up to 20% around the MemmoveNativeThreshold. This memory copy implementation is used in many other places in managed code. If there is a good reason to look at tuning it again, I would rather do it in a separate change.
The current MemmoveNativeThreshold is 2k. The PInvoke overhead is still significant for this buffer size. On my machine, there is actually a performance drop as we cross this threshold and fallback to PInvoke. This would call for making this threshold higher. |
* Rewrite Buffer.BlockCopy in C# Fixes #27106 * Workaround to enable type check optimizations for BlockCopy only Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Rewrite Buffer.BlockCopy in C# Fixes #27106 * Workaround to enable type check optimizations for BlockCopy only Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Rewrite Buffer.BlockCopy in C# Fixes #27106 * Workaround to enable type check optimizations for BlockCopy only Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Rewrite Buffer.BlockCopy in C# Fixes #27106 * Workaround to enable type check optimizations for BlockCopy only Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Rewrite Buffer.BlockCopy in C# Fixes #27106 * Workaround to enable type check optimizations for BlockCopy only Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Rewrite Buffer.BlockCopy in C# Fixes #27106 * Workaround to enable type check optimizations for BlockCopy only Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
I think, acutly there is hardware DMA on broad platforms, rather than CPU register and load/store instructions. |
Fixes #27106