Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Buffer.Memmove' allow managed code path for compatible overlap cases #10282

Closed
glenn-slayden opened this issue May 4, 2018 · 10 comments
Closed
Labels
area-System.Runtime enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Milestone

Comments

@glenn-slayden
Copy link

(This issue was originally mentioned here)

Commit c6372c5 changed how overlap between the source and destination is detected in Buffer.Memmove:

https://github.com/dotnet/coreclr/blob/ea9bee5ac2f96a1ea6b202dc4094b8d418d9209c/src/mscorlib/src/System/Buffer.cs#L262

...became...

https://github.com/dotnet/coreclr/blob/c6372c5bfebd61470ea5d111f224d035b1d2ebdf/src/mscorlib/src/System/Buffer.cs#L271

The original code may seem puzzling since the unsigned subtraction and compare "incorrectly" fails to identify source/destination overlap in the case where the destination (address) precedes the source (address) in memory. Indeed c6372c5 clarifies this and would also seem to fix a bug, but how did the original work anyway?

The answer, one eventually realizes, is that the original code was stealthily incorporating a strong assumption about--and dependency upon--the copy direction of the managed Buffer.Memmove implementation which followed it.

The unfortunate upshot is that, while c6372c5 certainly makes the code clearer and more readable, it now excludes all overlap cases and thus no longer allows the performance benefit of fully-managed operation (i.e., non-P/Invoke) in the 50% of cases where the overlap happens to be compatible with the copy direction implemented in Buffer.cs. I blame the original code for not documenting its entirely non-obvious design and/or operation.

Anyway, can we revisit this line (re-relax the overlap test) to restore the optimization that was lost here? That is, once again allow the managed code path to prevail instead of P/Invoke for the half of the detected overlaps which happen to comport with the copy direction that's hard-coded into Buffer.Memmove?

@glenn-slayden glenn-slayden changed the title 'Buffer.Memmove' allow managed code path for compatible overlap cases 'Buffer.Memmove' (re-)allow managed code path for compatible overlap cases May 4, 2018
@glenn-slayden glenn-slayden changed the title 'Buffer.Memmove' (re-)allow managed code path for compatible overlap cases 'Buffer.Memmove' allow managed code path for compatible overlap cases May 4, 2018
@RussKeldorph
Copy link
Contributor

@jkotas @vkvenkat @fiigii

@jkotas
Copy link
Member

jkotas commented May 4, 2018

The current implementation does not use uniform copy direction.

How would you propose to revisit this?

@glenn-slayden
Copy link
Author

Ok, I did not discern that from the C# implementation body. If that's truly the case for the managed code path in Buffer.Memmove then sorry for the trouble and feel free to close this issue.

@glenn-slayden
Copy link
Author

glenn-slayden commented May 12, 2018

@jkotas Because you say "the current implementation does not use uniform copy direction," I just wanted to make sure you properly understood what's at issue in here in dotnet/coreclr#17886.

Your remark, while possible, surprised me, so I checked the HAS_CUSTOM_BLOCKS (x64) code path in buffer.cs, and found that, seemingly against your claim, it manifests the forward copy direction, since it always progresses forward in the (addresses of the) 16- or 64-byte blocks it processes.

And as we know, barring exceptional factors, memcpy/memmove code which demonstrates either a consistent forward--or reverse--copy direction is eligible to allow, with full correctness, exactly one out of the two possible cases of source/destination overlap.

So perhaps you were referring not to the precise specifics of the 'uniform copy direction' claim, but rather to the more generally practical usage of the phrase--that is, whether such code does, in fact, correctly allow one of the overlap cases. Here I found that, indeed despite having the forward copy direction, there is an entangling issue in the current buffer.cs code as written. This problem is that the code deploys potentially redundant copying for the final remainder bytes by right-abutting a final large-block move. It appears that it is just this clean-up code for the last remaining bytes, outside the loop body of a potentially large move, that foils the entire eligibility for (one of the two) input src/dst overlap cases.

Is my understanding correct, namely that this single issue of right-aligning the final remainder block--since such may entail overlap with a previous block--is the only factor which foils what would otherwise allow for enabling one of the overlap cases, and that this is what you meant when you assert that the "current implementation does not use uniform copy direction?"

If so, one would have to agree that sacrificing the managed code path for fully half of all overlapping input cases--in favor of a single cute right-alignment trick on just the final block--raises a serious question on the current cost/benefit judgment. Especially since the trick likely involves some amount of redundant--and probably unaligned--copying... but that's another story.

I can provide more information on this issue, including my distilled-equivalent version of the HAS_CUSTOM_BLOCKS code path (which is much easier to follow), if requested.

@jkotas
Copy link
Member

jkotas commented May 12, 2018

Right, the redundant copying is the problem.

I do not mind fixing this - it is just not a simple tweak of the condition:

  • The fix should have zero effect on the existing non-overlapping cases.
  • The fix should ideally make all overlapping cases faster. The old code that made half of the overlapping cases fast was potential performance trap.

@glenn-slayden
Copy link
Author

glenn-slayden commented May 12, 2018

@jkotas Curious what you mean by "performance trap." Wouldn't making half the cases better always be preferable to keeping them all worse?

@jkotas
Copy link
Member

jkotas commented May 12, 2018

I think one would expect that copying from higher address to lower address has about the same performance as copying from lower address to higher address. If one is significantly faster than other, it is a performance cliff that tends to take people by surprise.

It is sometimes better to have consistently worse performance than to have uneven performance. This case may not be the best example, but it best to avoid it when you have a choice.

@glenn-slayden
Copy link
Author

Ok. Since you mentioned that you are willing to improve this code, at some point I will try to post a more detailed proposal for your consideration here in this thread.

In addition to your two points, I think the code should give some consideration to the input alignment characteristics. A starting point for the set of possible alignment cases is the cross-product of all the following. This represents 64 cases, although some of them may collapse for equivalent treatment:

  • source alignment (1/4/8/16)
  • destination alignment (1/4/8/16)
  • src ^ dst interference (1/4/8/16)

But not as many are equivalent as one might hope; note Table 3-5 "Effect of Address Misalignment on Memcpy() Performance" on page 3-70 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual (April 2018 version), which suggests that, for the most perverse src ^ dst alignment cases, the destination should be given priority.

I also agree that the managed code should support both copy directions for the overlapped cases. Note that the reverse direction (ulong)(pDst - pSrc) < (uint)length may still be slightly slower than forward. For one thing, in smaller-size reverse-direction overlap cases where rep movsq may be indicated, the Intel document notes:

"...setting the DF to force REP MOVSB to copy bytes from high towards low addresses will experience significant performance degradation." (p. 3-70)

Incidentally, the Intel document I cited has a wealth of information which is very relevant to this discussion, especially sections 3.7.5 through 3.7.6.3, as well as much discussion of SIMD, etc. tradeoffs. In the context of improving Buffer.Memmove, I think the points it raises should be as closely studied and deeply understood as possible.

@jkotas
Copy link
Member

jkotas commented May 12, 2018

The managed implementation is used only up to certain relatively small size to save the transition overhead. The C runtime implementation is used for larger sizes for which it becomes a lot more important to pay attention to things like alignment differences that you have mentioned. The assumption is that the C runtime implementation has handling for all these special cases so that we do not have to replicate all of them in C#.

Note that @vkvenkat and @fiigii work at Intel. I am sure that they were pretty well aware of Intel own documentation on this subject and the trade-offs involved while working on what's currently in master.

@glenn-slayden
Copy link
Author

glenn-slayden commented Jun 1, 2018

@jkotas commented:

The managed implementation is used only up to certain relatively small size to save the transition overhead.

Sure, but my experimentation so far shows that, thanks to the latest x64 JIT work and with precise use of System.Numerics.Vectors, it's actually possible to obtain JIT bytes that match (or even slightly improve upon) the unmanaged options currently used--across the entire size range. If my early numbers bear out, the evidence might argue for moving the threshold you mention from "relatively small" (currently 2048 bytes) to "not gigantic."

All the prominent native implementations of memcpy/memmove are included in my test battery: msvcrt!memcpy, ntdll!RtlMoveMemory (which is an alias for ntdll!memcpy), msvcr120_clr0400!memcpy, "Apex memmove", etc. In studying this wide range of native Windows options and then a couple GNU efforts as well, the only two "optimizations" that I find completely inaccessible to well-crafted IL are:

  1. the use of multi-byte NOP to ensure loop bytes don't straddle a cache line; and
  2. the use of prefetchnta for non-temporal operation, indicated for very large moves which are separated by more than a cache-line

On everything else, the JIT can end up emitting the exact same x64 bytes. For example, starting from live debugger bytes of ntdll.dll!memcpy I was able to deliberately get almost the exact same native image--except the prefetchnta--via precise use of C#.

Note that my evaluation harness also includes several fully managed alternatives, such as Array.Copy() (tested with byte[]; this one consistently scores very high, a surprise to me) System.Runtime.CompilerServices.Unsafe.Copy() and CopyUnaligned(), the System.Buffer APIs (both MemoryCopy and BlockCopy) as well as other Buffer.cs code from past and present, Mnemosyne and finally of course an evolving contestant of my own.

Oh, and I almost forgot, two hilarious baseline versions: the following, plus its equivalent in managed pointers: (Spoiler alert: nothing to see here--at all)

static void memcpy(byte* dst, byte* src, long cb)
{
    long i;
    if ((i = dst - src) == 0)
        return;

    if ((ulong)i < (ulong)cb)
        for (dst += cb, src += cb; --cb >= 0;)
            *(--src + i) = *src;
    else
        while (--cb >= 0)
            *(src + i) = *src++;
}

The early results I mentioned, namely that exactingly-tuned managed code has the potential to essentially destroy the field, sometimes even across-the-board with regard to the input characteristics I'm independently logging. For example, in the following early test run, a method very similar to the current Buffer.cs code--but with its size-threshold removed, and also two simple lines added to pre-align the destination to 16-byte boundary--is trouncing all comers, native and cpblk alike, and even winning the 2MB size category, the largest measured for this run.

memcpy-results-sample

The columns in the output show independent properties of the 6,000 randomized copy operation every unit identically must perform. For each of the 17, that's half-a-billion bytes per iteration. Categories include eleven logarithmically-sized buckets for copy size, five absolute alignment cases (each) for src and dst, five relative alignment cases for src^dst, two cases for implied direction--but more importantly, breaking out two logging categories for the more specific overlap cases where direction actually matters. The latter hints at a point apparently lost on even some very serious online observers, as I found to be evident from at least one significant implementation--and handful of unrelated discussions--asserting that reverse direction must necessarily hold whenever src < dst.

Anyway the impressive showing so far by several managed memcpy candidates has been quite surprising to me, and credit for this goes to some great work by the JIT and SIMD teams.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants