'Buffer.Memmove' allow managed code path for compatible overlap cases #10282

glenn-slayden · 2018-05-04T22:25:28Z

(This issue was originally mentioned here)

Commit c6372c5 changed how overlap between the source and destination is detected in Buffer.Memmove:

https://github.com/dotnet/coreclr/blob/ea9bee5ac2f96a1ea6b202dc4094b8d418d9209c/src/mscorlib/src/System/Buffer.cs#L262

...became...

https://github.com/dotnet/coreclr/blob/c6372c5bfebd61470ea5d111f224d035b1d2ebdf/src/mscorlib/src/System/Buffer.cs#L271

The original code may seem puzzling since the unsigned subtraction and compare "incorrectly" fails to identify source/destination overlap in the case where the destination (address) precedes the source (address) in memory. Indeed c6372c5 clarifies this and would also seem to fix a bug, but how did the original work anyway?

The answer, one eventually realizes, is that the original code was stealthily incorporating a strong assumption about--and dependency upon--the copy direction of the managed Buffer.Memmove implementation which followed it.

The unfortunate upshot is that, while c6372c5 certainly makes the code clearer and more readable, it now excludes all overlap cases and thus no longer allows the performance benefit of fully-managed operation (i.e., non-P/Invoke) in the 50% of cases where the overlap happens to be compatible with the copy direction implemented in Buffer.cs. I blame the original code for not documenting its entirely non-obvious design and/or operation.

Anyway, can we revisit this line (re-relax the overlap test) to restore the optimization that was lost here? That is, once again allow the managed code path to prevail instead of P/Invoke for the half of the detected overlaps which happen to comport with the copy direction that's hard-coded into Buffer.Memmove?

The text was updated successfully, but these errors were encountered:

RussKeldorph · 2018-05-04T23:40:18Z

@jkotas @vkvenkat @fiigii

jkotas · 2018-05-04T23:49:35Z

The current implementation does not use uniform copy direction.

How would you propose to revisit this?

glenn-slayden · 2018-05-05T05:56:20Z

Ok, I did not discern that from the C# implementation body. If that's truly the case for the managed code path in Buffer.Memmove then sorry for the trouble and feel free to close this issue.

glenn-slayden · 2018-05-12T03:16:43Z

@jkotas Because you say "the current implementation does not use uniform copy direction," I just wanted to make sure you properly understood what's at issue in here in dotnet/coreclr#17886.

Your remark, while possible, surprised me, so I checked the HAS_CUSTOM_BLOCKS (x64) code path in buffer.cs, and found that, seemingly against your claim, it manifests the forward copy direction, since it always progresses forward in the (addresses of the) 16- or 64-byte blocks it processes.

And as we know, barring exceptional factors, memcpy/memmove code which demonstrates either a consistent forward--or reverse--copy direction is eligible to allow, with full correctness, exactly one out of the two possible cases of source/destination overlap.

So perhaps you were referring not to the precise specifics of the 'uniform copy direction' claim, but rather to the more generally practical usage of the phrase--that is, whether such code does, in fact, correctly allow one of the overlap cases. Here I found that, indeed despite having the forward copy direction, there is an entangling issue in the current buffer.cs code as written. This problem is that the code deploys potentially redundant copying for the final remainder bytes by right-abutting a final large-block move. It appears that it is just this clean-up code for the last remaining bytes, outside the loop body of a potentially large move, that foils the entire eligibility for (one of the two) input src/dst overlap cases.

Is my understanding correct, namely that this single issue of right-aligning the final remainder block--since such may entail overlap with a previous block--is the only factor which foils what would otherwise allow for enabling one of the overlap cases, and that this is what you meant when you assert that the "current implementation does not use uniform copy direction?"

If so, one would have to agree that sacrificing the managed code path for fully half of all overlapping input cases--in favor of a single cute right-alignment trick on just the final block--raises a serious question on the current cost/benefit judgment. Especially since the trick likely involves some amount of redundant--and probably unaligned--copying... but that's another story.

I can provide more information on this issue, including my distilled-equivalent version of the HAS_CUSTOM_BLOCKS code path (which is much easier to follow), if requested.

jkotas · 2018-05-12T03:27:09Z

Right, the redundant copying is the problem.

I do not mind fixing this - it is just not a simple tweak of the condition:

The fix should have zero effect on the existing non-overlapping cases.
The fix should ideally make all overlapping cases faster. The old code that made half of the overlapping cases fast was potential performance trap.

glenn-slayden · 2018-05-12T06:10:20Z

@jkotas Curious what you mean by "performance trap." Wouldn't making half the cases better always be preferable to keeping them all worse?

jkotas · 2018-05-12T07:26:05Z

I think one would expect that copying from higher address to lower address has about the same performance as copying from lower address to higher address. If one is significantly faster than other, it is a performance cliff that tends to take people by surprise.

It is sometimes better to have consistently worse performance than to have uneven performance. This case may not be the best example, but it best to avoid it when you have a choice.

glenn-slayden · 2018-05-12T14:41:42Z

Ok. Since you mentioned that you are willing to improve this code, at some point I will try to post a more detailed proposal for your consideration here in this thread.

In addition to your two points, I think the code should give some consideration to the input alignment characteristics. A starting point for the set of possible alignment cases is the cross-product of all the following. This represents 64 cases, although some of them may collapse for equivalent treatment:

source alignment (1/4/8/16)
destination alignment (1/4/8/16)
src ^ dst interference (1/4/8/16)

But not as many are equivalent as one might hope; note Table 3-5 "Effect of Address Misalignment on Memcpy() Performance" on page 3-70 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual (April 2018 version), which suggests that, for the most perverse src ^ dst alignment cases, the destination should be given priority.

I also agree that the managed code should support both copy directions for the overlapped cases. Note that the reverse direction (ulong)(pDst - pSrc) < (uint)length may still be slightly slower than forward. For one thing, in smaller-size reverse-direction overlap cases where rep movsq may be indicated, the Intel document notes:

"...setting the DF to force REP MOVSB to copy bytes from high towards low addresses will experience significant performance degradation." (p. 3-70)

Incidentally, the Intel document I cited has a wealth of information which is very relevant to this discussion, especially sections 3.7.5 through 3.7.6.3, as well as much discussion of SIMD, etc. tradeoffs. In the context of improving Buffer.Memmove, I think the points it raises should be as closely studied and deeply understood as possible.

jkotas · 2018-05-12T15:00:57Z

The managed implementation is used only up to certain relatively small size to save the transition overhead. The C runtime implementation is used for larger sizes for which it becomes a lot more important to pay attention to things like alignment differences that you have mentioned. The assumption is that the C runtime implementation has handling for all these special cases so that we do not have to replicate all of them in C#.

Note that @vkvenkat and @fiigii work at Intel. I am sure that they were pretty well aware of Intel own documentation on this subject and the trade-offs involved while working on what's currently in master.

glenn-slayden · 2018-06-01T04:16:00Z

@jkotas commented:

The managed implementation is used only up to certain relatively small size to save the transition overhead.

Sure, but my experimentation so far shows that, thanks to the latest x64 JIT work and with precise use of System.Numerics.Vectors, it's actually possible to obtain JIT bytes that match (or even slightly improve upon) the unmanaged options currently used--across the entire size range. If my early numbers bear out, the evidence might argue for moving the threshold you mention from "relatively small" (currently 2048 bytes) to "not gigantic."

All the prominent native implementations of memcpy/memmove are included in my test battery: msvcrt!memcpy, ntdll!RtlMoveMemory (which is an alias for ntdll!memcpy), msvcr120_clr0400!memcpy, "Apex memmove", etc. In studying this wide range of native Windows options and then a couple GNU efforts as well, the only two "optimizations" that I find completely inaccessible to well-crafted IL are:

the use of multi-byte NOP to ensure loop bytes don't straddle a cache line; and
the use of prefetchnta for non-temporal operation, indicated for very large moves which are separated by more than a cache-line

On everything else, the JIT can end up emitting the exact same x64 bytes. For example, starting from live debugger bytes of ntdll.dll!memcpy I was able to deliberately get almost the exact same native image--except the prefetchnta--via precise use of C#.

Note that my evaluation harness also includes several fully managed alternatives, such as Array.Copy() (tested with byte[]; this one consistently scores very high, a surprise to me) System.Runtime.CompilerServices.Unsafe.Copy() and CopyUnaligned(), the System.Buffer APIs (both MemoryCopy and BlockCopy) as well as other Buffer.cs code from past and present, Mnemosyne and finally of course an evolving contestant of my own.

Oh, and I almost forgot, two hilarious baseline versions: the following, plus its equivalent in managed pointers: (Spoiler alert: nothing to see here--at all)

static void memcpy(byte* dst, byte* src, long cb)
{
    long i;
    if ((i = dst - src) == 0)
        return;

    if ((ulong)i < (ulong)cb)
        for (dst += cb, src += cb; --cb >= 0;)
            *(--src + i) = *src;
    else
        while (--cb >= 0)
            *(src + i) = *src++;
}

The early results I mentioned, namely that exactingly-tuned managed code has the potential to essentially destroy the field, sometimes even across-the-board with regard to the input characteristics I'm independently logging. For example, in the following early test run, a method very similar to the current Buffer.cs code--but with its size-threshold removed, and also two simple lines added to pre-align the destination to 16-byte boundary--is trouncing all comers, native and cpblk alike, and even winning the 2MB size category, the largest measured for this run.

The columns in the output show independent properties of the 6,000 randomized copy operation every unit identically must perform. For each of the 17, that's half-a-billion bytes per iteration. Categories include eleven logarithmically-sized buckets for copy size, five absolute alignment cases (each) for src and dst, five relative alignment cases for src^dst, two cases for implied direction--but more importantly, breaking out two logging categories for the more specific overlap cases where direction actually matters. The latter hints at a point apparently lost on even some very serious online observers, as I found to be evident from at least one significant implementation--and handful of unrelated discussions--asserting that reverse direction must necessarily hold whenever src < dst.

Anyway the impressive showing so far by several managed memcpy candidates has been quite surprising to me, and credit for this goes to some great work by the JIT and SIMD teams.

glenn-slayden changed the title ~~'Buffer.Memmove' allow managed code path for compatible overlap cases~~ 'Buffer.Memmove' (re-)allow managed code path for compatible overlap cases May 4, 2018

glenn-slayden changed the title ~~'Buffer.Memmove' (re-)allow managed code path for compatible overlap cases~~ 'Buffer.Memmove' allow managed code path for compatible overlap cases May 4, 2018

RussKeldorph closed this as completed May 5, 2018

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'Buffer.Memmove' allow managed code path for compatible overlap cases #10282

'Buffer.Memmove' allow managed code path for compatible overlap cases #10282

glenn-slayden commented May 4, 2018

RussKeldorph commented May 4, 2018

jkotas commented May 4, 2018

glenn-slayden commented May 5, 2018

glenn-slayden commented May 12, 2018 •

edited

Loading

jkotas commented May 12, 2018

glenn-slayden commented May 12, 2018 •

edited

Loading

jkotas commented May 12, 2018

glenn-slayden commented May 12, 2018

jkotas commented May 12, 2018 •

edited

Loading

glenn-slayden commented Jun 1, 2018 •

edited

Loading

'Buffer.Memmove' allow managed code path for compatible overlap cases #10282

'Buffer.Memmove' allow managed code path for compatible overlap cases #10282

Comments

glenn-slayden commented May 4, 2018

RussKeldorph commented May 4, 2018

jkotas commented May 4, 2018

glenn-slayden commented May 5, 2018

glenn-slayden commented May 12, 2018 • edited Loading

jkotas commented May 12, 2018

glenn-slayden commented May 12, 2018 • edited Loading

jkotas commented May 12, 2018

glenn-slayden commented May 12, 2018

jkotas commented May 12, 2018 • edited Loading

glenn-slayden commented Jun 1, 2018 • edited Loading

glenn-slayden commented May 12, 2018 •

edited

Loading

glenn-slayden commented May 12, 2018 •

edited

Loading

jkotas commented May 12, 2018 •

edited

Loading

glenn-slayden commented Jun 1, 2018 •

edited

Loading