-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'Buffer.Memmove' allow managed code path for compatible overlap cases #10282
Comments
The current implementation does not use uniform copy direction. How would you propose to revisit this? |
Ok, I did not discern that from the C# implementation body. If that's truly the case for the managed code path in |
@jkotas Because you say "the current implementation does not use uniform copy direction," I just wanted to make sure you properly understood what's at issue in here in dotnet/coreclr#17886. Your remark, while possible, surprised me, so I checked the And as we know, barring exceptional factors, So perhaps you were referring not to the precise specifics of the 'uniform copy direction' claim, but rather to the more generally practical usage of the phrase--that is, whether such code does, in fact, correctly allow one of the overlap cases. Here I found that, indeed despite having the forward copy direction, there is an entangling issue in the current buffer.cs code as written. This problem is that the code deploys potentially redundant copying for the final remainder bytes by right-abutting a final large-block move. It appears that it is just this clean-up code for the last remaining bytes, outside the loop body of a potentially large move, that foils the entire eligibility for (one of the two) input src/dst overlap cases. Is my understanding correct, namely that this single issue of right-aligning the final remainder block--since such may entail overlap with a previous block--is the only factor which foils what would otherwise allow for enabling one of the overlap cases, and that this is what you meant when you assert that the "current implementation does not use uniform copy direction?" If so, one would have to agree that sacrificing the managed code path for fully half of all overlapping input cases--in favor of a single cute right-alignment trick on just the final block--raises a serious question on the current cost/benefit judgment. Especially since the trick likely involves some amount of redundant--and probably unaligned--copying... but that's another story. I can provide more information on this issue, including my distilled-equivalent version of the |
Right, the redundant copying is the problem. I do not mind fixing this - it is just not a simple tweak of the condition:
|
@jkotas Curious what you mean by "performance trap." Wouldn't making half the cases better always be preferable to keeping them all worse? |
I think one would expect that copying from higher address to lower address has about the same performance as copying from lower address to higher address. If one is significantly faster than other, it is a performance cliff that tends to take people by surprise. It is sometimes better to have consistently worse performance than to have uneven performance. This case may not be the best example, but it best to avoid it when you have a choice. |
Ok. Since you mentioned that you are willing to improve this code, at some point I will try to post a more detailed proposal for your consideration here in this thread. In addition to your two points, I think the code should give some consideration to the input alignment characteristics. A starting point for the set of possible alignment cases is the cross-product of all the following. This represents 64 cases, although some of them may collapse for equivalent treatment:
But not as many are equivalent as one might hope; note Table 3-5 "Effect of Address Misalignment on Memcpy() Performance" on page 3-70 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual (April 2018 version), which suggests that, for the most perverse I also agree that the managed code should support both copy directions for the overlapped cases. Note that the reverse direction
Incidentally, the Intel document I cited has a wealth of information which is very relevant to this discussion, especially sections 3.7.5 through 3.7.6.3, as well as much discussion of SIMD, etc. tradeoffs. In the context of improving |
The managed implementation is used only up to certain relatively small size to save the transition overhead. The C runtime implementation is used for larger sizes for which it becomes a lot more important to pay attention to things like alignment differences that you have mentioned. The assumption is that the C runtime implementation has handling for all these special cases so that we do not have to replicate all of them in C#. Note that @vkvenkat and @fiigii work at Intel. I am sure that they were pretty well aware of Intel own documentation on this subject and the trade-offs involved while working on what's currently in master. |
@jkotas commented:
Sure, but my experimentation so far shows that, thanks to the latest x64 JIT work and with precise use of All the prominent native implementations of
On everything else, the JIT can end up emitting the exact same x64 bytes. For example, starting from live debugger bytes of ntdll.dll!memcpy I was able to deliberately get almost the exact same native image--except the Note that my evaluation harness also includes several fully managed alternatives, such as Oh, and I almost forgot, two hilarious baseline versions: the following, plus its equivalent in managed pointers: (Spoiler alert: nothing to see here--at all)
The early results I mentioned, namely that exactingly-tuned managed code has the potential to essentially destroy the field, sometimes even across-the-board with regard to the input characteristics I'm independently logging. For example, in the following early test run, a method very similar to the current The columns in the output show independent properties of the 6,000 randomized copy operation every unit identically must perform. For each of the 17, that's half-a-billion bytes per iteration. Categories include eleven logarithmically-sized buckets for copy size, five absolute alignment cases (each) for Anyway the impressive showing so far by several managed |
(This issue was originally mentioned here)
Commit c6372c5 changed how overlap between the source and destination is detected in
Buffer.Memmove
:https://github.com/dotnet/coreclr/blob/ea9bee5ac2f96a1ea6b202dc4094b8d418d9209c/src/mscorlib/src/System/Buffer.cs#L262
...became...
https://github.com/dotnet/coreclr/blob/c6372c5bfebd61470ea5d111f224d035b1d2ebdf/src/mscorlib/src/System/Buffer.cs#L271
The original code may seem puzzling since the unsigned subtraction and compare "incorrectly" fails to identify source/destination overlap in the case where the destination (address) precedes the source (address) in memory. Indeed c6372c5 clarifies this and would also seem to fix a bug, but how did the original work anyway?
The answer, one eventually realizes, is that the original code was stealthily incorporating a strong assumption about--and dependency upon--the copy direction of the managed
Buffer.Memmove
implementation which followed it.The unfortunate upshot is that, while c6372c5 certainly makes the code clearer and more readable, it now excludes all overlap cases and thus no longer allows the performance benefit of fully-managed operation (i.e., non-P/Invoke) in the 50% of cases where the overlap happens to be compatible with the copy direction implemented in Buffer.cs. I blame the original code for not documenting its entirely non-obvious design and/or operation.
Anyway, can we revisit this line (re-relax the overlap test) to restore the optimization that was lost here? That is, once again allow the managed code path to prevail instead of P/Invoke for the half of the detected overlaps which happen to comport with the copy direction that's hard-coded into
Buffer.Memmove
?The text was updated successfully, but these errors were encountered: