-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Improve span copy of pointers and structs containing pointers #9999
Conversation
Change was to call into memmoveGCRefs when total byte count to copy is over a threshold. Threshold was determined by measurement. Left: before change, right: after change. Scores are iterations per millisecond. In this test, each iteration copies that many bytes (00008 mean 8 bytes total, so one pointer) (Span of class). Total is geometric mean. Numbers were taken with x64 build on Windows. Copy non-overlapping spans (destination starts at a higher address than source):
Copy backwards (destination buffer starts inside the source buffer, overlapping the last pointer):
|
I wasn't able to avoid the small regressions in the first few buckets. I tried moving the call to the bottom using goto, such that the additional check would/should be the only extra cost for those buckets, but it made everything much slower for some reason that doesn't make sense to me. |
There seems to be no change in perf in copying spans of types that don't contain references, just some noise. |
src/vm/comutilnative.cpp
Outdated
{ | ||
QCALL_CONTRACT; | ||
|
||
if (!IS_ALIGNED(dst, sizeof(dst)) || !IS_ALIGNED(src, sizeof(src))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should never happen. It should be assert instead.
[MethodImpl(MethodImplOptions.NoInlining)] | ||
internal unsafe static bool RhCopyMemoryWithReferences<T>(ref T destination, ref T source, int elementCount) | ||
{ | ||
fixed (void* destinationPtr = &Unsafe.As<T, byte>(ref destination)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do not need to pin these for the fcall - the fcall can take ref byte
directly.
src/mscorlib/src/System/Span.cs
Outdated
} | ||
} | ||
} | ||
else | ||
{ | ||
if ((nuint)elementsCount * (nuint)Unsafe.SizeOf<T>() >= 128 && | ||
RuntimeImports.RhCopyMemoryWithReferences(ref destination, ref source, elementsCount)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we need special casing for small sizes here. The FCall should be fine for all case - if it is called directly from here; and avoids unnecessary layers internally.
src/vm/comutilnative.cpp
Outdated
{ | ||
QCALL_CONTRACT; | ||
|
||
memset(dst, 0, length); | ||
} | ||
|
||
bool QCALLTYPE MemoryNative::CopyWithReferences(void *dst, void *src, size_t byteCount) | ||
{ | ||
QCALL_CONTRACT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot be QCall. You have to be in GC cooperative mode to do the copy of object references.
// Non-inlinable wrapper around the QCall that avoids poluting the fast path | ||
// with P/Invoke prolog/epilog. | ||
[MethodImpl(MethodImplOptions.NoInlining)] | ||
internal unsafe static bool RhCopyMemoryWithReferences<T>(ref T destination, ref T source, int elementCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is called RhBulkMoveWithWriteBarrier in CoreRT - it maybe nice to call it the same.
src/vm/comutilnative.cpp
Outdated
return false; | ||
} | ||
|
||
memmoveGCRefs(dst, src, byteCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look how RhBulkMoveWithWriteBarrier is implemented in CoreRT to get everything 100% inlined and avoid any unnecessary calls. It would be nice to do the same here for best perf.
src/vm/comutilnative.cpp
Outdated
} | ||
|
||
memmoveGCRefs(dst, src, byteCount); | ||
return true; | ||
InlinedSetCardsAfterBulkCopy(reinterpret_cast<Object **>(dst), byteCount); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add FC_GC_POLL();
here to avoid GC starvation.
LGTM modulo one small comment. |
Sorry did not mean to push these commits, don't know how they got pushed... I'm still testing with various combinations as I haven't yet found a satisfactory solution |
If you are running into problems with Unsafe.Add not being inlined ... don't worry about it, the JIT folks are looking into it (it is a problem in other places too). I think what you got is how I think the implementation should look like. |
Thanks, made all of the changes from above and tweaked some things while I was comparing different versions. I still had to handle the one element case inline to avoid a ~7% regression there. Copy forwards:
Copy backwards:
|
92f53d3
to
7f9dd4a
Compare
SIZE_T *dptr = (SIZE_T *)dest; | ||
SIZE_T *sptr = (SIZE_T *)src; | ||
|
||
if ((len & sizeof(SIZE_T)) != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be aligning the destination pointer here. The misaligned writes have some penalty that would be best to avoid for longer copies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it is worth doing the alignment only on AMD64 (or other platforms where we will have custom implementations). It may look better to put the AMD64 implementation under one big ifdef.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is there a similar penalty for unaligned reads, and if so, is it worse for writes? I couldn't find much info on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, There is some penalty for reads too, but it is worse for writes.
_mm_storeu_ps((float *)dptr, v); | ||
#else // !_AMD64_ || FEATURE_PAL | ||
// Read two values and write two values to hint the use of wide loads and stores | ||
SIZE_T p[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you seen this array hint to work anywhere? I would be surprised if there is a compiler smart enough to pick it up.
I think writing it this way would be equivalent and less magic:
SIZE_T t0 = sptr[0];
SIZE_T t1 = sptr[1];
dptr[0] = t0;
dptr[1] = t1;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like MSVC and clang/llvm are taking the hint, I'll change it though
Fixes #9161 PR dotnet#9786 fixes perf of span copy of types that don't contain references
Added TODO, leaving fixing that for a separate PR
Made changes from above and enabled xmm intrinsics path on Unix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Should the CopyTo method be inlined? [MethodImpl(MethodImplOptions.AggressiveInlining)] public void CopyTo(Span<T> destination) Same with TryCopyTo? public bool TryCopyTo(Span<T> destination) |
These are small enough that the JIT should take care of inlining these without any special hints. |
…#9999) Improve span copy of pointers and structs containing pointers Fixes #9161 PR dotnet#9786 fixes perf of span copy of types that don't contain references
@kouvel do you still have the test harness you used for improving this code? |
Sorry for the delay, I have shared it here along with the test code I used: https://1drv.ms/f/s!AvtRwG9CobRTpl3WXmqgieJ6n5_f Edit Root\PerfTester\Run.bat first based on the instructions there, and then you should be able to run that and collect numbers. |
Fixes #9161
PR #9786 fixes perf of span copy of types that don't contain references