Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Improve span copy of pointers and structs containing pointers #9999

Merged
merged 7 commits into from
Mar 9, 2017

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Mar 7, 2017

Fixes #9161

PR #9786 fixes perf of span copy of types that don't contain references

@kouvel
Copy link
Member Author

kouvel commented Mar 7, 2017

Change was to call into memmoveGCRefs when total byte count to copy is over a threshold. Threshold was determined by measurement.

Left: before change, right: after change. Scores are iterations per millisecond. In this test, each iteration copies that many bytes (00008 mean 8 bytes total, so one pointer) (Span of class). Total is geometric mean. Numbers were taken with x64 build on Windows.

Copy non-overlapping spans (destination starts at a higher address than source):

Span copy pointer  Left score        Right score       ∆ Score   ∆ Score %  Comment
-----------------  ----------------  ----------------  --------  ---------  ---------
00008              108056.52 ±0.28%  103397.54 ±0.12%  -4658.99     -4.31%  Regressed
00016               81449.85 ±0.04%   78057.45 ±0.11%  -3392.40     -4.17%  Regressed
00032               53705.85 ±0.03%   52066.60 ±0.12%  -1639.25     -3.05%  Regressed
00064               31826.45 ±0.14%   31330.87 ±0.04%   -495.58     -1.56%  Regressed
00128               17565.26 ±0.16%   25008.12 ±0.23%   7442.86     42.37%  Improved
00256                8894.93 ±0.12%   19730.60 ±0.12%  10835.68    121.82%  Improved
00512                4654.36 ±0.23%   14095.66 ±0.25%   9441.31    202.85%  Improved
01024                2384.89 ±0.12%    9928.56 ±0.28%   7543.66    316.31%  Improved
04096                 610.08 ±0.10%    3745.38 ±0.11%   3135.30    513.91%  Improved
16384                 150.82 ±0.28%     924.38 ±0.25%    773.56    512.89%  Improved
65536                  38.20 ±0.12%     226.47 ±0.23%    188.27    492.85%  Improved
-----------------  ----------------  ----------------  --------  ---------  ---------
Total                5294.86 ±0.15%   11954.92 ±0.17%   6660.05    125.78%  Improved

Copy backwards (destination buffer starts inside the source buffer, overlapping the last pointer):

Span copy pointer backwards  Left score        Right score       ∆ Score   ∆ Score %  Comment
---------------------------  ----------------  ----------------  --------  ---------  ---------
00008                        161814.27 ±0.01%  168873.46 ±0.08%   7059.19      4.36%  Improved
00016                         81441.87 ±0.07%   77945.68 ±0.04%  -3496.18     -4.29%  Regressed
00032                         53622.73 ±0.04%   52146.69 ±0.01%  -1476.04     -2.75%  Regressed
00064                         31827.17 ±0.20%   31277.84 ±0.02%   -549.33     -1.73%  Regressed
00128                         17580.56 ±0.03%   27326.68 ±0.31%   9746.12     55.44%  Improved
00256                          8889.00 ±0.21%   24177.97 ±0.02%  15288.97    172.00%  Improved
00512                          4660.67 ±0.07%   20342.95 ±0.20%  15682.28    336.48%  Improved
01024                          2394.70 ±0.02%   14433.39 ±0.24%  12038.69    502.72%  Improved
04096                           609.64 ±0.06%    3305.12 ±0.23%   2695.47    442.14%  Improved
16384                           152.82 ±0.12%     895.95 ±0.09%    743.13    486.28%  Improved
65536                            38.31 ±0.03%     231.31 ±0.45%    193.00    503.82%  Improved
---------------------------  ----------------  ----------------  --------  ---------  ---------
Total                          5502.48 ±0.08%   13561.18 ±0.15%   8058.70    146.46%  Improved

@kouvel
Copy link
Member Author

kouvel commented Mar 7, 2017

I wasn't able to avoid the small regressions in the first few buckets. I tried moving the call to the bottom using goto, such that the additional check would/should be the only extra cost for those buckets, but it made everything much slower for some reason that doesn't make sense to me.

@kouvel
Copy link
Member Author

kouvel commented Mar 7, 2017

There seems to be no change in perf in copying spans of types that don't contain references, just some noise.

@kouvel
Copy link
Member Author

kouvel commented Mar 7, 2017

@jkotas @ahsonkhan

{
QCALL_CONTRACT;

if (!IS_ALIGNED(dst, sizeof(dst)) || !IS_ALIGNED(src, sizeof(src)))
Copy link
Member

@jkotas jkotas Mar 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never happen. It should be assert instead.

[MethodImpl(MethodImplOptions.NoInlining)]
internal unsafe static bool RhCopyMemoryWithReferences<T>(ref T destination, ref T source, int elementCount)
{
fixed (void* destinationPtr = &Unsafe.As<T, byte>(ref destination))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need to pin these for the fcall - the fcall can take ref byte directly.

}
}
}
else
{
if ((nuint)elementsCount * (nuint)Unsafe.SizeOf<T>() >= 128 &&
RuntimeImports.RhCopyMemoryWithReferences(ref destination, ref source, elementsCount))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we need special casing for small sizes here. The FCall should be fine for all case - if it is called directly from here; and avoids unnecessary layers internally.

{
QCALL_CONTRACT;

memset(dst, 0, length);
}

bool QCALLTYPE MemoryNative::CopyWithReferences(void *dst, void *src, size_t byteCount)
{
QCALL_CONTRACT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be QCall. You have to be in GC cooperative mode to do the copy of object references.

// Non-inlinable wrapper around the QCall that avoids poluting the fast path
// with P/Invoke prolog/epilog.
[MethodImpl(MethodImplOptions.NoInlining)]
internal unsafe static bool RhCopyMemoryWithReferences<T>(ref T destination, ref T source, int elementCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is called RhBulkMoveWithWriteBarrier in CoreRT - it maybe nice to call it the same.

return false;
}

memmoveGCRefs(dst, src, byteCount);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look how RhBulkMoveWithWriteBarrier is implemented in CoreRT to get everything 100% inlined and avoid any unnecessary calls. It would be nice to do the same here for best perf.

}

memmoveGCRefs(dst, src, byteCount);
return true;
InlinedSetCardsAfterBulkCopy(reinterpret_cast<Object **>(dst), byteCount);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add FC_GC_POLL(); here to avoid GC starvation.

@jkotas
Copy link
Member

jkotas commented Mar 8, 2017

LGTM modulo one small comment.

@jkotas jkotas closed this Mar 8, 2017
@jkotas jkotas reopened this Mar 8, 2017
@kouvel
Copy link
Member Author

kouvel commented Mar 8, 2017

Sorry did not mean to push these commits, don't know how they got pushed... I'm still testing with various combinations as I haven't yet found a satisfactory solution

@jkotas
Copy link
Member

jkotas commented Mar 8, 2017

If you are running into problems with Unsafe.Add not being inlined ... don't worry about it, the JIT folks are looking into it (it is a problem in other places too). I think what you got is how I think the implementation should look like.

@kouvel
Copy link
Member Author

kouvel commented Mar 8, 2017

Thanks, made all of the changes from above and tweaked some things while I was comparing different versions. I still had to handle the one element case inline to avoid a ~7% regression there.

Copy forwards:

Span copy pointer  Left score        Right score       ∆ Score   ∆ Score %  Comment
-----------------  ----------------  ----------------  --------  ---------  --------
00008              109614.65 ±0.08%  119794.12 ±0.16%  10179.46      9.29%  Improved
00016               81315.36 ±0.03%   98287.19 ±0.02%  16971.83     20.87%  Improved
00032               53476.25 ±0.14%   92995.15 ±0.22%  39518.91     73.90%  Improved
00064               31764.28 ±0.21%   90630.96 ±0.23%  58866.69    185.32%  Improved
00128               17543.05 ±0.14%   79587.23 ±0.02%  62044.18    353.67%  Improved
00256                8889.78 ±0.11%   63449.10 ±0.07%  54559.31    613.73%  Improved
00512                4648.04 ±0.19%   44938.24 ±0.04%  40290.20    866.82%  Improved
01024                2386.51 ±0.06%   27487.00 ±0.03%  25100.49   1051.76%  Improved
04096                 608.05 ±0.06%    4899.76 ±0.44%   4291.71    705.82%  Improved
16384                 152.66 ±0.18%    1291.61 ±0.03%   1138.95    746.05%  Improved
65536                  38.12 ±0.17%     382.78 ±0.05%    344.66    904.12%  Improved
-----------------  ----------------  ----------------  --------  ---------  --------
Total                5299.97 ±0.12%   23967.94 ±0.12%  18667.97    352.23%  Improved

Copy backwards:

Span copy pointer backwards  Left score        Right score       ∆ Score   ∆ Score %  Comment
---------------------------  ----------------  ----------------  --------  ---------  --------
00008                        161482.34 ±0.08%  205250.05 ±0.01%  43767.71     27.10%  Improved
00016                         81176.62 ±0.20%   93344.65 ±0.04%  12168.03     14.99%  Improved
00032                         53529.77 ±0.05%   92916.95 ±0.06%  39387.18     73.58%  Improved
00064                         31769.12 ±0.06%   86905.38 ±0.04%  55136.26    173.55%  Improved
00128                         17571.13 ±0.06%   76314.84 ±0.02%  58743.72    334.32%  Improved
00256                          8892.42 ±0.10%   61281.16 ±0.07%  52388.75    589.14%  Improved
00512                          4656.13 ±0.22%   44139.38 ±0.02%  39483.25    847.98%  Improved
01024                          2388.44 ±0.12%   24521.85 ±0.22%  22133.42    926.69%  Improved
04096                           608.88 ±0.05%    6466.27 ±0.10%   5857.40    962.00%  Improved
16384                           152.48 ±0.22%    1571.80 ±0.23%   1419.32    930.85%  Improved
65536                            38.25 ±0.12%     184.14 ±0.14%    145.89    381.38%  Improved
---------------------------  ----------------  ----------------  --------  ---------  --------
Total                          5493.71 ±0.12%   23918.34 ±0.09%  18424.63    335.38%  Improved

@kouvel kouvel force-pushed the SpanCopyPerf branch 2 times, most recently from 92f53d3 to 7f9dd4a Compare March 8, 2017 13:39
SIZE_T *dptr = (SIZE_T *)dest;
SIZE_T *sptr = (SIZE_T *)src;

if ((len & sizeof(SIZE_T)) != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be aligning the destination pointer here. The misaligned writes have some penalty that would be best to avoid for longer copies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it is worth doing the alignment only on AMD64 (or other platforms where we will have custom implementations). It may look better to put the AMD64 implementation under one big ifdef.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is there a similar penalty for unaligned reads, and if so, is it worse for writes? I couldn't find much info on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, There is some penalty for reads too, but it is worse for writes.

_mm_storeu_ps((float *)dptr, v);
#else // !_AMD64_ || FEATURE_PAL
// Read two values and write two values to hint the use of wide loads and stores
SIZE_T p[2];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen this array hint to work anywhere? I would be surprised if there is a compiler smart enough to pick it up.

I think writing it this way would be equivalent and less magic:

SIZE_T t0 = sptr[0];
SIZE_T t1 = sptr[1];
dptr[0] = t0;
dptr[1] = t1;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like MSVC and clang/llvm are taking the hint, I'll change it though

kouvel added 4 commits March 8, 2017 11:30
Fixes #9161

PR dotnet#9786 fixes perf of span copy of types that don't contain references
Added TODO, leaving fixing that for a separate PR
@kouvel
Copy link
Member Author

kouvel commented Mar 9, 2017

Made changes from above and enabled xmm intrinsics path on Unix

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@ahsonkhan
Copy link
Member

ahsonkhan commented Mar 9, 2017

Should the CopyTo method be inlined? [MethodImpl(MethodImplOptions.AggressiveInlining)]
https://github.com/kouvel/coreclr/blob/65ef7295d8ae52b923b1f1cb00304a6c223db982/src/mscorlib/src/System/Span.cs#L281

public void CopyTo(Span<T> destination)

Same with TryCopyTo?
https://github.com/kouvel/coreclr/blob/65ef7295d8ae52b923b1f1cb00304a6c223db982/src/mscorlib/src/System/Span.cs#L295

public bool TryCopyTo(Span<T> destination)

@jkotas
Copy link
Member

jkotas commented Mar 9, 2017

Should the CopyTo method be inlined

These are small enough that the JIT should take care of inlining these without any special hints.

@kouvel kouvel merged commit a6a7bde into dotnet:master Mar 9, 2017
@kouvel kouvel deleted the SpanCopyPerf branch March 9, 2017 21:12
jorive pushed a commit to guhuro/coreclr that referenced this pull request May 4, 2017
…#9999)

Improve span copy of pointers and structs containing pointers

Fixes #9161

PR dotnet#9786 fixes perf of span copy of types that don't contain references
@mjp41
Copy link
Member

mjp41 commented Aug 14, 2017

@kouvel do you still have the test harness you used for improving this code?

@kouvel
Copy link
Member Author

kouvel commented Aug 22, 2017

Sorry for the delay, I have shared it here along with the test code I used: https://1drv.ms/f/s!AvtRwG9CobRTpl3WXmqgieJ6n5_f

Edit Root\PerfTester\Run.bat first based on the instructions there, and then you should be able to run that and collect numbers.

@karelz karelz modified the milestone: 2.0.0 Aug 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize Span Fill, Clear, and CopyTo method implementations
6 participants