Improve span copy of pointers and structs containing pointers #9999

kouvel · 2017-03-07T05:48:30Z

Fixes #9161

PR #9786 fixes perf of span copy of types that don't contain references

kouvel · 2017-03-07T05:59:46Z

Change was to call into memmoveGCRefs when total byte count to copy is over a threshold. Threshold was determined by measurement.

Left: before change, right: after change. Scores are iterations per millisecond. In this test, each iteration copies that many bytes (00008 mean 8 bytes total, so one pointer) (Span of class). Total is geometric mean. Numbers were taken with x64 build on Windows.

Copy non-overlapping spans (destination starts at a higher address than source):

Span copy pointer  Left score        Right score       ∆ Score   ∆ Score %  Comment
-----------------  ----------------  ----------------  --------  ---------  ---------
00008              108056.52 ±0.28%  103397.54 ±0.12%  -4658.99     -4.31%  Regressed
00016               81449.85 ±0.04%   78057.45 ±0.11%  -3392.40     -4.17%  Regressed
00032               53705.85 ±0.03%   52066.60 ±0.12%  -1639.25     -3.05%  Regressed
00064               31826.45 ±0.14%   31330.87 ±0.04%   -495.58     -1.56%  Regressed
00128               17565.26 ±0.16%   25008.12 ±0.23%   7442.86     42.37%  Improved
00256                8894.93 ±0.12%   19730.60 ±0.12%  10835.68    121.82%  Improved
00512                4654.36 ±0.23%   14095.66 ±0.25%   9441.31    202.85%  Improved
01024                2384.89 ±0.12%    9928.56 ±0.28%   7543.66    316.31%  Improved
04096                 610.08 ±0.10%    3745.38 ±0.11%   3135.30    513.91%  Improved
16384                 150.82 ±0.28%     924.38 ±0.25%    773.56    512.89%  Improved
65536                  38.20 ±0.12%     226.47 ±0.23%    188.27    492.85%  Improved
-----------------  ----------------  ----------------  --------  ---------  ---------
Total                5294.86 ±0.15%   11954.92 ±0.17%   6660.05    125.78%  Improved

Copy backwards (destination buffer starts inside the source buffer, overlapping the last pointer):

Span copy pointer backwards  Left score        Right score       ∆ Score   ∆ Score %  Comment
---------------------------  ----------------  ----------------  --------  ---------  ---------
00008                        161814.27 ±0.01%  168873.46 ±0.08%   7059.19      4.36%  Improved
00016                         81441.87 ±0.07%   77945.68 ±0.04%  -3496.18     -4.29%  Regressed
00032                         53622.73 ±0.04%   52146.69 ±0.01%  -1476.04     -2.75%  Regressed
00064                         31827.17 ±0.20%   31277.84 ±0.02%   -549.33     -1.73%  Regressed
00128                         17580.56 ±0.03%   27326.68 ±0.31%   9746.12     55.44%  Improved
00256                          8889.00 ±0.21%   24177.97 ±0.02%  15288.97    172.00%  Improved
00512                          4660.67 ±0.07%   20342.95 ±0.20%  15682.28    336.48%  Improved
01024                          2394.70 ±0.02%   14433.39 ±0.24%  12038.69    502.72%  Improved
04096                           609.64 ±0.06%    3305.12 ±0.23%   2695.47    442.14%  Improved
16384                           152.82 ±0.12%     895.95 ±0.09%    743.13    486.28%  Improved
65536                            38.31 ±0.03%     231.31 ±0.45%    193.00    503.82%  Improved
---------------------------  ----------------  ----------------  --------  ---------  ---------
Total                          5502.48 ±0.08%   13561.18 ±0.15%   8058.70    146.46%  Improved

kouvel · 2017-03-07T06:01:47Z

I wasn't able to avoid the small regressions in the first few buckets. I tried moving the call to the bottom using goto, such that the additional check would/should be the only extra cost for those buckets, but it made everything much slower for some reason that doesn't make sense to me.

kouvel · 2017-03-07T06:04:38Z

There seems to be no change in perf in copying spans of types that don't contain references, just some noise.

kouvel · 2017-03-07T06:05:43Z

@jkotas @ahsonkhan

jkotas · 2017-03-07T06:13:11Z

src/vm/comutilnative.cpp

+{
+    QCALL_CONTRACT;
+
+    if (!IS_ALIGNED(dst, sizeof(dst)) || !IS_ALIGNED(src, sizeof(src)))


This should never happen. It should be assert instead.

jkotas · 2017-03-07T06:15:30Z

src/mscorlib/src/System/Runtime/RuntimeImports.cs

+        [MethodImpl(MethodImplOptions.NoInlining)]
+        internal unsafe static bool RhCopyMemoryWithReferences<T>(ref T destination, ref T source, int elementCount)
+        {
+            fixed (void* destinationPtr = &Unsafe.As<T, byte>(ref destination))


You do not need to pin these for the fcall - the fcall can take ref byte directly.

jkotas · 2017-03-07T06:18:03Z

src/mscorlib/src/System/Span.cs

                    }
                }
            }
            else
            {
+                if ((nuint)elementsCount * (nuint)Unsafe.SizeOf<T>() >= 128 &&
+                    RuntimeImports.RhCopyMemoryWithReferences(ref destination, ref source, elementsCount))


I do not think we need special casing for small sizes here. The FCall should be fine for all case - if it is called directly from here; and avoids unnecessary layers internally.

jkotas · 2017-03-07T06:18:43Z

src/vm/comutilnative.cpp

 {
    QCALL_CONTRACT;

    memset(dst, 0, length);
 }

+bool QCALLTYPE MemoryNative::CopyWithReferences(void *dst, void *src, size_t byteCount)
+{
+    QCALL_CONTRACT;


This cannot be QCall. You have to be in GC cooperative mode to do the copy of object references.

jkotas · 2017-03-07T06:21:57Z

src/mscorlib/src/System/Runtime/RuntimeImports.cs

+        // Non-inlinable wrapper around the QCall that avoids poluting the fast path
+        // with P/Invoke prolog/epilog.
+        [MethodImpl(MethodImplOptions.NoInlining)]
+        internal unsafe static bool RhCopyMemoryWithReferences<T>(ref T destination, ref T source, int elementCount)


It is called RhBulkMoveWithWriteBarrier in CoreRT - it maybe nice to call it the same.

jkotas · 2017-03-07T06:23:19Z

src/vm/comutilnative.cpp

+        return false;
+    }
+
+    memmoveGCRefs(dst, src, byteCount);


Take a look how RhBulkMoveWithWriteBarrier is implemented in CoreRT to get everything 100% inlined and avoid any unnecessary calls. It would be nice to do the same here for best perf.

jkotas · 2017-03-08T02:15:22Z

src/vm/comutilnative.cpp

    }

-    memmoveGCRefs(dst, src, byteCount);
-    return true;
+    InlinedSetCardsAfterBulkCopy(reinterpret_cast<Object **>(dst), byteCount);
 }


It would be nice to add FC_GC_POLL(); here to avoid GC starvation.

jkotas · 2017-03-08T02:15:37Z

LGTM modulo one small comment.

kouvel · 2017-03-08T03:19:37Z

Sorry did not mean to push these commits, don't know how they got pushed... I'm still testing with various combinations as I haven't yet found a satisfactory solution

jkotas · 2017-03-08T05:00:48Z

If you are running into problems with Unsafe.Add not being inlined ... don't worry about it, the JIT folks are looking into it (it is a problem in other places too). I think what you got is how I think the implementation should look like.

kouvel · 2017-03-08T12:54:52Z

Thanks, made all of the changes from above and tweaked some things while I was comparing different versions. I still had to handle the one element case inline to avoid a ~7% regression there.

Copy forwards:

Span copy pointer  Left score        Right score       ∆ Score   ∆ Score %  Comment
-----------------  ----------------  ----------------  --------  ---------  --------
00008              109614.65 ±0.08%  119794.12 ±0.16%  10179.46      9.29%  Improved
00016               81315.36 ±0.03%   98287.19 ±0.02%  16971.83     20.87%  Improved
00032               53476.25 ±0.14%   92995.15 ±0.22%  39518.91     73.90%  Improved
00064               31764.28 ±0.21%   90630.96 ±0.23%  58866.69    185.32%  Improved
00128               17543.05 ±0.14%   79587.23 ±0.02%  62044.18    353.67%  Improved
00256                8889.78 ±0.11%   63449.10 ±0.07%  54559.31    613.73%  Improved
00512                4648.04 ±0.19%   44938.24 ±0.04%  40290.20    866.82%  Improved
01024                2386.51 ±0.06%   27487.00 ±0.03%  25100.49   1051.76%  Improved
04096                 608.05 ±0.06%    4899.76 ±0.44%   4291.71    705.82%  Improved
16384                 152.66 ±0.18%    1291.61 ±0.03%   1138.95    746.05%  Improved
65536                  38.12 ±0.17%     382.78 ±0.05%    344.66    904.12%  Improved
-----------------  ----------------  ----------------  --------  ---------  --------
Total                5299.97 ±0.12%   23967.94 ±0.12%  18667.97    352.23%  Improved

Copy backwards:

Span copy pointer backwards  Left score        Right score       ∆ Score   ∆ Score %  Comment
---------------------------  ----------------  ----------------  --------  ---------  --------
00008                        161482.34 ±0.08%  205250.05 ±0.01%  43767.71     27.10%  Improved
00016                         81176.62 ±0.20%   93344.65 ±0.04%  12168.03     14.99%  Improved
00032                         53529.77 ±0.05%   92916.95 ±0.06%  39387.18     73.58%  Improved
00064                         31769.12 ±0.06%   86905.38 ±0.04%  55136.26    173.55%  Improved
00128                         17571.13 ±0.06%   76314.84 ±0.02%  58743.72    334.32%  Improved
00256                          8892.42 ±0.10%   61281.16 ±0.07%  52388.75    589.14%  Improved
00512                          4656.13 ±0.22%   44139.38 ±0.02%  39483.25    847.98%  Improved
01024                          2388.44 ±0.12%   24521.85 ±0.22%  22133.42    926.69%  Improved
04096                           608.88 ±0.05%    6466.27 ±0.10%   5857.40    962.00%  Improved
16384                           152.48 ±0.22%    1571.80 ±0.23%   1419.32    930.85%  Improved
65536                            38.25 ±0.12%     184.14 ±0.14%    145.89    381.38%  Improved
---------------------------  ----------------  ----------------  --------  ---------  --------
Total                          5493.71 ±0.12%   23918.34 ±0.09%  18424.63    335.38%  Improved

jkotas · 2017-03-08T15:15:45Z

src/classlibnative/bcltype/arraynative.inl

+    SIZE_T *dptr = (SIZE_T *)dest;
+    SIZE_T *sptr = (SIZE_T *)src;
+
+    if ((len & sizeof(SIZE_T)) != 0)


We should be aligning the destination pointer here. The misaligned writes have some penalty that would be best to avoid for longer copies.

And it is worth doing the alignment only on AMD64 (or other platforms where we will have custom implementations). It may look better to put the AMD64 implementation under one big ifdef.

Out of curiosity, is there a similar penalty for unaligned reads, and if so, is it worse for writes? I couldn't find much info on this.

Yes, There is some penalty for reads too, but it is worse for writes.

jkotas · 2017-03-08T15:34:33Z

src/classlibnative/bcltype/arraynative.inl

+        _mm_storeu_ps((float *)dptr, v);
+#else // !_AMD64_ || FEATURE_PAL
+        // Read two values and write two values to hint the use of wide loads and stores
+        SIZE_T p[2];


Have you seen this array hint to work anywhere? I would be surprised if there is a compiler smart enough to pick it up.

I think writing it this way would be equivalent and less magic:

SIZE_T t0 = sptr[0]; SIZE_T t1 = sptr[1]; dptr[0] = t0; dptr[1] = t1;

It looks like MSVC and clang/llvm are taking the hint, I'll change it though

Fixes #9161 PR dotnet#9786 fixes perf of span copy of types that don't contain references

Added TODO, leaving fixing that for a separate PR

kouvel · 2017-03-09T04:03:43Z

Made changes from above and enabled xmm intrinsics path on Unix

jkotas

Nice!

ahsonkhan · 2017-03-09T05:43:37Z

Should the CopyTo method be inlined? [MethodImpl(MethodImplOptions.AggressiveInlining)]
https://github.com/kouvel/coreclr/blob/65ef7295d8ae52b923b1f1cb00304a6c223db982/src/mscorlib/src/System/Span.cs#L281

public void CopyTo(Span<T> destination)

Same with TryCopyTo?
https://github.com/kouvel/coreclr/blob/65ef7295d8ae52b923b1f1cb00304a6c223db982/src/mscorlib/src/System/Span.cs#L295

public bool TryCopyTo(Span<T> destination)

jkotas · 2017-03-09T05:49:32Z

Should the CopyTo method be inlined

These are small enough that the JIT should take care of inlining these without any special hints.

…#9999) Improve span copy of pointers and structs containing pointers Fixes #9161 PR dotnet#9786 fixes perf of span copy of types that don't contain references

mjp41 · 2017-08-14T09:50:28Z

@kouvel do you still have the test harness you used for improving this code?

kouvel · 2017-08-22T18:02:26Z

Sorry for the delay, I have shared it here along with the test code I used: https://1drv.ms/f/s!AvtRwG9CobRTpl3WXmqgieJ6n5_f

Edit Root\PerfTester\Run.bat first based on the instructions there, and then you should be able to run that and collect numbers.

dnfclas added the cla-already-signed label Mar 7, 2017

jkotas reviewed Mar 7, 2017

View reviewed changes

jkotas reviewed Mar 8, 2017

View reviewed changes

jkotas closed this Mar 8, 2017

jkotas reopened this Mar 8, 2017

dnfclas added the cla-already-signed label Mar 8, 2017

kouvel force-pushed the SpanCopyPerf branch from 2451121 to b296a68 Compare March 8, 2017 03:21

kouvel force-pushed the SpanCopyPerf branch 2 times, most recently from 92f53d3 to 7f9dd4a Compare March 8, 2017 13:39

jkotas reviewed Mar 8, 2017

View reviewed changes

kouvel added 4 commits March 8, 2017 11:30

Improve span copy of pointers and structs containing pointers

75cc69f

Fixes #9161 PR dotnet#9786 fixes perf of span copy of types that don't contain references

Address feedback

47afec9

Add GC poll

989ea9a

Disable __m128 intrinsic usage on Unix for this PR

d879bdd

Added TODO, leaving fixing that for a separate PR

kouvel force-pushed the SpanCopyPerf branch from 7f9dd4a to 99068ae Compare March 9, 2017 04:03

kouvel force-pushed the SpanCopyPerf branch from 99068ae to 265e7ff Compare March 9, 2017 04:05

kouvel added 2 commits March 8, 2017 20:11

Align destination before loop

fb6ed34

Enable xmm intrinsics path on Unix

65ef729

kouvel force-pushed the SpanCopyPerf branch from 265e7ff to 65ef729 Compare March 9, 2017 04:11

jkotas approved these changes Mar 9, 2017

View reviewed changes

Fix copy forward loop

0de3e24

kouvel merged commit a6a7bde into dotnet:master Mar 9, 2017

kouvel deleted the SpanCopyPerf branch March 9, 2017 21:12

jkotas mentioned this pull request Mar 15, 2017

Make use of CopyBlock for non-overlapping spans dotnet/corefx#17063

Merged

karelz modified the milestone: 2.0.0 Aug 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve span copy of pointers and structs containing pointers #9999

Improve span copy of pointers and structs containing pointers #9999

kouvel commented Mar 7, 2017

kouvel commented Mar 7, 2017 •

edited

Loading

kouvel commented Mar 7, 2017 •

edited

Loading

kouvel commented Mar 7, 2017

kouvel commented Mar 7, 2017

jkotas Mar 7, 2017 •

edited

Loading

jkotas Mar 7, 2017

jkotas Mar 7, 2017

jkotas Mar 7, 2017

jkotas Mar 7, 2017

jkotas Mar 7, 2017

jkotas Mar 8, 2017

jkotas commented Mar 8, 2017

kouvel commented Mar 8, 2017

jkotas commented Mar 8, 2017 •

edited

Loading

kouvel commented Mar 8, 2017

jkotas Mar 8, 2017

jkotas Mar 8, 2017

kouvel Mar 8, 2017

jkotas Mar 8, 2017

jkotas Mar 8, 2017

kouvel Mar 8, 2017

kouvel commented Mar 9, 2017

jkotas left a comment

ahsonkhan commented Mar 9, 2017 •

edited

Loading

jkotas commented Mar 9, 2017

mjp41 commented Aug 14, 2017

kouvel commented Aug 22, 2017

Improve span copy of pointers and structs containing pointers #9999

Improve span copy of pointers and structs containing pointers #9999

Conversation

kouvel commented Mar 7, 2017

kouvel commented Mar 7, 2017 • edited Loading

kouvel commented Mar 7, 2017 • edited Loading

kouvel commented Mar 7, 2017

kouvel commented Mar 7, 2017

jkotas Mar 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas commented Mar 8, 2017

kouvel commented Mar 8, 2017

jkotas commented Mar 8, 2017 • edited Loading

kouvel commented Mar 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouvel commented Mar 9, 2017

jkotas left a comment

Choose a reason for hiding this comment

ahsonkhan commented Mar 9, 2017 • edited Loading

jkotas commented Mar 9, 2017

mjp41 commented Aug 14, 2017

kouvel commented Aug 22, 2017

kouvel commented Mar 7, 2017 •

edited

Loading

kouvel commented Mar 7, 2017 •

edited

Loading

jkotas Mar 7, 2017 •

edited

Loading

jkotas commented Mar 8, 2017 •

edited

Loading

ahsonkhan commented Mar 9, 2017 •

edited

Loading