Improve SpanHelpers.ClearWithReferences for arm64 #93346

EgorBo · 2023-10-11T15:39:24Z

Follow up to #93214, now for Span<GC>.Clear();

I noticed that the existing implementation is slightly sub-optimal on arm: it's relying on complex addressing modes + is not aligned to 16 bytes. Turns out, 16-byte alignment makes a lot of sense even if we don't use SIMD - stp (two 8-byte stores) benefits from it as well.

Standalone benchmark (for simpler tests)

Apple M2 Max (also tested on Ampere):

| Method     | len  | Mean       | Ratio |
|----------- |----- |-----------:|------:|-
| Clear_Base | 1    |   1.199 ns |  1.00 |
| Clear_PR   | 1    |   1.186 ns |  0.99 |
|            |      |            |       |
| Clear_Base | 5    |   1.215 ns |  1.00 |
| Clear_PR   | 5    |   1.215 ns |  1.00 |
|            |      |            |       |
| Clear_Base | 8    |   1.617 ns |  1.00 |
| Clear_PR   | 8    |   1.269 ns |  0.78 |
|            |      |            |       |
| Clear_Base | 16   |   2.405 ns |  1.00 |
| Clear_PR   | 16   |   2.061 ns |  0.86 |
|            |      |            |       |
| Clear_Base | 64   |   8.727 ns |  1.00 |
| Clear_PR   | 64   |   5.415 ns |  0.62 |
|            |      |            |       |
| Clear_Base | 512  |  70.081 ns |  1.00 |
| Clear_PR   | 512  |  35.482 ns |  0.51 |
|            |      |            |       |
| Clear_Base | 1024 | 140.080 ns |  1.00 |
| Clear_PR   | 1024 |  72.254 ns |  0.52 |
|            |      |            |       |
| Clear_Base | 4096 | 559.551 ns |  1.00 |
| Clear_PR   | 4096 | 333.439 ns |  0.60 |

Ryzen 7950X:

| Method     | len  | Mean        | Ratio |
|----------- |----- |------------:|------:|-
| Clear_Base | 1    |   0.3854 ns |  1.00 |
| Clear_PR   | 1    |   0.3798 ns |  0.99 |
|            |      |             |       |
| Clear_Base | 5    |   0.5310 ns |  1.00 |
| Clear_PR   | 5    |   0.3907 ns |  0.74 |
|            |      |             |       |
| Clear_Base | 8    |   0.8995 ns |  1.00 |
| Clear_PR   | 8    |   0.5264 ns |  0.59 |
|            |      |             |       |
| Clear_Base | 16   |   1.4398 ns |  1.00 |
| Clear_PR   | 16   |   1.2039 ns |  0.84 |
|            |      |             |       |
| Clear_Base | 64   |   5.1603 ns |  1.00 |
| Clear_PR   | 64   |   5.1023 ns |  0.99 |
|            |      |             |       |
| Clear_Base | 512  |  46.0601 ns |  1.00 |
| Clear_PR   | 512  |  45.9100 ns |  1.00 |
|            |      |             |       |
| Clear_Base | 1024 |  94.9471 ns |  1.00 |
| Clear_PR   | 1024 |  94.7812 ns |  1.00 |
|            |      |             |       |
| Clear_Base | 4096 | 380.4915 ns |  1.00 |
| Clear_PR   | 4096 | 375.3246 ns |  0.99 |

For X64, it slightly improves small inputs. It seems like if we pin the input and use SIMD we can see >2X improvement for large inputs but I was not sure about doing pinning (from GC's point of view, atomicity is fine) here and left as is, cc @jkotas

ghost · 2023-10-11T15:39:40Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Follow up to #93214, now for Span<GC>.Clear();

I noticed that the existing implementation is slightly sub-optimal on arm: it's relying on complex addressing modes + is not aligned to 16 bytes. Turns out, 16-byte alignment makes a lot of sense even if we don't use SIMD - stp (two 8-byte stores) benefits from it as well.

Standalone benchmark (for simpler tests)

Apple M2 Max (also tested on Ampere):

| Method     | len  | Mean       | Ratio |
|----------- |----- |-----------:|------:|-
| Clear_Base | 1    |   1.199 ns |  1.00 |
| Clear_PR   | 1    |   1.186 ns |  0.99 |
|            |      |            |       |
| Clear_Base | 5    |   1.215 ns |  1.00 |
| Clear_PR   | 5    |   1.215 ns |  1.00 |
|            |      |            |       |
| Clear_Base | 8    |   1.617 ns |  1.00 |
| Clear_PR   | 8    |   1.269 ns |  0.78 |
|            |      |            |       |
| Clear_Base | 16   |   2.405 ns |  1.00 |
| Clear_PR   | 16   |   2.061 ns |  0.86 |
|            |      |            |       |
| Clear_Base | 64   |   8.727 ns |  1.00 |
| Clear_PR   | 64   |   5.415 ns |  0.62 |
|            |      |            |       |
| Clear_Base | 512  |  70.081 ns |  1.00 |
| Clear_PR   | 512  |  35.482 ns |  0.51 |
|            |      |            |       |
| Clear_Base | 1024 | 140.080 ns |  1.00 |
| Clear_PR   | 1024 |  72.254 ns |  0.52 |
|            |      |            |       |
| Clear_Base | 4096 | 559.551 ns |  1.00 |
| Clear_PR   | 4096 | 333.439 ns |  0.60 |

Ryzen 7950X:

| Method     | len  | Mean        | Ratio |
|----------- |----- |------------:|------:|-
| Clear_Base | 1    |   0.3854 ns |  1.00 |
| Clear_PR   | 1    |   0.3798 ns |  0.99 |
|            |      |             |       |
| Clear_Base | 5    |   0.5310 ns |  1.00 |
| Clear_PR   | 5    |   0.3907 ns |  0.74 |
|            |      |             |       |
| Clear_Base | 8    |   0.8995 ns |  1.00 |
| Clear_PR   | 8    |   0.5264 ns |  0.59 |
|            |      |             |       |
| Clear_Base | 16   |   1.4398 ns |  1.00 |
| Clear_PR   | 16   |   1.2039 ns |  0.84 |
|            |      |             |       |
| Clear_Base | 64   |   5.1603 ns |  1.00 |
| Clear_PR   | 64   |   5.1023 ns |  0.99 |
|            |      |             |       |
| Clear_Base | 512  |  46.0601 ns |  1.00 |
| Clear_PR   | 512  |  45.9100 ns |  1.00 |
|            |      |             |       |
| Clear_Base | 1024 |  94.9471 ns |  1.00 |
| Clear_PR   | 1024 |  94.7812 ns |  1.00 |
|            |      |             |       |
| Clear_Base | 4096 | 380.4915 ns |  1.00 |
| Clear_PR   | 4096 | 375.3246 ns |  0.99 |

For X64, it slightly improves small inputs. It seems like if we pin the input and use SIMD we can see >2X improvement for large inputs but I was not sure about doing pinning (from GC's point of view, atomicity is fine) here and left as is, cc @jkotas

Author:	EgorBo
Assignees:	EgorBo
Labels:	`area-System.Memory`
Milestone:	-

jkotas · 2023-10-11T15:45:14Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.cs

-
-        Write1:
-            Debug.Assert(pointerSizeLength >= 1);
+                // This switch is expected to be lowered to a jump table


This works great for microbenchmarks, but it works poorly for real work loads where the jump target is poorly predicated. We used to have switch table like this in managed memcopy originally, but it was later removed.

This works great for microbenchmarks, but it works poorly for real work loads where the jump target is poorly predicated. We used to have switch table like this in managed memcopy originally, but it was later removed.

Should I revert to the previous shape? I was mostly interested in aligning to 16 bytes (it's the reason I am seeing nice results on arm64) and slightly different addressing mode in the main loop, to make it foldable into stp

Btw, presumably, PGO can re-shuffle the switch (or convert to if-else) based on profile. However, in the real world it's likely has a mixed profile.

Should I revert to the previous shape?

Yes, unless you can prove that the switch with jump table is better for real workloads.

For reference, dotnet/coreclr#9786 is the PR that got rid of the switch from memory copy. It went through extensive validation.

This works great for microbenchmarks, but it works poorly for real work loads where the jump target is poorly predicated. We used to have switch table like this in managed memcopy originally, but it was later removed.

@jkotas, I'm not sure what you mean here, could you elaborate?

The hardware optimization manuals and underlying native memcpy algorithms across all 3 major implementations (MSVC, GCC, Clang; for x64 and Arm64) use/recommend a jump table explicitly for this reason. Because an unconditional and unpredictable branch, is better than a tight tree of conditional and still unpredictable branches.

The premise is that a fallthrough jump table allows a tight, unrolled version of handling for the trailing elements to exist. This then requires 1 unpredicted branch to decide which of the immediately following and equidistant switch cases needs to be jumped to. These cases are often already in the instruction/decode cache due to being in the same or following 32-byte code window.

dotnet/coreclr#9786 was done by Intel engineers to address bottlenecks observed in real workloads. Yes, it goes contrary to the guidance you have quoted.

Possible explanation: Conditional jumps and indirect jumps use two different predictors. The relative performance of conditional jumps vs. indirect jump depends on relative pressure on each predictor. .NET code tends to have more indirect jumps than an average code. It means that the indirect jump predictor sees more pressure in .NET code and it is more profitable to depend on indirect jump predictor in .NET memcpy implementation.

EgorBo · 2023-10-11T15:58:31Z

Codegen on arm64:

New codegen:

; Method Prog:ClearWithReferences2(byref,ulong) (FullOpts)
G_M64195_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
						;; size=8 bbWeight=8 PerfScore 12.00

G_M64195_IG02:  ;; offset=0x0008
            mov     x2, x1
            cmp     x2, #8
            bhi     G_M64195_IG12
						;; size=12 bbWeight=8 PerfScore 16.00

G_M64195_IG03:  ;; offset=0x0014
            cmp     w2, #8
            bhi     G_M64195_IG12
            mov     w1, w2
            adr     x2, [@RWD00]
            ldr     w2, [x2, x1, LSL #2]
            adr     x3, [G_M64195_IG02]
            add     x2, x2, x3
            br      x2
						;; size=32 bbWeight=4 PerfScore 30.00

G_M64195_IG04:  ;; offset=0x0034
            str     xzr, [x0]
            b       G_M64195_IG15
            align   [0 bytes for IG13]
            align   [0 bytes]
            align   [0 bytes]
            align   [0 bytes]
						;; size=8 bbWeight=0.50 PerfScore 1.00

G_M64195_IG05:  ;; offset=0x003C
            stp     xzr, xzr, [x0]
            b       G_M64195_IG15
						;; size=8 bbWeight=0.50 PerfScore 1.00

G_M64195_IG06:  ;; offset=0x0044
            stp     xzr, xzr, [x0]
            str     xzr, [x0, #0x10]
            b       G_M64195_IG15
						;; size=12 bbWeight=0.50 PerfScore 1.50

G_M64195_IG07:  ;; offset=0x0050
            stp     xzr, xzr, [x0]
            stp     xzr, xzr, [x0, #0x10]
            b       G_M64195_IG15
						;; size=12 bbWeight=0.50 PerfScore 1.50

G_M64195_IG08:  ;; offset=0x005C
            stp     xzr, xzr, [x0]
            stp     xzr, xzr, [x0, #0x10]
            str     xzr, [x0, #0x20]
            b       G_M64195_IG15
						;; size=16 bbWeight=0.50 PerfScore 2.00

G_M64195_IG09:  ;; offset=0x006C
            stp     xzr, xzr, [x0]
            stp     xzr, xzr, [x0, #0x10]
            stp     xzr, xzr, [x0, #0x20]
            b       G_M64195_IG15
						;; size=16 bbWeight=0.50 PerfScore 2.00

G_M64195_IG10:  ;; offset=0x007C
            stp     xzr, xzr, [x0]
            stp     xzr, xzr, [x0, #0x10]
            stp     xzr, xzr, [x0, #0x20]
            str     xzr, [x0, #0x30]
            b       G_M64195_IG15
						;; size=20 bbWeight=0.50 PerfScore 2.50

G_M64195_IG11:  ;; offset=0x0090
            stp     xzr, xzr, [x0]
            stp     xzr, xzr, [x0, #0x10]
            stp     xzr, xzr, [x0, #0x20]
            stp     xzr, xzr, [x0, #0x30]
            b       G_M64195_IG15
						;; size=20 bbWeight=0.50 PerfScore 2.50

G_M64195_IG12:  ;; offset=0x00A4
            lsr     x2, x1, #3
						;; size=4 bbWeight=4 PerfScore 4.00

G_M64195_IG13:  ;; offset=0x00A8
            stp     xzr, xzr, [x0]
            stp     xzr, xzr, [x0, #0x10]
            stp     xzr, xzr, [x0, #0x20]
            stp     xzr, xzr, [x0, #0x30]
            add     x0, x0, #64
            sub     x2, x2, #1
            cbnz    x2, G_M64195_IG13
						;; size=28 bbWeight=32 PerfScore 192.00

G_M64195_IG14:  ;; offset=0x00C4
            and     x1, x1, #7
            b       G_M64195_IG02
						;; size=8 bbWeight=4 PerfScore 6.00

G_M64195_IG15:  ;; offset=0x00CC
            ldp     fp, lr, [sp], #0x10
            ret     lr
						;; size=8 bbWeight=1 PerfScore 2.00
RWD00  	dd	000000C4h ; case G_M64195_IG15
       	dd	0000002Ch ; case G_M64195_IG04
       	dd	00000034h ; case G_M64195_IG05
       	dd	0000003Ch ; case G_M64195_IG06
       	dd	00000048h ; case G_M64195_IG07
       	dd	00000054h ; case G_M64195_IG08
       	dd	00000064h ; case G_M64195_IG09
       	dd	00000074h ; case G_M64195_IG10
       	dd	00000088h ; case G_M64195_IG11
; Total bytes of code: 212

Previous codegen:

; Method Prog:ClearWithReferences(byref,ulong) (FullOpts)
G_M45361_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
						;; size=8 bbWeight=1 PerfScore 1.50

G_M45361_IG02:  ;; offset=0x0008
            cmp     x1, #8
            blo     G_M45361_IG04
            align   [0 bytes for IG03]
            align   [0 bytes]
            align   [0 bytes]
            align   [0 bytes]
						;; size=8 bbWeight=1 PerfScore 1.50

G_M45361_IG03:  ;; offset=0x0010
            lsl     x2, x1, #3
            add     x3, x0, x2
            str     xzr, [x3, #-0x08]
            add     x3, x0, x2
            str     xzr, [x3, #-0x10]
            add     x3, x0, x2
            str     xzr, [x3, #-0x18]
            add     x3, x0, x2
            str     xzr, [x3, #-0x20]
            add     x3, x0, x2
            str     xzr, [x3, #-0x28]
            add     x3, x0, x2
            str     xzr, [x3, #-0x30]
            add     x3, x0, x2
            str     xzr, [x3, #-0x38]
            add     x2, x0, x2
            str     xzr, [x2, #-0x40]
            sub     x1, x1, #8
            cmp     x1, #8
            bhs     G_M45361_IG03
						;; size=80 bbWeight=4 PerfScore 60.00

G_M45361_IG04:  ;; offset=0x0060
            cmp     x1, #4
            bhs     G_M45361_IG07
						;; size=8 bbWeight=1 PerfScore 1.50

G_M45361_IG05:  ;; offset=0x0068
            cmp     x1, #2
            bhs     G_M45361_IG08
            cbnz    x1, G_M45361_IG09
						;; size=12 bbWeight=0.50 PerfScore 1.25

G_M45361_IG06:  ;; offset=0x0074
            ldp     fp, lr, [sp], #0x10
            ret     lr
						;; size=8 bbWeight=0.50 PerfScore 1.00

G_M45361_IG07:  ;; offset=0x007C
            stp     xzr, xzr, [x0, #0x10]
            lsl     x2, x1, #3
            add     x3, x0, x2
            str     xzr, [x3, #-0x18]
            add     x2, x0, x2
            str     xzr, [x2, #-0x10]
						;; size=24 bbWeight=0.50 PerfScore 2.50

G_M45361_IG08:  ;; offset=0x0094
            str     xzr, [x0, #0x08]
            lsl     x2, x1, #3
            add     x1, x0, x2
            str     xzr, [x1, #-0x08]
						;; size=16 bbWeight=0.50 PerfScore 1.75

G_M45361_IG09:  ;; offset=0x00A4
            str     xzr, [x0]
						;; size=4 bbWeight=0.50 PerfScore 0.50

G_M45361_IG10:  ;; offset=0x00A8
            ldp     fp, lr, [sp], #0x10
            ret     lr
						;; size=8 bbWeight=0.50 PerfScore 1.00
; Total bytes of code: 176

I can tune JIT to fold

stp     xzr, xzr, [x0]
stp     xzr, xzr, [x0, #..]

to SIMD in the newly added LowerIndirCoelsceStore, but it seems to be fast as is.

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.cs

EgorBo · 2023-10-13T17:24:40Z

I have mixed results with this change, I'll wait for fixes for #93382 (needed for the remainder) and #76067 (comment) (needed for the main loop) and try again after

ghost assigned EgorBo Oct 11, 2023

dotnet-issue-labeler bot added the area-System.Memory label Oct 11, 2023

jkotas reviewed Oct 11, 2023

View reviewed changes

This was referenced Oct 11, 2023

CI error: System.Net.Quic.QuicException: The connection timed out from inactivity #91757

Closed

MSBuild crashing in the build #92290

Open

pin data for large inputs

5f4c26e

EgorBo force-pushed the opt-clear-with-ref branch from 0d6d7b8 to 5f4c26e Compare October 11, 2023 22:20

EgorBo marked this pull request as draft October 11, 2023 22:59

Simplify

597577c

jkotas reviewed Oct 11, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.cs Outdated Show resolved Hide resolved

Address feedback

c449795

Update SpanHelpers.cs

154a7ca

This was referenced Oct 12, 2023

Test_EventSource_EtwManifestGeneration* tests failing in CI #48798

Closed

System.Net.Quic.Tests.MsQuicRemoteExecutorTests.SslKeyLogFile_IsCreatedAndFilled fails #93404

Closed

Merge branch 'main' of github.com:dotnet/runtime into opt-clear-with-ref

c033dcf

Merge branch 'main' of github.com:dotnet/runtime into opt-clear-with-ref

6063af6

EgorBo closed this Oct 13, 2023

This was referenced Oct 13, 2023

Build crashes in System.Runtime.Serialization.SerializationGuard #92007

Closed

Test failure: JIT/opt/AssertionPropagation/ArrBoundElim/ArrBoundElim.dll #93508

Closed

ghost locked as resolved and limited conversation to collaborators Nov 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SpanHelpers.ClearWithReferences for arm64 #93346

Improve SpanHelpers.ClearWithReferences for arm64 #93346

EgorBo commented Oct 11, 2023 •

edited

Loading

ghost commented Oct 11, 2023

jkotas Oct 11, 2023

EgorBo Oct 11, 2023

EgorBo Oct 11, 2023 •

edited

Loading

jkotas Oct 11, 2023

tannergooding Oct 13, 2023 •

edited

Loading

jkotas Oct 13, 2023

EgorBo commented Oct 11, 2023 •

edited

Loading

EgorBo commented Oct 13, 2023 •

edited

Loading

Improve SpanHelpers.ClearWithReferences for arm64 #93346

Improve SpanHelpers.ClearWithReferences for arm64 #93346

Conversation

EgorBo commented Oct 11, 2023 • edited Loading

ghost commented Oct 11, 2023

jkotas Oct 11, 2023

Choose a reason for hiding this comment

EgorBo Oct 11, 2023

Choose a reason for hiding this comment

EgorBo Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

jkotas Oct 11, 2023

Choose a reason for hiding this comment

tannergooding Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

jkotas Oct 13, 2023

Choose a reason for hiding this comment

EgorBo commented Oct 11, 2023 • edited Loading

EgorBo commented Oct 13, 2023 • edited Loading

EgorBo commented Oct 11, 2023 •

edited

Loading

EgorBo Oct 11, 2023 •

edited

Loading

tannergooding Oct 13, 2023 •

edited

Loading

EgorBo commented Oct 11, 2023 •

edited

Loading

EgorBo commented Oct 13, 2023 •

edited

Loading