ARM64 intrinsic support for Vector64.Create() and Vector128.Create() #35590

kunalspathak · 2020-04-28T23:46:59Z

Added hardware intrinsic for various overloads of Vector64.Create() and Vector128.Create():

Multiple arguments - The APIs that takes multiple parameters to be set in respective lanes are implemented in C# using AdvSimd.Insert.
Single arguments - The APIs that takes single argument and should be copied in all lanes are implemented in JIT by generating dup/mov/fmov instructions.

While I was there, I noticed an edge case where we hit assert if trying to emit an immediate int.MaxValue. Fixed it as well.

Contributes to #33308 and #33496.
Fixes: #35821

kunalspathak · 2020-04-28T23:47:15Z

@dotnet/jit-contrib , @tannergooding

BruceForstall · 2020-04-29T00:16:35Z

cc @TamarChristinaArm

TamarChristinaArm · 2020-04-29T14:39:27Z

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs

+                result = Arm.AdvSimd.Insert(result, 5, e5);
+                result = Arm.AdvSimd.Insert(result, 6, e6);
+                result = Arm.AdvSimd.Insert(result, 7, e7);
+                result = Arm.AdvSimd.Insert(result, 8, e8);


not for this PR, but since we only have 8 argument registers the rest will be passed on the stack.
In which case would be easier to load the remaining 64 bits into a register directly from the stack into the top part of the 128-bit vector.

i.e. do something like this:

and w0, w0, 255 fmov s0, w0 ins v0.b[1], w1 ins v0.b[2], w2 ins v0.b[3], w3 ins v0.b[4], w4 ins v0.b[5], w5 ins v0.b[6], w6 ins v0.b[7], w7 ld1 {v0.d}[1], [sp]

And you can do this is something like this for a lot of the cases with more than 8 arguments. Do you know what the code generates here at the moment?

That's a good point which I didn't realize. I will explore your suggestion.

Generated code

53001C00 uxtb w0, w0 4E011C10 ins v16.b[0], w0 53001C20 uxtb w0, w1 4E031C10 ins v16.b[1], w0 53001C40 uxtb w0, w2 4E051C10 ins v16.b[2], w0 53001C60 uxtb w0, w3 4E071C10 ins v16.b[3], w0 53001C80 uxtb w0, w4 4E091C10 ins v16.b[4], w0 53001CA0 uxtb w0, w5 4E0B1C10 ins v16.b[5], w0 53001CC0 uxtb w0, w6 4E0D1C10 ins v16.b[6], w0 53001CE0 uxtb w0, w7 4E0F1C10 ins v16.b[7], w0 B94023A0 ldr w0, [fp,#32] // [V08 arg8] 53001C00 uxtb w0, w0 4E111C10 ins v16.b[8], w0 B9402BA0 ldr w0, [fp,#40] // [V09 arg9] 53001C00 uxtb w0, w0 4E131C10 ins v16.b[9], w0 B94033A0 ldr w0, [fp,#48] // [V10 arg10] 53001C00 uxtb w0, w0 4E151C10 ins v16.b[10], w0 B9403BA0 ldr w0, [fp,#56] // [V11 arg11] 53001C00 uxtb w0, w0 4E171C10 ins v16.b[11], w0 B94043A0 ldr w0, [fp,#64] // [V12 arg12] 53001C00 uxtb w0, w0 4E191C10 ins v16.b[12], w0 B9404BA0 ldr w0, [fp,#72] // [V13 arg13] 53001C00 uxtb w0, w0 4E1B1C10 ins v16.b[13], w0 B94053A0 ldr w0, [fp,#80] // [V14 arg14] 53001C00 uxtb w0, w0 4E1D1C10 ins v16.b[14], w0 B9405BA0 ldr w0, [fp,#88] // [V15 arg15] 53001C00 uxtb w0, w0 4EB01E08 mov v8.16b, v16.16b 4E1F1C08 ins v8.b[15], w0 D28D0500 movz x0, #0x6828 F2A6EF60 movk x0, #0x377b LSL #16 F2CFFFA0 movk x0, #0x7ffd LSL #32 6E084509 mov v9.d[0], v8.d[1] 97FF53EA bl CORINFO_HELP_NEWSFAST 6E180528 mov v8.d[1], v9.d[0] 3C808008 str q8, [x0,#8]

And you can do this is something like this for a lot of the cases with more than 8 arguments.

Actually, this is the only API that has more than 8 arguments.

I wonder how hard is goint to be to get rid of the uxtb-s

yeah, I am not much familiar with this, but would love to get rid of them. Just to call out, they show up for byte, sbyte, short and ushort parameters.

yeah, I am not much familiar with this, but would love to get rid of them. Just to call out, they show up for byte, sbyte, short and ushort parameters.

Correct, because these types are usually "normalized" (i.e. sign- or zero-extended) to a 32 bit value

Yeah, those aren't needed indeed, would be good to get rid of them if possible since they double the number of instructions.

Also

B94053A0 ldr w0, [fp,#80] // [V14 arg14] 53001C00 uxtb w0, w0 4E1D1C10 ins v16.b[14], w0

Without the optimization I talked about above should ideally be

ld1 {v16.b[14]}, [x1]

where x1 gets incremented, or you could use a which prevents from having to move between register files. But also don't know how easy this is to do..

I briefly discussed various options with @echesakovMSFT and concluded that it might not be straight forward. I would like to revisit it after we implement other APIs. Opened #35688 to track it.

src/coreclr/src/jit/hwintrinsiccodegenarm64.cpp

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector64.cs

src/coreclr/src/jit/hwintrinsiccodegenarm64.cpp

src/coreclr/src/jit/lowerarmarch.cpp

src/coreclr/src/jit/emitarm64.cpp

src/coreclr/src/jit/hwintrinsiccodegenarm64.cpp

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs

echesakov

Looks Good to me with some questions/suggestions.

src/coreclr/src/jit/emitarm64.cpp

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs

echesakov · 2020-05-01T20:34:56Z

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs

@@ -752,6 +763,26 @@ public static unsafe Vector128<byte> Create(byte e0, byte e1, byte e2, byte e3,
                return Sse2.UnpackLow(lo64, hi64).AsByte();                                         // <  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 >
            }

+            if (AdvSimd.IsSupported)
+            {
+                Vector128<byte> result = CreateScalarUnsafe(e0);


@kunalspathak You might want to add comment here in the same fashion as it's done for Sse2 case

Spoke offline and this is not needed.

…nsic

kunalspathak · 2020-05-04T16:08:54Z

@tannergooding and @echesakovMSFT - Just FYI, while working on something else I found 2 issues for which we don't have test coverage today.

Vector64.Create((double)10) (any integer value casted to double)
Passing Vector64<double> or Vector64<long> as parameter to a function.

I will fix and add test coverage for it before merging.

Dotnet-GitSync-Bot added the area-CodeGen-coreclr label Apr 28, 2020

BruceForstall requested review from echesakov and tannergooding April 29, 2020 00:16

TamarChristinaArm reviewed Apr 29, 2020

View reviewed changes

src/coreclr/src/jit/hwintrinsiccodegenarm64.cpp Show resolved Hide resolved

ericstj mentioned this pull request Apr 29, 2020

Test failed: System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout #13610

Closed

echesakov reviewed Apr 29, 2020

View reviewed changes

tannergooding reviewed Apr 29, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs Show resolved Hide resolved

tannergooding reviewed Apr 29, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs Outdated Show resolved Hide resolved

echesakov mentioned this pull request Apr 30, 2020

[Arm64] Implement Vector64/128.CreateScalar() using AdvSimd.Insert #35300

Merged

kunalspathak mentioned this pull request May 1, 2020

Optimize Vector128.Create() that takes 15 arguments #35688

Open

echesakov approved these changes May 1, 2020

View reviewed changes

kunalspathak added 9 commits May 1, 2020 14:36

Make Vector64.Create() that takes multiple arguments use ARM64 intrinsic

4ad1eee

Make Vector128.Create() that takes multiple arguments use ARM64 intri…

523e67e

…nsic

Intrinsify Vector64.Create() that takes single argument

ba09d5e

Intrinsify Vector64.Create() that takes single argument

6d62318

Fix edge case where int.MaxValue was failing if used as immediate

3d60b7a

Addressed review comments

660d721

added unit test

1b9c421

fix the return type of emitDecodeByteShiftedImm

dae7537

merge conflicts fixup

76cc65e

kunalspathak force-pushed the create-scalar-multiple branch from 1a86bc1 to 76cc65e Compare May 1, 2020 21:55

kunalspathak added 3 commits May 1, 2020 16:44

moved the assert at right location

6f59efc

Bug fixes

4707161

formatting

dbc8dde

tannergooding approved these changes May 4, 2020

View reviewed changes

Added test coverage

0c51436

kunalspathak added 2 commits May 4, 2020 17:41

Trimmed the test IL code

0a99bf3

Add required return in the test

0382dea

kunalspathak merged commit d23f1a2 into dotnet:master May 5, 2020

kunalspathak mentioned this pull request Jun 4, 2020

Optimize AsVector, AsVector128, GetUpper, As and WithElement with ARM64 intrinsics #37338

Merged

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM64 intrinsic support for Vector64.Create() and Vector128.Create() #35590

ARM64 intrinsic support for Vector64.Create() and Vector128.Create() #35590

kunalspathak commented Apr 28, 2020 •

edited

Loading

kunalspathak commented Apr 28, 2020

BruceForstall commented Apr 29, 2020

TamarChristinaArm Apr 29, 2020

kunalspathak Apr 29, 2020

echesakov Apr 29, 2020

kunalspathak Apr 29, 2020

echesakov Apr 29, 2020

TamarChristinaArm Apr 29, 2020

kunalspathak May 1, 2020

echesakov left a comment

echesakov May 1, 2020

kunalspathak May 1, 2020

kunalspathak commented May 4, 2020

ARM64 intrinsic support for Vector64.Create() and Vector128.Create() #35590

ARM64 intrinsic support for Vector64.Create() and Vector128.Create() #35590

Conversation

kunalspathak commented Apr 28, 2020 • edited Loading

kunalspathak commented Apr 28, 2020

BruceForstall commented Apr 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echesakov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalspathak commented May 4, 2020

kunalspathak commented Apr 28, 2020 •

edited

Loading