Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64 intrinsic support for Vector64.Create() and Vector128.Create() #35590

Merged
merged 15 commits into from
May 5, 2020

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented Apr 28, 2020

Added hardware intrinsic for various overloads of Vector64.Create() and Vector128.Create():

  • Multiple arguments - The APIs that takes multiple parameters to be set in respective lanes are implemented in C# using AdvSimd.Insert.
  • Single arguments - The APIs that takes single argument and should be copied in all lanes are implemented in JIT by generating dup/mov/fmov instructions.

While I was there, I noticed an edge case where we hit assert if trying to emit an immediate int.MaxValue. Fixed it as well.

Contributes to #33308 and #33496.
Fixes: #35821

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 28, 2020
@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib , @tannergooding

@BruceForstall
Copy link
Member

cc @TamarChristinaArm

result = Arm.AdvSimd.Insert(result, 5, e5);
result = Arm.AdvSimd.Insert(result, 6, e6);
result = Arm.AdvSimd.Insert(result, 7, e7);
result = Arm.AdvSimd.Insert(result, 8, e8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for this PR, but since we only have 8 argument registers the rest will be passed on the stack.
In which case would be easier to load the remaining 64 bits into a register directly from the stack into the top part of the 128-bit vector.

i.e. do something like this:

        and     w0, w0, 255
        fmov    s0, w0
        ins     v0.b[1], w1
        ins     v0.b[2], w2
        ins     v0.b[3], w3
        ins     v0.b[4], w4
        ins     v0.b[5], w5
        ins     v0.b[6], w6
        ins     v0.b[7], w7
        ld1     {v0.d}[1], [sp]

And you can do this is something like this for a lot of the cases with more than 8 arguments. Do you know what the code generates here at the moment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point which I didn't realize. I will explore your suggestion.

Generated code
        53001C00          uxtb    w0, w0
        4E011C10          ins     v16.b[0], w0
        53001C20          uxtb    w0, w1
        4E031C10          ins     v16.b[1], w0
        53001C40          uxtb    w0, w2
        4E051C10          ins     v16.b[2], w0
        53001C60          uxtb    w0, w3
        4E071C10          ins     v16.b[3], w0
        53001C80          uxtb    w0, w4
        4E091C10          ins     v16.b[4], w0
        53001CA0          uxtb    w0, w5
        4E0B1C10          ins     v16.b[5], w0
        53001CC0          uxtb    w0, w6
        4E0D1C10          ins     v16.b[6], w0
        53001CE0          uxtb    w0, w7
        4E0F1C10          ins     v16.b[7], w0
        B94023A0          ldr     w0, [fp,#32]  // [V08 arg8]
        53001C00          uxtb    w0, w0
        4E111C10          ins     v16.b[8], w0
        B9402BA0          ldr     w0, [fp,#40]  // [V09 arg9]
        53001C00          uxtb    w0, w0
        4E131C10          ins     v16.b[9], w0
        B94033A0          ldr     w0, [fp,#48]  // [V10 arg10]
        53001C00          uxtb    w0, w0
        4E151C10          ins     v16.b[10], w0
        B9403BA0          ldr     w0, [fp,#56]  // [V11 arg11]
        53001C00          uxtb    w0, w0
        4E171C10          ins     v16.b[11], w0
        B94043A0          ldr     w0, [fp,#64]  // [V12 arg12]
        53001C00          uxtb    w0, w0
        4E191C10          ins     v16.b[12], w0
        B9404BA0          ldr     w0, [fp,#72]  // [V13 arg13]
        53001C00          uxtb    w0, w0
        4E1B1C10          ins     v16.b[13], w0
        B94053A0          ldr     w0, [fp,#80]  // [V14 arg14]
        53001C00          uxtb    w0, w0
        4E1D1C10          ins     v16.b[14], w0
        B9405BA0          ldr     w0, [fp,#88]  // [V15 arg15]
        53001C00          uxtb    w0, w0
        4EB01E08          mov     v8.16b, v16.16b
        4E1F1C08          ins     v8.b[15], w0
        D28D0500          movz    x0, #0x6828
        F2A6EF60          movk    x0, #0x377b LSL #16
        F2CFFFA0          movk    x0, #0x7ffd LSL #32
        6E084509          mov     v9.d[0], v8.d[1]
        97FF53EA          bl      CORINFO_HELP_NEWSFAST
        6E180528          mov     v8.d[1], v9.d[0]
        3C808008          str     q8, [x0,#8]

And you can do this is something like this for a lot of the cases with more than 8 arguments.

Actually, this is the only API that has more than 8 arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how hard is goint to be to get rid of the uxtb-s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I am not much familiar with this, but would love to get rid of them. Just to call out, they show up for byte, sbyte, short and ushort parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I am not much familiar with this, but would love to get rid of them. Just to call out, they show up for byte, sbyte, short and ushort parameters.

Correct, because these types are usually "normalized" (i.e. sign- or zero-extended) to a 32 bit value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, those aren't needed indeed, would be good to get rid of them if possible since they double the number of instructions.

Also

        B94053A0          ldr     w0, [fp,#80]  // [V14 arg14]
        53001C00          uxtb    w0, w0
        4E1D1C10          ins     v16.b[14], w0

Without the optimization I talked about above should ideally be

ld1 {v16.b[14]}, [x1]

where x1 gets incremented, or you could use a which prevents from having to move between register files. But also don't know how easy this is to do..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly discussed various options with @echesakovMSFT and concluded that it might not be straight forward. I would like to revisit it after we implement other APIs. Opened #35688 to track it.

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good to me with some questions/suggestions.

src/coreclr/src/jit/emitarm64.cpp Show resolved Hide resolved
@@ -752,6 +763,26 @@ public static unsafe Vector128<byte> Create(byte e0, byte e1, byte e2, byte e3,
return Sse2.UnpackLow(lo64, hi64).AsByte(); // < 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 >
}

if (AdvSimd.IsSupported)
{
Vector128<byte> result = CreateScalarUnsafe(e0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kunalspathak You might want to add comment here in the same fashion as it's done for Sse2 case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke offline and this is not needed.

@kunalspathak
Copy link
Member Author

@tannergooding and @echesakovMSFT - Just FYI, while working on something else I found 2 issues for which we don't have test coverage today.

  1. Vector64.Create((double)10) (any integer value casted to double)
  2. Passing Vector64<double> or Vector64<long> as parameter to a function.

I will fix and add test coverage for it before merging.

@kunalspathak kunalspathak merged commit d23f1a2 into dotnet:master May 5, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Assertion failed '!"Didn't find a class handle for simdType"' when trying to pass Vector64<ulong> to function
6 participants