Add fmov arm64 intrinsic in JIT to implement Vector*.CreateScalarUnsafe API #34485

kunalspathak · 2020-04-03T02:04:50Z

Add support of fmov in JIT to move to and from gp register into float/vector register. With that we will be able to generate that instruction when user calls CreateScalar() C# API.

See #33495 (comment) and needed for #33496 as well.

The text was updated successfully, but these errors were encountered:

kunalspathak · 2020-04-03T02:05:15Z

cc : @tannergooding , @echesakovMSFT , @BruceForstall

kunalspathak · 2020-04-07T00:12:28Z

This is the mapping that I can think of at this point. For below APIs that takes data types that are 32-bit or 64-bit variant, use fmov.

public static Vector64<int> CreateScalarUnsafe(int value);
public static Vector64<uint> CreateScalarUnsafe(uint value);
public static Vector64<float> CreateScalarUnsafe(float value);

public static Vector128<int> CreateScalarUnsafe(int value);
public static Vector128<uint> CreateScalarUnsafe(uint value);
public static Vector128<float> CreateScalarUnsafe(float value);

public static Vector128<long> CreateScalarUnsafe(long value);
public static Vector128<ulong> CreateScalarUnsafe(ulong value);
public static Vector128<double> CreateScalarUnsafe(double value);

fmov Sn/Dn, Xn/Wn

However for the ones, whose arguments are less than 32-bit variants, use ins. The reason being if the argument value is negative, we end up storing 0xFF from 2nd lane onwards if we use fmov.

public static Vector64<byte> CreateScalarUnsafe(byte value);
public static Vector64<sbyte> CreateScalarUnsafe(sbyte value);
public static Vector64<short> CreateScalarUnsafe(short value);
public static Vector64<ushort> CreateScalarUnsafe(ushort value);

public static Vector128<byte> CreateScalarUnsafe(byte value);
public static Vector128<sbyte> CreateScalarUnsafe(sbyte value);
public static Vector128<short> CreateScalarUnsafe(short value);
public static Vector128<ushort> CreateScalarUnsafe(ushort value);

ins Sn.b[0], w0 # sbyte
ins Sn.h[0], w0 # short

Note: We can use fmov for byte and ushort as well because it they don't have negatives, but I just thought that it will be more consistent to use same instruction for signed/unsigned variant of data type.

@TamarChristinaArm , @tannergooding , @echesakovMSFT

tannergooding · 2020-04-07T00:16:29Z

For Unsafe in particular, we are free to do whatever is most efficient. The value of the upper bits is "undefined". For CreateScalar however, we must ensure the upper bits (anything other than the first element) are zeroed.

BruceForstall · 2020-04-07T01:33:21Z

@kunalspathak As you work on System.Runtime.Intrinsics cases, can you please edit/update the top comment in #33308 to be more reflective of the sets of work to do there? When I created that issue, I was a little sloppy for this particular namespace, since there are so many overloads. I'd prefer that the "checkbox items" be split up as appropriate such that if a PR implements some subset, that becomes a checkbox we can check in the overall tracking issue.

kunalspathak · 2020-04-07T01:37:42Z

Sure, I will update that list as I discover them.

TamarChristinaArm · 2020-04-07T15:05:33Z

For Unsafe in particular, we are free to do whatever is most efficient. The value of the upper bits is "undefined". For CreateScalar however, we must ensure the upper bits (anything other than the first element) are zeroed.

Hmm is this because you can pass this type on to a Vector function by accident? There's no instruction that would use the register as a scalar input that would read the top bits.

C compilers usually don't care about this since you can't really do this easily, e.g.

int16x4_t square(int16x4_t c, short a) {
    return vmul_n_s16(c, a);
}

produces

        dup     v1.4h, w0
        mul     v0.4h, v0.4h, v1.h[0]
        ret

So we don't really care about the top bits going into a scalar operation or a by-element as far as I can tell.

tannergooding · 2020-04-07T15:54:33Z

Hmm is this because you can pass this type on to a Vector function by accident? There's no instruction that would use the register as a scalar input that would read the top bits.

CreateScalar takes a T (like byte, short, int, or float) and creates a Vector128<T>. It will set element 0 to the given value and sets elements 1 through element Vector128<T>.Count to be default(T) (which is zero).
CreateScalarUnsafe is similar, but rather than explicitly zeroing the upper elements, it leaves them as whatever value was already there.

The former is the default and it ensures you get deterministic results. The latter exists primarily for perf reasons as there may be scenarios where ensuring the upper bits get zeroed (either explicitly by another instruction or implicitly by an operation) may be undesirable. A couple examples are because the upper elements are unused or because you will be explicitly setting them in a later operation.

X86

On x86, the scalar operations preserve the upper elements by default. That is, if you do Vector128<float> Sse.AddScalar(Vector128<float> left, Vector128<float> right) (which does scalar addition on two Vector128<float>) the result is <left[0] + right[0], left[1], left[2], left[3]>.

What this means is that if you want to ensure "deterministic" results, then you have the following codegen:

; Vector128<float> CreateScalar(float value)
vxorps xmm0, xmm0, xmm0    ; Zero the xmm0 register
vmovss xmm0, xmm0, xmm1    ; Move scalar value (xmm1) into element 0 of xmm0; elements 1, 2, and 3 are taken from xmm0 which was set to zero

; Vector128<float> CreateScalarUnsafe(float value)
vmovss xmm0, xmm0, xmm1    ; Move scalar value (xmm1) into element 0 of xmm0; elements 1, 2, and 3 are taken from xmm0 and could be any value

The latter can also be fully elided from codegen (we can just use the value from xmm1 directly), etc.

ARM

As I understand it, ARM is basically the opposite. It zeroes the upper elements by default except for specific instructions like insert.
This means that for a number of cases CreateScalar and CreateScalarUnsafe will have identical codegen.

However, it seems like there are likely some cases (like byte, sbyte, short, and ushort) where we could simplify the codegen. That is, CreateScalar likely needs to be implemented as a zero-extension followed by an fmov. While CreateScalarUnsafe could likely be implemented just as an insert operation.

echesakov · 2020-04-20T21:20:44Z

This is the mapping that I can think of at this point. For below APIs that takes data types that are 32-bit or 64-bit variant, use fmov.
public static Vector64 CreateScalarUnsafe(int value);
public static Vector64 CreateScalarUnsafe(uint value);
public static Vector64 CreateScalarUnsafe(float value);

public static Vector128 CreateScalarUnsafe(int value);
public static Vector128 CreateScalarUnsafe(uint value);
public static Vector128 CreateScalarUnsafe(float value);

public static Vector128 CreateScalarUnsafe(long value);
public static Vector128 CreateScalarUnsafe(ulong value);
public static Vector128 CreateScalarUnsafe(double value);
fmov Sn/Dn, Xn/Wn
However for the ones, whose arguments are less than 32-bit variants, use ins. The reason being if the argument value is negative, we end up storing 0xFF from 2nd lane onwards if we use fmov.
public static Vector64 CreateScalarUnsafe(byte value);
public static Vector64 CreateScalarUnsafe(sbyte value);
public static Vector64 CreateScalarUnsafe(short value);
public static Vector64 CreateScalarUnsafe(ushort value);

public static Vector128 CreateScalarUnsafe(byte value);
public static Vector128 CreateScalarUnsafe(sbyte value);
public static Vector128 CreateScalarUnsafe(short value);
public static Vector128 CreateScalarUnsafe(ushort value);
ins Sn.b[0], w0 # sbyte
ins Sn.h[0], w0 # short
Note: We can use fmov for byte and ushort as well because it they don't have negatives, but I just thought that it will be more consistent to use same instruction for signed/unsigned variant of data type.

@kunalspathak

I gave more thought to this while working on #35030 and I believe the implementation of CreateScalarUnsafe should be

for byte, sbyte, short, ushort, int, uint, long, ulong

when op1 is CnsInt and encodable as an immediate of either mvni or movi then use the corresponding instruction (i.e. emit MOVI/MVNI Vd.T, #imm8)
when op1 is CnsInt and not encodable as an immediate of either mvni or movi do CodeGen::instGen_Set_Reg_To_Imm to enregister an immediate in a general-purpose register and go to 3)
when op1 is enregistered move op1Reg to a SIMD register by using INS Vd.T[0], op1Reg

for float or double

when op1 is CnsDbl and 0 - emit movi Vd, #0 (wonder if it is more efficient than 1)?)
when op1 is CnsDbl and encodable as an immediate of fmov then emit fmov Sd, #imm8 or fmov Dd, #imm8
when op1 is CnsDbl and not encodable as an immediate of fmov then either
2a) put op1 const in a constant pool, compute the address of the label with adrp followed by ldr Sd, [Xn] or ldr Dd, [Xn] (i.e. do what we usually do for GT_CNS_DBL) or
2b) contain op1 and reinterpret float/double constant as int/long constant, enregister the value in a general-purpose register with CodeGen::instGen_Set_Reg_To_Imm then emit FMOV Dd, Xd or FMOV Sd, Wd - the downside here is more instructions
@TamarChristinaArm Can you please advise what approach here is more efficient assuming that both approaches require an additional temporary general-purpose register?
when op1 is enregistered and targetReg is op1Reg then NOP - the value is already there
when op1 is enregistered and targetReg is not op1Reg then emit FMOV Dd, Dn or FMOV Sd, Sn

@kunalspathak Let me know if you plan to continue working on this or you want me to do this as continuation of #35030.

kunalspathak · 2020-04-20T22:05:09Z

Thanks @echesakovMSFT ! Let me take a stab at it. I will ping you if I hit any blocker.

TamarChristinaArm · 2020-04-21T11:45:02Z

@TamarChristinaArm Can you please advise what approach here is more efficient assuming that both approaches require an additional temporary general-purpose register?

@echesakovMSFT 2b is generally the preferred approach.. for Single this will (pretty much) always be better. For Double things are tricky. There is a cut-off point where creating a Double as an integer becomes more expensive than the literal pool load. In GCC we use the cost model to determine when this would be the case. https://godbolt.org/z/uDJWYj

If it possible to use the perfscore for this? Otherwise I would use a hard metric based on number of movs to give a rough estimation.

echesakov · 2020-04-21T19:56:31Z

2b is generally the preferred approach.. for Single this will (pretty much) always be better. For Double things are tricky. There is a cut-off point where creating a Double as an integer becomes more expensive than the literal pool load. In GCC we use the cost model to determine when this would be the case. https://godbolt.org/z/uDJWYj
If it possible to use the perfscore for this? Otherwise I would use a hard metric based on number of movs to give a rough estimation.

@TamarChristinaArm Thanks for the reply!

cc @dotnet/jit-contrib This is related to our discussion of "constant pool vs sequence of movs" we had yesterday.

Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Apr 3, 2020

kunalspathak mentioned this issue Apr 3, 2020

BitOperations arm64 intrinsic for LeadingZeroCount, TrailingZeroCount and Log2 #34486

Merged

BruceForstall added this to the 5.0 milestone Apr 3, 2020

BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 3, 2020

kunalspathak self-assigned this Apr 3, 2020

kunalspathak changed the title ~~Add fmov arm64 intrinsic in JIT to implement Vector*.CreateScalar API~~ Add fmov arm64 intrinsic in JIT to implement Vector*.CreateScalarUnsafe API Apr 6, 2020

kunalspathak mentioned this issue Apr 6, 2020

Add support of CreateScalarUnsafe() for arm64 intrinsic in JIT #34579

Merged

echesakov mentioned this issue Apr 22, 2020

[Arm64] Implement Vector64/128.CreateScalar() using AdvSimd.Insert #35300

Merged

kunalspathak closed this as completed in #34579 Apr 27, 2020

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fmov arm64 intrinsic in JIT to implement Vector*.CreateScalarUnsafe API #34485

Add fmov arm64 intrinsic in JIT to implement Vector*.CreateScalarUnsafe API #34485

kunalspathak commented Apr 3, 2020

kunalspathak commented Apr 3, 2020

kunalspathak commented Apr 7, 2020

tannergooding commented Apr 7, 2020

BruceForstall commented Apr 7, 2020

kunalspathak commented Apr 7, 2020

TamarChristinaArm commented Apr 7, 2020

tannergooding commented Apr 7, 2020

echesakov commented Apr 20, 2020

kunalspathak commented Apr 20, 2020

TamarChristinaArm commented Apr 21, 2020

echesakov commented Apr 21, 2020

Add fmov arm64 intrinsic in JIT to implement Vector*.CreateScalarUnsafe API #34485

Add fmov arm64 intrinsic in JIT to implement Vector*.CreateScalarUnsafe API #34485

Comments

kunalspathak commented Apr 3, 2020

kunalspathak commented Apr 3, 2020

kunalspathak commented Apr 7, 2020

tannergooding commented Apr 7, 2020

BruceForstall commented Apr 7, 2020

kunalspathak commented Apr 7, 2020

TamarChristinaArm commented Apr 7, 2020

tannergooding commented Apr 7, 2020

X86

ARM

echesakov commented Apr 20, 2020

kunalspathak commented Apr 20, 2020

TamarChristinaArm commented Apr 21, 2020

echesakov commented Apr 21, 2020