Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fmov arm64 intrinsic in JIT to implement Vector*.CreateScalarUnsafe API #34485

Closed
kunalspathak opened this issue Apr 3, 2020 · 11 comments · Fixed by #34579
Closed

Add fmov arm64 intrinsic in JIT to implement Vector*.CreateScalarUnsafe API #34485

kunalspathak opened this issue Apr 3, 2020 · 11 comments · Fixed by #34579
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@kunalspathak
Copy link
Member

Add support of fmov in JIT to move to and from gp register into float/vector register. With that we will be able to generate that instruction when user calls CreateScalar() C# API.

See #33495 (comment) and needed for #33496 as well.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Apr 3, 2020
@kunalspathak
Copy link
Member Author

cc : @tannergooding , @echesakovMSFT , @BruceForstall

@BruceForstall BruceForstall added this to the 5.0 milestone Apr 3, 2020
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 3, 2020
@kunalspathak kunalspathak self-assigned this Apr 3, 2020
@kunalspathak kunalspathak changed the title Add fmov arm64 intrinsic in JIT to implement Vector*.CreateScalar API Add fmov arm64 intrinsic in JIT to implement Vector*.CreateScalarUnsafe API Apr 6, 2020
@kunalspathak
Copy link
Member Author

This is the mapping that I can think of at this point. For below APIs that takes data types that are 32-bit or 64-bit variant, use fmov.

public static Vector64<int> CreateScalarUnsafe(int value);
public static Vector64<uint> CreateScalarUnsafe(uint value);
public static Vector64<float> CreateScalarUnsafe(float value);

public static Vector128<int> CreateScalarUnsafe(int value);
public static Vector128<uint> CreateScalarUnsafe(uint value);
public static Vector128<float> CreateScalarUnsafe(float value);

public static Vector128<long> CreateScalarUnsafe(long value);
public static Vector128<ulong> CreateScalarUnsafe(ulong value);
public static Vector128<double> CreateScalarUnsafe(double value);
fmov Sn/Dn, Xn/Wn

However for the ones, whose arguments are less than 32-bit variants, use ins. The reason being if the argument value is negative, we end up storing 0xFF from 2nd lane onwards if we use fmov.

public static Vector64<byte> CreateScalarUnsafe(byte value);
public static Vector64<sbyte> CreateScalarUnsafe(sbyte value);
public static Vector64<short> CreateScalarUnsafe(short value);
public static Vector64<ushort> CreateScalarUnsafe(ushort value);

public static Vector128<byte> CreateScalarUnsafe(byte value);
public static Vector128<sbyte> CreateScalarUnsafe(sbyte value);
public static Vector128<short> CreateScalarUnsafe(short value);
public static Vector128<ushort> CreateScalarUnsafe(ushort value);
ins Sn.b[0], w0 # sbyte
ins Sn.h[0], w0 # short

Note: We can use fmov for byte and ushort as well because it they don't have negatives, but I just thought that it will be more consistent to use same instruction for signed/unsigned variant of data type.

@TamarChristinaArm , @tannergooding , @echesakovMSFT

@tannergooding
Copy link
Member

For Unsafe in particular, we are free to do whatever is most efficient. The value of the upper bits is "undefined". For CreateScalar however, we must ensure the upper bits (anything other than the first element) are zeroed.

@BruceForstall
Copy link
Member

@kunalspathak As you work on System.Runtime.Intrinsics cases, can you please edit/update the top comment in #33308 to be more reflective of the sets of work to do there? When I created that issue, I was a little sloppy for this particular namespace, since there are so many overloads. I'd prefer that the "checkbox items" be split up as appropriate such that if a PR implements some subset, that becomes a checkbox we can check in the overall tracking issue.

@kunalspathak
Copy link
Member Author

Sure, I will update that list as I discover them.

@TamarChristinaArm
Copy link
Contributor

For Unsafe in particular, we are free to do whatever is most efficient. The value of the upper bits is "undefined". For CreateScalar however, we must ensure the upper bits (anything other than the first element) are zeroed.

Hmm is this because you can pass this type on to a Vector function by accident? There's no instruction that would use the register as a scalar input that would read the top bits.

C compilers usually don't care about this since you can't really do this easily, e.g.

int16x4_t square(int16x4_t c, short a) {
    return vmul_n_s16(c, a);
}

produces

        dup     v1.4h, w0
        mul     v0.4h, v0.4h, v1.h[0]
        ret

So we don't really care about the top bits going into a scalar operation or a by-element as far as I can tell.

@tannergooding
Copy link
Member

Hmm is this because you can pass this type on to a Vector function by accident? There's no instruction that would use the register as a scalar input that would read the top bits.

CreateScalar takes a T (like byte, short, int, or float) and creates a Vector128<T>. It will set element 0 to the given value and sets elements 1 through element Vector128<T>.Count to be default(T) (which is zero).
CreateScalarUnsafe is similar, but rather than explicitly zeroing the upper elements, it leaves them as whatever value was already there.

The former is the default and it ensures you get deterministic results. The latter exists primarily for perf reasons as there may be scenarios where ensuring the upper bits get zeroed (either explicitly by another instruction or implicitly by an operation) may be undesirable. A couple examples are because the upper elements are unused or because you will be explicitly setting them in a later operation.

X86

On x86, the scalar operations preserve the upper elements by default. That is, if you do Vector128<float> Sse.AddScalar(Vector128<float> left, Vector128<float> right) (which does scalar addition on two Vector128<float>) the result is <left[0] + right[0], left[1], left[2], left[3]>.

What this means is that if you want to ensure "deterministic" results, then you have the following codegen:

; Vector128<float> CreateScalar(float value)
vxorps xmm0, xmm0, xmm0    ; Zero the xmm0 register
vmovss xmm0, xmm0, xmm1    ; Move scalar value (xmm1) into element 0 of xmm0; elements 1, 2, and 3 are taken from xmm0 which was set to zero

; Vector128<float> CreateScalarUnsafe(float value)
vmovss xmm0, xmm0, xmm1    ; Move scalar value (xmm1) into element 0 of xmm0; elements 1, 2, and 3 are taken from xmm0 and could be any value

The latter can also be fully elided from codegen (we can just use the value from xmm1 directly), etc.

ARM

As I understand it, ARM is basically the opposite. It zeroes the upper elements by default except for specific instructions like insert.
This means that for a number of cases CreateScalar and CreateScalarUnsafe will have identical codegen.

However, it seems like there are likely some cases (like byte, sbyte, short, and ushort) where we could simplify the codegen. That is, CreateScalar likely needs to be implemented as a zero-extension followed by an fmov. While CreateScalarUnsafe could likely be implemented just as an insert operation.

@echesakov
Copy link
Contributor

This is the mapping that I can think of at this point. For below APIs that takes data types that are 32-bit or 64-bit variant, use fmov.
public static Vector64 CreateScalarUnsafe(int value);
public static Vector64 CreateScalarUnsafe(uint value);
public static Vector64 CreateScalarUnsafe(float value);

public static Vector128 CreateScalarUnsafe(int value);
public static Vector128 CreateScalarUnsafe(uint value);
public static Vector128 CreateScalarUnsafe(float value);

public static Vector128 CreateScalarUnsafe(long value);
public static Vector128 CreateScalarUnsafe(ulong value);
public static Vector128 CreateScalarUnsafe(double value);
fmov Sn/Dn, Xn/Wn
However for the ones, whose arguments are less than 32-bit variants, use ins. The reason being if the argument value is negative, we end up storing 0xFF from 2nd lane onwards if we use fmov.
public static Vector64 CreateScalarUnsafe(byte value);
public static Vector64 CreateScalarUnsafe(sbyte value);
public static Vector64 CreateScalarUnsafe(short value);
public static Vector64 CreateScalarUnsafe(ushort value);

public static Vector128 CreateScalarUnsafe(byte value);
public static Vector128 CreateScalarUnsafe(sbyte value);
public static Vector128 CreateScalarUnsafe(short value);
public static Vector128 CreateScalarUnsafe(ushort value);
ins Sn.b[0], w0 # sbyte
ins Sn.h[0], w0 # short
Note: We can use fmov for byte and ushort as well because it they don't have negatives, but I just thought that it will be more consistent to use same instruction for signed/unsigned variant of data type.

@kunalspathak

I gave more thought to this while working on #35030 and I believe the implementation of CreateScalarUnsafe should be

for byte, sbyte, short, ushort, int, uint, long, ulong

  1. when op1 is CnsInt and encodable as an immediate of either mvni or movi then use the corresponding instruction (i.e. emit MOVI/MVNI Vd.T, #imm8)
  2. when op1 is CnsInt and not encodable as an immediate of either mvni or movi do CodeGen::instGen_Set_Reg_To_Imm to enregister an immediate in a general-purpose register and go to 3)
  3. when op1 is enregistered move op1Reg to a SIMD register by using INS Vd.T[0], op1Reg

for float or double

  1. when op1 is CnsDbl and 0 - emit movi Vd, #0 (wonder if it is more efficient than 1)?)
  2. when op1 is CnsDbl and encodable as an immediate of fmov then emit fmov Sd, #imm8 or fmov Dd, #imm8
  3. when op1 is CnsDbl and not encodable as an immediate of fmov then either
    2a) put op1 const in a constant pool, compute the address of the label with adrp followed by ldr Sd, [Xn] or ldr Dd, [Xn] (i.e. do what we usually do for GT_CNS_DBL) or
    2b) contain op1 and reinterpret float/double constant as int/long constant, enregister the value in a general-purpose register with CodeGen::instGen_Set_Reg_To_Imm then emit FMOV Dd, Xd or FMOV Sd, Wd - the downside here is more instructions
    @TamarChristinaArm Can you please advise what approach here is more efficient assuming that both approaches require an additional temporary general-purpose register?
  4. when op1 is enregistered and targetReg is op1Reg then NOP - the value is already there
  5. when op1 is enregistered and targetReg is not op1Reg then emit FMOV Dd, Dn or FMOV Sd, Sn

@kunalspathak Let me know if you plan to continue working on this or you want me to do this as continuation of #35030.

@kunalspathak
Copy link
Member Author

Thanks @echesakovMSFT ! Let me take a stab at it. I will ping you if I hit any blocker.

@TamarChristinaArm
Copy link
Contributor

@TamarChristinaArm Can you please advise what approach here is more efficient assuming that both approaches require an additional temporary general-purpose register?

@echesakovMSFT 2b is generally the preferred approach.. for Single this will (pretty much) always be better. For Double things are tricky. There is a cut-off point where creating a Double as an integer becomes more expensive than the literal pool load. In GCC we use the cost model to determine when this would be the case. https://godbolt.org/z/uDJWYj

If it possible to use the perfscore for this? Otherwise I would use a hard metric based on number of movs to give a rough estimation.

@echesakov
Copy link
Contributor

2b is generally the preferred approach.. for Single this will (pretty much) always be better. For Double things are tricky. There is a cut-off point where creating a Double as an integer becomes more expensive than the literal pool load. In GCC we use the cost model to determine when this would be the case. https://godbolt.org/z/uDJWYj
If it possible to use the perfscore for this? Otherwise I would use a hard metric based on number of movs to give a rough estimation.

@TamarChristinaArm Thanks for the reply!

cc @dotnet/jit-contrib This is related to our discussion of "constant pool vs sequence of movs" we had yesterday.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants