Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support emitting a constant for the Vector64/128/256.Create hardware intrinsic methods #10033

Closed
tannergooding opened this issue Mar 26, 2018 · 18 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@tannergooding
Copy link
Member

For the case where the Vector64.Create, Vector128.Create, and Vector256.Create helper functions are called with all constant arguments, we should support emitting a constant which can be loaded from memory, rather than emitting a chain of shuffle or insert calls.

We may also find some benefit in doing the same for partial constants as a partial constant with several inserts can still be faster than treating it as non-constant.

category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium

@tannergooding
Copy link
Member Author

FYI. @CarolEidt, @fiigii

@fiigii
Copy link
Contributor

fiigii commented Mar 26, 2018

rather than emitting a chain of shuffle or insert calls.

We may need more scenarios and profiling data to support this statement...

@tannergooding
Copy link
Member Author

We may need more scenarios and profiling data to support this statement...

It may be worth noting that MSVC, GCC, and Clang all do this.

They do it for both _mm_set1 and _mm_set. However, I imagine it is only beneficial for the latter (_mm_set), since the former (_mm_set1) is generally a single broadcast or shuffle.

I do agree we should do the appropriate profiling/etc, but the native compilers doing it provide a good precedent.

@4creators
Copy link
Contributor

it is only beneficial for the latter (_mm_set), since the former (_mm_set1) is generally a single broadcast or shuffle

It's beneficial for both and is done by all native compilers. Shuffle is cheap but broadcast mostly not.

@CarolEidt
Copy link
Contributor

I do agree we should do the appropriate profiling/etc, but the native compilers doing it provide a good precedent.

There are many things that native compilers do that the JIT doesn't. We need to justify this kind of work with actual scenarios.

@tannergooding
Copy link
Member Author

tannergooding commented Mar 27, 2018

@CarolEidt, this is basically the same as: https://github.com/dotnet/coreclr/issues/17256 (just replace = new Vector4(1, 2, 3, 4) with = SetVector128(1, 2, 3, 4))

I can come up with a bunch of scenarios showing the bad codegen here and actual numbers if it's needed.

@fiigii
Copy link
Contributor

fiigii commented Mar 28, 2018

this is basically the same as: dotnet/coreclr#17256 (just replace = new Vector4(1, 2, 3, 4) with = SetVector128(1, 2, 3, 4))

The literals of float/double have to be read from memory, but integers can be encoded in the instructions. So, SetVector128/256 should have different codegen heuristics from Vector4.

@tannergooding
Copy link
Member Author

The literals of float/double have to be read from memory

Right, so for float/double, it should be more efficient to do a single read, when all four values are constant.

but integers can be encoded in the instructions

Could you elaborate?

None of integer SIMD load instructions that I can find (movd, movq, movdqa, movdqu, etc) support immediates. So at best, you will be emitting a mov reg, imm64, pinsrq xmm, reg, imm per 8-bytes (on new hardware, with more complex code involving shifting on older hardware).

@fiigii
Copy link
Contributor

fiigii commented Mar 28, 2018

Could you elaborate?
None of integer SIMD load instructions that
you will be emitting a mov reg, imm64

Yes, I meant the codegen of "evaluating the integer constants" rather than the SIMD instruction 😄

it should be more efficient to do a single read, when all four values are constant.

Yes, we have to store each constant vector in a memory slot. If a program has many distinct constant vectors, it would spend much memory. So, there is tread-off...

@mikedn
Copy link
Contributor

mikedn commented Mar 28, 2018

If a program has many distinct constant vectors, it would spend much memory. There is tread-off...

I doubt there can be so many distinct constant vectors in a program that it would matter. The real problem is that each method is getting its own data section and thus its own constants. And even inside the same data section constants are not deduplicated so that the same constant is emitted multiple times. Also add the lack of proper packing into the mix and things start to look a bit risky.

All this already happens today for float/double constants but at least those are much smaller...

@tannergooding
Copy link
Member Author

If a program has many distinct constant vectors, it would spend much memory.

I'm not so sure on this part. It looks like it generally takes more bytes to do the insert/shift code than it does to store the raw bytes and read from memory.

; This takes ~38 bytes of code, plus 16-bytes of storage
vmovss xmm0, dword ptr [rip+0x00]
vinsertps xmm0, xmm0, dword ptr [reg+0x04], 0x10
vinsertps xmm0, xmm0, dword ptr [reg+0x08], 0x20
vinsertps xmm0, xmm0, dword ptr [reg+0x12], 0x30
; This takes ~8 bytes of code, plus 16-bytes of storage
vmovups xmm0, xmmword ptr [reg+0x00]

So, even with "perfect" deduping of float constants (which we don't have), we still only have a code savings of ~2-bytes.

@mikedn
Copy link
Contributor

mikedn commented Mar 28, 2018

Hmm, right, SetVector tends to generate so much code that either way things aren't that great in terms of size. The concerns about data size would be more relevant to SetAll, that one generates much less code and yet native compilers tends to also use memory constants for that as well. But then native compilers have the luxury of deduplicating constants...

@tannergooding tannergooding changed the title Support emitting a constant for certain SetVector hardware intrinsic helpers Support emitting a constant for the Vector64/128/256.Create hardware intrinsic methods Nov 21, 2019
@tannergooding
Copy link
Member Author

I've modified this to track Vector*.Create as the SetVector* methods no longer exist and were changed to the above instead.

@CarolEidt
Copy link
Contributor

It looks like dotnet/coreclr#27909 gives us a scenario to motivate this.

@mikedn
Copy link
Contributor

mikedn commented Nov 22, 2019

Note that what I did with Vector4 & co. may be insufficient/too limited. The proper solution might require adding GT_CNS_SIMD.

@damageboy
Copy link
Contributor

I also have a bunch of constants that could use this.

I managed to these into hand-written ReadOnlySpan proeperty getters + LoadDquVector256().

While the use of ReadOnlySpan<byte> initially looks risky, given that these intrinsics are x86 only anyway, and the constant are obviously x86 only as well, it's an ugliness I'm willing to bear.

This mostly alleviates the pain of Vector256.Create() with constants, which is quite considerable.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@tannergooding
Copy link
Member Author

I'd like to see if I can get this done as it should help codegen in a few places.

Would we want to try and mimic dotnet/coreclr#14393, which basically just handles this in lowering by replacing the node with a GT_CLS_VAR_ADDR and an emitted constant or would we want to introduce a new GT_CNS_* node type (GT_CNS_SIMD, as suggested above, or GT_CNS_VEC might be good names) and handle this earlier (maybe in importation or morph)?

CC. @CarolEidt, @dotnet/jit-contrib

@tannergooding
Copy link
Member Author

Resolved with #35857

@ghost ghost locked as resolved and limited conversation to collaborators Dec 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

No branches or pull requests

7 participants