-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lowering subset of Vector512 methods for avx512. #82953
Lowering subset of Vector512 methods for avx512. #82953
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue Detailsnull
|
c81ec4a
to
d084305
Compare
@DeepakRajendrakumaran Does this address a specific GitHub issue (work item) that is listed on #77034? |
This PR addresses part of this - #80814 I plan to add a few more including store here. So, this will have further commits this week. Will move this PR from draft -> live once it's ready for review |
Looks to be a small TP regression, likely due to the more case statements and that changing how the jump tables or other dispatch can work. It's possible some of these would be better via flags or something long term, but I don' think that's something we strictly have to handle as part of this PR. Might be good as a separate PR and then taking this PR after if the up to |
fba4927
to
a50908a
Compare
CC. @dotnet/jit-contrib, @dotnet/avx512-contrib for secondary review and merging |
Maybe because all the |
I can switch this around and check again. I'll have another set of methods to lower pretty soon and this should be easy to check |
Note that our SPMI throughput measurements are based on collections taken on the Helix test machines. I think these are all AVX-512 capable machines so presumably hit the AVX2 paths. Seems like that's the one to optimize for in general, based on expected customer installed base. Also note that TP measurements are done using a native JIT built without PGO. Theoretically, if all the code paths were hit in the training scenarios (unlikely), the native compiler would rearrange the order of branches. Final note: the TP measurements are instruction counts, not cycle counts. It's as close a proxy to time-based throughput as we can reliably get. |
|
This PR addresses the following issue : #80814. It currently has the following support.
Vector512.Load()
Vector512.LoadUnsafe()
Vector512.LoadAligned()
Vector512.LoadAlignedNonTemporal()
Vector512.Store()
Vector512.StoreUnsafe()
Vector512.StoreAligned()
Vector512.StoreAlignedNonTemporal()