Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT ARM64-SVE: Add TrueMask and LoadVector #98218

Merged
merged 18 commits into from
Mar 12, 2024

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Feb 9, 2024

WIP patch to add TrueMask and LoadVector support

a74nh added 2 commits February 9, 2024 11:44
Change-Id: I285f8aba668409ca94e11be2489a6d9b50a4ec6e
Change-Id: I3ad4fd9a8d823cb43a9546ba6356006a0907ac57
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Feb 9, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation labels Feb 9, 2024
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost
Copy link

ghost commented Feb 9, 2024

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

WIP patch to add TrueMask and LoadVector support

Author: a74nh
Assignees: -
Labels:

area-CodeGen-coreclr, new-api-needs-documentation, community-contribution

Milestone: -

@a74nh
Copy link
Contributor Author

a74nh commented Feb 9, 2024

Wanted to show where I am with getting some API code working.

Intention here is to use truemask() to get a full predicate register, and then pass into a load function.

Test in Sve_mine.cs is not intended for merging and is only until I get template testing working.

        [MethodImpl(MethodImplOptions.NoInlining)]
        public unsafe static Vector<byte> LoadVector_ImplicitMask(byte* address)
        {
            Vector<byte> mask = Sve.TrueMask(SveMaskPattern.All);
            return Sve.LoadVector(mask, address);
        }

Generates to:

G_M32969_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x20]!
            mov     fp, sp
            str     x0, [fp, #0x18]	// [V00 arg0]
						;; size=12 bbWeight=1 PerfScore 2.50
G_M32969_IG02:  ;; offset=0x000C
            ptrue   p7.b
            ldr     x0, [fp, #0x18]	// [V00 arg0]
            ld1b    { z0.b }, p7/z, [x0]
						;; size=12 bbWeight=1 PerfScore 12.00
G_M32969_IG03:  ;; offset=0x0018
            ldp     fp, lr, [sp], #0x20
            ret     lr
						;; size=8 bbWeight=1 PerfScore 2.00

When run the test sometimes passes with

1
2
3
....

And sometimes fails with:

0 1 != 0
0
1 2 != 0
0
2 3 != 0
0
....

I assume there it's either due to register allocation code not yet done or something missing in GC or related.

@kunalspathak @tannergooding

@ryujit-bot
Copy link

Diff results for #98218

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.01% to -0.00%)
Collection PDIFF
coreclr_tests.run.linux.arm64.checked.mch -0.01%
libraries.crossgen2.linux.arm64.checked.mch -0.01%
libraries.pmi.linux.arm64.checked.mch -0.01%
libraries_tests_no_tiered_compilation.run.linux.arm64.Release.mch -0.01%

Throughput diffs for osx/arm64 ran on windows/x64

MinOpts (-0.01% to -0.00%)
Collection PDIFF
benchmarks.run.osx.arm64.checked.mch -0.01%
benchmarks.run_pgo.osx.arm64.checked.mch -0.01%
benchmarks.run_tiered.osx.arm64.checked.mch -0.01%
coreclr_tests.run.osx.arm64.checked.mch -0.01%
libraries.crossgen2.osx.arm64.checked.mch -0.01%
libraries.pmi.osx.arm64.checked.mch -0.01%
libraries_tests_no_tiered_compilation.run.osx.arm64.Release.mch -0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.01% to -0.00%)
Collection PDIFF
benchmarks.run.windows.arm64.checked.mch -0.01%
benchmarks.run_pgo.windows.arm64.checked.mch -0.01%
benchmarks.run_tiered.windows.arm64.checked.mch -0.01%
coreclr_tests.run.windows.arm64.checked.mch -0.01%
libraries.crossgen2.windows.arm64.checked.mch -0.01%
libraries.pmi.windows.arm64.checked.mch -0.01%
libraries_tests_no_tiered_compilation.run.windows.arm64.Release.mch -0.01%

Details here


Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall

// this output can be used as a per-element mask
HW_Flag_ReturnsPerElementMask = 0x10000,

// The intrinsic uses a mask in arg1 to select elements present in the result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arg1: Is it always be the case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not just check for TYP_MASK to determine this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arg1: Is it always be the case?

Yes, that's the sve convention. Result, then mask, then inputs.

Can we not just check for TYP_MASK to determine this?

Ok, that sounds better. I can look and see how this would be done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not just check for TYP_MASK to determine this?

@tannergooding - Looking closer at this, I'm not quite sure what this would entail.

In hwintrinsiclistxarch.h the only reference to mask is use of HW_Flag_ReturnsPerElementMask.

I can't see any obvious way for the jit to understand know that the first arg of the method is expected to be a predicate mask, other than to use the enum or hardcode it with case statements somewhere.

The jit can check the type of the actual arg1 child node, but that only tells us what the type actually is, and not what the expected type is. I imagine I'll have to write code that says if the actual type and expected type don't match, then somehow convert arg1 to the expected type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine I'll have to write code that says if the actual type and expected type don't match, then somehow convert arg1 to the expected type.

Yes, basically.

Most intrinsics support masking optionally and so you'll have something similar to this https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L19988-L20008. That is, you'll have some bool GenTree::isSveEmbeddedMaskingCompatibleHWIntrinsic() which likely looks up a flag in the hwintrinsiclistarm64.h table to see if that particular intrinsic supports embedded masking/predication.

There are then a handful of intrinsics which require masking. For example, SVE comparison intrinsics may always return a TYP_MASK, in which case you could either add a new entry to the table such as HW_Flag_ReturnsSveMask or explicitly handle it like xarch does here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L3985-L3999

There are then a handful of intrinsics which require mask inputs and which aren't recognized via pattern matching. You would likewise add a flag or manually handle the few of them like this: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L3970-L3983

The insertion of the ConvertVectorToMask and ConvertMaskToVector intrinsics is important since the user may have passedin something that was of the incorrect type. For example, it might've been a mask of bytes, where we needed a mask of ints; or might've been an actual vector where we needed a mask and vice-versa. Likewise it ensures we don't need to check the type on every other intrinsic that does properly take a vector.

We then make this efficient in morph (see https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L10775-L10827) where we ensure that we aren't unnecessarily converting from mask to vector and back to mask, or vice versa. This allows things that take a mask to consume a produced mask directly and gives the optimal codegen expected in most scenarios.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the comment around

We are notably missing and need to add a bit which handles the case where we have LCL_VAR TYP_SIMD = TYP_MASK because that can currently block the ability to consume a mask directly if it's multi-use. We ideally would have it stored as LCL_VAR TYP_MASK instead (even if the use manually hoisted as a Vector in C#/IL) and then have the things consume it as ConvertMaskToVector(LCL_VAR) if they actually needed a vector.

This shouldn't be overly complex to add, however, it's just not been done as of yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. That feels like it might touch quite a few files. Given the size of this PR, do you think it's worth keeping this PR as is, and then putting the LCL_VAR TYP_MASK in a follow on, along with the lowering code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then putting the LCL_VAR TYP_MASK in a follow on

Yes, I think this would even be the preferred route given its not required and is its own isolated change really.

along with the lowering code?

Which lowering code is this?


In general I think its fine for this PR to be the basic plumbing of TYP_MASK support into the Arm64 side of the JIT. As long as TrueMask and LoadVector are minimally working as expected, I think we're golden and we can extend that to other operations and enable optimizations separately. That is exactly what we did for xarch to help with review and scoping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which lowering code is this?

I added some code do the remove the mask->vector->mask and vector->mask->vector conversions. But, nothing in this PR uses it because of the lcl var, so I decided not to push it.

Will mark this as ready now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will mark this as ready now.

... but not quite yet, as I need #99049 to merge so I can remove it from this PR.

src/coreclr/jit/hwintrinsic.h Show resolved Hide resolved
src/coreclr/jit/hwintrinsiclistarm64sve.h Outdated Show resolved Hide resolved
src/coreclr/jit/hwintrinsiccodegenarm64.cpp Outdated Show resolved Hide resolved
case INS_sve_ld1h:
case INS_sve_ld1w:
case INS_sve_ld1d:
return emitIns_R_R_R_I(ins, size, reg1, reg2, reg3, 0, opt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason why we can't call emitIns_R_R_R_I() directly from the caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's two ways to do this. Without any special coding in the table, it'll just automatically use the R_R_R() version because that's how many args there are in the intrinsic.

I see that elsewhere in NEON etc, this is how it's already done.

Alternatively, we could add an extra flag something like HW_Flag_extra_zero_arg or just hardcode via the HW_Flag_specialcodegen route. That feels a lot of extra code.

@ghost ghost added needs-author-action An issue or pull request that requires more info or actions from the author. and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Feb 9, 2024
@a74nh
Copy link
Contributor Author

a74nh commented Feb 21, 2024

New version pushed:

  • Uses latest version of the API
  • My dummy tests have been removed. These were sometimes failing due to incorrect usage of GC pinning.
  • Now uses template tests. These pass on real SVE hardware.
  • Reflection test disabled for now. This requires mask register allocation before it will work as the reflection look up causes the mask variable to be stored to memory.
  • No fixes due to review comments yet
BEGIN EXECUTION
/home/alahay01/dotnet/runtime_sve/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false -p System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization=true HardwareIntrinsics_Arm_ro.dll ''
11:35:25.798 Running test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_float()
Supported ISAs:
  AdvSimd:   True
  Aes:       True
  ArmBase:   True
  Crc32:     True
  Dp:        True
  Rdm:       True
  Sha1:      True
  Sha256:    True
  Sve:       True

Beginning scenario: RunBasicScenario_Load
11:35:25.891 Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_float()
11:35:25.905 Running test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_double()
Beginning scenario: RunBasicScenario_Load
11:35:25.914 Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_double()
11:35:25.916 Running test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_sbyte()
Beginning scenario: RunBasicScenario_Load
11:35:25.922 Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_sbyte()
11:35:25.923 Running test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_short()
Beginning scenario: RunBasicScenario_Load
11:35:25.929 Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_short()
11:35:25.931 Running test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_int()
Beginning scenario: RunBasicScenario_Load
11:35:25.937 Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_int()
11:35:25.938 Running test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_long()
Beginning scenario: RunBasicScenario_Load
11:35:25.945 Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_long()
11:35:25.947 Running test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_byte()
Beginning scenario: RunBasicScenario_Load
11:35:25.953 Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_byte()
11:35:25.955 Running test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_ushort()
Beginning scenario: RunBasicScenario_Load
11:35:25.961 Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_ushort()
11:35:25.962 Running test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_uint()
Beginning scenario: RunBasicScenario_Load
11:35:25.968 Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_uint()
11:35:25.970 Running test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_ulong()
Beginning scenario: RunBasicScenario_Load
11:35:25.976 Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveLoadVector_ulong()
11:35:25.978 Running test: JIT/HardwareIntrinsics/Arm/ArmBase/Yield_ro/Yield_ro.dll
11:35:25.979 Passed test: JIT/HardwareIntrinsics/Arm/ArmBase/Yield_ro/Yield_ro.dll
Expected: 100
Actual: 100
END EXECUTION - PASSED

@ryujit-bot
Copy link

Diff results for #98218

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

Overall (-0.01% to -0.00%)
Collection PDIFF
coreclr_tests.run.linux.arm64.checked.mch -0.01%
MinOpts (-0.01% to -0.00%)
Collection PDIFF
coreclr_tests.run.linux.arm64.checked.mch -0.01%
libraries.crossgen2.linux.arm64.checked.mch -0.01%
libraries.pmi.linux.arm64.checked.mch -0.01%
libraries_tests_no_tiered_compilation.run.linux.arm64.Release.mch -0.01%

Throughput diffs for osx/arm64 ran on windows/x64

Overall (-0.01% to -0.00%)
Collection PDIFF
coreclr_tests.run.osx.arm64.checked.mch -0.01%
MinOpts (-0.01% to +0.00%)
Collection PDIFF
coreclr_tests.run.osx.arm64.checked.mch -0.01%
libraries.crossgen2.osx.arm64.checked.mch -0.01%
libraries_tests_no_tiered_compilation.run.osx.arm64.Release.mch -0.01%

Throughput diffs for windows/arm64 ran on windows/x64

Overall (-0.01% to -0.00%)
Collection PDIFF
coreclr_tests.run.windows.arm64.checked.mch -0.01%
MinOpts (-0.01% to +0.00%)
Collection PDIFF
coreclr_tests.run.windows.arm64.checked.mch -0.01%
libraries.crossgen2.windows.arm64.checked.mch -0.01%
libraries_tests_no_tiered_compilation.run.windows.arm64.Release.mch -0.01%

Details here


case NI_Sve_CreateTrueMaskUInt16:
case NI_Sve_CreateTrueMaskUInt32:
case NI_Sve_CreateTrueMaskUInt64:
needBranchTargetReg = !intrin.op1->isContainedIntOrIImmed();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this creates internal register for "def". Make sure that we create an internal register for "use" as well. I forgot to do that in one place and fixed it in #98814.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this creates internal register for "def". Make sure that we create an internal register for "use" as well. I forgot to do that in one place and fixed it in #98814.

I think this is ok as is. The code will use all the generic functionality and work down to the buildInternalRegisterUses() call at the end of the function.

switch (intrin.id)
{
case NI_Sve_LoadVector:
srcCandidates = RBM_LOWMASK;
Copy link
Member

@kunalspathak kunalspathak Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is RBM_LOWMASK true for all variants of ld1* or are there some which could operate on higher mask register? I am wondering how we can make this easy for development of other APIs where developer don't have to think about which candidates to set for given intrinsic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is RBM_LOWMASK true for all variants of ld1* or are there some which could operate on higher mask register? I am wondering how we can make this easy for development of other APIs where developer don't have to think about which candidates to set for given intrinsic?

Yes, all ld1* should be the same. We should be able to pull this information automatically.

I've pushed an update which adds HW_Flag_LowMaskedOperation instead of the switch. I'm fairly keen on pushing as much as we can into the table as it reduces number of files touched each time an api is added. But, I'm a little concerned we'll run out of space for flags - at that point, we can either get creative with flag reuse or turn some flags back into a switch.

@a74nh
Copy link
Contributor Author

a74nh commented Feb 29, 2024

Updated version now produces the move to/from predicates.

Test case dump with 4 annotations:

*************** After end code gen, before unwindEmit()
G_M22300_IG01:        ; func=00, offs=0x000000, size=0x001C, bbWeight=1, PerfScore 6.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG

IN0044: 000000      stp     fp, lr, [sp, #-0x50]!
IN0045: 000004      mov     fp, sp
IN0046: 000008      str     xzr, [fp, #0x30]	// [V01 loc0]
IN0047: 00000C      str     xzr, [fp, #0x38]	// [V01 loc0+0x08]
IN0048: 000010      str     xzr, [fp, #0x20]	// [V02 loc1]
IN0049: 000014      str     xzr, [fp, #0x28]	// [V02 loc1+0x08]
IN004a: 000018      str     x0, [fp, #0x48]	// [V00 this]

G_M22300_IG02:        ; offs=0x00001C, size=0x010C, bbWeight=1, PerfScore 98.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB01 [0000], byref

IN0001: 00001C      movz    x0, #472
IN0002: 000020      movk    x0, #0xA0B4 LSL #16
IN0003: 000024      movk    x0, #0xFFFF LSL #32
IN0004: 000028      movz    x1, #0xE4C0      // code for TestLibrary.TestFramework:BeginScenario(System.String)
IN0005: 00002C      movk    x1, #0x5631 LSL #16
IN0006: 000030      movk    x1, #0xFFFF LSL #32
IN0007: 000034      ldr     x1, [x1]
IN0008: 000038      blr     x1
IN0009: 00003C      ptrue   p7.d                            - CREATE TRUE MASK
IN000a: 000040      mov     z16.d, p7/z, #1           -  MOVE MASK IN PREDICATE TO Z
IN000b: 000044      str     q16, [fp, #0x30]	// [V01 loc0]
IN000c: 000048      ldr     x0, [fp, #0x48]	// [V00 this]
IN000d: 00004C      ldrsb   wzr, [x0]
IN000e: 000050      ldr     x0, [fp, #0x48]	// [V00 this]
IN000f: 000054      add     x0, x0, #16
IN0010: 000058      movz    x1, #0xE2E0      // code for JIT.HardwareIntrinsics.Arm._Sve.LoadUnaryOpTest__SveLoadVector_ulong+DataTable:get_inArray1Ptr():ulong:this
IN0011: 00005C      movk    x1, #0x5631 LSL #16
IN0012: 000060      movk    x1, #0xFFFF LSL #32
IN0013: 000064      ldr     x1, [x1]
IN0014: 000068      blr     x1
IN0015: 00006C      ldr     q16, [fp, #0x30]	// [V01 loc0]
IN0016: 000070      ptrue   p7.d                                           - CREATE EMBEDDED MASK FOR PREDICATE MOVE
IN0017: 000074      cmpne   p7.d, p7/z, z16.d, #0             - MOVE MASK FROM Z TO PREDICATE
IN0018: 000078      ld1d    { z16.d }, p7/z, [x0]
IN0019: 00007C      str     q16, [fp, #0x20]	// [V02 loc1]
IN001a: 000080      ldr     x0, [fp, #0x48]	// [V00 this]
IN001b: 000084      ldrsb   wzr, [x0]
IN001c: 000088      ldr     x0, [fp, #0x48]	// [V00 this]
IN001d: 00008C      add     x0, x0, #16
IN001e: 000090      movz    x1, #0xE2F8      // code for JIT.HardwareIntrinsics.Arm._Sve.LoadUnaryOpTest__SveLoadVector_ulong+DataTable:get_outArrayPtr():ulong:this
IN001f: 000094      movk    x1, #0x5631 LSL #16
IN0020: 000098      movk    x1, #0xFFFF LSL #32
IN0021: 00009C      ldr     x1, [x1]
IN0022: 0000A0      blr     x1
IN0023: 0000A4      ldr     q16, [fp, #0x20]	// [V02 loc1]
IN0024: 0000A8      str     q16, [x0]
IN0025: 0000AC      ldr     x0, [fp, #0x48]	// [V00 this]
IN0026: 0000B0      ldrsb   wzr, [x0]
IN0027: 0000B4      ldr     x0, [fp, #0x48]	// [V00 this]
IN0028: 0000B8      add     x0, x0, #16
IN0029: 0000BC      movz    x1, #0xE2E0      // code for JIT.HardwareIntrinsics.Arm._Sve.LoadUnaryOpTest__SveLoadVector_ulong+DataTable:get_inArray1Ptr():ulong:this
IN002a: 0000C0      movk    x1, #0x5631 LSL #16
IN002b: 0000C4      movk    x1, #0xFFFF LSL #32
IN002c: 0000C8      ldr     x1, [x1]
IN002d: 0000CC      blr     x1
IN002e: 0000D0      str     x0, [fp, #0x18]	// [V04 tmp1]
IN002f: 0000D4      ldr     x0, [fp, #0x48]	// [V00 this]
IN0030: 0000D8      ldrsb   wzr, [x0]
IN0031: 0000DC      ldr     x0, [fp, #0x48]	// [V00 this]
IN0032: 0000E0      add     x0, x0, #16
IN0033: 0000E4      movz    x1, #0xE2F8      // code for JIT.HardwareIntrinsics.Arm._Sve.LoadUnaryOpTest__SveLoadVector_ulong+DataTable:get_outArrayPtr():ulong:this
IN0034: 0000E8      movk    x1, #0x5631 LSL #16
IN0035: 0000EC      movk    x1, #0xFFFF LSL #32
IN0036: 0000F0      ldr     x1, [x1]
IN0037: 0000F4      blr     x1
IN0038: 0000F8      str     x0, [fp, #0x10]	// [V05 tmp2]
IN0039: 0000FC      ldr     x2, [fp, #0x10]	// [V05 tmp2]
IN003a: 000100      ldr     x1, [fp, #0x18]	// [V04 tmp1]
IN003b: 000104      ldr     x0, [fp, #0x48]	// [V00 this]
IN003c: 000108      movz    x3, #472
IN003d: 00010C      movk    x3, #0xA0B4 LSL #16
IN003e: 000110      movk    x3, #0xFFFF LSL #32
IN003f: 000114      movz    x4, #0xE3E8      // code for JIT.HardwareIntrinsics.Arm._Sve.LoadUnaryOpTest__SveLoadVector_ulong:ValidateResult(ulong,ulong,System.String):this
IN0040: 000118      movk    x4, #0x5631 LSL #16
IN0041: 00011C      movk    x4, #0xFFFF LSL #32
IN0042: 000120      ldr     x4, [x4]
IN0043: 000124      blr     x4

G_M22300_IG03:        ; offs=0x000128, size=0x0008, bbWeight=1, PerfScore 2.00, epilog, nogc, extend

IN004b: 000128      ldp     fp, lr, [sp], #0x50
IN004c: 00012C      ret     lr

Next I need to add the lowering code which spots masks moves and removes them

@a74nh a74nh changed the title JIT ARM64-SVE: Add TrueMask JIT ARM64-SVE: Add TrueMask and LoadVector Feb 29, 2024
@a74nh a74nh marked this pull request as ready for review March 4, 2024 17:29
@a74nh a74nh marked this pull request as draft March 4, 2024 17:30
Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@kunalspathak kunalspathak merged commit 17eb59c into dotnet:main Mar 12, 2024
170 of 191 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support community-contribution Indicates that the PR has been added by a community member new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants