Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64: Fix lsra for AdvSimd_LoadAndInsertScalar #107786

Merged
merged 9 commits into from
Sep 26, 2024

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Sep 13, 2024

Spotted during implementation of #107459

The single register version of AdvSimd LoadAndInsertScalar is:
Vector128<X> LoadAndInsertScalar(Vector128<X> value, byte index, X* address);

Currently, the code builds op1 as an address and op3 as a rmw node:

BuildAddrUses(intrin.op1);
BuildDelayFreeUses(intrin.op2, intrin.op1, candidates)
BuildDelayFreeUses(intrin.op3, (tgtPrefOp2 ? intrin.op2 : intrin.op1), candidates);

This instead should be:

tgtPrefUse = BuildUse(intrin.op1);
BuildDelayFreeUses(intrin.op2, intrin.op1, candidates)
BuildAddrUses(intrin.op3);

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 13, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Sep 13, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@a74nh a74nh marked this pull request as ready for review September 13, 2024 11:09
@a74nh
Copy link
Contributor Author

a74nh commented Sep 13, 2024

Checked in a debugger the right functions are now called.

@dotnet/arm64-contrib @kunalspathak

Change-Id: I9579349d76498ed0d73bade48eba66cc23a31ebf
@@ -3846,9 +3846,23 @@ int LinearScan::BuildDelayFreeUses(GenTree* node,
return 0;
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kunalspathak : As suggested offline, added some additional checks in the delay slot logic. If the register type of the register does not match that of the delay slot register then do not add the delay free use. This fixes where we call BuildDelayFreeUses() for op2 which is a integer value.

Instead, we could do these checks in BuildHWIntrinsic() to not call BuildDelayFreeUses(), but then BuildDelayFreeUses() would still need the same checks turned into asserts. So it seemed simpler to do inside.

This will work nicely with BuildHWIntrinsic() rewrite PR too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this should be assert and the callers should not be calling delay free uses for different register type. Did you turned it into an assert and see how many places are hit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difficult to test with an assert because 1) all the hwintrinsic tests will fail at the first assert 2) many of these issues will be due during ConstantExpected APIs, and for a lot of those we are only testing using hardcoded constants, meaning the assert will never be hit.

Scanning the SVE API, all the methods we have that are RMW and have a ConstantExpected are:

AddRotateComplex
DotProductBySelectedScalar
ExtractVector
FusedMultiplyAddBySelectedScalar
FusedMultiplySubtractBySelectedScalar
MultiplyAddRotateComplex
MultiplyAddRotateComplexBySelectedScalar
MultiplyBySelectedScalar
SaturatingDecrementBy16BitElementCount
SaturatingDecrementBy32BitElementCount
SaturatingDecrementBy64BitElementCount
SaturatingDecrementBy8BitElementCount
SaturatingDecrementByActiveElementCount
SaturatingIncrementBy16BitElementCount
SaturatingIncrementBy32BitElementCount
SaturatingIncrementBy64BitElementCount
SaturatingIncrementBy8BitElementCount
SaturatingIncrementByActiveElementCount
ShiftRightArithmeticForDivide
TrigonometricMultiplyAddCoefficient

Then there are AdvSimd ones. Then there are possibly others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much easier to fix properly in the lsra rewrite PR, as the check can just be added into the main for loop in BuildHWIntrinsic()

        else if (delayFreeOp != nullptr && TheExtraChecksFromThisPR.....)
        {
            srcCount += BuildDelayFreeUses(operand, delayFreeOp, candidates);
        }
        else
        {
            srcCount += BuildOperandUses(operand, candidates);
        }

@kunalspathak
Copy link
Member

/azp run Fuzzlyn

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double check the complete diff for this or similar methods that has zero-size diffs? Wondering why previously we decided to have copy of x0 in both v0 and v17, but now, we don't need it?

image

@@ -161,6 +165,7 @@ namespace JIT.HardwareIntrinsics.Arm
for (var i = 0; i < Op1ElementCount; i++) { _data1[i] = {NextValueOp1}; }
Unsafe.CopyBlockUnaligned(ref Unsafe.As<{Op1VectorType}<{Op1BaseType}>, byte>(ref _fld1), ref Unsafe.As<{Op1BaseType}, byte>(ref _data1[0]), (uint)Unsafe.SizeOf<{Op1VectorType}<{Op1BaseType}>>());

_fld2 = {ElementIndex};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lot of the tests that takes immediate value is missing this coverage. We should fix it some day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an issue: #108060

@@ -3846,9 +3846,23 @@ int LinearScan::BuildDelayFreeUses(GenTree* node,
return 0;
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this should be assert and the callers should not be calling delay free uses for different register type. Did you turned it into an assert and see how many places are hit?

src/coreclr/jit/lsrabuild.cpp Outdated Show resolved Hide resolved
// Multi register nodes should no go via this route.
assert(!node->IsMultiRegNode());
if (rmwNode == nullptr || varTypeUsesSameRegType(rmwNode->TypeGet(), node->TypeGet()) ||
(rmwNode->IsMultiRegNode() && varTypeUsesFloatReg(node->TypeGet())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not understand the && varTypeUsesFloatReg(node->TypeGet()) here? did you see code where having delay free is ok to have for multi-reg node that are GPR (if there is such a thing)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the RMW node is a multi register, then it'll always be vector registers.

The normal node could be general register or vector register. We want to discount the case where node is a general register. So, varTypeUsesFloatReg() is confirming node uses a FP, vector or sve vector register

Eg:

(Vector128<sbyte> Value1, Vector128<sbyte> Value2) LoadAndInsertScalar((Vector128<sbyte>, Vector128<sbyte>) values, [ConstantExpected(Max = (byte)(15))] byte index, sbyte* address)

The RMW node is the multi register values.
This code makes sure index is not delay slotted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering we should have an assert that if it is a multiRegNode, then node should be floatreg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some asserts.

That then broke things because for delayFreeMultiple intrinsics we do:

BuildDelayFreeUses(use.GetNode(), intrinsicTree);

Which is passing the entire intrinsicTree down as the rmw node which feels wrong. Fixed that to pass op1 instead.

@kunalspathak
Copy link
Member

Wondering why previously we decided to have copy of x0 in both v0 and v17, but now, we don't need it?

Other than this and the comment in #107786 (comment), we should be good.

@a74nh
Copy link
Contributor Author

a74nh commented Sep 23, 2024

Can you double check the complete diff for this or similar methods that has zero-size diffs? Wondering why previously we decided to have copy of x0 in both v0 and v17, but now, we don't need it?

image

#37@69 is the function return value, #35@68 is the def of the INS

In base:

 Interval 17: simd16 (interfering uses) RefPositions {#35@68 #37@69} physReg:NA Preferences=[d0] Aversions=[]

 <RefPosition #33  @67  RefTypeUse <Ivl:13> BB02 regmask=[allFloat] minReg=1 last wt=100.00>
 <RefPosition #34  @67  RefTypeUse <Ivl:16> BB02 regmask=[x0-xip0 x19-x28] minReg=1 last delay wt=100.00>
 <RefPosition #35  @68  RefTypeDef <Ivl:17> HWINTRINSIC BB02 regmask=[allFloat] minReg=1 wt=400.00>
 <RefPosition #36  @69  RefTypeFixedReg <Reg:d0 > BB02 regmask=[d0] minReg=1 wt=100.00>
 <RefPosition #37  @69  RefTypeUse <Ivl:17> BB02 regmask=[d0] minReg=1 last fixed wt=100.00>

Because of this conflict, it allocates the def of INS to d17 and then plants an extra MOV to move from the def to the function return.

in diff:
The conflict no longer exists:

 Interval 17: simd16 RefPositions {#35@68 #37@69} physReg:NA Preferences=[d0] Aversions=[] 

<RefPosition #33  @67  RefTypeUse <Ivl:13> BB02 regmask=[allFloat] minReg=1 last wt=100.00>
 <RefPosition #34  @67  RefTypeUse <Ivl:16> BB02 regmask=[x0-xip0 x19-x28] minReg=1 last wt=100.00>
 <RefPosition #35  @68  RefTypeDef <Ivl:17> HWINTRINSIC BB02 regmask=[d0] minReg=1 wt=400.00>
 <RefPosition #36  @69  RefTypeFixedReg <Reg:d0 > BB02 regmask=[d0] minReg=1 wt=100.00>
 <RefPosition #37  @69  RefTypeUse <Ivl:17> BB02 regmask=[d0] minReg=1 last fixed wt=100.00>

Which means it is able to allocate d0 for the def of INS. At codegen, it has to plant the MOV from d17 to d0 before generating the INS.

full log for base

full log for diff

@kunalspathak
Copy link
Member

@a74nh - seems the unit tests are failing. can you take a look?

@a74nh
Copy link
Contributor Author

a74nh commented Sep 24, 2024

@a74nh - seems the unit tests are failing. can you take a look?

I've restored the change to the call to BuildDelayFreeUses() and removed an assert. Seems the safest option for now. The rewrite PR will fix it better anyway.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kunalspathak kunalspathak merged commit d22c74d into dotnet:main Sep 26, 2024
117 checks passed
@a74nh a74nh deleted the lsra_insertscalar_github branch September 26, 2024 13:50
sirntar pushed a commit to sirntar/runtime that referenced this pull request Sep 30, 2024
* ARM64: Fix lsra for AdvSimd_LoadAndInsertScalar

* Fix comment

* Expand LoadAndInsertScalar testing

* Only delayfree when register types match

* fix formatting

* Fix comment

* Check that multiregister nodes use fp registers

* Revert rmw assert
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants