Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64-SVE: Implement IF_SVE_DW_2A, IF_SVE_DW_2B, IF_SVE_EB_1B #97800

Merged
merged 3 commits into from
Feb 2, 2024

Conversation

amanasifkhalid
Copy link
Member

Part of #94549. Note that the preferred disassembly for FMOV <Zd>.<T>, #0.0 is MOV <Zd>.<T>, 0, per the SVE docs: FMOV is a pseudo-instruction of DUP, which is aliased by MOV.

cstool output:

00C03825  mov   z0.b, #0
01C07825  mov   z1.h, #0
02C0B825  mov   z2.s, #0
03C0F825  mov   z3.d, #0
10702025  pext  p0.b, pn8[0]
31716025  pext  p1.h, pn9[1]
5272A025  pext  p2.s, pn0xA[2]
7373E025  pext  p3.d, pn0xB[3]
98742025  pext  { p8.b, p9.b }, pn0xC[0]
B9756025  pext  { p9.h, p10.h }, pn0xD[1]
DA74A025  pext  { p10.s, p11.s }, pn0xE[0]
FB75E025  pext  { p11.d, p12.d }, pn0xF[1]

JitDisasm output:

mov     z0.b, #0
mov     z1.h, #0
mov     z2.s, #0
mov     z3.d, #0
pext    p0.b, pn8[0]
pext    p1.h, pn9[1]
pext    p2.s, pn10[2]
pext    p3.d, pn11[3]
pext    { p8.b, p9.b }, pn12[0]
pext    { p9.h, p10.h }, pn13[1]
pext    { p10.s, p11.s }, pn14[0]
pext    { p11.d, p12.d }, pn15[1]

cc @dotnet/arm64-contrib

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 1, 2024
@ghost ghost assigned amanasifkhalid Feb 1, 2024
@ghost
Copy link

ghost commented Feb 1, 2024

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Part of #94549. Note that the preferred disassembly for FMOV <Zd>.<T>, #0.0 is MOV <Zd>.<T>, 0, per the SVE docs: FMOV is a pseudo-instruction of DUP, which is aliased by MOV.

cstool output:

00C03825  mov   z0.b, #0
01C07825  mov   z1.h, #0
02C0B825  mov   z2.s, #0
03C0F825  mov   z3.d, #0
10702025  pext  p0.b, pn8[0]
31716025  pext  p1.h, pn9[1]
5272A025  pext  p2.s, pn0xA[2]
7373E025  pext  p3.d, pn0xB[3]
98742025  pext  { p8.b, p9.b }, pn0xC[0]
B9756025  pext  { p9.h, p10.h }, pn0xD[1]
DA74A025  pext  { p10.s, p11.s }, pn0xE[0]
FB75E025  pext  { p11.d, p12.d }, pn0xF[1]

JitDisasm output:

mov     z0.b, #0
mov     z1.h, #0
mov     z2.s, #0
mov     z3.d, #0
pext    p0.b, pn8[0]
pext    p1.h, pn9[1]
pext    p2.s, pn10[2]
pext    p3.d, pn11[3]
pext    { p8.b, p9.b }, pn12[0]
pext    { p9.h, p10.h }, pn13[1]
pext    { p10.s, p11.s }, pn14[0]
pext    { p11.d, p12.d }, pn15[1]

cc @dotnet/arm64-contrib

Author: amanasifkhalid
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@kunalspathak kunalspathak added the arm-sve Work related to arm64 SVE/SVE2 support label Feb 1, 2024
@ryujit-bot
Copy link

Diff results for #97800

Throughput diffs

Throughput diffs for linux/arm64 ran on linux/x64

Overall (-0.00% to +0.01%)
Collection PDIFF
coreclr_tests.run.linux.arm64.checked.mch +0.01%
benchmarks.run_tiered.linux.arm64.checked.mch +0.01%
libraries_tests.run.linux.arm64.Release.mch +0.01%
MinOpts (+0.00% to +0.04%)
Collection PDIFF
coreclr_tests.run.linux.arm64.checked.mch +0.02%
benchmarks.run_tiered.linux.arm64.checked.mch +0.02%
libraries_tests.run.linux.arm64.Release.mch +0.02%
libraries_tests_no_tiered_compilation.run.linux.arm64.Release.mch +0.02%
libraries.pmi.linux.arm64.checked.mch +0.01%
benchmarks.run_pgo.linux.arm64.checked.mch +0.02%
benchmarks.run.linux.arm64.checked.mch +0.02%
realworld.run.linux.arm64.checked.mch +0.04%

Details here


Copy link
Contributor

@a74nh a74nh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -18300,6 +18377,14 @@ size_t emitter::emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp)
break;
}

case IF_SVE_EB_1B: // ........xx...... ...........ddddd -- SVE broadcast integer immediate (unpredicated)
// ins is MOV for this encoding, as it is the preferred disassembly, so pass FMOV to emitInsCodeSve
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is annoying, and slightly different to aliasing in other instructions.

Alternatively, keep it as FMOV in emitIns_R() and then covert to MOV in emitDispInsHelp().

I'm not sure either way is better. Happy to keep as is in here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, keep it as FMOV in emitIns_R() and then covert to MOV in emitDispInsHelp().

I'd like to keep the instruction as FMOV in emitIns_R, but in emitDispInsHelp, in order to print it as a MOV, we would have to add a check for this case at the beginning of the method so we don't print the instruction before modifying it. My current approach is hacky, but it allows us to contain this behavior to the relevant switch case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine to me. Comment describes why so it's all good.

@amanasifkhalid
Copy link
Member Author

@kunalspathak this change does have some TP impact. I suspect this is from me adding the insScalableOpts parameter to emitIns_R_R_I. Just pointing this out if we're looking for reasons to split out the SVE instructions into separate emitIns methods.

@ryujit-bot
Copy link

Diff results for #97800

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (+0.00% to +0.01%)
Collection PDIFF
libraries.pmi.linux.arm64.checked.mch +0.01%

Details here


@ryujit-bot
Copy link

Diff results for #97800

Throughput diffs

Throughput diffs for linux/arm64 ran on linux/x64

Overall (-0.00% to +0.01%)
Collection PDIFF
libraries_tests.run.linux.arm64.Release.mch +0.01%
benchmarks.run_tiered.linux.arm64.checked.mch +0.01%
coreclr_tests.run.linux.arm64.checked.mch +0.01%
MinOpts (+0.00% to +0.04%)
Collection PDIFF
benchmarks.run.linux.arm64.checked.mch +0.02%
libraries_tests.run.linux.arm64.Release.mch +0.02%
realworld.run.linux.arm64.checked.mch +0.04%
libraries_tests_no_tiered_compilation.run.linux.arm64.Release.mch +0.02%
benchmarks.run_tiered.linux.arm64.checked.mch +0.02%
libraries.pmi.linux.arm64.checked.mch +0.01%
benchmarks.run_pgo.linux.arm64.checked.mch +0.02%
coreclr_tests.run.linux.arm64.checked.mch +0.02%

Details here


Copy link
Contributor

@TIHan TIHan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I wouldn't worry too much about the TP regressions as I think I solved a chunk of it in #97739 by separating the SVE format cases in emitOutputInstr into their own function.

We could also separate out the SVE instruction cases in the emitIns functions into their own emitInsSve which could further help mitigate any TP regressions, but we don't have to do that now.

@amanasifkhalid
Copy link
Member Author

Thanks for the reviews!

I wouldn't worry too much about the TP regressions as I think I solved a chunk of it in #97739 by separating the SVE format cases in emitOutputInstr into their own function.

I didn't realize we'd accumulated that much TP overhead from the SVE switch cases; >1% TP improvement is a lot...

We could also separate out the SVE instruction cases in the emitIns functions into their own emitInsSve which could further help mitigate any TP regressions, but we don't have to do that now.

emitOutputInstr is probably the worst offender, but I'd be curious to see how much more we can claw back by separating out the emitIns methods.

@TIHan
Copy link
Contributor

TIHan commented Feb 2, 2024

I didn't realize we'd accumulated that much TP overhead from the SVE switch cases; >1% TP improvement is a lot...

The 1% TP improvement may not be accurate because it's running the ARM64 path on windows/x64 and it still counts the instructions generated for x64 too.

The TP regressions to really look at is the linux/arm64 running on linux/arm64, which improved modestly.

I'd be curious to see how much more we can claw back by separating out the emitIns methods.

I'd be curious as well. I already did it to emitIns_R_R_R_R in the PR, but it didn't change the TP, probably because emitIns_R_R_R_R is not common.

@amanasifkhalid amanasifkhalid merged commit f568f42 into dotnet:main Feb 2, 2024
127 of 129 checks passed
@amanasifkhalid amanasifkhalid deleted the sve-pext branch February 2, 2024 21:42
@github-actions github-actions bot locked and limited conversation to collaborators Mar 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants