ARM64: Optimize pair of "ldr reg, [fp]" to ldp #35130

kunalspathak · 2020-04-17T20:40:17Z

ldr     x2, [fp,#24]
ldr     x3, [fp,#32]

can be combined into ldp if the loads are happening from subsequent memory.

ldp x2, x3, [fp, #24]

I collected no. of such ldr pairs in framework libraries and found approx. 28K pairs in 13K methods.

Details:

ldr_ldr_fp_to_ldp.txt

category:cq
theme:optimization
skill-level:intermediate
cost:small
impact:medium

The text was updated successfully, but these errors were encountered:

Dotnet-GitSync-Bot · 2020-04-17T20:40:20Z

I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label.

BruceForstall · 2020-04-18T01:36:00Z

Related: #35132

BruceForstall · 2020-04-18T01:52:16Z

I noticed a few things from the attached file:

Sometimes the low offset comes first, sometimes it comes second in the instruction stream. A peep would need to handle both cases.
There are a few cases in the attached examples of non-contiguous loads that wouldn't be mergeable (e.g., [fp,#72] / [fp,#32])
Were there any cases of consecutive loads of the w sub-registers? Or the floating-point registers?

BruceForstall · 2020-04-18T01:56:35Z

Extending this to arm32, we could use STM with a register mask to collapse multiple store (if the consecutive stores were using increasing register numbers, and possibly other conditions were met).

This change serves to address the following four Github tickets: 1. ARM64: Optimize pair of "ldr reg, [fp]" to ldp dotnet#35130 2. ARM64: Optimize pair of "ldr reg, [reg]" to ldp dotnet#35132 3. ARM64: Optimize pair of "str reg, [reg]" to stp dotnet#35133 4. ARM64: Optimize pair of "str reg, [fp]" to stp dotnet#35134 A technique was employed that involved detecting an optimisation opportunity as instruction sequences were being generated. The optimised instruction was then generated on top of the previous instruction, with no second instruction generated. Thus, there were no changes to instruction group size at “emission time” and no changes to jump instructions.

…77540) * Replace successive "ldr" and "str" instructions with "ldp" and "stp" This change serves to address the following four Github tickets: 1. ARM64: Optimize pair of "ldr reg, [fp]" to ldp #35130 2. ARM64: Optimize pair of "ldr reg, [reg]" to ldp #35132 3. ARM64: Optimize pair of "str reg, [reg]" to stp #35133 4. ARM64: Optimize pair of "str reg, [fp]" to stp #35134 A technique was employed that involved detecting an optimisation opportunity as instruction sequences were being generated. The optimised instruction was then generated on top of the previous instruction, with no second instruction generated. Thus, there were no changes to instruction group size at “emission time” and no changes to jump instructions. * No longer use a temporary buffer to build the optimized instruction. * Addressed assorted review comments. * Now optimizes ascending locations and decending locations with consecutive STR and LDR instructions. * Modification to remove last instructions. * Ongoing improvements to remove previously-emitted instruction during ldr / str optimization. * Stopped optimization of consecutive instructions that straddled an instruction group boundary. * Addressed code change requests in GitHub. * Various fixes to ldp/stp optimization Add code to update IP mappings when an instruction is removed. * Delete unnecessary and incorrect assert * Diagnostic change only, to confirm whether a theory is correct or not when chasing an error. * Revert "Diagnostic change only, to confirm whether a theory is correct or" This reverts commit 4b0e51e. * Do not merge. Temporarily removed calls to "codeGen->genIPmappingUpdateForRemovedInstruction()". Also, corrected minor bug in instruction numbering when removing instructions during optimization. * Modifications to better update the IP mapping table for a replaced instruction. * Minor formatting change. * Check for out of range offsets * Don't optimise during prolog/epilog * Fix windows build error * IGF_HAS_REMOVED_INSTR is ARM64 only * Add OptimizeLdrStr function * Fix formatting * Ensure local variables are tracked * Don't peephole local variables Co-authored-by: Bruce Forstall <brucefo@microsoft.com> Co-authored-by: Alan Hayward <alan.hayward@arm.com> Co-authored-by: Alan Hayward <a74nh@users.noreply.github.com>

kunalspathak · 2023-04-27T06:11:29Z

Fixed in various peepholes, latest being #85032.

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 17, 2020

kunalspathak added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Apr 17, 2020

BruceForstall mentioned this issue Apr 18, 2020

ARM64: Optimize pair of "str reg, [fp]" to stp #35134

Closed

BruceForstall added this to the Future milestone Apr 20, 2020

BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 20, 2020

kunalspathak mentioned this issue May 5, 2020

Improving ARM64 Performance in .NET 5.0 – Closing the gap with x64 #35853

Closed

kunalspathak mentioned this issue Jun 18, 2020

Optimize WithUpper/WithLower with InsertSelectedScalar, SpanHelpers.Sequence APIs #38075

Merged

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

kunalspathak mentioned this issue Nov 9, 2020

[Arm64] Planned JIT work in .NET 6 #43629

Closed

29 tasks

kunalspathak modified the milestones: Future, 6.0.0 Nov 9, 2020

kunalspathak removed the JitUntriaged CLR JIT issues needing additional triage label Nov 9, 2020

JulieLeeMSFT modified the milestones: 6.0.0, Future Mar 23, 2021

echesakov mentioned this issue Jul 8, 2021

[Arm64] Peephole optimization opportunities #55365

Closed

8 tasks

kunalspathak mentioned this issue Sep 29, 2022

AdvSimd.Arm64.LoadPairVector128 has additional stack usage #75957

Open

SwapnilGaikwad mentioned this issue Oct 26, 2022

[ARM64/Linux] Inefficiencies when using initializing/cleaning unsafe pointers #12736

Closed

BruceForstall mentioned this issue Oct 27, 2022

Replace successive "ldr" and "str" instructions with "ldp" and "stp" #77540

Merged

kunalspathak mentioned this issue Apr 6, 2023

Perform ldr to ldp peephole optimization #84399

Merged

kunalspathak closed this as completed Apr 27, 2023

ghost locked as resolved and limited conversation to collaborators May 27, 2023

JulieLeeMSFT added this to .NET Core CodeGen Jun 5, 2024

JulieLeeMSFT moved this to Done in .NET Core CodeGen Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM64: Optimize pair of "ldr reg, [fp]" to ldp #35130

ARM64: Optimize pair of "ldr reg, [fp]" to ldp #35130

kunalspathak commented Apr 17, 2020 •

edited by BruceForstall

Loading

Dotnet-GitSync-Bot commented Apr 17, 2020

BruceForstall commented Apr 18, 2020

BruceForstall commented Apr 18, 2020

BruceForstall commented Apr 18, 2020

kunalspathak commented Apr 27, 2023

ARM64: Optimize pair of "ldr reg, [fp]" to ldp #35130

ARM64: Optimize pair of "ldr reg, [fp]" to ldp #35130

Comments

kunalspathak commented Apr 17, 2020 • edited by BruceForstall Loading

Dotnet-GitSync-Bot commented Apr 17, 2020

BruceForstall commented Apr 18, 2020

BruceForstall commented Apr 18, 2020

BruceForstall commented Apr 18, 2020

kunalspathak commented Apr 27, 2023

kunalspathak commented Apr 17, 2020 •

edited by BruceForstall

Loading