Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64: Optimize pair of "ldr reg, [fp]" to ldp #35130

Closed
kunalspathak opened this issue Apr 17, 2020 · 5 comments
Closed

ARM64: Optimize pair of "ldr reg, [fp]" to ldp #35130

kunalspathak opened this issue Apr 17, 2020 · 5 comments
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Apr 17, 2020

ldr     x2, [fp,#24]
ldr     x3, [fp,#32]

can be combined into ldp if the loads are happening from subsequent memory.

ldp x2, x3, [fp, #24]

I collected no. of such ldr pairs in framework libraries and found approx. 28K pairs in 13K methods.

Details:

ldr_ldr_fp_to_ldp.txt

category:cq
theme:optimization
skill-level:intermediate
cost:small
impact:medium

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 17, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label.

@kunalspathak kunalspathak added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Apr 17, 2020
@BruceForstall
Copy link
Member

Related: #35132

@BruceForstall
Copy link
Member

I noticed a few things from the attached file:

  • Sometimes the low offset comes first, sometimes it comes second in the instruction stream. A peep would need to handle both cases.
  • There are a few cases in the attached examples of non-contiguous loads that wouldn't be mergeable (e.g., [fp,#72] / [fp,#32])
  • Were there any cases of consecutive loads of the w sub-registers? Or the floating-point registers?

@BruceForstall
Copy link
Member

Extending this to arm32, we could use STM with a register mask to collapse multiple store (if the consecutive stores were using increasing register numbers, and possibly other conditions were met).

@BruceForstall BruceForstall added this to the Future milestone Apr 20, 2020
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 20, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@kunalspathak kunalspathak modified the milestones: Future, 6.0.0 Nov 9, 2020
@kunalspathak kunalspathak removed the JitUntriaged CLR JIT issues needing additional triage label Nov 9, 2020
@JulieLeeMSFT JulieLeeMSFT modified the milestones: 6.0.0, Future Mar 23, 2021
AndyJGraham added a commit to AndyJGraham/runtime that referenced this issue Oct 27, 2022
This change serves to address the following four Github tickets:

    1. ARM64: Optimize pair of "ldr reg, [fp]" to ldp dotnet#35130
    2. ARM64: Optimize pair of "ldr reg, [reg]" to ldp dotnet#35132
    3. ARM64: Optimize pair of "str reg, [reg]" to stp dotnet#35133
    4. ARM64: Optimize pair of "str reg, [fp]" to stp  dotnet#35134

A technique was employed that involved detecting an optimisation
opportunity as instruction sequences were being generated.
The optimised instruction was then generated on top of the previous
instruction, with no second instruction generated. Thus, there were no
changes to instruction group size at “emission time” and no changes to
jump instructions.
BruceForstall pushed a commit to BruceForstall/runtime that referenced this issue Jan 12, 2023
This change serves to address the following four Github tickets:

    1. ARM64: Optimize pair of "ldr reg, [fp]" to ldp dotnet#35130
    2. ARM64: Optimize pair of "ldr reg, [reg]" to ldp dotnet#35132
    3. ARM64: Optimize pair of "str reg, [reg]" to stp dotnet#35133
    4. ARM64: Optimize pair of "str reg, [fp]" to stp  dotnet#35134

A technique was employed that involved detecting an optimisation
opportunity as instruction sequences were being generated.
The optimised instruction was then generated on top of the previous
instruction, with no second instruction generated. Thus, there were no
changes to instruction group size at “emission time” and no changes to
jump instructions.
kunalspathak pushed a commit that referenced this issue Jan 27, 2023
…77540)

* Replace successive "ldr" and "str" instructions with "ldp" and "stp"

This change serves to address the following four Github tickets:

    1. ARM64: Optimize pair of "ldr reg, [fp]" to ldp #35130
    2. ARM64: Optimize pair of "ldr reg, [reg]" to ldp #35132
    3. ARM64: Optimize pair of "str reg, [reg]" to stp #35133
    4. ARM64: Optimize pair of "str reg, [fp]" to stp  #35134

A technique was employed that involved detecting an optimisation
opportunity as instruction sequences were being generated.
The optimised instruction was then generated on top of the previous
instruction, with no second instruction generated. Thus, there were no
changes to instruction group size at “emission time” and no changes to
jump instructions.

* No longer use a temporary buffer to build the optimized instruction.

* Addressed assorted review comments.

* Now optimizes ascending locations and decending locations with
consecutive STR and LDR instructions.

* Modification to remove last instructions.

* Ongoing improvements to remove previously-emitted instruction
during ldr / str optimization.

* Stopped optimization of consecutive instructions that straddled an instruction group boundary.

* Addressed code change requests in GitHub.

* Various fixes to ldp/stp optimization

Add code to update IP mappings when an instruction is removed.

* Delete unnecessary and incorrect assert

* Diagnostic change only, to confirm whether a theory is correct or
not when chasing an error.

* Revert "Diagnostic change only, to confirm whether a theory is correct or"

This reverts commit 4b0e51e.

* Do not merge. Temporarily removed calls to
"codeGen->genIPmappingUpdateForRemovedInstruction()".

Also, corrected minor bug in instruction numbering
when removing instructions during optimization.

* Modifications to better update the IP mapping table for a replaced instruction.

* Minor formatting change.

* Check for out of range offsets

* Don't optimise during prolog/epilog

* Fix windows build error

* IGF_HAS_REMOVED_INSTR is ARM64 only

* Add OptimizeLdrStr function

* Fix formatting

* Ensure local variables are tracked

* Don't peephole local variables

Co-authored-by: Bruce Forstall <brucefo@microsoft.com>
Co-authored-by: Alan Hayward <alan.hayward@arm.com>
Co-authored-by: Alan Hayward <a74nh@users.noreply.github.com>
@kunalspathak
Copy link
Member Author

Fixed in various peepholes, latest being #85032.

@ghost ghost locked as resolved and limited conversation to collaborators May 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Archived in project
Development

No branches or pull requests

4 participants