Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strength reduction for add operations performed power of 2 times #34938

Closed
kunalspathak opened this issue Apr 14, 2020 · 10 comments
Closed

Strength reduction for add operations performed power of 2 times #34938

kunalspathak opened this issue Apr 14, 2020 · 10 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Apr 14, 2020

static uint i4(uint i) {
	return i + i + i + i;
}

Above code should obviously be return 4 * i, but today if we see something like this, we don't optimize it that way and perform add operations. We could optimize it to lsl if adds are performed in power of 2.
Today we generate:

G_M48040_IG02:
        0B000001          add     w1, w0, w0
        0B000021          add     w1, w1, w0
        0B000020          add     w0, w1, w0

We should generate:

        531E7400          lsl     w0, w0, #2

We never optimize even for 8 or 16 operations and thus emitting series of add operations.

category:cq
theme:basic-cq
skill-level:intermediate
cost:medium

@kunalspathak kunalspathak added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Apr 14, 2020
@kunalspathak kunalspathak self-assigned this Apr 14, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 14, 2020
@kunalspathak
Copy link
Member Author

@BruceForstall

@EgorBo
Copy link
Member

EgorBo commented Apr 14, 2020

I guess it's not arm-specific (could generate a single lea on x86 but it does not)

@kunalspathak
Copy link
Member Author

Yep, I agree. I have seen x86 too emitting lea r8d, [rcx+rcx]. Updated the title and label.

@kunalspathak kunalspathak changed the title ARM64: Strength reduction for add operations performed power of 2 times Strength reduction for add operations performed power of 2 times Apr 14, 2020
@BruceForstall BruceForstall added this to the Future milestone Apr 20, 2020
@BruceForstall BruceForstall added optimization and removed untriaged New issue has not been triaged by the area owner labels Apr 20, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 25, 2020
@anthonycanino
Copy link
Contributor

@kunalspathak I'm interested in working on this issue and could use a little design advice if you don't mind.

My strategy to this peephole optimization is to reduce chains of add operations into a single mul operation, and then let the optimizer transform the mul operation into a more efficient one, as is done at: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L12426

So essentially, I can do something like Add(Add(Add(V0, V0), V0), V0) => Mul(V0, 3) and even Add(Mul(V0, 3), V0) => Mul(V0, 4), where V0 is some local var, but it is done as an additional walk over the tree before the preorder processing of each operand, i.e., done at https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L11545, with something like

if (opts.OptimizationEnabled() && fgGlobalMorph)
{
  tree = gtReduceStrength(tree);
}

/*-------------------------------------------------------------------------
 * Process the first operand, if any
 */

if (op1)
{
  // ...

at which point the subsequent (https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L12426) optimization will transform Mul(V0, 4) => Lsh(V0, 2) etc.

I wanted to fold this in during the postorder processing, but the multiplication to shift optimization is conflicts with the strength reduction, i.e., Add(Add(Add(V0, V0), V0), V0) becomes Add(Add(Lsh(V0, 1), V0), V0), which becomes Add(Add(LSH(V0, 1), V0), V0).

I see three solutions, and I'm curious how they might fit in to the design and looking for some feedback:

  1. Keep the strength reduction peephole separate as I currently have done. But I think this conflicts with the natural order of the rest of the optimizations.
  2. Pass additional context from parent to child, similar to MorphAddrContext, so we can known whether to leave Mul(V0, 2) alone if its parent is of the shape Add(Mul(V0, 2), V0) etc.
  3. Handle cases like Add(Lsh(V0, 1), V0) , which would unwind the Lsh to a Mul, and eventually the outermost expression will stay in an optimized state.

@kunalspathak
Copy link
Member Author

@kunalspathak I'm interested in working on this issue

Thank you for volunteering.

Keep the strength reduction peephole separate as I currently have done. But I think this conflicts with the natural order of the rest of the optimizations.

You mean, doing it using gtReduceStrength() before the preorder processing of each operand? This approach looks reasonable and feels natural because we won't have to add extra processing to handle cases of Add(Mul(V0, 2), V0) or Add(Lsh(V0, 1), V0). @EgorBo - do you have any different opinion?

@anthonycanino
Copy link
Contributor

You mean, doing it using gtReduceStrength() before the preorder processing of each operand? This approach looks reasonable and feels natural because we won't have to add extra processing to handle cases of Add(Mul(V0, 2), V0) or Add(Lsh(V0, 1), V0). @EgorBo - do you have any different opinion?

Yes, that's what I meant --- use gtReduceStrength before preorder processing of each operand.

One thing though, is don't we want to handle cases like Add(Mul(V0, 2), V0), in the event that we have an expression like i * 2 + i (admittedly, it doesn't seem likely that would pop up), or are we strictly speaking about reducing add expressions in the shape of i + i + i + i etc.

@kunalspathak
Copy link
Member Author

One thing though, is don't we want to handle cases like Add(Mul(V0, 2), V0), in the event that we have an expression like i * 2 + i

We need to see what we generate for those code patterns. This issue is for cases like i + i + .....

@SingleAccretion
Copy link
Contributor

Yes, that's what I meant --- use gtReduceStrength before preorder processing of each operand.

Why before each operand? I would have expected it to live in the pre-order processing of GT_ADD (i. e. in the first big switch).

This is a tradeoff in complexity vs completeness. Pre-order will miss some things, but the code will be simpler. The way to evaluate this is to make both and check the diffs (which will also tell us how valuable this optimization is in general in real-world(ish) code).

@anthonycanino
Copy link
Contributor

One thing though, is don't we want to handle cases like Add(Mul(V0, 2), V0), in the event that we have an expression like i * 2 + i

We need to see what we generate for those code patterns. This issue is for cases like i + i + .....

Understood. Will work on it as such.

Yes, that's what I meant --- use gtReduceStrength before preorder processing of each operand.

Why before each operand? I would have expected it to live in the pre-order processing of GT_ADD (i. e. in the first big switch).

I think we are thinking the same thing, I meant to say in the pre-order processing of GT_ADD (first big switch).

Thanks for the feedback folks, will work on it.

@kunalspathak
Copy link
Member Author

I do see that we generate expected output.

G_M44653_IG02:              ;; offset=0008H
        531E7400          lsl     w0, w0, #2
G_M44653_IG02:              ;; offset=0000H
       8D048D00000000       lea      eax, [4*rcx]

@ghost ghost locked as resolved and limited conversation to collaborators Jan 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

No branches or pull requests

6 participants