Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJIT: Index Variable Widening optimization for array accesses #7312

Closed
sivarv opened this issue Jan 27, 2017 · 8 comments
Closed

RyuJIT: Index Variable Widening optimization for array accesses #7312

sivarv opened this issue Jan 27, 2017 · 8 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@sivarv
Copy link
Member

sivarv commented Jan 27, 2017

There are many benchstone benchmarks that would benefit from IV widening, which currently Legacy Jit64 does.

From benchmarkgames Mandelbrot and Nbody are such examples.

category:cq
theme:loop-opt
skill-level:expert
cost:large

@JosephTremoulet
Copy link
Contributor

Copying over @mikedn's comment from dup dotnet/coreclr#11608:

An alternative solution for array indices might be to change signed casts to unsigned casts, I once tried that in dotnet/coreclr#8821 and it does improve performance in some cases. That approach unfortunately produced some regressions but there may be a way to do this cheaply in assertion propagation.

@mikedn
Copy link
Contributor

mikedn commented Jul 22, 2017

there may be a way to do this cheaply in assertion propagation

It is possible to have optAssertionProp_Cast check for OAK_NO_THROW assertions, they imply that the index is positive and thus a signed cast can be turned into an unsigned cast. That means that instead of movsxd instruction we end up generating a shorter and 0 latency mov instruction. Saves about 7K in jit-diff fx and a tiny loop I tested with becomes 50% faster.

Unfortunately this fails to work in the common case - for (int i = 0; i < a.Length; i++) ... a[i] .... Such loops are handled by loop cloning which runs before assertion propagation and removes the range check.

@JosephTremoulet
Copy link
Contributor

I wonder if it would be worth having an operator like GT_ASSUME to feed info to assertion prop (and that would get removed by rationalize or something); then loop cloning could leave an assume in place of the removed bounds check.

@mikedn
Copy link
Contributor

mikedn commented Jul 25, 2017

We could probably have loop cloning preserve the range check and set GTF_ARR_BOUND_INBND so it is removed at a later time. But that may mean that assertion prop will do more work because it encounters more range checks.

I'll have to check if there other assertion kinds that can be used for this, loop cloning conditions should generate some.

There's also the question if loop cloning should be done so early, seems strange to me.

@mikedn
Copy link
Contributor

mikedn commented Jul 25, 2017

I'll have to check if there other assertion kinds that can be used for this, loop cloning conditions should generate some.

I'm dumb, in the common case there will be no loop cloning conditions. There are probably only two ways to get this trick to work after loop cloning:

  • keep the range checks (or replace them with something like GT_ASSUME)
  • have RangeCheck do the cast conversion from signed to unsigned - probably too expensive to be worth it

@pentp
Copy link
Contributor

pentp commented Oct 19, 2020

There's some related discussion also here: dotnet/coreclr#21553. In summary, instead of/in addition to simply widening:

  • Array/Span access is always bounds checked, so the induction variable is guaranteed to be non-negative and instead of movsxd could use just mov.
  • Ideally, could take advantage of the x64 (and arm64) 32 bit instruction semantics that implicitly zero the upper bits which would eliminate all of the existing sign-extension overhead and potential instruction size overhead from induction variable widening (but introduce some new overhead in loop header in some cases to guarantee that upper 32bits are zero).

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Jan 15, 2021
@pentp
Copy link
Contributor

pentp commented Sep 30, 2021

The expected benefits are somewhat reduced since #57970 removed movsxd uses.
Blindly widening the index variable (e.g., during importation) would probably actually reduce perf because the length comparison would need widening also then.

If we know that the index variable is 32-bit, could we just elide the zero-extension (cast) entirely because any time we load it to a register (or increment/modify it) it will be implicitly zero-extended anyway? Only situation that would need special handling then is if the index is from a narrowing cast - this should do an actual mov then...

@jakobbotsch
Copy link
Member

Fixed by #97865

@jakobbotsch jakobbotsch self-assigned this Feb 28, 2024
@jakobbotsch jakobbotsch modified the milestones: Future, 9.0.0 Feb 28, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Mar 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants