New compile time regressions (12 vs 13) on znver3 (LoopMicroOpBufferSize too large?) #50802

davidbolvansky · 2021-08-12T19:26:43Z


Bugzilla Link	51460
Version	trunk
OS	All
CC	@alinas,@LebedevRI,@preames,@RKSimon,@MattPD,@nikic,@rotateright,@tstellar

Extended Description

Phoronix reported some significant CT regressions for mplayer and ffmpeg

https://www.phoronix.com/scan.php?page=article&item=clang13-initial-epyc&num=4

LebedevRI · 2021-08-12T19:41:23Z

This is a duplicate of #49928
This is expected, and will not be addressed for a release.

*** This bug has been marked as a duplicate of bug #49928 ***

nikic · 2021-08-12T19:55:42Z

Shouldn't LoopMicroOpBufferSize for znver3 be reduced? While it may be correct from a micro-architectural perspective, the way it is used by LLVM is clearly inappropriate: LoopMicroOpBufferSize should be an upper bound on how much we partially unroll, but what we seem to be doing right now is to just unroll maximally up to that size. That's okay if the buffer size is reasonably small, but not if it's very large. That results in massive code bloat for no benefit. While there are scalability problems in other passes, that seems like the part that is causing the immediate problem and should be addressed. As this value appears to only be used for this unrolling heuristics, I'd suggest dropping it to a more sensible value.

LebedevRI · 2021-08-12T20:09:11Z

I agree that cost/benefit model for full loop unrolling is broken.
It seems we want both to aggressively fully unroll if we can,
and not do that at the same time.

At the very least, i really don't think we should be doing full unrolling
before inlining/vectorization, but IMHO we should not be fully unrolling at all,
at least not unless we can tell that there will be other benefits
than lack of said loop.

+Reames - thoughts on changing profitability heuristics for full unrolling?

nikic · 2021-08-12T20:21:01Z

I don't think full unrolling is relevant here: LoopMicroOpBufferSize controls the PartialThreshold, which is used for partial and runtime unrolling.

Full unrolling is controlled by different thresholds and a more sophisticated cost model that takes into account how much simplification is expected from breaking the loop backedge. Fully unrolling a relatively large loop can still be beneficial if it results in significant simplification.

Partial and runtime unrolling have more tenuous profitability heuristics, and creating 4096 instruction loops because that's the micro-architectural loop buffer size is almost certainly not profitable. Runtime unrolling will at least limit to 8 unrolls by default, but partial unrolling will be happily unroll to the full threshold.

davidbolvansky · 2021-08-13T15:18:26Z

Shouldn't LoopMicroOpBufferSize for znver3 be reduced?

Yeah, there should be a more reasonable value because as wee see for reported projects, regressions are huge (and hardly with 30% better runtime perf to “justify” them).

Reopened, renamed a bit.

LebedevRI · 2021-08-13T16:18:10Z

Thank you everyone for your thoughts.
It is always fascinating to read what people think.

As a word of caution, i would strongly advise from using some random
3'rd party numbers to pass hard judgments.

Unfortunately, my position still hasn't changed, and this is still WONFIX.
Although we can discuss temporarily reducing it iff someone
personally promises to resolve #49928 .

Even halving the LoopMicroOpBufferSize (512->256) results in a number of
statistically-significant performance regressions (+ ~6%).

nikic · 2021-08-13T16:29:21Z

To be clear, my primary concern here is (for once!) not compile-time, but the fact that this simply generates terrible code. Just look at this: https://c.godbolt.org/z/617TjjKhs The compiler shouldn't be unrolling random loops to hundreds of iterations.

While I'd love to see issues like #49928 resolved as a matter of general principle, they are only tangentially related to the problem at hand, which is way too aggressive unrolling. The overzealous unrolling must be fixed independently of any compile-time problems it may also be triggering.

preames · 2021-08-16T19:17:47Z

Sorry for late response here, was on vacation.

I started writing a long comment on unrolling heuristics, realized it was only vaguely relevant here, and decided to save it to my public notes for ease of later use. See https://github.com/preames/public-notes/blob/master/llvm-loop-opt-ideas.rst#unroll-heuristics if interested.

Specifically for this bug, I think it's unfortunate that our unrolling heuristic is too tightly tied to an architectural feature, but I don't think we should avoid correcting the feature value simply because it causes some regressions on that architecture. I would suggest we look at some workarounds to preserve the previous behavior while letting the flag be correct. A few ideas:

Cap the scheduler with a fixed upper bound to avoid complexity explosion with a link specifically to the bug to fix.
Introduce a "legacy feature for unrolling heuristic" TTI hook which used the old value on that platform, and defaults to the feature value on all other platforms.

I think it's unreasonable to ask Roman to rewrite the core partial unrolling heuristic just because he happened to trip across a case where it's broken. A more constructive approach is workaround it for the moment, and let someone interested with time and resources to tackle unrolling separately.

LebedevRI · 2021-08-16T19:33:38Z

Sorry for late response here, was on vacation.

I started writing a long comment on unrolling heuristics, realized it was
only vaguely relevant here, and decided to save it to my public notes for
ease of later use. See
https://github.com/preames/public-notes/blob/master/llvm-loop-opt-ideas.
rst#unroll-heuristics if interested.

Specifically for this bug, I think it's unfortunate that our unrolling
heuristic is too tightly tied to an architectural feature, but I don't
think we should avoid correcting the feature value simply because it causes
some regressions on that architecture. I would suggest we look at some
workarounds to preserve the previous behavior while letting the flag be
correct. A few ideas:

Cap the scheduler with a fixed upper bound to avoid complexity explosion
with a link specifically to the bug to fix.

Introduce a "legacy feature for unrolling heuristic" TTI hook which used
the old value on that platform, and defaults to the feature value on all
other platforms.
To be honest, i'm not quite convinced there is anything to fix
wrt the threshold itself - unrolled loops are great, and as long as loop's body
is less than 4K x86 uops (well, let's pretend that it maps 1:1 to 4K LLVM IR ops),
then it fit within uop cache and we've also improved cpu dispatch performance.

So yes, if we unroll 100-instruction loop 40 times, and bloat it to 4000 ops,
that's fine, ignoring the latent quadratic edge-cases it exposes
in the rest of the compiler. Of course, if you care about binary size,
perhaps you should use -Os/-Oz :]

But then yes, we probably should reevaluate the cases where the code
is strictly sequential, like the contrived example from nikic,
those are perhaps not worth unrolling, at least as much.

I think it's unreasonable to ask Roman to rewrite the core partial unrolling
heuristic just because he happened to trip across a case where it's broken.

A more constructive approach is workaround it for the moment, and let
someone interested with time and resources to tackle unrolling separately.

The thing is, the current threshold (512) is already a compromise/workaround. :)

tstellar · 2021-08-18T03:26:37Z

Can we remove release-13.0.0 from the blocks field?

LebedevRI · 2021-08-18T08:20:11Z

Thank you all for participating in this disscussion!

llvmbot · 2022-08-08T12:33:03Z

@llvm/issue-subscribers-backend-x86

RKSimon · 2024-05-22T15:11:41Z

Closing this - the znver3/4 case was addressed by #91340 / 54e52aa

(The more general threshold limit at #67657 was reverted but should be re-committed at some point)

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021

nikic mentioned this issue Aug 8, 2022

Excessive loop unrolling #42332

Open

RKSimon added backend:X86 backend:X86 Scheduler Models Accuracy of X86 scheduler models labels Aug 8, 2022

RKSimon self-assigned this Aug 8, 2022

RKSimon closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New compile time regressions (12 vs 13) on znver3 (LoopMicroOpBufferSize too large?) #50802

New compile time regressions (12 vs 13) on znver3 (LoopMicroOpBufferSize too large?) #50802

davidbolvansky commented Aug 12, 2021

LebedevRI commented Aug 12, 2021

nikic commented Aug 12, 2021

LebedevRI commented Aug 12, 2021

nikic commented Aug 12, 2021

davidbolvansky commented Aug 13, 2021

LebedevRI commented Aug 13, 2021

nikic commented Aug 13, 2021

preames commented Aug 16, 2021

LebedevRI commented Aug 16, 2021

tstellar commented Aug 18, 2021

LebedevRI commented Aug 18, 2021

llvmbot commented Aug 8, 2022

RKSimon commented May 22, 2024

New compile time regressions (12 vs 13) on znver3 (LoopMicroOpBufferSize too large?) #50802

New compile time regressions (12 vs 13) on znver3 (LoopMicroOpBufferSize too large?) #50802

Comments

davidbolvansky commented Aug 12, 2021

Extended Description

LebedevRI commented Aug 12, 2021

nikic commented Aug 12, 2021

LebedevRI commented Aug 12, 2021

nikic commented Aug 12, 2021

davidbolvansky commented Aug 13, 2021

LebedevRI commented Aug 13, 2021

nikic commented Aug 13, 2021

preames commented Aug 16, 2021

LebedevRI commented Aug 16, 2021

tstellar commented Aug 18, 2021

LebedevRI commented Aug 18, 2021

llvmbot commented Aug 8, 2022

RKSimon commented May 22, 2024