-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tiering makes regex-redux significantly slower #87753
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsI noticed in the public numbers that source generated regexes (6.cs) are significantly slower than ref-emit compiled regexes (5.cs). This doesn't show up in Benchmark.NET, only on the command line. with default settings, generated is 16% slower, even though it doesn't have to emit and compile any code at runtime:
disabling tiered compilation makes the difference go away, both are 25% faster, and generated is slightly faster than compiled, as expected:
Standalone repro, clone https://github.com/danmoseley/repro1.git and run repro.bat. hyperfine is there as a convenient way to benchmark an exe, feel free to use something else. Is there any way to improve this, or is this just a limitation that shows up with a short lived app like this? BTW, for what it's worth here's native AOT numbers. I guess the regular configuration here is actually forced to the interpreter.
|
@stephentoub any idea why RegexOptions.Compiled under native AOT is a lot faster than RegexOptions.None (but slower than non AOT) |
In a variety of places we assume "Compiled" isn't literally "the only thing that's different is emitting MSIL" but rather "you're asking us to take more time to optimize throughput", and as such there are optimizations performed when Compiled is set that aren't related to emitting MSIL, like spending more time analyzing sets to determine the most optimal thing to search for as part of finding a starting position. I'd bet if you were to debug through you'd find the RegexFindOptimizations is different when you set Compiled vs None. |
Tiering does not do well on a (hopefully smallish) category of apps that are jit intensive, compute intensive, and short running. Crossgen2 is like this, for instance. Sometimes setting cc @EgorBo |
In the benchmark game, we don't really have an opportunity to set environment variables (such as DOTNET_TieredCompilation=0). Is there anything that can be done in the code to discourage tiering? |
I think we are better off trying to fix tiering to handle these cases better. If we disable / discourage tiering then we end up losing perf if the app is not short lived. And I would guess it may be harder to determine if an app is going to be long running than to determine if the app would benefit from more aggressive tiering up. There are various ideas floating about:
|
I seem to remember at one point that recompilation for tiering was done with work items queued to the thread pool. Was that / is that still the case? If that is, that's going to be even more problematic for this particular test case, which might end up saturating the pool with work items, such that the tiering work items won't get a chance to execute promptly. |
CC @mangod9 PTAL. |
is the penalty here that
anyway, hopefully this standalone repro with relatively few methods involved is a convenient vehicle to investigate. |
Is this a regression? @kouvel, dont believe Tiering workitems are queued to the regular ThreadPool, are they? |
If they're not then presumably this comment should be revised: runtime/src/coreclr/vm/tieredcompilation.cpp Lines 32 to 34 in 8b25fd3
|
@stephentoub will it make sense to sort of use non-compiled mode for Regex during first N seconds of an app start so we can make sure all the more important things are well handled? Afair we already have a similiar logic on C# level for Expression trees? where we have a sort of C#-level call counting. The problem that our promotion mechanism looks like this: |
RegexOptions.Compiled works fine and doesn't go through tiering because it uses dynamic methods which don't tier. The problem Dan is highlighting is with source generated regexes, at which point you're suggesting not using code in the developer's app, which would be speculating that the code there was in fact written by the source generator and is in fact identical in semantics to the interpreter, neither of which we can guarantee (nor would I feel comfortable doing such a substitution even if we could). |
Presumably, the source-generator based code should be prejittable, right? In that case I assume it's not an issue - if user doesn't use R2R we probably can guess they don't care about start up that much? Although, the short living benchmarks without R2R can suffer (maybe even with it actually) - the conservative promotion algorithm applies here as well. |
This is one of the benchmarks from Benchmarks Games so we can't change how it is run. @EgorBo can you dig in when you have a chance and see if the various hypotheses above are correct? Until then we shouldn't speculate too much on what fixes we might need. |
for what it's worth I was looking at the fork of benchmark games that is a bit more active and takes regular github PR's. that doesn't change the problem though. |
@danmoseley you mentioned in the description "repro.bat" - was it supposed to be in the repro repository? |
Oops, added. But all it was doing was running the apps. |
Default: 1.258 ms (also checked OSR but it being disabled didn't improve anything) So yes, it's the tier1 promotion problem, exactly the same as #83112 |
Tiering uses a separate background thread. There is another config var that was intended for those benchmark-like cases that wouldn't play well with tiering, |
if we'll recommend and support that, exposing it in runtime.config would make that clear. it would also possibly be acceptable to the benchmarks owners. finally, it would avoid flowing to any spawned child apps. |
There are basically 3 scenarios:
The question is whether we can detect one of these scenarious without user's help. Possible heuristics:
Also, maybe we can leave a sort of a staticpgo-like hint for future sessions that the previous one was short/long living compute-intense, etc? |
DOTNET_TC_AggressiveTiering=1 indeed gives a big improvement.
|
Unfortunately, there is nothing we can do here as part of .NET 8.0 timeframe |
Just to get clarity here, is the problem that the tier0 code is poor, or also that the tier1 compilation work is consuming resources? Are there any "easy wins" where the tier0 code is particularly poor in this scenario and could reasonably be improved? |
Oops, didn't notice this question. We try to improve Tier0's perf if it doesn't hurt JIT's throughput, but overall, I think, it's reasonable to expect 2-10x slower perf from code execution in Tier0 compared to Tier1. For this particular case ("short-living, compute-heavy workload") we have only two options:
|
I noticed in the public numbers that source generated regexes (6.cs) are significantly slower than ref-emit compiled regexes (5.cs). This is even though source generated mode doesn't have to emit and compile any code at runtime:
This doesn't show up in Benchmark.NET, only on the command line. With default settings, generated is 16% slower,
disabling tiered compilation makes the difference go away, both are 25% faster, and generated is slightly faster than compiled, as expected:
Standalone repro, clone https://github.com/danmoseley/repro1.git and run repro.bat. hyperfine is there as a convenient way to benchmark an exe, feel free to use something else.
Is there any way to improve this, or is this just a limitation that shows up with a short lived app like this?
== what follows is not relevant to this issue but just for comparison ==
BTW, for what it's worth here's native AOT numbers. I guess the regular configuration here is actually forced to the interpreter.
and interpreter and nonbacktracking using nativeAOT. I don't know why the interpreter is slower than "compiled" if the latter is using the interpreter as well.
The text was updated successfully, but these errors were encountered: