Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tiering makes regex-redux significantly slower #87753

Open
danmoseley opened this issue Jun 19, 2023 · 26 comments
Open

tiering makes regex-redux significantly slower #87753

danmoseley opened this issue Jun 19, 2023 · 26 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@danmoseley
Copy link
Member

danmoseley commented Jun 19, 2023

I noticed in the public numbers that source generated regexes (6.cs) are significantly slower than ref-emit compiled regexes (5.cs). This is even though source generated mode doesn't have to emit and compile any code at runtime:

This doesn't show up in Benchmark.NET, only on the command line. With default settings, generated is 16% slower,

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):      1.473 s ±  0.050 s    [User: 1.159 s, System: 0.228 s]
  Range (min … max):    1.383 s …  1.521 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.273 s ±  0.014 s    [User: 0.967 s, System: 0.237 s]
  Range (min … max):    1.262 s …  1.318 s    20 runs

disabling tiered compilation makes the difference go away, both are 25% faster, and generated is slightly faster than compiled, as expected:

C:\proj\rr>set DOTNET_TieredCompilation=0

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):      1.064 s ±  0.019 s    [User: 0.733 s, System: 0.237 s]
  Range (min … max):    1.045 s …  1.119 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.079 s ±  0.007 s    [User: 0.758 s, System: 0.236 s]
  Range (min … max):    1.068 s …  1.098 s    20 runs

Standalone repro, clone https://github.com/danmoseley/repro1.git and run repro.bat. hyperfine is there as a convenient way to benchmark an exe, feel free to use something else.

Is there any way to improve this, or is this just a limitation that shows up with a short lived app like this?

== what follows is not relevant to this issue but just for comparison ==
BTW, for what it's worth here's native AOT numbers. I guess the regular configuration here is actually forced to the interpreter.

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe gen"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe gen
  Time (mean ± σ):      1.094 s ±  0.037 s    [User: 0.533 s, System: 0.335 s]
  Range (min … max):    1.030 s …  1.161 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe reg"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe reg
  Time (mean ± σ):      1.393 s ±  0.021 s    [User: 0.888 s, System: 0.204 s]
  Range (min … max):    1.363 s …  1.435 s    20 runs

and interpreter and nonbacktracking using nativeAOT. I don't know why the interpreter is slower than "compiled" if the latter is using the interpreter as well.

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe non"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe non
  Time (mean ± σ):      3.514 s ±  0.026 s    [User: 3.210 s, System: 0.207 s]
  Range (min … max):    3.476 s …  3.576 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe nbt"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe nbt
  Time (mean ± σ):      1.661 s ±  0.059 s    [User: 1.353 s, System: 0.214 s]
  Range (min … max):    1.606 s …  1.789 s    20 runs
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 19, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 19, 2023
@danmoseley danmoseley added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 19, 2023
@ghost
Copy link

ghost commented Jun 19, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

I noticed in the public numbers that source generated regexes (6.cs) are significantly slower than ref-emit compiled regexes (5.cs). This doesn't show up in Benchmark.NET, only on the command line.

with default settings, generated is 16% slower, even though it doesn't have to emit and compile any code at runtime:

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):      1.473 s ±  0.050 s    [User: 1.159 s, System: 0.228 s]
  Range (min … max):    1.383 s …  1.521 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.273 s ±  0.014 s    [User: 0.967 s, System: 0.237 s]
  Range (min … max):    1.262 s …  1.318 s    20 runs

disabling tiered compilation makes the difference go away, both are 25% faster, and generated is slightly faster than compiled, as expected:

C:\proj\rr>set DOTNET_TieredCompilation=0

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):      1.064 s ±  0.019 s    [User: 0.733 s, System: 0.237 s]
  Range (min … max):    1.045 s …  1.119 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.079 s ±  0.007 s    [User: 0.758 s, System: 0.236 s]
  Range (min … max):    1.068 s …  1.098 s    20 runs

Standalone repro, clone https://github.com/danmoseley/repro1.git and run repro.bat. hyperfine is there as a convenient way to benchmark an exe, feel free to use something else.

Is there any way to improve this, or is this just a limitation that shows up with a short lived app like this?

BTW, for what it's worth here's native AOT numbers. I guess the regular configuration here is actually forced to the interpreter.

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe gen"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe gen
  Time (mean ± σ):      1.094 s ±  0.037 s    [User: 0.533 s, System: 0.335 s]
  Range (min … max):    1.030 s …  1.161 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe reg"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe reg
  Time (mean ± σ):      1.393 s ±  0.021 s    [User: 0.888 s, System: 0.204 s]
  Range (min … max):    1.363 s …  1.435 s    20 runs
Author: danmoseley
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged, needs-area-label

Milestone: -

@danmoseley
Copy link
Member Author

@stephentoub any idea why RegexOptions.Compiled under native AOT is a lot faster than RegexOptions.None (but slower than non AOT)

@stephentoub
Copy link
Member

any idea why RegexOptions.Compiled under native AOT is a lot faster than RegexOptions.None

In a variety of places we assume "Compiled" isn't literally "the only thing that's different is emitting MSIL" but rather "you're asking us to take more time to optimize throughput", and as such there are optimizations performed when Compiled is set that aren't related to emitting MSIL, like spending more time analyzing sets to determine the most optimal thing to search for as part of finding a starting position. I'd bet if you were to debug through you'd find the RegexFindOptimizations is different when you set Compiled vs None.

@AndyAyersMS
Copy link
Member

Tiering does not do well on a (hopefully smallish) category of apps that are jit intensive, compute intensive, and short running. Crossgen2 is like this, for instance.

Sometimes setting DOTNET_TC_CallCountingDelayMs=0 can help.

cc @EgorBo

@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Jun 19, 2023
@danmoseley
Copy link
Member Author

In the benchmark game, we don't really have an opportunity to set environment variables (such as DOTNET_TieredCompilation=0). Is there anything that can be done in the code to discourage tiering?

@AndyAyersMS
Copy link
Member

I think we are better off trying to fix tiering to handle these cases better. If we disable / discourage tiering then we end up losing perf if the app is not short lived. And I would guess it may be harder to determine if an app is going to be long running than to determine if the app would benefit from more aggressive tiering up.

There are various ideas floating about:

  • bypass tiering for subset of methods that won't benefit from tiering (eg most property get/set)
  • priority queue (or similar) for rejitting so we tier up the most important methods sooner
  • avoid some unnecessary rejittings (.NET 8 only) PGO: redundant Tier0-instr promotions #84517
  • (possibly) adaptive call counting delay -- initially set it lower and back off if app seems to still be in "startup mode"
  • adjustments to OSR thresholds

@stephentoub
Copy link
Member

I seem to remember at one point that recompilation for tiering was done with work items queued to the thread pool. Was that / is that still the case? If that is, that's going to be even more problematic for this particular test case, which might end up saturating the pool with work items, such that the tiering work items won't get a chance to execute promptly.

@JulieLeeMSFT
Copy link
Member

CC @mangod9 PTAL.

@JulieLeeMSFT JulieLeeMSFT added this to the 8.0.0 milestone Jun 19, 2023
@danmoseley
Copy link
Member Author

danmoseley commented Jun 19, 2023

is the penalty here that

  • the counting / instrumentation cost is not getting amortized by later improved codegen
  • similar but the cost of repeating code gen, ie., tiering is not the problem per se, if the jitting didn't happen til later it would be better because it isn't getting amortized
  • the ref emit code while never tiered is better optimized than the tier-0 regular code, and the code doesn't get rejitted early enough to compete
    or possibly
  • tiering is making bad choices about what to rejit, thus deoptimizing code
  • rejitted code itself is a deoptimization in some cases

anyway, hopefully this standalone repro with relatively few methods involved is a convenient vehicle to investigate.

@mangod9
Copy link
Member

mangod9 commented Jun 19, 2023

Is this a regression? @kouvel, dont believe Tiering workitems are queued to the regular ThreadPool, are they?

@stephentoub
Copy link
Member

dont believe Tiering workitems are queued to the regular ThreadPool, are they?

If they're not then presumably this comment should be revised:

// be recompiled (m_methodsToOptimize). If there is currently no thread
// servicing our queue asynchronously then we use the runtime threadpool
// QueueUserWorkItem to recruit one. During the callback for each threadpool work

@EgorBo
Copy link
Member

EgorBo commented Jun 19, 2023

@stephentoub will it make sense to sort of use non-compiled mode for Regex during first N seconds of an app start so we can make sure all the more important things are well handled? Afair we already have a similiar logic on C# level for Expression trees? where we have a sort of C#-level call counting.

The problem that our promotion mechanism looks like this:
a timer with 100ms interval is fired and checks - were any new Tier0 compilations requested since the last 100ms interval? if there were any - delay the promotion (actually, call counting stub installation) till the next 100ms tick. So a single tier0 compilation may delay Tier1 promotion for e.g. 100000 methods (seen something like that in Bing) - it is something we want to improve but just don't know how exactly, e.g. the priority-based queue like Andy mentioned.

@stephentoub
Copy link
Member

will it make sense to sort of use non-compiled mode for Regex during first N seconds of an app start so we can make sure all the more important things are well handled? Afair we already have a similiar logic on C# level for Expression trees? where we have a sort of C#-level call counting.

RegexOptions.Compiled works fine and doesn't go through tiering because it uses dynamic methods which don't tier.

The problem Dan is highlighting is with source generated regexes, at which point you're suggesting not using code in the developer's app, which would be speculating that the code there was in fact written by the source generator and is in fact identical in semantics to the interpreter, neither of which we can guarantee (nor would I feel comfortable doing such a substitution even if we could).

@EgorBo
Copy link
Member

EgorBo commented Jun 19, 2023

Presumably, the source-generator based code should be prejittable, right? In that case I assume it's not an issue - if user doesn't use R2R we probably can guess they don't care about start up that much?

Although, the short living benchmarks without R2R can suffer (maybe even with it actually) - the conservative promotion algorithm applies here as well.

@AndyAyersMS
Copy link
Member

This is one of the benchmarks from Benchmarks Games so we can't change how it is run.

@EgorBo can you dig in when you have a chance and see if the various hypotheses above are correct? Until then we shouldn't speculate too much on what fixes we might need.

@danmoseley
Copy link
Member Author

for what it's worth I was looking at the fork of benchmark games that is a bit more active and takes regular github PR's. that doesn't change the problem though.
hanabi1224/Programming-Language-Benchmarks#396

@EgorBo
Copy link
Member

EgorBo commented Jun 19, 2023

@danmoseley you mentioned in the description "repro.bat" - was it supposed to be in the repro repository?

@danmoseley
Copy link
Member Author

Oops, added. But all it was doing was running the apps.

@EgorBo
Copy link
Member

EgorBo commented Jun 19, 2023

@AndyAyersMS

Default: 1.258 ms
Default, PGO=0: 1.276 ms
Default, TC=0: 870.2 ms
Default, TC_CallCountingDelayMs=0: 843.3 ms
Default, TC_CallCountingDelayMs=0, PGO=0: 870.1 ms

(also checked OSR but it being disabled didn't improve anything)

So yes, it's the tier1 promotion problem, exactly the same as #83112

@kouvel
Copy link
Member

kouvel commented Jun 19, 2023

dont believe Tiering workitems are queued to the regular ThreadPool, are they?

Tiering uses a separate background thread.

There is another config var that was intended for those benchmark-like cases that wouldn't play well with tiering, TC_AggressiveTiering=1. That setting includes TC_CallCountingDelayMs=0 and also adjusts call count thresholds to be minimal to tier up more quickly. If it would be useful, it could be made configurable through .runtimeconfig.json and the project.

@danmoseley
Copy link
Member Author

if we'll recommend and support that, exposing it in runtime.config would make that clear. it would also possibly be acceptable to the benchmarks owners. finally, it would avoid flowing to any spawned child apps.

@EgorBo
Copy link
Member

EgorBo commented Jun 19, 2023

There are basically 3 scenarios:

  1. We need to tier up as quick as possible - short living, compute-heavy apps like crossgen, benchmarks, might be also relevant to some web services too, e.g. some Azure Function with a compute-heavy task.
  2. We need to start as quick as possible so background Tier1 jit events might negatively impact on it - most UI apps, WPF, possibly some microservices as well where we need to serve the 1st request as soon as possible.
  3. We should delay the promotion even further compared to 2) so we can collect a better profile and optimize our long-living app/service better (where we don't care about the startup really) - it's especially important because we don't support context-sensitive profiling so if we profile & promote, say, Enumerable.Sum too early we might bake a profile that won't be valid for other future uses

The question is whether we can detect one of these scenarious without user's help. Possible heuristics:

  • is it a console app/wpf/aspnet
  • is gc server mode enabled
  • etc

Also, maybe we can leave a sort of a staticpgo-like hint for future sessions that the previous one was short/long living compute-intense, etc?

@danmoseley
Copy link
Member Author

DOTNET_TC_AggressiveTiering=1 indeed gives a big improvement.

C:\proj\rr>hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):      1.374 s ±  0.033 s    [User: 1.055 s, System: 0.228 s]
  Range (min … max):    1.350 s …  1.471 s    20 runs

C:\proj\rr>hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.246 s ±  0.008 s    [User: 0.933 s, System: 0.230 s]
  Range (min … max):    1.236 s …  1.268 s    20 runs

C:\proj\rr>set DOTNET_TC_AggressiveTiering=1

C:\proj\rr>hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):     944.1 ms ±  15.4 ms    [User: 680.9 ms, System: 225.9 ms]
  Range (min … max):   917.3 ms … 970.7 ms    20 runs

C:\proj\rr>hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.113 s ±  0.007 s    [User: 1.093 s, System: 0.214 s]
  Range (min … max):    1.103 s …  1.133 s    20 runs

@EgorBo
Copy link
Member

EgorBo commented Jul 22, 2023

Unfortunately, there is nothing we can do here as part of .NET 8.0 timeframe

@EgorBo EgorBo modified the milestones: 8.0.0, 9.0.0 Jul 22, 2023
@danmoseley
Copy link
Member Author

Just to get clarity here, is the problem that the tier0 code is poor, or also that the tier1 compilation work is consuming resources?

Are there any "easy wins" where the tier0 code is particularly poor in this scenario and could reasonably be improved?

@EgorBo
Copy link
Member

EgorBo commented Aug 14, 2023

Just to get clarity here, is the problem that the tier0 code is poor, or also that the tier1 compilation work is consuming resources?

Are there any "easy wins" where the tier0 code is particularly poor in this scenario and could reasonably be improved?

Oops, didn't notice this question. We try to improve Tier0's perf if it doesn't hurt JIT's throughput, but overall, I think, it's reasonable to expect 2-10x slower perf from code execution in Tier0 compared to Tier1. For this particular case ("short-living, compute-heavy workload") we have only two options:

  1. Expect some hint from the developer that an app needs no tiering or needs to reach top tier as fast as it can. E.g. via some runtimeconfig/msbuild property
  2. Try to guess that somehow - this one is tricky because we have other types of scenarios.

@EgorBo EgorBo modified the milestones: 9.0.0, Future May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

7 participants