tiering makes regex-redux significantly slower #87753

danmoseley · 2023-06-19T00:13:06Z

I noticed in the public numbers that source generated regexes (6.cs) are significantly slower than ref-emit compiled regexes (5.cs). This is even though source generated mode doesn't have to emit and compile any code at runtime:

This doesn't show up in Benchmark.NET, only on the command line. With default settings, generated is 16% slower,

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):      1.473 s ±  0.050 s    [User: 1.159 s, System: 0.228 s]
  Range (min … max):    1.383 s …  1.521 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.273 s ±  0.014 s    [User: 0.967 s, System: 0.237 s]
  Range (min … max):    1.262 s …  1.318 s    20 runs

disabling tiered compilation makes the difference go away, both are 25% faster, and generated is slightly faster than compiled, as expected:

C:\proj\rr>set DOTNET_TieredCompilation=0

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):      1.064 s ±  0.019 s    [User: 0.733 s, System: 0.237 s]
  Range (min … max):    1.045 s …  1.119 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.079 s ±  0.007 s    [User: 0.758 s, System: 0.236 s]
  Range (min … max):    1.068 s …  1.098 s    20 runs

Standalone repro, clone https://github.com/danmoseley/repro1.git and run repro.bat. hyperfine is there as a convenient way to benchmark an exe, feel free to use something else.

Is there any way to improve this, or is this just a limitation that shows up with a short lived app like this?

== what follows is not relevant to this issue but just for comparison ==
BTW, for what it's worth here's native AOT numbers. I guess the regular configuration here is actually forced to the interpreter.

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe gen"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe gen
  Time (mean ± σ):      1.094 s ±  0.037 s    [User: 0.533 s, System: 0.335 s]
  Range (min … max):    1.030 s …  1.161 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe reg"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe reg
  Time (mean ± σ):      1.393 s ±  0.021 s    [User: 0.888 s, System: 0.204 s]
  Range (min … max):    1.363 s …  1.435 s    20 runs

and interpreter and nonbacktracking using nativeAOT. I don't know why the interpreter is slower than "compiled" if the latter is using the interpreter as well.

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe non"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe non
  Time (mean ± σ):      3.514 s ±  0.026 s    [User: 3.210 s, System: 0.207 s]
  Range (min … max):    3.476 s …  3.576 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe nbt"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe nbt
  Time (mean ± σ):      1.661 s ±  0.059 s    [User: 1.353 s, System: 0.214 s]
  Range (min … max):    1.606 s …  1.789 s    20 runs

The text was updated successfully, but these errors were encountered:

ghost · 2023-06-19T00:13:23Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

I noticed in the public numbers that source generated regexes (6.cs) are significantly slower than ref-emit compiled regexes (5.cs). This doesn't show up in Benchmark.NET, only on the command line.

with default settings, generated is 16% slower, even though it doesn't have to emit and compile any code at runtime:

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):      1.473 s ±  0.050 s    [User: 1.159 s, System: 0.228 s]
  Range (min … max):    1.383 s …  1.521 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.273 s ±  0.014 s    [User: 0.967 s, System: 0.237 s]
  Range (min … max):    1.262 s …  1.318 s    20 runs

disabling tiered compilation makes the difference go away, both are 25% faster, and generated is slightly faster than compiled, as expected:

C:\proj\rr>set DOTNET_TieredCompilation=0

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):      1.064 s ±  0.019 s    [User: 0.733 s, System: 0.237 s]
  Range (min … max):    1.045 s …  1.119 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.079 s ±  0.007 s    [User: 0.758 s, System: 0.236 s]
  Range (min … max):    1.068 s …  1.098 s    20 runs

Standalone repro, clone https://github.com/danmoseley/repro1.git and run repro.bat. hyperfine is there as a convenient way to benchmark an exe, feel free to use something else.

Is there any way to improve this, or is this just a limitation that shows up with a short lived app like this?

BTW, for what it's worth here's native AOT numbers. I guess the regular configuration here is actually forced to the interpreter.

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe gen"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe gen
  Time (mean ± σ):      1.094 s ±  0.037 s    [User: 0.533 s, System: 0.335 s]
  Range (min … max):    1.030 s …  1.161 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe reg"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe reg
  Time (mean ± σ):      1.393 s ±  0.021 s    [User: 0.888 s, System: 0.204 s]
  Range (min … max):    1.363 s …  1.435 s    20 runs

Author:	danmoseley
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `untriaged`, `needs-area-label`
Milestone:	-

danmoseley · 2023-06-19T02:49:01Z

@stephentoub any idea why RegexOptions.Compiled under native AOT is a lot faster than RegexOptions.None (but slower than non AOT)

stephentoub · 2023-06-19T03:11:35Z

any idea why RegexOptions.Compiled under native AOT is a lot faster than RegexOptions.None

In a variety of places we assume "Compiled" isn't literally "the only thing that's different is emitting MSIL" but rather "you're asking us to take more time to optimize throughput", and as such there are optimizations performed when Compiled is set that aren't related to emitting MSIL, like spending more time analyzing sets to determine the most optimal thing to search for as part of finding a starting position. I'd bet if you were to debug through you'd find the RegexFindOptimizations is different when you set Compiled vs None.

AndyAyersMS · 2023-06-19T15:27:41Z

Tiering does not do well on a (hopefully smallish) category of apps that are jit intensive, compute intensive, and short running. Crossgen2 is like this, for instance.

Sometimes setting DOTNET_TC_CallCountingDelayMs=0 can help.

cc @EgorBo

danmoseley · 2023-06-19T15:37:40Z

In the benchmark game, we don't really have an opportunity to set environment variables (such as DOTNET_TieredCompilation=0). Is there anything that can be done in the code to discourage tiering?

AndyAyersMS · 2023-06-19T15:56:03Z

I think we are better off trying to fix tiering to handle these cases better. If we disable / discourage tiering then we end up losing perf if the app is not short lived. And I would guess it may be harder to determine if an app is going to be long running than to determine if the app would benefit from more aggressive tiering up.

There are various ideas floating about:

bypass tiering for subset of methods that won't benefit from tiering (eg most property get/set)
priority queue (or similar) for rejitting so we tier up the most important methods sooner
avoid some unnecessary rejittings (.NET 8 only) PGO: redundant Tier0-instr promotions #84517
(possibly) adaptive call counting delay -- initially set it lower and back off if app seems to still be in "startup mode"
adjustments to OSR thresholds

stephentoub · 2023-06-19T15:59:07Z

I seem to remember at one point that recompilation for tiering was done with work items queued to the thread pool. Was that / is that still the case? If that is, that's going to be even more problematic for this particular test case, which might end up saturating the pool with work items, such that the tiering work items won't get a chance to execute promptly.

JulieLeeMSFT · 2023-06-19T16:47:35Z

CC @mangod9 PTAL.

danmoseley · 2023-06-19T16:48:24Z

is the penalty here that

the counting / instrumentation cost is not getting amortized by later improved codegen
similar but the cost of repeating code gen, ie., tiering is not the problem per se, if the jitting didn't happen til later it would be better because it isn't getting amortized
the ref emit code while never tiered is better optimized than the tier-0 regular code, and the code doesn't get rejitted early enough to compete
or possibly
tiering is making bad choices about what to rejit, thus deoptimizing code
rejitted code itself is a deoptimization in some cases

anyway, hopefully this standalone repro with relatively few methods involved is a convenient vehicle to investigate.

mangod9 · 2023-06-19T18:13:52Z

Is this a regression? @kouvel, dont believe Tiering workitems are queued to the regular ThreadPool, are they?

stephentoub · 2023-06-19T18:18:09Z

dont believe Tiering workitems are queued to the regular ThreadPool, are they?

If they're not then presumably this comment should be revised:

runtime/src/coreclr/vm/tieredcompilation.cpp

Lines 32 to 34 in 8b25fd3

    
           // be recompiled (m_methodsToOptimize). If there is currently no thread 
        
           // servicing our queue asynchronously then we use the runtime threadpool 
        
           // QueueUserWorkItem to recruit one. During the callback for each threadpool work

EgorBo · 2023-06-19T18:38:03Z

@stephentoub will it make sense to sort of use non-compiled mode for Regex during first N seconds of an app start so we can make sure all the more important things are well handled? Afair we already have a similiar logic on C# level for Expression trees? where we have a sort of C#-level call counting.

The problem that our promotion mechanism looks like this:
a timer with 100ms interval is fired and checks - were any new Tier0 compilations requested since the last 100ms interval? if there were any - delay the promotion (actually, call counting stub installation) till the next 100ms tick. So a single tier0 compilation may delay Tier1 promotion for e.g. 100000 methods (seen something like that in Bing) - it is something we want to improve but just don't know how exactly, e.g. the priority-based queue like Andy mentioned.

stephentoub · 2023-06-19T18:44:23Z

will it make sense to sort of use non-compiled mode for Regex during first N seconds of an app start so we can make sure all the more important things are well handled? Afair we already have a similiar logic on C# level for Expression trees? where we have a sort of C#-level call counting.

RegexOptions.Compiled works fine and doesn't go through tiering because it uses dynamic methods which don't tier.

The problem Dan is highlighting is with source generated regexes, at which point you're suggesting not using code in the developer's app, which would be speculating that the code there was in fact written by the source generator and is in fact identical in semantics to the interpreter, neither of which we can guarantee (nor would I feel comfortable doing such a substitution even if we could).

EgorBo · 2023-06-19T18:48:21Z

Presumably, the source-generator based code should be prejittable, right? In that case I assume it's not an issue - if user doesn't use R2R we probably can guess they don't care about start up that much?

Although, the short living benchmarks without R2R can suffer (maybe even with it actually) - the conservative promotion algorithm applies here as well.

AndyAyersMS · 2023-06-19T19:11:01Z

This is one of the benchmarks from Benchmarks Games so we can't change how it is run.

@EgorBo can you dig in when you have a chance and see if the various hypotheses above are correct? Until then we shouldn't speculate too much on what fixes we might need.

danmoseley · 2023-06-19T19:17:25Z

for what it's worth I was looking at the fork of benchmark games that is a bit more active and takes regular github PR's. that doesn't change the problem though.
hanabi1224/Programming-Language-Benchmarks#396

EgorBo · 2023-06-19T19:24:24Z

@danmoseley you mentioned in the description "repro.bat" - was it supposed to be in the repro repository?

danmoseley · 2023-06-19T19:27:14Z

Oops, added. But all it was doing was running the apps.

EgorBo · 2023-06-19T19:40:25Z

@AndyAyersMS

Default: 1.258 ms
Default, PGO=0: 1.276 ms
Default, TC=0: 870.2 ms
Default, TC_CallCountingDelayMs=0: 843.3 ms
Default, TC_CallCountingDelayMs=0, PGO=0: 870.1 ms

(also checked OSR but it being disabled didn't improve anything)

So yes, it's the tier1 promotion problem, exactly the same as #83112

kouvel · 2023-06-19T20:31:22Z

dont believe Tiering workitems are queued to the regular ThreadPool, are they?

Tiering uses a separate background thread.

There is another config var that was intended for those benchmark-like cases that wouldn't play well with tiering, TC_AggressiveTiering=1. That setting includes TC_CallCountingDelayMs=0 and also adjusts call count thresholds to be minimal to tier up more quickly. If it would be useful, it could be made configurable through .runtimeconfig.json and the project.

danmoseley · 2023-06-19T20:40:14Z

if we'll recommend and support that, exposing it in runtime.config would make that clear. it would also possibly be acceptable to the benchmarks owners. finally, it would avoid flowing to any spawned child apps.

EgorBo · 2023-06-19T20:57:42Z

There are basically 3 scenarios:

We need to tier up as quick as possible - short living, compute-heavy apps like crossgen, benchmarks, might be also relevant to some web services too, e.g. some Azure Function with a compute-heavy task.
We need to start as quick as possible so background Tier1 jit events might negatively impact on it - most UI apps, WPF, possibly some microservices as well where we need to serve the 1st request as soon as possible.
We should delay the promotion even further compared to 2) so we can collect a better profile and optimize our long-living app/service better (where we don't care about the startup really) - it's especially important because we don't support context-sensitive profiling so if we profile & promote, say, Enumerable.Sum too early we might bake a profile that won't be valid for other future uses

The question is whether we can detect one of these scenarious without user's help. Possible heuristics:

is it a console app/wpf/aspnet
is gc server mode enabled
etc

Also, maybe we can leave a sort of a staticpgo-like hint for future sessions that the previous one was short/long living compute-intense, etc?

danmoseley · 2023-06-19T21:01:58Z

DOTNET_TC_AggressiveTiering=1 indeed gives a big improvement.

C:\proj\rr>hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):      1.374 s ±  0.033 s    [User: 1.055 s, System: 0.228 s]
  Range (min … max):    1.350 s …  1.471 s    20 runs

C:\proj\rr>hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.246 s ±  0.008 s    [User: 0.933 s, System: 0.230 s]
  Range (min … max):    1.236 s …  1.268 s    20 runs

C:\proj\rr>set DOTNET_TC_AggressiveTiering=1

C:\proj\rr>hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):     944.1 ms ±  15.4 ms    [User: 680.9 ms, System: 225.9 ms]
  Range (min … max):   917.3 ms … 970.7 ms    20 runs

C:\proj\rr>hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.113 s ±  0.007 s    [User: 1.093 s, System: 0.214 s]
  Range (min … max):    1.103 s …  1.133 s    20 runs

EgorBo · 2023-07-22T19:30:36Z

Unfortunately, there is nothing we can do here as part of .NET 8.0 timeframe

danmoseley · 2023-07-26T16:10:38Z

Just to get clarity here, is the problem that the tier0 code is poor, or also that the tier1 compilation work is consuming resources?

Are there any "easy wins" where the tier0 code is particularly poor in this scenario and could reasonably be improved?

EgorBo · 2023-08-14T02:57:38Z

Just to get clarity here, is the problem that the tier0 code is poor, or also that the tier1 compilation work is consuming resources?

Are there any "easy wins" where the tier0 code is particularly poor in this scenario and could reasonably be improved?

Oops, didn't notice this question. We try to improve Tier0's perf if it doesn't hurt JIT's throughput, but overall, I think, it's reasonable to expect 2-10x slower perf from code execution in Tier0 compared to Tier1. For this particular case ("short-living, compute-heavy workload") we have only two options:

Expect some hint from the developer that an app needs no tiering or needs to reach top tier as fast as it can. E.g. via some runtimeconfig/msbuild property
Try to guess that somehow - this one is tricky because we have other types of scenarios.

ghost added the untriaged New issue has not been triaged by the area owner label Jun 19, 2023

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 19, 2023

danmoseley added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 19, 2023

AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Jun 19, 2023

JulieLeeMSFT added this to the 8.0.0 milestone Jun 19, 2023

EgorBo mentioned this issue Jun 21, 2023

The environment variable "DOTNET_TC_CallCountThreshold" should be configurable via runtimeconfig #87885

Open

JulieLeeMSFT assigned EgorBo Jul 6, 2023

EgorBo modified the milestones: 8.0.0, 9.0.0 Jul 22, 2023

danmoseley mentioned this issue Aug 14, 2023

Clean up AggressiveOpt tier in vm #90504

Merged

EgorBo modified the milestones: 9.0.0, Future May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tiering makes regex-redux significantly slower #87753

tiering makes regex-redux significantly slower #87753

danmoseley commented Jun 19, 2023 •

edited

Loading

ghost commented Jun 19, 2023

danmoseley commented Jun 19, 2023

stephentoub commented Jun 19, 2023

AndyAyersMS commented Jun 19, 2023

danmoseley commented Jun 19, 2023

AndyAyersMS commented Jun 19, 2023

stephentoub commented Jun 19, 2023

JulieLeeMSFT commented Jun 19, 2023

danmoseley commented Jun 19, 2023 •

edited

Loading

mangod9 commented Jun 19, 2023

stephentoub commented Jun 19, 2023

EgorBo commented Jun 19, 2023 •

edited

Loading

stephentoub commented Jun 19, 2023

EgorBo commented Jun 19, 2023 •

edited

Loading

AndyAyersMS commented Jun 19, 2023

danmoseley commented Jun 19, 2023

EgorBo commented Jun 19, 2023

danmoseley commented Jun 19, 2023

EgorBo commented Jun 19, 2023

kouvel commented Jun 19, 2023

danmoseley commented Jun 19, 2023

EgorBo commented Jun 19, 2023 •

edited

Loading

danmoseley commented Jun 19, 2023

EgorBo commented Jul 22, 2023

danmoseley commented Jul 26, 2023

EgorBo commented Aug 14, 2023 •

edited

Loading

tiering makes regex-redux significantly slower #87753

tiering makes regex-redux significantly slower #87753

Comments

danmoseley commented Jun 19, 2023 • edited Loading

ghost commented Jun 19, 2023

danmoseley commented Jun 19, 2023

stephentoub commented Jun 19, 2023

AndyAyersMS commented Jun 19, 2023

danmoseley commented Jun 19, 2023

AndyAyersMS commented Jun 19, 2023

stephentoub commented Jun 19, 2023

JulieLeeMSFT commented Jun 19, 2023

danmoseley commented Jun 19, 2023 • edited Loading

mangod9 commented Jun 19, 2023

stephentoub commented Jun 19, 2023

EgorBo commented Jun 19, 2023 • edited Loading

stephentoub commented Jun 19, 2023

EgorBo commented Jun 19, 2023 • edited Loading

AndyAyersMS commented Jun 19, 2023

danmoseley commented Jun 19, 2023

EgorBo commented Jun 19, 2023

danmoseley commented Jun 19, 2023

EgorBo commented Jun 19, 2023

kouvel commented Jun 19, 2023

danmoseley commented Jun 19, 2023

EgorBo commented Jun 19, 2023 • edited Loading

danmoseley commented Jun 19, 2023

EgorBo commented Jul 22, 2023

danmoseley commented Jul 26, 2023

EgorBo commented Aug 14, 2023 • edited Loading

danmoseley commented Jun 19, 2023 •

edited

Loading

danmoseley commented Jun 19, 2023 •

edited

Loading

EgorBo commented Jun 19, 2023 •

edited

Loading

EgorBo commented Jun 19, 2023 •

edited

Loading

EgorBo commented Jun 19, 2023 •

edited

Loading

EgorBo commented Aug 14, 2023 •

edited

Loading