Remove CPU cache contention for instrumented code #81932
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #76520
Problem
Tier0 with instrumentation demonstrates up to 5x lower RPS than the non-instrumented Tier0:
After various investigations we came to the conclusion that the main overhead is not from the counters/class probes in the codegen but the cache contention they cause trying to update the same addresses in memory from different threads. Both block counters and class probes are 50%/50% guilty for it. I've verified that by doing two things:
inc[(reloc)]
for block counters. ("fake counters")JIT_ClassProfile32
no-op in VM while JIT still emits calls to it for class probes.Both steps restored missing RPS from the Tier0-instr. More experiments are listed here:
After some brainstorming with @jakobbotsch and @AndyAyersMS we detected a low-hanging fruit in
HandleHistogramProfileRand
. TLS random might slightly decrease the quality of the class histogram if the same method is invoked with different objects in parallel, but that sounds like a reasonable price to reduce the contention.Benchmarks
Tier0 TE results. Both Base and Diff use PGO instrumentation, Diff has this fix.
Plaintext-MVC
JSON-JSON
Up to +70% RPS. While this is not necessarily will result in faster "time to start/first request" - it definitely will improve throughput/latency of a non-fully warmed up code. It also highlights the general problem we have in other places.
Alternative solutions
Use better source of random, e.g.
cntvct_el0
on arm64. Related: #72387 (comment)