Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove CPU cache contention for instrumented code #81932

Merged
merged 1 commit into from
Feb 10, 2023

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Feb 10, 2023

Closes #76520

Problem

Tier0 with instrumentation demonstrates up to 5x lower RPS than the non-instrumented Tier0:

image

After various investigations we came to the conclusion that the main overhead is not from the counters/class probes in the codegen but the cache contention they cause trying to update the same addresses in memory from different threads. Both block counters and class probes are 50%/50% guilty for it. I've verified that by doing two things:

  1. Modified JIT to update stack locals instead of inc[(reloc)] for block counters. ("fake counters")
  2. Made JIT_ClassProfile32 no-op in VM while JIT still emits calls to it for class probes.

Both steps restored missing RPS from the Tier0-instr. More experiments are listed here:
image
image

After some brainstorming with @jakobbotsch and @AndyAyersMS we detected a low-hanging fruit in HandleHistogramProfileRand. TLS random might slightly decrease the quality of the class histogram if the same method is invoked with different objects in parallel, but that sounds like a reasonable price to reduce the contention.

Benchmarks

Tier0 TE results. Both Base and Diff use PGO instrumentation, Diff has this fix.

Plaintext-MVC

| load                   |                        Base |                        Diff |          |
| ---------------------- | --------------------------- | --------------------------- | -------- |
| Requests/sec           |                     130,722 |                     223,201 |  +70.74% |
| Latency 50th (ms)      |                       23.37 |                       10.89 |  -53.40% |
| Latency 75th (ms)      |                       32.82 |                       15.54 |  -52.65% |

JSON-JSON

| load                   |                        Base |                        Diff |          |
| ---------------------- | --------------------------- | --------------------------- | -------- |
| Requests/sec           |                     145,746 |                     249,565 |  +71.23% |
| Latency 50th (ms)      |                        1.75 |                        1.00 |  -42.86% |
| Latency 75th (ms)      |                        2.06 |                        1.20 |  -41.75% |

Up to +70% RPS. While this is not necessarily will result in faster "time to start/first request" - it definitely will improve throughput/latency of a non-fully warmed up code. It also highlights the general problem we have in other places.

Alternative solutions

Use better source of random, e.g. cntvct_el0 on arm64. Related: #72387 (comment)

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@EgorBo EgorBo merged commit 0d95bf1 into dotnet:main Feb 10, 2023
@EgorBo EgorBo deleted the tls-random-checkedsample branch February 10, 2023 11:28
@EgorBo
Copy link
Member Author

EgorBo commented Feb 11, 2023

image

@ghost ghost locked as resolved and limited conversation to collaborators Mar 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize CheckSample
2 participants