Remove CPU cache contention for instrumented code #81932

EgorBo · 2023-02-10T01:27:06Z

Problem

Tier0 with instrumentation demonstrates up to 5x lower RPS than the non-instrumented Tier0:

After various investigations we came to the conclusion that the main overhead is not from the counters/class probes in the codegen but the cache contention they cause trying to update the same addresses in memory from different threads. Both block counters and class probes are 50%/50% guilty for it. I've verified that by doing two things:

Modified JIT to update stack locals instead of inc[(reloc)] for block counters. ("fake counters")
Made JIT_ClassProfile32 no-op in VM while JIT still emits calls to it for class probes.

Both steps restored missing RPS from the Tier0-instr. More experiments are listed here:

After some brainstorming with @jakobbotsch and @AndyAyersMS we detected a low-hanging fruit in HandleHistogramProfileRand. TLS random might slightly decrease the quality of the class histogram if the same method is invoked with different objects in parallel, but that sounds like a reasonable price to reduce the contention.

Benchmarks

Tier0 TE results. Both Base and Diff use PGO instrumentation, Diff has this fix.

Plaintext-MVC

| load                   |                        Base |                        Diff |          |
| ---------------------- | --------------------------- | --------------------------- | -------- |
| Requests/sec           |                     130,722 |                     223,201 |  +70.74% |
| Latency 50th (ms)      |                       23.37 |                       10.89 |  -53.40% |
| Latency 75th (ms)      |                       32.82 |                       15.54 |  -52.65% |

JSON-JSON

| load                   |                        Base |                        Diff |          |
| ---------------------- | --------------------------- | --------------------------- | -------- |
| Requests/sec           |                     145,746 |                     249,565 |  +71.23% |
| Latency 50th (ms)      |                        1.75 |                        1.00 |  -42.86% |
| Latency 75th (ms)      |                        2.06 |                        1.20 |  -41.75% |

Up to +70% RPS. While this is not necessarily will result in faster "time to start/first request" - it definitely will improve throughput/latency of a non-fully warmed up code. It also highlights the general problem we have in other places.

Alternative solutions

Use better source of random, e.g. cntvct_el0 on arm64. Related: #72387 (comment)

AndyAyersMS

Nice!

EgorBo · 2023-02-11T17:20:37Z

Remove contention from HandleHistogramProfileRand

16ad10f

dotnet-issue-labeler bot added the area-VM-coreclr label Feb 10, 2023

ghost assigned EgorBo Feb 10, 2023

AndyAyersMS approved these changes Feb 10, 2023

View reviewed changes

EgorBo merged commit 0d95bf1 into dotnet:main Feb 10, 2023

EgorBo deleted the tls-random-checkedsample branch February 10, 2023 11:28

This was referenced Feb 10, 2023

JIT: add option to use interlocked add for PGO edge count updates #81934

Merged

Dynamic PGO startup improvements in NET 8 #76969

Closed

ghost locked as resolved and limited conversation to collaborators Mar 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove CPU cache contention for instrumented code #81932

Remove CPU cache contention for instrumented code #81932

EgorBo commented Feb 10, 2023 •

edited

Loading

AndyAyersMS left a comment

EgorBo commented Feb 11, 2023

Remove CPU cache contention for instrumented code #81932

Remove CPU cache contention for instrumented code #81932

Conversation

EgorBo commented Feb 10, 2023 • edited Loading

Problem

Benchmarks

Plaintext-MVC

JSON-JSON

Alternative solutions

AndyAyersMS left a comment

Choose a reason for hiding this comment

EgorBo commented Feb 11, 2023

EgorBo commented Feb 10, 2023 •

edited

Loading