JIT_CountProfile: Avoid BSR instruction for low counts #110258

sfiruch · 2024-11-28T23:38:32Z

I decided to create this PR after talking to @AndyAyersMS about it.

The JIT_CountProfile32 and JIT_CountProfile64 functions used BitScanReverse functions to compute the log2 of counters, and then compare that to a threshold. This changes it so that the counter is directly compared to 1<<threshold instead. This saves a branch and the BitScanReverse when the counter is below the threshold.

By not zero-initializing logCount we can prevent MSVC from issuing a pointless zero-write to the stack. Unfortunately, MSVC still insists to store logCount to the stack for unknown reasons. Here's the x64 diff for JIT_CountProfile32:

  mov eax, DWORD PTR [rcx]
  mov r8, rcx
  mov r9d, 1

-  test eax, eax
-  jle SHORT $addCounter
+  cmp eax, 8192 ; 00002000H
+  jl SHORT $addCounter

+  mov ecx, DWORD PTR _tls_index
  bsr eax, eax
-  mov DWORD PTR logCount$1[rsp], 0
  mov DWORD PTR logCount$1[rsp], eax
-  cmp eax, 13
-  jb SHORT $addCounter

-  mov ecx, DWORD PTR _tls_index
  add eax, -12 ; fffffff4H
+  mov edx, eax
-  shlx r9d, r9d, eax
  mov rax, QWORD PTR gs:88
+  shlx r9d, r9d, eax
  mov edx, OFFSET FLAT:unsigned int `unsigned int HandleHistogramProfileRand(void)'::`2'::s_rng
  mov rcx, QWORD PTR [rax+rcx*8]
  add rcx, rdx
  mov edx, DWORD PTR [rcx]
  mov eax, edx
  shl eax, 13
  xor edx, eax
  mov eax, edx
  shr eax, 17
  xor edx, eax
  mov eax, edx
  shl eax, 5
  xor edx, eax
  lea eax, DWORD PTR [r9-1]
  mov DWORD PTR [rcx], edx
  test eax, edx
  jne SHORT $exit
$addCounter:
  lock add DWORD PTR [r8], r9d
$exit:
  ret 0

Codegen for LLVM and GCC looks decent (https://godbolt.org/z/ersf3KjTz).

The JIT_CountProfile32 and JIT_CountProfile64 functions used BitScanReverse functions to compute the log2 of counters, and then compare that to a threshold. This changes it so that the counter is directly compared to `1<<threshold` instead.

AndyAyersMS · 2024-11-29T18:48:52Z

@EgorBo FYI. We should try this on the TechEmpower Tier0+instr overhead test.

@erozenfeld (if you're not too busy with other stuff) perhaps you can help us understand why MSVC is stack allocating here...?

AndyAyersMS · 2024-11-29T18:49:46Z

Also @jakobbotsch you might find this interesting

sfiruch · 2024-11-29T20:49:41Z

@erozenfeld (if you're not too busy with other stuff) perhaps you can help us understand why MSVC is stack allocating here...?

I found reports for similar cases on developercommunity.visualstudio.com [1], [2], where MSVC initially allocates a stack slot, which is optimized by a later pass. But apparently this pass is not able to eliminate all stack accesses. Phase-ordering in compilers is tricky...

EgorBo · 2024-11-30T08:56:01Z

@EgorBo FYI. We should try this on the TechEmpower Tier0+instr overhead test.

I won't be around for the next few weeks, I think it's fine to merge as long as it looks ok to you and then watch the dashboard?

AndyAyersMS · 2024-11-30T17:28:03Z

At least so far the egor bot results do not look compelling. However I think we also need to up the exact count threshold. After 8192 counts we switch over to approximate counting and always BSR. Likely BDN hasn't even started measuring before we count that high, so we don't see the impact.

So maybe we could add DOTNET_TieredPGO_ScalableCountThreshold=1E? Recall this is hex, so we'll count up to 2^30 before switching to approximate counting.

Of course that means that with our current threshold, any benefit from this change is limited to the warmup period, but as we've seen with some apps, the overhead here matters too.

EgorBo · 2024-11-30T17:57:30Z

At least so far the egor bot results do not look compelling.

I had a small issue in the bot for --profiler (results were fine).
So hot ASM for Main: https://gist.github.com/EgorBot/90e930ff81d3ab7ebeaf0b73a8d7ee10#file-base_asm_ded8b906-asm-L8-L52
hot ASM for this PR: https://gist.github.com/EgorBot/4be29734afc8d3925187304963f3d9c1#file-diff_asm_ded8b906-asm-L8-L53

link for the recent run: EgorBot/runtime-utils#183 (comment)

sfiruch · 2024-12-01T18:14:37Z

The disassembly from EgorBot shows some interesting differences to what I get from my x64 Windows release build:

TLS accesses are not inlined
The value of g_pConfig->TieredPGO_ScalableCountThreshold() is not inlined as constant¹

I don't understand enough to tell whether these differences are expected or not. In any way, I don't think that either of those things affects the results in a meaningful way. But when I ran the benchmark on my machine, I saw that JIT_CountProfile32 represents less than 15% CPU in the benchmark (excluding startup/shutdown). The results were also quite noisy, because of the threading involved. Afaik, BDN is focused on steady-state, and not the warm-up period. This PR mostly affects the warm-up period, and I'm not sure whether this is observable using BDN, even with DOTNET_TieredPGO_ScalableCountThreshold.

Here's the single-threaded, non-BDN workload I used to measure, where JIT_CountProfile32 represents 50% of the CPU usage (excluding startup/shutdown):

Alternative workload code

var data = (new float[5000]).AsSpan();
for (var i = 0; i < data.Length; i++)
    data[i] = i;

var sum = 0f;
for(var i=0; i < 200; i++) //repetitions can be tuned such that we don't start measuring tier1 code
{
    Random.Shared.Shuffle(data);
    sum += FindMedian(data);
}
Console.WriteLine(sum);

static float FindMedian(Span<float> data)
{
    static float QuickSelect(Span<float> data, int k)
    {
        var left = 0;
        var right = data.Length - 1;

        for (; ; )
        {
            if (left == right)
                return data[left];

            var pivotIndex = Partition(data, left, right);
            if (k == pivotIndex)
                return data[k];
            else if (k < pivotIndex)
                right = pivotIndex - 1;
            else
                left = pivotIndex + 1;
        }
    }

    static int Partition(Span<float> data, int left, int right)
    {
        var pivot = data[right];
        var i = left;

        for (var j = left; j < right; j++)
            if (data[j] <= pivot)
            {
                (data[i], data[j]) = (data[j], data[i]);
                i++;
            }

        (data[i], data[right]) = (data[right], data[i]);
        return i;
    }

    int n = data.Length;
    int middle = n / 2;

    if (n % 2 == 1)
    {
        return QuickSelect(data, middle);
    }
    else
    {
        var left = QuickSelect(data, middle - 1);
        var right = QuickSelect(data, middle);
        return (left + right) / 2f;
    }
}

And here's a profile of a local run: https://share.firefox.dev/3ZgkJ6d.

I don't understand how a potentially configurable value could be inlined on my machine?! ↩

EgorBo · 2024-12-01T18:38:57Z

TLS accesses are not inlined

Afair, they're never inlined (__tls_get_addr) in Linux in shared libraries

The value of g_pConfig->TieredPGO_ScalableCountThreshold() is not inlined as constant1

Because it's a native code, how can it bake a runtime parameter as a constant?

represents less than 15% CPU in the benchmark (excluding startup/shutdown). The results were also quite noisy, because of the threading involved.

Do you mean the benchmark with Parallel? You can ignore it, it's bottlenecked in SpinWait.

AndyAyersMS · 2024-12-02T16:09:43Z

@EgorBot -intel -arm -profiler --envvars DOTNET_TC_CallCounting:0 DOTNET_TieredPGO_InstrumentOnlyHotCode:0 DOTNET_TieredPGO_ScalableCountThreshold:1E

using BenchmarkDotNet.Attributes;

public class Bench
{
    int a = 0;
    int b = 0;
    int c = 0;
    int d = 0;
    int e = 0;

    [Benchmark]
    public int Test()
    {
        int zeros = 0;
        Parallel.For(0, 1024, i =>
        {
            int zeros = 0;
            int nonzeros = 0;
            if (a == 0)
                zeros++;
            else nonzeros++;
            if (b == 0)
                zeros++;
            else nonzeros++;
            if (c == 0)
                zeros++;
            else nonzeros++;
            if (d == 0)
                zeros++;
            else nonzeros++;
            if (e == 0)
                zeros++;
            else nonzeros++;
        });
        return zeros;
    }
}

EgorBo · 2024-12-02T17:09:20Z

@AndyAyersMS I think Parallel.For here is not a good scenario - it is bottle-necked in SpinWait, because it has a big queue of small tasks. In order to check performance under cache contention we need a benchmark where we create, say 16 threads, and all of them do some long running job

AndyAyersMS · 2024-12-02T17:15:08Z

@AndyAyersMS I think Parallel.For here is not a good scenario - it is bottle-necked in SpinWait, because it has a big queue of small tasks. In order to check performance under cache contention we need a benchmark where we create, say 16 threads, and all of them do some long running job

I don't expect this PR to alter contention behavior, that is what the approximate counting (and 64 bit counters, when needed) is for. So this is more about the raw cost of counting, especially for the exact counting regime.

AndyAyersMS · 2024-12-02T17:16:52Z

@EgorBot -intel -arm -profiler --envvars DOTNET_TC_CallCounting:0 DOTNET_TieredPGO_InstrumentOnlyHotCode:0 DOTNET_TieredPGO_ScalableCountThreshold:1E

using BenchmarkDotNet.Attributes;

public class Bench
{
    int a = 0;
    int b = 0;
    int c = 0;
    int d = 0;
    int e = 0;

    [Benchmark]
    public int Test()
    {
        int zeros = 0;
        for (int i = 0; i < 1024; i++)
        {
            int zeros = 0;
            int nonzeros = 0;
            if (a == 0)
                zeros++;
            else nonzeros++;
            if (b == 0)
                zeros++;
            else nonzeros++;
            if (c == 0)
                zeros++;
            else nonzeros++;
            if (d == 0)
                zeros++;
            else nonzeros++;
            if (e == 0)
                zeros++;
            else nonzeros++;
        }
        return zeros;
    }
}

AndyAyersMS · 2024-12-02T17:17:37Z

I need a C# syntax checker integrated into github comments.

sfiruch · 2024-12-03T08:33:42Z

@EgorBot -intel -arm -profiler --envvars DOTNET_TC_CallCounting:0 DOTNET_TieredPGO_InstrumentOnlyHotCode:0 DOTNET_TieredPGO_ScalableCountThreshold:1E

using BenchmarkDotNet.Attributes;

public class Bench
{
    int a = 0;
    int b = 0;
    int c = 0;
    int d = 0;
    int e = 0;

    [Benchmark]
    public int Test()
    {
        int zeros = 0;
        for (int i = 0; i < 1024; i++)
        {
            int nonzeros = 0;
            if (a == 0)
                zeros++;
            else nonzeros++;
            if (b == 0)
                zeros++;
            else nonzeros++;
            if (c == 0)
                zeros++;
            else nonzeros++;
            if (d == 0)
                zeros++;
            else nonzeros++;
            if (e == 0)
                zeros++;
            else nonzeros++;
        }
        return zeros;
    }
}

sfiruch · 2024-12-03T15:14:39Z

Results are looking good for x64, meh for Cobalt. Feel free to merge if you like it.

AndyAyersMS

Yes, the results look good.

src/coreclr/vm/jithelpers.cpp

AndyAyersMS · 2024-12-04T01:29:46Z

Copying this for posterity

BenchmarkDotNet v0.14.0, Ubuntu 24.04.1 LTS (Noble Numbat)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores
  Job-SYGFHD : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-NQWYCW : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
EnvironmentVariables=DOTNET_TC_CallCounting=0,DOTNET_TieredPGO_InstrumentOnlyHotCode=0,DOTNET_TieredPGO_ScalableCountThreshold=1E

Method	Toolchain	Mean	Error	Ratio
Test	Main	8.718 μs	0.0499 μs	1.00
Test	PR	7.518 μs	0.0251 μs	0.86

AndyAyersMS · 2024-12-04T17:06:50Z

@sfiruch thank you!

sfiruch · 2024-12-04T19:49:46Z

@sfiruch thank you!

Thank you and your team for making it easy to contribute a tiny bit!

erozenfeld · 2024-12-04T22:14:48Z

@EgorBo FYI. We should try this on the TechEmpower Tier0+instr overhead test.

@erozenfeld (if you're not too busy with other stuff) perhaps you can help us understand why MSVC is stack allocating here...?

This is the same issue as https://developercommunity.visualstudio.com/t/MSVC-not-optimizing-away-trivial-useless/10618521 and I fixed it in October. We were too conservative with volatiles that affected writes to address-taken non-volatiles in the same function. The fix shipped in 17.12. Looks like https://godbolt.org/ is still on 17.10.

The JIT_CountProfile32 and JIT_CountProfile64 functions used BitScanReverse functions to compute the log2 of counters, and then compare that to a threshold. This changes it so that the counter is directly compared to `1<<threshold` instead.

dotnet-issue-labeler bot added the area-VM-coreclr label Nov 28, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Nov 28, 2024

Avoid compiler warning

59049cc

EgorBot mentioned this pull request Nov 30, 2024

EgorBot for EgorBo in #110258 EgorBot/runtime-utils#182

Open

EgorBot mentioned this pull request Nov 30, 2024

EgorBot for EgorBo in #110258 EgorBot/runtime-utils#183

Open

This comment was marked as off-topic.

Sign in to view

EgorBot mentioned this pull request Nov 30, 2024

EgorBot for EgorBo in #110258 EgorBot/runtime-utils#185

Open

This comment was marked as outdated.

Sign in to view

EgorBot mentioned this pull request Dec 2, 2024

EgorBot for AndyAyersMS in #110258 EgorBot/runtime-utils#194

Open

EgorBot mentioned this pull request Dec 2, 2024

EgorBot for AndyAyersMS in #110258 EgorBot/runtime-utils#195

Open

This comment was marked as outdated.

Sign in to view

EgorBot mentioned this pull request Dec 2, 2024

EgorBot for AndyAyersMS in #110258 EgorBot/runtime-utils#196

Open

EgorBot mentioned this pull request Dec 2, 2024

EgorBot for AndyAyersMS in #110258 EgorBot/runtime-utils#197

Open

EgorBot mentioned this pull request Dec 3, 2024

EgorBot for sfiruch in #110258 EgorBot/runtime-utils#201

Open

sfiruch marked this pull request as ready for review December 3, 2024 15:14

Merge branch 'main' into patch-3

9f8ce27

AndyAyersMS reviewed Dec 3, 2024

View reviewed changes

src/coreclr/vm/jithelpers.cpp Outdated Show resolved Hide resolved

AndyAyersMS added 2 commits December 3, 2024 09:41

Update src/coreclr/vm/jithelpers.cpp

b98e1cd

Merge branch 'main' into patch-3

6ae4b3e

AndyAyersMS approved these changes Dec 3, 2024

View reviewed changes

This was referenced Dec 3, 2024

iOS tests failing with WORKLOAD TIMED OUT - Killing user command. #108103

Open

System.Formats.Nrbf.Tests timeouts #110285

Closed

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

AndyAyersMS reviewed Dec 3, 2024

View reviewed changes

src/coreclr/vm/jithelpers.cpp Outdated Show resolved Hide resolved

Update src/coreclr/vm/jithelpers.cpp

1ce5357

AndyAyersMS merged commit f7974ea into dotnet:main Dec 4, 2024
86 of 90 checks passed

build-analysis bot mentioned this pull request Dec 4, 2024

iOS test fails with "App is not signed" #110395

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT_CountProfile: Avoid BSR instruction for low counts #110258

JIT_CountProfile: Avoid BSR instruction for low counts #110258

sfiruch commented Nov 28, 2024

AndyAyersMS commented Nov 29, 2024

AndyAyersMS commented Nov 29, 2024

sfiruch commented Nov 29, 2024

EgorBo commented Nov 30, 2024

AndyAyersMS commented Nov 30, 2024

EgorBo commented Nov 30, 2024 •

edited

Loading

This comment was marked as off-topic.

sfiruch commented Dec 1, 2024

EgorBo commented Dec 1, 2024 •

edited

Loading

This comment was marked as outdated.

AndyAyersMS commented Dec 2, 2024

EgorBo commented Dec 2, 2024

AndyAyersMS commented Dec 2, 2024

This comment was marked as outdated.

AndyAyersMS commented Dec 2, 2024

AndyAyersMS commented Dec 2, 2024

sfiruch commented Dec 3, 2024

sfiruch commented Dec 3, 2024

AndyAyersMS left a comment

AndyAyersMS commented Dec 4, 2024

AndyAyersMS commented Dec 4, 2024

sfiruch commented Dec 4, 2024

erozenfeld commented Dec 4, 2024 •

edited

Loading

JIT_CountProfile: Avoid BSR instruction for low counts #110258

JIT_CountProfile: Avoid BSR instruction for low counts #110258

Conversation

sfiruch commented Nov 28, 2024

AndyAyersMS commented Nov 29, 2024

AndyAyersMS commented Nov 29, 2024

sfiruch commented Nov 29, 2024

EgorBo commented Nov 30, 2024

AndyAyersMS commented Nov 30, 2024

EgorBo commented Nov 30, 2024 • edited Loading

This comment was marked as off-topic.

sfiruch commented Dec 1, 2024

Footnotes

EgorBo commented Dec 1, 2024 • edited Loading

This comment was marked as outdated.

AndyAyersMS commented Dec 2, 2024

EgorBo commented Dec 2, 2024

AndyAyersMS commented Dec 2, 2024

This comment was marked as outdated.

AndyAyersMS commented Dec 2, 2024

AndyAyersMS commented Dec 2, 2024

sfiruch commented Dec 3, 2024

sfiruch commented Dec 3, 2024

AndyAyersMS left a comment

Choose a reason for hiding this comment

AndyAyersMS commented Dec 4, 2024

AndyAyersMS commented Dec 4, 2024

sfiruch commented Dec 4, 2024

erozenfeld commented Dec 4, 2024 • edited Loading

EgorBo commented Nov 30, 2024 •

edited

Loading

EgorBo commented Dec 1, 2024 •

edited

Loading

erozenfeld commented Dec 4, 2024 •

edited

Loading