Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT_CountProfile: Avoid BSR instruction for low counts #110258

Merged
merged 6 commits into from
Dec 4, 2024

Conversation

sfiruch
Copy link
Contributor

@sfiruch sfiruch commented Nov 28, 2024

I decided to create this PR after talking to @AndyAyersMS about it.

The JIT_CountProfile32 and JIT_CountProfile64 functions used BitScanReverse functions to compute the log2 of counters, and then compare that to a threshold. This changes it so that the counter is directly compared to 1<<threshold instead. This saves a branch and the BitScanReverse when the counter is below the threshold.

By not zero-initializing logCount we can prevent MSVC from issuing a pointless zero-write to the stack. Unfortunately, MSVC still insists to store logCount to the stack for unknown reasons. Here's the x64 diff for JIT_CountProfile32:

  mov eax, DWORD PTR [rcx]
  mov r8, rcx
  mov r9d, 1

-  test eax, eax
-  jle SHORT $addCounter
+  cmp eax, 8192 ; 00002000H
+  jl SHORT $addCounter

+  mov ecx, DWORD PTR _tls_index
  bsr eax, eax
-  mov DWORD PTR logCount$1[rsp], 0
  mov DWORD PTR logCount$1[rsp], eax
-  cmp eax, 13
-  jb SHORT $addCounter

-  mov ecx, DWORD PTR _tls_index
  add eax, -12 ; fffffff4H
+  mov edx, eax
-  shlx r9d, r9d, eax
  mov rax, QWORD PTR gs:88
+  shlx r9d, r9d, eax
  mov edx, OFFSET FLAT:unsigned int `unsigned int HandleHistogramProfileRand(void)'::`2'::s_rng
  mov rcx, QWORD PTR [rax+rcx*8]
  add rcx, rdx
  mov edx, DWORD PTR [rcx]
  mov eax, edx
  shl eax, 13
  xor edx, eax
  mov eax, edx
  shr eax, 17
  xor edx, eax
  mov eax, edx
  shl eax, 5
  xor edx, eax
  lea eax, DWORD PTR [r9-1]
  mov DWORD PTR [rcx], edx
  test eax, edx
  jne SHORT $exit
$addCounter:
  lock add DWORD PTR [r8], r9d
$exit:
  ret 0

Codegen for LLVM and GCC looks decent (https://godbolt.org/z/ersf3KjTz).

The JIT_CountProfile32 and JIT_CountProfile64 functions used BitScanReverse functions to compute the log2 of counters, and then compare that to a threshold. This changes it so that the counter is directly compared to `1<<threshold` instead.
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Nov 28, 2024
@AndyAyersMS
Copy link
Member

@EgorBo FYI. We should try this on the TechEmpower Tier0+instr overhead test.

@erozenfeld (if you're not too busy with other stuff) perhaps you can help us understand why MSVC is stack allocating here...?

@AndyAyersMS
Copy link
Member

Also @jakobbotsch you might find this interesting

@sfiruch
Copy link
Contributor Author

sfiruch commented Nov 29, 2024

@erozenfeld (if you're not too busy with other stuff) perhaps you can help us understand why MSVC is stack allocating here...?

I found reports for similar cases on developercommunity.visualstudio.com [1], [2], where MSVC initially allocates a stack slot, which is optimized by a later pass. But apparently this pass is not able to eliminate all stack accesses. Phase-ordering in compilers is tricky...

@EgorBo
Copy link
Member

EgorBo commented Nov 30, 2024

@EgorBo FYI. We should try this on the TechEmpower Tier0+instr overhead test.

I won't be around for the next few weeks, I think it's fine to merge as long as it looks ok to you and then watch the dashboard?

@AndyAyersMS
Copy link
Member

At least so far the egor bot results do not look compelling. However I think we also need to up the exact count threshold. After 8192 counts we switch over to approximate counting and always BSR. Likely BDN hasn't even started measuring before we count that high, so we don't see the impact.

So maybe we could add DOTNET_TieredPGO_ScalableCountThreshold=1E? Recall this is hex, so we'll count up to 2^30 before switching to approximate counting.

Of course that means that with our current threshold, any benefit from this change is limited to the warmup period, but as we've seen with some apps, the overhead here matters too.

@EgorBo
Copy link
Member

EgorBo commented Nov 30, 2024

At least so far the egor bot results do not look compelling.

I had a small issue in the bot for --profiler (results were fine).
So hot ASM for Main: https://gist.github.com/EgorBot/90e930ff81d3ab7ebeaf0b73a8d7ee10#file-base_asm_ded8b906-asm-L8-L52
hot ASM for this PR: https://gist.github.com/EgorBot/4be29734afc8d3925187304963f3d9c1#file-diff_asm_ded8b906-asm-L8-L53

link for the recent run: EgorBot/runtime-utils#183 (comment)

@EgorBo

This comment was marked as off-topic.

@sfiruch
Copy link
Contributor Author

sfiruch commented Dec 1, 2024

The disassembly from EgorBot shows some interesting differences to what I get from my x64 Windows release build:

  • TLS accesses are not inlined
  • The value of g_pConfig->TieredPGO_ScalableCountThreshold() is not inlined as constant1

I don't understand enough to tell whether these differences are expected or not. In any way, I don't think that either of those things affects the results in a meaningful way. But when I ran the benchmark on my machine, I saw that JIT_CountProfile32 represents less than 15% CPU in the benchmark (excluding startup/shutdown). The results were also quite noisy, because of the threading involved. Afaik, BDN is focused on steady-state, and not the warm-up period. This PR mostly affects the warm-up period, and I'm not sure whether this is observable using BDN, even with DOTNET_TieredPGO_ScalableCountThreshold.

Here's the single-threaded, non-BDN workload I used to measure, where JIT_CountProfile32 represents 50% of the CPU usage (excluding startup/shutdown):

Alternative workload code
var data = (new float[5000]).AsSpan();
for (var i = 0; i < data.Length; i++)
    data[i] = i;

var sum = 0f;
for(var i=0; i < 200; i++) //repetitions can be tuned such that we don't start measuring tier1 code
{
    Random.Shared.Shuffle(data);
    sum += FindMedian(data);
}
Console.WriteLine(sum);

static float FindMedian(Span<float> data)
{
    static float QuickSelect(Span<float> data, int k)
    {
        var left = 0;
        var right = data.Length - 1;

        for (; ; )
        {
            if (left == right)
                return data[left];

            var pivotIndex = Partition(data, left, right);
            if (k == pivotIndex)
                return data[k];
            else if (k < pivotIndex)
                right = pivotIndex - 1;
            else
                left = pivotIndex + 1;
        }
    }

    static int Partition(Span<float> data, int left, int right)
    {
        var pivot = data[right];
        var i = left;

        for (var j = left; j < right; j++)
            if (data[j] <= pivot)
            {
                (data[i], data[j]) = (data[j], data[i]);
                i++;
            }

        (data[i], data[right]) = (data[right], data[i]);
        return i;
    }

    int n = data.Length;
    int middle = n / 2;

    if (n % 2 == 1)
    {
        return QuickSelect(data, middle);
    }
    else
    {
        var left = QuickSelect(data, middle - 1);
        var right = QuickSelect(data, middle);
        return (left + right) / 2f;
    }
}

And here's a profile of a local run: https://share.firefox.dev/3ZgkJ6d.

Footnotes

  1. I don't understand how a potentially configurable value could be inlined on my machine?!

@EgorBo
Copy link
Member

EgorBo commented Dec 1, 2024

TLS accesses are not inlined

Afair, they're never inlined (__tls_get_addr) in Linux in shared libraries

The value of g_pConfig->TieredPGO_ScalableCountThreshold() is not inlined as constant1

Because it's a native code, how can it bake a runtime parameter as a constant?

represents less than 15% CPU in the benchmark (excluding startup/shutdown). The results were also quite noisy, because of the threading involved.

Do you mean the benchmark with Parallel? You can ignore it, it's bottlenecked in SpinWait.

@AndyAyersMS

This comment was marked as outdated.

@AndyAyersMS
Copy link
Member

@EgorBot -intel -arm -profiler --envvars DOTNET_TC_CallCounting:0 DOTNET_TieredPGO_InstrumentOnlyHotCode:0 DOTNET_TieredPGO_ScalableCountThreshold:1E

using BenchmarkDotNet.Attributes;

public class Bench
{
    int a = 0;
    int b = 0;
    int c = 0;
    int d = 0;
    int e = 0;

    [Benchmark]
    public int Test()
    {
        int zeros = 0;
        Parallel.For(0, 1024, i =>
        {
            int zeros = 0;
            int nonzeros = 0;
            if (a == 0)
                zeros++;
            else nonzeros++;
            if (b == 0)
                zeros++;
            else nonzeros++;
            if (c == 0)
                zeros++;
            else nonzeros++;
            if (d == 0)
                zeros++;
            else nonzeros++;
            if (e == 0)
                zeros++;
            else nonzeros++;
        });
        return zeros;
    }
}

@EgorBo
Copy link
Member

EgorBo commented Dec 2, 2024

@AndyAyersMS I think Parallel.For here is not a good scenario - it is bottle-necked in SpinWait, because it has a big queue of small tasks. In order to check performance under cache contention we need a benchmark where we create, say 16 threads, and all of them do some long running job

@AndyAyersMS
Copy link
Member

@AndyAyersMS I think Parallel.For here is not a good scenario - it is bottle-necked in SpinWait, because it has a big queue of small tasks. In order to check performance under cache contention we need a benchmark where we create, say 16 threads, and all of them do some long running job

I don't expect this PR to alter contention behavior, that is what the approximate counting (and 64 bit counters, when needed) is for. So this is more about the raw cost of counting, especially for the exact counting regime.

@AndyAyersMS

This comment was marked as outdated.

@AndyAyersMS
Copy link
Member

@EgorBot -intel -arm -profiler --envvars DOTNET_TC_CallCounting:0 DOTNET_TieredPGO_InstrumentOnlyHotCode:0 DOTNET_TieredPGO_ScalableCountThreshold:1E

using BenchmarkDotNet.Attributes;

public class Bench
{
    int a = 0;
    int b = 0;
    int c = 0;
    int d = 0;
    int e = 0;

    [Benchmark]
    public int Test()
    {
        int zeros = 0;
        for (int i = 0; i < 1024; i++)
        {
            int zeros = 0;
            int nonzeros = 0;
            if (a == 0)
                zeros++;
            else nonzeros++;
            if (b == 0)
                zeros++;
            else nonzeros++;
            if (c == 0)
                zeros++;
            else nonzeros++;
            if (d == 0)
                zeros++;
            else nonzeros++;
            if (e == 0)
                zeros++;
            else nonzeros++;
        }
        return zeros;
    }
}

@AndyAyersMS
Copy link
Member

I need a C# syntax checker integrated into github comments.

@sfiruch
Copy link
Contributor Author

sfiruch commented Dec 3, 2024

@EgorBot -intel -arm -profiler --envvars DOTNET_TC_CallCounting:0 DOTNET_TieredPGO_InstrumentOnlyHotCode:0 DOTNET_TieredPGO_ScalableCountThreshold:1E

using BenchmarkDotNet.Attributes;

public class Bench
{
    int a = 0;
    int b = 0;
    int c = 0;
    int d = 0;
    int e = 0;

    [Benchmark]
    public int Test()
    {
        int zeros = 0;
        for (int i = 0; i < 1024; i++)
        {
            int nonzeros = 0;
            if (a == 0)
                zeros++;
            else nonzeros++;
            if (b == 0)
                zeros++;
            else nonzeros++;
            if (c == 0)
                zeros++;
            else nonzeros++;
            if (d == 0)
                zeros++;
            else nonzeros++;
            if (e == 0)
                zeros++;
            else nonzeros++;
        }
        return zeros;
    }
}

@sfiruch
Copy link
Contributor Author

sfiruch commented Dec 3, 2024

Results are looking good for x64, meh for Cobalt. Feel free to merge if you like it.

@sfiruch sfiruch marked this pull request as ready for review December 3, 2024 15:14
Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the results look good.

src/coreclr/vm/jithelpers.cpp Outdated Show resolved Hide resolved
@AndyAyersMS
Copy link
Member

Copying this for posterity

BenchmarkDotNet v0.14.0, Ubuntu 24.04.1 LTS (Noble Numbat)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores
  Job-SYGFHD : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-NQWYCW : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
EnvironmentVariables=DOTNET_TC_CallCounting=0,DOTNET_TieredPGO_InstrumentOnlyHotCode=0,DOTNET_TieredPGO_ScalableCountThreshold=1E
Method Toolchain Mean Error Ratio
Test Main 8.718 μs 0.0499 μs 1.00
Test PR 7.518 μs 0.0251 μs 0.86

@AndyAyersMS AndyAyersMS merged commit f7974ea into dotnet:main Dec 4, 2024
86 of 90 checks passed
@AndyAyersMS
Copy link
Member

@sfiruch thank you!

@sfiruch
Copy link
Contributor Author

sfiruch commented Dec 4, 2024

@sfiruch thank you!

Thank you and your team for making it easy to contribute a tiny bit!

@erozenfeld
Copy link
Member

erozenfeld commented Dec 4, 2024

@EgorBo FYI. We should try this on the TechEmpower Tier0+instr overhead test.

@erozenfeld (if you're not too busy with other stuff) perhaps you can help us understand why MSVC is stack allocating here...?

This is the same issue as https://developercommunity.visualstudio.com/t/MSVC-not-optimizing-away-trivial-useless/10618521 and I fixed it in October. We were too conservative with volatiles that affected writes to address-taken non-volatiles in the same function. The fix shipped in 17.12. Looks like https://godbolt.org/ is still on 17.10.

eduardo-vp pushed a commit to eduardo-vp/runtime that referenced this pull request Dec 5, 2024
The JIT_CountProfile32 and JIT_CountProfile64 functions used BitScanReverse functions to compute the log2 of counters, and then compare that to a threshold. This changes it so that the counter is directly compared to `1<<threshold` instead.
mikelle-rogers pushed a commit to mikelle-rogers/runtime that referenced this pull request Dec 10, 2024
The JIT_CountProfile32 and JIT_CountProfile64 functions used BitScanReverse functions to compute the log2 of counters, and then compare that to a threshold. This changes it so that the counter is directly compared to `1<<threshold` instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-VM-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants