-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT_CountProfile: Avoid BSR instruction for low counts #110258
Conversation
The JIT_CountProfile32 and JIT_CountProfile64 functions used BitScanReverse functions to compute the log2 of counters, and then compare that to a threshold. This changes it so that the counter is directly compared to `1<<threshold` instead.
@EgorBo FYI. We should try this on the TechEmpower Tier0+instr overhead test. @erozenfeld (if you're not too busy with other stuff) perhaps you can help us understand why MSVC is stack allocating here...? |
Also @jakobbotsch you might find this interesting |
I found reports for similar cases on developercommunity.visualstudio.com [1], [2], where MSVC initially allocates a stack slot, which is optimized by a later pass. But apparently this pass is not able to eliminate all stack accesses. Phase-ordering in compilers is tricky... |
At least so far the egor bot results do not look compelling. However I think we also need to up the exact count threshold. After 8192 counts we switch over to approximate counting and always BSR. Likely BDN hasn't even started measuring before we count that high, so we don't see the impact. So maybe we could add Of course that means that with our current threshold, any benefit from this change is limited to the warmup period, but as we've seen with some apps, the overhead here matters too. |
I had a small issue in the bot for --profiler (results were fine). link for the recent run: EgorBot/runtime-utils#183 (comment) |
This comment was marked as off-topic.
This comment was marked as off-topic.
The disassembly from EgorBot shows some interesting differences to what I get from my x64 Windows release build:
I don't understand enough to tell whether these differences are expected or not. In any way, I don't think that either of those things affects the results in a meaningful way. But when I ran the benchmark on my machine, I saw that Here's the single-threaded, non-BDN workload I used to measure, where Alternative workload codevar data = (new float[5000]).AsSpan();
for (var i = 0; i < data.Length; i++)
data[i] = i;
var sum = 0f;
for(var i=0; i < 200; i++) //repetitions can be tuned such that we don't start measuring tier1 code
{
Random.Shared.Shuffle(data);
sum += FindMedian(data);
}
Console.WriteLine(sum);
static float FindMedian(Span<float> data)
{
static float QuickSelect(Span<float> data, int k)
{
var left = 0;
var right = data.Length - 1;
for (; ; )
{
if (left == right)
return data[left];
var pivotIndex = Partition(data, left, right);
if (k == pivotIndex)
return data[k];
else if (k < pivotIndex)
right = pivotIndex - 1;
else
left = pivotIndex + 1;
}
}
static int Partition(Span<float> data, int left, int right)
{
var pivot = data[right];
var i = left;
for (var j = left; j < right; j++)
if (data[j] <= pivot)
{
(data[i], data[j]) = (data[j], data[i]);
i++;
}
(data[i], data[right]) = (data[right], data[i]);
return i;
}
int n = data.Length;
int middle = n / 2;
if (n % 2 == 1)
{
return QuickSelect(data, middle);
}
else
{
var left = QuickSelect(data, middle - 1);
var right = QuickSelect(data, middle);
return (left + right) / 2f;
}
} And here's a profile of a local run: https://share.firefox.dev/3ZgkJ6d. Footnotes
|
Afair, they're never inlined (__tls_get_addr) in Linux in shared libraries
Because it's a native code, how can it bake a runtime parameter as a constant?
Do you mean the benchmark with Parallel? You can ignore it, it's bottlenecked in SpinWait. |
This comment was marked as outdated.
This comment was marked as outdated.
@EgorBot -intel -arm -profiler --envvars DOTNET_TC_CallCounting:0 DOTNET_TieredPGO_InstrumentOnlyHotCode:0 DOTNET_TieredPGO_ScalableCountThreshold:1E using BenchmarkDotNet.Attributes;
public class Bench
{
int a = 0;
int b = 0;
int c = 0;
int d = 0;
int e = 0;
[Benchmark]
public int Test()
{
int zeros = 0;
Parallel.For(0, 1024, i =>
{
int zeros = 0;
int nonzeros = 0;
if (a == 0)
zeros++;
else nonzeros++;
if (b == 0)
zeros++;
else nonzeros++;
if (c == 0)
zeros++;
else nonzeros++;
if (d == 0)
zeros++;
else nonzeros++;
if (e == 0)
zeros++;
else nonzeros++;
});
return zeros;
}
} |
@AndyAyersMS I think Parallel.For here is not a good scenario - it is bottle-necked in SpinWait, because it has a big queue of small tasks. In order to check performance under cache contention we need a benchmark where we create, say 16 threads, and all of them do some long running job |
I don't expect this PR to alter contention behavior, that is what the approximate counting (and 64 bit counters, when needed) is for. So this is more about the raw cost of counting, especially for the exact counting regime. |
This comment was marked as outdated.
This comment was marked as outdated.
@EgorBot -intel -arm -profiler --envvars DOTNET_TC_CallCounting:0 DOTNET_TieredPGO_InstrumentOnlyHotCode:0 DOTNET_TieredPGO_ScalableCountThreshold:1E using BenchmarkDotNet.Attributes;
public class Bench
{
int a = 0;
int b = 0;
int c = 0;
int d = 0;
int e = 0;
[Benchmark]
public int Test()
{
int zeros = 0;
for (int i = 0; i < 1024; i++)
{
int zeros = 0;
int nonzeros = 0;
if (a == 0)
zeros++;
else nonzeros++;
if (b == 0)
zeros++;
else nonzeros++;
if (c == 0)
zeros++;
else nonzeros++;
if (d == 0)
zeros++;
else nonzeros++;
if (e == 0)
zeros++;
else nonzeros++;
}
return zeros;
}
} |
I need a C# syntax checker integrated into github comments. |
@EgorBot -intel -arm -profiler --envvars DOTNET_TC_CallCounting:0 DOTNET_TieredPGO_InstrumentOnlyHotCode:0 DOTNET_TieredPGO_ScalableCountThreshold:1E using BenchmarkDotNet.Attributes;
public class Bench
{
int a = 0;
int b = 0;
int c = 0;
int d = 0;
int e = 0;
[Benchmark]
public int Test()
{
int zeros = 0;
for (int i = 0; i < 1024; i++)
{
int nonzeros = 0;
if (a == 0)
zeros++;
else nonzeros++;
if (b == 0)
zeros++;
else nonzeros++;
if (c == 0)
zeros++;
else nonzeros++;
if (d == 0)
zeros++;
else nonzeros++;
if (e == 0)
zeros++;
else nonzeros++;
}
return zeros;
}
} |
Results are looking good for x64, meh for Cobalt. Feel free to merge if you like it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the results look good.
Copying this for posterity
|
@sfiruch thank you! |
Thank you and your team for making it easy to contribute a tiny bit! |
This is the same issue as https://developercommunity.visualstudio.com/t/MSVC-not-optimizing-away-trivial-useless/10618521 and I fixed it in October. We were too conservative with volatiles that affected writes to address-taken non-volatiles in the same function. The fix shipped in 17.12. Looks like https://godbolt.org/ is still on 17.10. |
The JIT_CountProfile32 and JIT_CountProfile64 functions used BitScanReverse functions to compute the log2 of counters, and then compare that to a threshold. This changes it so that the counter is directly compared to `1<<threshold` instead.
The JIT_CountProfile32 and JIT_CountProfile64 functions used BitScanReverse functions to compute the log2 of counters, and then compare that to a threshold. This changes it so that the counter is directly compared to `1<<threshold` instead.
I decided to create this PR after talking to @AndyAyersMS about it.
The JIT_CountProfile32 and JIT_CountProfile64 functions used BitScanReverse functions to compute the log2 of counters, and then compare that to a threshold. This changes it so that the counter is directly compared to
1<<threshold
instead. This saves a branch and the BitScanReverse when the counter is below the threshold.By not zero-initializing
logCount
we can prevent MSVC from issuing a pointless zero-write to the stack. Unfortunately, MSVC still insists to storelogCount
to the stack for unknown reasons. Here's the x64 diff forJIT_CountProfile32
:Codegen for LLVM and GCC looks decent (https://godbolt.org/z/ersf3KjTz).