-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConcurrentQueue spending excess time in SpinWait #44077
Comments
Tagging subscribers to this area: @eiriktsarpalis, @jeffhandley |
cc: @kouvel |
Thanks @alexcovington, some of those are large differences indeed. Since |
@kouvel would it be useful to run the TechEmpower benchmarks as well? I could run them on Citrine, AMD, and ARM machines using your and @alexcovington changes. |
Yes :) |
That would be great @adamsitnik, thanks! |
@kouvel I'm seeing some changes, but not nearly as drastic as chanigng the threhsold (numbers based off the Skylake
Ryzen
I can email you the EPYC numbers, but the change for EPYC is about the same as Ryzen in this case. Edit: Pasted the wrong numbers for Ryzen, updated my comment above. |
Thanks @alexcovington. Looks like it's just the contention in the same-thread tests. Since there are only two threads in the test doing enqueue-dequeue in a loop, the Sleep(1) would basically turn the test into a single-threaded test for the most part. In the different-thread test the two threads would mostly not be contending with one another. I'm still leaning towards not adding the Sleep(1) due to the long delays it can add. Do any of you think the same-thread test is realistic enough to be worth optimizing for? Maybe the test can be modified to be a bit more realistic, like for each iteration to do a batch of enqueues, then some random work, then a batch of dequeues with a bit of random work in-between dequeues. |
We haven't run the TechEmpower tests with these comparisons so that may also be interesting. |
@kouvel I ran the TechEmpower benchmark using Crank and the following options: $ crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/master/scenarios/platform.benchmarks.yml --config amd.benchmarks.yml --profile $PROFILE --scenario plaintext --application.framework netcoreapp5.0 --application.options.outputFiles /path/to/runtime/$ARTIFACT_DIR/bin/testhost/\*\* --output $PROFILE.$ARTIFACT_DIR.json I just want to confirm this is the right way to run the TechEmpower benchmarks with a local build? I'll be sending the numbers in an internal email thread shortly. |
@alexcovington that looks correct, and it would be better to confirm you get the same number when building the assets without your changes or when using the nightly runtime. I assume right now it's using rc2, because these are the latest published bits. You can get the rtm ones (release branch) with You can also use both saved results to generate a comparison table: |
I've run the TechEmpower benchmarks using a copy of So far I was able to get only the 12 and 28 core Intel x64 machines results:
I don't see a significant difference except for the reproducible +10k for Fortunes with @kouvel changes |
@sebastienros when I am trying to use the
and I am unable to ping the machine. Is it offline? For the Mono machine I am getting a different error: mono:
jobs:
db:
endpoints:
- http://asp-citrine-amd:5001
application:
endpoints:
- http://asp-mono-lin:5001
variables:
databaseServer: 10.0.0.106
load:
endpoints:
- http://asp-mono-load:5002
variables:
serverUri: http://asp-mono-lin
@sebastienros is there anything I could do to make it work? |
@adamsitnik the amd machine is not available anymore. Nic issues, AMD (Alex) couldn't repro the problem, Mellanox (card brand) support didn't help because it's a Dell machine, and the labs people closed the ticket. Next step is for me to contact Dell and hope they can diagnose it. Then for mono, I know they have moved the machines and got new IPs, but I don't think they gave me the new values or updated the records. I will ask to get the records updated, but you shouldn't use mix a citrine machine with the mono machine if you have other options. I can't guaranty the stability and efficiency of the network between these machines. |
@adamsitnik might be worth using a scenario that explicitly uses the structure that is changed in this PR, and I don't know how much Kestrel/Json relies on it, even indirectly (did a profile say?). And maybe just create a more realistic usage of the concurrent queue within a web app? Might make sense for any concurrent data structure btw, though I haven't checked how the micro benchmarks are built. |
Forgot to mention that crank can also run BND benchmarks now, without any change on the benchmark. Here is the documentation about the feature, pointing to some example using the dotnet/performance repos: https://github.com/dotnet/crank/blob/master/docs/microbenchmarks.md This means you can use the labs machines to run the benchmarks, not just the machines you have access to. Or run the micro benchmarks and the TE ones on the same machines. |
@sebastienros I want to use a machine that has more cores than the Citrine machine to see how @alexcovington proposal affects #36447 where profiles have proven that |
Description
I've noticed that when a
ConcurrentQueue
instance has many enqueuers/dequeuers, there is a lot of extra time spent inSpinWait.SpinOnce
. This seems to be because theSpinWait.SpinOnce
call is passing the optional parametersleep1Threshold: -1
, which disables the call toThread.Sleep
that a thread would eventually call after spinning too long.When I change the parameter from
sleep1Threshold: -1
tosleep1Threshold: Thread.OptimalMaxSpinWaitsPerSpinIteration
, I see a significant increase in performance on some of my machines in certain microbenchmark cases. I ran the benchmarks against local builds from therelease/5.0-rc2
branch, with and without the change to the sleeping behavior. The microbenchmarks I ran against are from the dotnet/performance repository, and can be reproduced with:I've also included BenchmarkDotNet results from the dotnet/performance Microbenchmarks to show the effect of the change below.
Configuration
Each machine is an x64 machine running Ubuntu 20.04. More information in the BenchmarkDotNet results.
Regression?
It looks like this was changed to improve performance back in .NET 3.0 based on this merge.
Data
Skylake
Ryzen
Analysis
Here is the change I made on my fork. This branch is based off
master
, so BenchmarkDotNet may complain about versioning if you just clone this branch alone. I can push a branch off ofrelease/5.0-rc2
if that would be convenient.Basically, changing the threshold value used in
ConcurrentQueueSegment
from-1
to a value that allows threads to sleep seems to help threads spend less time spin-waiting. I played around with a few values and foundThread.OptimalMaxSpinWaitsPerSpinIteration
gave me the best result, but this was just blindly guessing with various values and may not be the most optimal. Removing the parameter entirely to allow for default behavior withSpinWait.SpinOnce()
also improved performance, but not as much as using theThread.OptimalMaxSpinWaitsPerSpinIteration
value.I'm wondering if there is a case where
-1
is still optimal, or could this be changed?Please let me know if I can include any other information or clarify anything above.
Edit: Needed to remove EPYC results, but this problem does appear on EPYC with similar results to Ryzen. Please see internal email thread for those numbers.
The text was updated successfully, but these errors were encountered: