-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConcurrentQueueSegment allows spinning threads to sleep. #44265
ConcurrentQueueSegment allows spinning threads to sleep. #44265
Conversation
I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label. |
...libraries/System.Private.CoreLib/src/System/Collections/Concurrent/ConcurrentQueueSegment.cs
Outdated
Show resolved
Hide resolved
What about just using |
...libraries/System.Private.CoreLib/src/System/Collections/Concurrent/ConcurrentQueueSegment.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Private.CoreLib/src/System/Collections/Concurrent/ConcurrentQueueSegment.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Private.CoreLib/src/System/Collections/Concurrent/ConcurrentQueueSegment.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Private.CoreLib/src/System/Collections/Concurrent/ConcurrentQueueSegment.cs
Outdated
Show resolved
Hide resolved
That's a good point. No argument is still an improvement and is in the same ballpark for Skylake and Ryzen, but the improvement isn't as great for EPYC when passing a threshold. EPYC sees an increase ~150-200% more requests/sec with a threshold and an increase of ~80% more requests/sec with no argument. So I'd prefer to pass the threshold value, if possible. Here are the Skylake and Ryzen numbers comparing current release (default), sleep threshold of 8, and no sleep threshold argument (noarg): Ryzen
Skylake
|
…ncurrent/ConcurrentQueueSegment.cs Co-authored-by: Stephen Toub <stoub@microsoft.com>
Co-authored-by: Stephen Toub <stoub@microsoft.com>
Thanks, @alexcovington. @kouvel, this makes me wonder if need to revisit
e.g. whether it should be the same "8" that @alexcovington has landed on here...? We can certainly check in the 8 that's used here, but such magic values being thrown around do make me a little nervous. |
Possibly, but that would involve a lot more testing on anything that uses the |
I'm not sure that one number would work best for everything. I had to tweak spin counts for different cases before depending on how expensive the following wait is, it could also vary based on how the data structure would be used and how much it would contend. |
Ok. Let's get this in but then follow-up. We use -1 in as well in ConcurrentStack, BlockingCollection, ManualResetEventSlim, SemaphoreSlim, SpinLock, Barrier, CountdownEvent, and Task: @alexcovington, is this something you'd be interested in helping with? If not, totally fine, just figured I'd ask :) |
Mainly if a proper wait follows the spin-wait, then there wouldn't be much benefit in doing Sleep(1), I think some of those fall into that category where avoiding the sleep is probably ok |
@stephentoub I'd be happy to help 😄. Just let me know how I can contribute. |
Thanks, @alexcovington. I think the work would "just" be to look at the other uses of SpinWait.SpinOnce(-1) (you can see all of them here: https://source.dot.net/#System.Private.CoreLib/SpinWait.cs,e030659599d0fa3f,references) and decide if any should be changed to either SpinOnce() or SpinOnce(someOtherValue). We could also look at existing uses of the parameterless SpinOnce (https://source.dot.net/#System.Private.CoreLib/SpinWait.cs,39bd72970cc926fe,references), though that seems less important given how much closer that was to the ideal throughput in your tests of ConcurrentQueue. As @kouvel says, some of them are probably fine as is, but I expect some might warrant a change, e.g. ConcurrentStack's usage is similar to ConcurrentQueue's. It's also fine to decide everything is good the way it is, or the difference is negligible enough to not be worth the effort. I just see us changing one usage and want to make sure we've at least thought about the others and whether they're relevant. |
@stephentoub Makes sense. I'll start poking around and will post a new issue if I find anything. |
author Stephen Toub <stoub@microsoft.com> 1604601164 -0500 committer Tammy Qiu <tammy.qiu@yahoo.com> 1604960878 -0500 Add stream conformance tests for TranscodingStream (dotnet#44248) * Add stream conformance tests for TranscodingStream * Special-case 0-length input buffers to TranscodingStream.Write{Async} The base implementation of Encoder.Convert doesn't like empty inputs. Regardless, if the input is empty, we can avoid a whole bunch of unnecessary work. JIT: minor inliner refactoring (dotnet#44215) Extract out the budget check logic so it can vary by inlining policy. Use this to exempt the FullPolicy from budget checking. Fix inline xml to dump the proper (full name) hash for inlinees. Update range dumper to dump ranges in hex. Remove unused QCall for WinRTSupported (dotnet#44278) ConcurrentQueueSegment allows spinning threads to sleep. (dotnet#44265) * Allow threads to sleep when ConcurrentQueue has many enqueuers/dequeuers. * Update src/libraries/System.Private.CoreLib/src/System/Collections/Concurrent/ConcurrentQueueSegment.cs Co-authored-by: Stephen Toub <stoub@microsoft.com> * Apply suggestions from code review Co-authored-by: Stephen Toub <stoub@microsoft.com> Co-authored-by: AMD DAYTONA EPYC <amd@amd-DAYTONA-X0.com> Co-authored-by: Stephen Toub <stoub@microsoft.com> File.Exists() is not null when true (dotnet#44310) * File.Exists() is not null when true * Fix compile * Fix compile 2 [master][watchOS] Add simwatch64 support (dotnet#44303) Xcode 12.2 removed 32 bits support for watchOS simulators, this PR helps to fix xamarin/xamarin-macios#9949, we have tested the new binaries and they are working as expected ![unknown](https://user-images.githubusercontent.com/204671/98253709-64413200-1f49-11eb-9774-8c5aa416fc57.png) Co-authored-by: dalexsoto <dalexsoto@users.noreply.github.com> Implementing support to Debugger::Break. (dotnet#44305) Set fgOptimizedFinally flag correctly (dotnet#44268) - Initialize to 0 at compiler startup - Set flag when finally cloning optimization kicks in Fixes non-deterministic generation of nop opcodes into ARM32 code Forbid `- byref cnst` -> `+ (byref -cnst)` transformation. (dotnet#44266) * Add a repro test. * Forbid the transformation for byrefs. * Update src/coreclr/src/jit/morph.cpp Co-authored-by: Andy Ayers <andya@microsoft.com> * Update src/coreclr/src/jit/morph.cpp * Fix the test return value. WriteLine is just to make sure we don't delete the value. * improve the test. avoid a possible overflow and don't waste time on printing. Co-authored-by: Andy Ayers <andya@microsoft.com> Pick libmonosgen-2.0.so from cmake install directory instead of .libs (dotnet#44291) This aligns Linux with what we already do for all the other platforms. Update SharedPerformanceCounter assert (dotnet#44333) Remove silly ToString in GetCLRInstanceString (dotnet#44335) Use targetPlatformMoniker for net5.0 and newer tfms (dotnet#43965) * Use targetPlatformMoniker for net5.0 and newer tfms * disabling analyzer, update version to 0.0, and use new format. * update the targetFramework.sdk * removing supportedOS assembly level attribute * fix linker errors and addressing feedback * making _TargetFrameworkWithoutPlatform as private [sgen] Add Ward annotations to sgen_get_total_allocated_bytes (dotnet#43833) Attempt to fix https://jenkins.mono-project.com/job/test-mono-mainline-staticanalysis/ Co-authored-by: lambdageek <lambdageek@users.noreply.github.com> [tests] Re-enable tests fixed by dotnet#44081 (dotnet#44212) Fixes mono/mono#15030 and fixes mono/mono#15031 and fixes mono/mono#15032 Add an implicit argument coercion check. (dotnet#43386) * Add `impCheckImplicitArgumentCoercion`. * Fix tests with type mismatch. * Try to fix VM signature. * Allow to pass byref as native int. * another fix. * Fix another IL test. [mono] Change CMakelists.txt "python" -> Python3_EXECUTABLE (dotnet#44340) Debian doesn't install a "python" binary for python3. Tweak StreamConformanceTests for cancellation (dotnet#44342) - Avoid unnecessary timers - Separate tests for precancellation, ReadAsync(byte[], ...) cancellation, and ReadAsync(Memory, ...) cancellation Use Dictionary for underlying cache of ResourceSet (dotnet#44104) Simplify catch-rethrow logic in NetworkStream (dotnet#44246) A follow-up on dotnet#40772 (comment), simplifies and harmonizes the way we wrap exceptions into IOException. Having one catch block working with System.Exception seems to be enough here, no need for specific handling of SocketException. Simple GT_NEG optimization for dotnet#13837 (dotnet#43921) * Simple arithmetic optimization with GT_NEG * Skip GT_NEG optimization when an operand is constant. Revert bitwise rotation pattern * Fixed Value Numbering assert * Cleaned up code and comments for simple GT_NEG optimization * Formatting Co-authored-by: Julie Lee <jeonlee@microsoft.com> [master] Update dependencies from mono/linker (dotnet#44322) * Update dependencies from https://github.com/mono/linker build 20201105.1 Microsoft.NET.ILLink.Tasks From Version 6.0.0-alpha.1.20527.2 -> To Version 6.0.0-alpha.1.20555.1 * Update dependencies from https://github.com/mono/linker build 20201105.2 Microsoft.NET.ILLink.Tasks From Version 6.0.0-alpha.1.20527.2 -> To Version 6.0.0-alpha.1.20555.2 * Disable new optimization for libraries mode (it cannot work in this mode) Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com> Co-authored-by: Marek Safar <marek.safar@gmail.com> Tighten argument validation in StreamConformanceTests (dotnet#44326) Add threshold on number of files / partition in SPMI collection (dotnet#44180) * Add check for files count * Fix the OS check * decrese file limit to 1500: * misc fix * Do not upload to azure if mch files are zero size Fix ELT profiler tests (dotnet#44285) [master] Update dependencies from dotnet/arcade dotnet/llvm-project dotnet/icu (dotnet#44336) [master] Update dependencies from dotnet/arcade dotnet/llvm-project dotnet/icu - Merge branch 'master' into darc-master-2211df94-2a02-4c3c-abe1-e3534e896267 Fix Send_TimeoutResponseContent_Throws (dotnet#44356) If the client times out too quickly, the server may never have a connection to accept and will hang forever. Match CoreCLR behaviour on thread start failure (dotnet#44124) Co-authored-by: Aleksey Kliger (λgeek) <akliger@gmail.com> Add slash in Windows SoD tool build (dotnet#44359) * Add slash in Windows SoD tool build * Update SoD search path to match output dir * Fixup dotnet version * Remove merge commit headers * Disable PRs Co-authored-by: Drew Scoggins <andrew.g.scoggins@gmail> Reflect test path changes in .gitattributes; remove nonexistent files (dotnet#44371) Bootstrapping a test for R2RDump (dotnet#42150) Improve performance of Enum's generic IsDefined / GetName / GetNames (dotnet#44355) Eliminates the boxing in IsDefined/GetName/GetValues, and in GetNames avoids having to go through RuntimeType's GetEnumNames override. clarify http version test (dotnet#44379) Co-authored-by: Geoffrey Kizer <geoffrek@windows.microsoft.com> Update dependencies from https://github.com/mono/linker build 20201106.1 (dotnet#44367) Microsoft.NET.ILLink.Tasks From Version 6.0.0-alpha.1.20555.2 -> To Version 6.0.0-alpha.1.20556.1 Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com> Disable RunThreadLocalTest8_Values on Mono (dotnet#44357) * Disable RunThreadLocalTest8_Values on Mono It's failing on SLES * fix typo LongProcessNamesAreSupported: make test work on distros where sleep is a symlink/script (dotnet#44299) * LongProcessNamesAreSupported: make test work on distros where sleep is a symlink/script * PR feedback Co-authored-by: Stephen Toub <stoub@microsoft.com> * fix compilation Co-authored-by: Stephen Toub <stoub@microsoft.com> add missing constructor overloads (dotnet#44380) Co-authored-by: Geoffrey Kizer <geoffrek@windows.microsoft.com> change using in ConnectCallback_UseUnixDomainSocket_Success (dotnet#44366) Clean up the samples (dotnet#44293) Update dotnet/roslyn issue link Delete stale comment about dotnet/roslyn#30797 Fix/remove TODO-NULLABLEs (dotnet#44300) * Fix/remove TODO-NULLABLEs * remove redundant ! * apply Jozkee's feedback * address feedback Update glossary (dotnet#44274) Co-authored-by: Juan Hoyos <juan.hoyos@microsoft.com> Co-authored-by: Stephen Toub <stoub@microsoft.com> Co-authored-by: Günther Foidl <gue@korporal.at> Add files need for wasm executable relinking/aot to the wasm runtime pack. (dotnet#43785) Co-authored-by: Alexander Köplinger <alex.koeplinger@outlook.com> Move some more UnmanagedCallersOnly tests to IL now that they're invalid C# (dotnet#43366) Fix C++ build for mono/metadata/threads.c (dotnet#44413) `throw` is a reserved keyword in C++. Disable a failing test. (dotnet#44404) Change async void System.Text.Json test to be async Task (dotnet#44418) Improve crossgen2 comparison jobs (dotnet#44119) - Fix compilation on unix platforms - Wrap use of wildcard in quotes - Print better display name into log - Fix X86 constant comparison handling - Add ability to compile specific overload via single method switches Remove some unnecessary GetTypeInfo usage (dotnet#44414) Fix MarshalTypedArrayByte and re-enable it. Re-enable TestFunctionApply
- dotnet#44265 seems to have caused large regressions on Windows and Linux-arm64. During that change we had tested adding the `Sleep(1)` to some `ConcurrentQueue` operations in contending cases, and not spin-waiting at all in forward-progressing cases. Not spin-waiting at all where possible in contending cases seemed to be better or equal for the most part (compared with spin-waiting without `Sleep(1)`), so I have removed spin-waiting in forward-progressing cases in `ConcurrentQueue`. - There were some regressions from the portable thread pool on Windows. I have moved/tweaked a slight delay that I had added early on, after changes thereafter it lost its intention, with the changes it goes back to the original intention and seems to resolve some of the gap, but maybe not all of it in some tests. We'll check the graphs after this change and see if there is more to investigate. There are also other things to improve on Windows, and many of those may be separate from the portable thread pool but some may be relevant to the changes in perf characteristics.
- #44265 seems to have caused large regressions on Windows and Linux-arm64. During that change we had tested adding the `Sleep(1)` to some `ConcurrentQueue` operations in contending cases, and not spin-waiting at all in forward-progressing cases. Not spin-waiting at all where possible in contending cases seemed to be better or equal for the most part (compared with spin-waiting without `Sleep(1)`), so I have removed spin-waiting in forward-progressing cases in `ConcurrentQueue`. - There were some regressions from the portable thread pool on Windows. I have moved/tweaked a slight delay that I had added early on, after changes thereafter it lost its intention, with the changes it goes back to the original intention and seems to resolve some of the gap, but maybe not all of it in some tests. We'll check the graphs after this change and see if there is more to investigate. There are also other things to improve on Windows, and many of those may be separate from the portable thread pool but some may be relevant to the changes in perf characteristics.
Proposal to fix this issue. The
SpinWait
instances inConcurrentQueueSegment
does not allow enqueuers/dequeuers to sleep when there is contention, causing a lot of time spent busy waiting.This change would essentially undo this merge and reimplements a threshold value for the spinners. Originally I used the
Thread.OptimalMaxSpinWaitsPerSpinIteration
as the threshold value, which improved throughput on my EPYC machine significantly, but I've found that using a constant value of8
results in very similar performance, whichThread.OptimalMaxSpinWaitsPerSpinIteration
usually evaluates to anyways.I ran both the Microbenchmark and TechEmpower benchmarks to evaluate the impact of the change. Here are the results for my Ryzen and Skylake machines:
Microbenchmarks
Base - Current
master
branchDiff - This change
Ryzen
Skylake
TechEmpower
Default - Current
master
branchx.x.8
- Sleep threshold value of8
Plaintext
Ryzen
Skylake
Fortunes
Ryzen
Skylake
Json
Ryzen
Skylake
This change is more significant on high-core count CPUs and impacts my EPYC machine the most, but I cannot post the numbers publicly. Please let me know if anyone would like to review them and I can send an internal email with results.
Please let me know if I can clarify or expand on any of the above.