ConcurrentQueue spending excess time in SpinWait #44077

alexcovington · 2020-10-30T18:02:42Z

Description

I've noticed that when a ConcurrentQueue instance has many enqueuers/dequeuers, there is a lot of extra time spent in SpinWait.SpinOnce. This seems to be because the SpinWait.SpinOnce call is passing the optional parameter sleep1Threshold: -1, which disables the call to Thread.Sleep that a thread would eventually call after spinning too long.

When I change the parameter from sleep1Threshold: -1 to sleep1Threshold: Thread.OptimalMaxSpinWaitsPerSpinIteration, I see a significant increase in performance on some of my machines in certain microbenchmark cases. I ran the benchmarks against local builds from the release/5.0-rc2 branch, with and without the change to the sleeping behavior. The microbenchmarks I ran against are from the dotnet/performance repository, and can be reproduced with:

sudo python3 ./script/benchmarks_ci.py -c Release -f netcoreapp5.0 --filter '*ConcurrentQueue*' --corerun $CORERUN_PATH --bdn-artifacts $BDN_ARTIFACTS_DIR

I've also included BenchmarkDotNet results from the dotnet/performance Microbenchmarks to show the effect of the change below.

Configuration

Each machine is an x64 machine running Ubuntu 20.04. More information in the BenchmarkDotNet results.

Regression?

It looks like this was changed to improve performance back in .NET 3.0 based on this merge.

Data

Skylake

Base (SpinOnce(sleep1Threshold: -1))
------------------------------------

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 20.04
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-alpha.1.20528.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT
  Job-EQUTZA : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-ISAALG : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun  
InvocationCount=1  IterationCount=100  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  

|                     Namespace |                                  Type |          Method |        Job | MaxWarmupIterationCount | MinWarmupIterationCount | UnrollFactor | WarmupCount | Count |    Size |              Mean |            Error |            StdDev |            Median |               Min |               Max |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------------------------------ |-------------------------------------- |---------------- |----------- |------------------------ |------------------------ |------------- |------------ |------ |-------- |------------------:|-----------------:|------------------:|------------------:|------------------:|------------------:|-------:|-------:|------:|----------:|
|            System.Collections |                CtorDefaultSize<Int32> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |       ? |          72.31 ns |         0.056 ns |          0.164 ns |          72.24 ns |          72.08 ns |          72.75 ns | 0.1376 |      - |     - |     576 B |
|            System.Collections |               CtorDefaultSize<String> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |       ? |          85.67 ns |         0.249 ns |          0.723 ns |          86.14 ns |          84.69 ns |          86.65 ns | 0.1988 |      - |     - |     832 B |
|      System.Collections.Tests |         Add_Remove_SteadyState<Int32> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |   512 |       ? |          20.71 ns |         0.001 ns |          0.002 ns |          20.71 ns |          20.71 ns |          20.72 ns |      - |      - |     - |         - |
|      System.Collections.Tests |        Add_Remove_SteadyState<String> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |   512 |       ? |          21.20 ns |         0.001 ns |          0.003 ns |          21.20 ns |          21.19 ns |          21.21 ns |      - |      - |     - |         - |
|            System.Collections |             CtorFromCollection<Int32> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |     512 |       7,447.65 ns |         1.621 ns |          4.755 ns |       7,448.25 ns |       7,437.28 ns |       7,460.69 ns | 1.0432 |      - |     - |    4448 B |
|            System.Collections |            CtorFromCollection<String> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |     512 |       8,138.52 ns |         1.695 ns |          4.809 ns |       8,139.33 ns |       8,124.25 ns |       8,148.08 ns | 2.0214 | 0.0978 |     - |    8544 B |
|            System.Collections |                 IterateForEach<Int32> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |     512 |       4,358.00 ns |         4.003 ns |         11.290 ns |       4,358.73 ns |       4,331.23 ns |       4,383.26 ns |      - |      - |     - |      72 B |
|            System.Collections |                IterateForEach<String> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,604.88 ns |         2.521 ns |          7.272 ns |       5,603.75 ns |       5,586.93 ns |       5,625.49 ns |      - |      - |     - |      72 B |
|            System.Collections |              CreateAddAndClear<Int32> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |     512 |       7,811.29 ns |         1.145 ns |          3.190 ns |       7,811.23 ns |       7,799.31 ns |       7,819.26 ns | 2.3137 |      - |     - |    9792 B |
|            System.Collections |             CreateAddAndClear<String> | ConcurrentQueue | Job-EQUTZA |                 Default |                 Default |           16 |           1 |     ? |     512 |       8,553.62 ns |         2.325 ns |          6.166 ns |       8,555.10 ns |       8,537.96 ns |       8,569.89 ns | 4.2855 |      - |     - |   17984 B |
| System.Collections.Concurrent |  AddRemoveFromDifferentThreads<Int32> | ConcurrentQueue | Job-ISAALG |                      10 |                       6 |            1 |          -1 |     ? | 2000000 |  31,107,778.63 ns | 2,041,343.278 ns |  5,690,449.134 ns |  28,001,835.00 ns |  26,040,926.50 ns |  49,191,892.50 ns |      - |      - |     - |    9168 B |
| System.Collections.Concurrent | AddRemoveFromDifferentThreads<String> | ConcurrentQueue | Job-ISAALG |                      10 |                       6 |            1 |          -1 |     ? | 2000000 |  30,865,184.17 ns | 1,633,527.591 ns |  4,444,132.284 ns |  28,404,088.50 ns |  27,454,580.00 ns |  44,963,492.00 ns |      - |      - |     - |  526032 B |
| System.Collections.Concurrent |       AddRemoveFromSameThreads<Int32> | ConcurrentQueue | Job-ISAALG |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 187,964,095.59 ns | 4,103,621.184 ns | 11,970,455.607 ns | 189,390,999.50 ns | 144,958,019.50 ns | 212,779,102.50 ns |      - |      - |     - |    1192 B |
| System.Collections.Concurrent |      AddRemoveFromSameThreads<String> | ConcurrentQueue | Job-ISAALG |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 180,518,506.05 ns | 4,279,085.954 ns | 12,616,981.480 ns | 181,038,632.00 ns | 141,594,659.00 ns | 205,936,025.00 ns |      - |      - |     - |    4008 B |


Diff (SpinOnce(sleep1Threshold: Thread.OptimalMaxSpinWaitsPerSpinIteration))
----------------------------------------------------------------------------

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 20.04
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-alpha.1.20528.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT
  Job-MIAVZY : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-SIKWHO : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun  
InvocationCount=1  IterationCount=100  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  

|                     Namespace |                                  Type |          Method |        Job | MaxWarmupIterationCount | MinWarmupIterationCount | UnrollFactor | WarmupCount | Count |    Size |             Mean |            Error |           StdDev |           Median |              Min |              Max |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------------------------------ |-------------------------------------- |---------------- |----------- |------------------------ |------------------------ |------------- |------------ |------ |-------- |-----------------:|-----------------:|-----------------:|-----------------:|-----------------:|-----------------:|-------:|-------:|------:|----------:|
|            System.Collections |                CtorDefaultSize<Int32> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |       ? |         87.06 ns |         0.049 ns |         0.143 ns |         87.12 ns |         86.76 ns |         87.38 ns | 0.1395 |      - |     - |     584 B |
|            System.Collections |               CtorDefaultSize<String> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |       ? |        101.25 ns |         0.017 ns |         0.043 ns |        101.25 ns |        101.16 ns |        101.42 ns | 0.2007 |      - |     - |     840 B |
|      System.Collections.Tests |         Add_Remove_SteadyState<Int32> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |   512 |       ? |         21.28 ns |         0.001 ns |         0.004 ns |         21.28 ns |         21.27 ns |         21.29 ns |      - |      - |     - |         - |
|      System.Collections.Tests |        Add_Remove_SteadyState<String> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |   512 |       ? |         21.27 ns |         0.001 ns |         0.002 ns |         21.27 ns |         21.26 ns |         21.27 ns |      - |      - |     - |         - |
|            System.Collections |             CtorFromCollection<Int32> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |     512 |      7,717.92 ns |         2.011 ns |         5.834 ns |      7,717.93 ns |      7,704.51 ns |      7,730.39 ns | 1.0483 |      - |     - |    4456 B |
|            System.Collections |            CtorFromCollection<String> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |     512 |      8,386.39 ns |         1.475 ns |         4.185 ns |      8,386.13 ns |      8,378.25 ns |      8,397.01 ns | 2.0161 | 0.1008 |     - |    8552 B |
|            System.Collections |                 IterateForEach<Int32> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |     512 |      4,867.26 ns |         1.113 ns |         3.229 ns |      4,867.12 ns |      4,862.70 ns |      4,876.69 ns |      - |      - |     - |      72 B |
|            System.Collections |                IterateForEach<String> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |     512 |      5,658.29 ns |         2.533 ns |         7.469 ns |      5,656.20 ns |      5,647.76 ns |      5,677.85 ns |      - |      - |     - |      72 B |
|            System.Collections |              CreateAddAndClear<Int32> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |     512 |      8,204.22 ns |         1.247 ns |         3.578 ns |      8,204.07 ns |      8,195.18 ns |      8,212.79 ns | 2.3306 |      - |     - |    9840 B |
|            System.Collections |             CreateAddAndClear<String> | ConcurrentQueue | Job-MIAVZY |                 Default |                 Default |           16 |           1 |     ? |     512 |      8,603.39 ns |         2.441 ns |         6.641 ns |      8,601.63 ns |      8,596.32 ns |      8,626.47 ns | 4.3044 |      - |     - |   18032 B |
| System.Collections.Concurrent |  AddRemoveFromDifferentThreads<Int32> | ConcurrentQueue | Job-SIKWHO |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 31,377,565.45 ns | 1,724,624.611 ns | 4,920,451.059 ns | 29,505,311.50 ns | 26,032,254.00 ns | 44,129,165.00 ns |      - |      - |     - |  526880 B |
| System.Collections.Concurrent | AddRemoveFromDifferentThreads<String> | ConcurrentQueue | Job-SIKWHO |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 31,061,566.93 ns | 1,654,339.740 ns | 4,638,945.997 ns | 28,762,799.00 ns | 27,510,170.00 ns | 45,084,255.00 ns |      - |      - |     - | 1050656 B |
| System.Collections.Concurrent |       AddRemoveFromSameThreads<Int32> | ConcurrentQueue | Job-SIKWHO |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 83,496,964.09 ns |   415,818.700 ns | 1,166,000.215 ns | 83,076,257.00 ns | 81,850,532.00 ns | 86,766,334.00 ns |      - |      - |     - |     424 B |
| System.Collections.Concurrent |      AddRemoveFromSameThreads<String> | ConcurrentQueue | Job-SIKWHO |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 88,636,063.77 ns |   350,925.444 ns | 1,006,870.882 ns | 88,184,104.00 ns | 87,145,891.00 ns | 91,002,926.00 ns |      - |      - |     - |     424 B |


Comparison
----------
summary:
better: 2, geomean: 2.148
worse: 3, geomean: 1.039
total diff: 5

| Slower                                                                   | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| ------------------------------------------------------------------------ | ---------:| ----------------:| ----------------:| --------:|
| System.Collections.CreateAddAndClear<Int32>.ConcurrentQueue(Size: 512)   |      1.05 |          7804.02 |          8202.04 |         |
| System.Collections.CtorFromCollection<Int32>.ConcurrentQueue(Size: 512)  |      1.04 |          7442.15 |          7713.63 |         |
| System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512) |      1.03 |          8136.48 |          8383.06 |         |

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Collections.Concurrent.AddRemoveFromSameThreads<Int32>.ConcurrentQueue(Si |      2.24 |     187936495.50 |      83897866.50 |         |
| System.Collections.Concurrent.AddRemoveFromSameThreads<String>.ConcurrentQueue(S |      2.06 |     182276745.00 |      88472010.00 |         |

Ryzen

Base (SpinOnce(sleep1Threshold: -1))
------------------------------------

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 20.04
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=6.0.100-alpha.1.20528.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT
  Job-YGFIWA : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-JNBMSA : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun  
InvocationCount=1  IterationCount=100  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  

|                     Namespace |                                  Type |          Method |        Job | MaxWarmupIterationCount | MinWarmupIterationCount | UnrollFactor | WarmupCount | Count |    Size |              Mean |             Error |             StdDev |            Median |               Min |               Max |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------------------------------ |-------------------------------------- |---------------- |----------- |------------------------ |------------------------ |------------- |------------ |------ |-------- |------------------:|------------------:|-------------------:|------------------:|------------------:|------------------:|-------:|-------:|------:|----------:|
|            System.Collections |                CtorDefaultSize<Int32> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |       ? |          85.04 ns |          0.543 ns |           1.577 ns |          85.18 ns |          82.04 ns |          88.88 ns | 0.0342 |      - |     - |     576 B |
|            System.Collections |               CtorDefaultSize<String> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |       ? |          95.75 ns |          0.059 ns |           0.165 ns |          95.74 ns |          95.22 ns |          96.24 ns | 0.0496 |      - |     - |     832 B |
|      System.Collections.Tests |         Add_Remove_SteadyState<Int32> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |   512 |       ? |          12.31 ns |          0.003 ns |           0.007 ns |          12.31 ns |          12.30 ns |          12.33 ns |      - |      - |     - |         - |
|      System.Collections.Tests |        Add_Remove_SteadyState<String> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |   512 |       ? |          13.45 ns |          0.079 ns |           0.234 ns |          13.55 ns |          12.95 ns |          13.87 ns |      - |      - |     - |         - |
|            System.Collections |             CtorFromCollection<Int32> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,151.24 ns |          1.689 ns |           4.764 ns |       5,149.44 ns |       5,144.51 ns |       5,165.32 ns | 0.2479 |      - |     - |    4448 B |
|            System.Collections |            CtorFromCollection<String> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,992.60 ns |          6.465 ns |          18.340 ns |       5,986.52 ns |       5,972.87 ns |       6,039.16 ns | 0.5035 | 0.0240 |     - |    8544 B |
|            System.Collections |                 IterateForEach<Int32> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |     512 |       4,392.32 ns |          0.578 ns |           1.611 ns |       4,392.13 ns |       4,389.04 ns |       4,397.64 ns |      - |      - |     - |      72 B |
|            System.Collections |                IterateForEach<String> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,370.14 ns |          0.403 ns |           1.163 ns |       5,369.99 ns |       5,367.94 ns |       5,373.05 ns |      - |      - |     - |      72 B |
|            System.Collections |              CreateAddAndClear<Int32> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,042.72 ns |          2.979 ns |           8.401 ns |       5,040.23 ns |       5,028.93 ns |       5,066.91 ns | 0.5662 |      - |     - |    9792 B |
|            System.Collections |             CreateAddAndClear<String> | ConcurrentQueue | Job-YGFIWA |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,811.73 ns |          4.256 ns |          12.142 ns |       5,808.91 ns |       5,789.65 ns |       5,846.74 ns | 1.0692 | 0.0697 |     - |   17984 B |
| System.Collections.Concurrent |  AddRemoveFromDifferentThreads<Int32> | ConcurrentQueue | Job-JNBMSA |                      10 |                       6 |            1 |          -1 |     ? | 2000000 |  19,289,620.33 ns |  1,373,898.353 ns |   3,807,065.866 ns |  19,122,269.00 ns |  14,069,073.00 ns |  30,219,178.00 ns |      - |      - |     - | 2100176 B |
| System.Collections.Concurrent | AddRemoveFromDifferentThreads<String> | ConcurrentQueue | Job-JNBMSA |                      10 |                       6 |            1 |          -1 |     ? | 2000000 |  23,454,660.90 ns |    781,060.146 ns |   2,071,260.706 ns |  23,092,573.50 ns |  18,199,132.00 ns |  30,285,417.00 ns |      - |      - |     - |   33776 B |
| System.Collections.Concurrent |       AddRemoveFromSameThreads<Int32> | ConcurrentQueue | Job-JNBMSA |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 567,072,512.67 ns | 64,097,635.471 ns | 188,993,324.379 ns | 659,905,224.50 ns |  63,575,588.00 ns | 755,952,744.00 ns |      - |      - |     - |    9128 B |
| System.Collections.Concurrent |      AddRemoveFromSameThreads<String> | ConcurrentQueue | Job-JNBMSA |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 595,239,541.73 ns | 54,895,214.567 ns | 161,859,778.713 ns | 685,948,586.00 ns | 134,833,576.50 ns | 783,141,323.50 ns |      - |      - |     - |   16808 B |


Diff (SpinOnce(sleep1Threshold: Thread.OptimalMaxSpinWaitsPerSpinIteration))
----------------------------------------------------------------------------

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 20.04
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=6.0.100-alpha.1.20528.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT
  Job-AHLXGP : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-OCAWOY : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun  
InvocationCount=1  IterationCount=100  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  

|                     Namespace |                                  Type |          Method |        Job | MaxWarmupIterationCount | MinWarmupIterationCount | UnrollFactor | WarmupCount | Count |    Size |              Mean |             Error |            StdDev |            Median |              Min |               Max |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------------------------------ |-------------------------------------- |---------------- |----------- |------------------------ |------------------------ |------------- |------------ |------ |-------- |------------------:|------------------:|------------------:|------------------:|-----------------:|------------------:|-------:|-------:|------:|----------:|
|            System.Collections |                CtorDefaultSize<Int32> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |       ? |          81.89 ns |          0.501 ns |          1.445 ns |          81.75 ns |         78.85 ns |          85.34 ns | 0.0347 |      - |     - |     584 B |
|            System.Collections |               CtorDefaultSize<String> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |       ? |          98.97 ns |          0.164 ns |          0.470 ns |          98.98 ns |         98.20 ns |         100.32 ns | 0.0500 |      - |     - |     840 B |
|      System.Collections.Tests |         Add_Remove_SteadyState<Int32> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |   512 |       ? |          12.36 ns |          0.011 ns |          0.032 ns |          12.35 ns |         12.32 ns |          12.42 ns |      - |      - |     - |         - |
|      System.Collections.Tests |        Add_Remove_SteadyState<String> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |   512 |       ? |          13.69 ns |          0.035 ns |          0.102 ns |          13.69 ns |         13.38 ns |          13.95 ns |      - |      - |     - |         - |
|            System.Collections |             CtorFromCollection<Int32> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,160.01 ns |          1.392 ns |          3.764 ns |       5,159.72 ns |      5,149.85 ns |       5,169.32 ns | 0.2483 |      - |     - |    4456 B |
|            System.Collections |            CtorFromCollection<String> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |     512 |       6,018.71 ns |         10.216 ns |         29.638 ns |       6,007.39 ns |      5,985.17 ns |       6,104.02 ns | 0.5048 | 0.0240 |     - |    8552 B |
|            System.Collections |                 IterateForEach<Int32> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |     512 |       4,390.67 ns |          0.744 ns |          2.097 ns |       4,389.89 ns |      4,387.79 ns |       4,397.29 ns |      - |      - |     - |      72 B |
|            System.Collections |                IterateForEach<String> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,369.40 ns |          0.349 ns |          0.979 ns |       5,369.37 ns |      5,367.52 ns |       5,372.03 ns |      - |      - |     - |      72 B |
|            System.Collections |              CreateAddAndClear<Int32> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,069.14 ns |          2.806 ns |          7.869 ns |       5,069.77 ns |      5,053.56 ns |       5,086.62 ns | 0.5684 |      - |     - |    9840 B |
|            System.Collections |             CreateAddAndClear<String> | ConcurrentQueue | Job-AHLXGP |                 Default |                 Default |           16 |           1 |     ? |     512 |       5,833.39 ns |          4.384 ns |         12.508 ns |       5,833.01 ns |      5,786.71 ns |       5,865.83 ns | 1.0740 | 0.0700 |     - |   18032 B |
| System.Collections.Concurrent |  AddRemoveFromDifferentThreads<Int32> | ConcurrentQueue | Job-OCAWOY |                      10 |                       6 |            1 |          -1 |     ? | 2000000 |  18,958,972.16 ns |  1,357,806.277 ns |  3,851,870.408 ns |  18,124,957.50 ns | 14,039,125.50 ns |  30,775,269.50 ns |      - |      - |     - |  527168 B |
| System.Collections.Concurrent | AddRemoveFromDifferentThreads<String> | ConcurrentQueue | Job-OCAWOY |                      10 |                       6 |            1 |          -1 |     ? | 2000000 |  22,155,658.25 ns |    955,012.487 ns |  2,646,335.104 ns |  22,400,435.00 ns | 14,281,531.00 ns |  28,474,005.00 ns |      - |      - |     - |   33528 B |
| System.Collections.Concurrent |       AddRemoveFromSameThreads<Int32> | ConcurrentQueue | Job-OCAWOY |                      10 |                       6 |            1 |          -1 |     ? | 2000000 | 109,724,591.48 ns | 10,191,909.446 ns | 29,568,577.342 ns | 108,815,782.50 ns | 51,159,775.50 ns | 180,713,226.50 ns |      - |      - |     - |    2488 B |
| System.Collections.Concurrent |      AddRemoveFromSameThreads<String> | ConcurrentQueue | Job-OCAWOY |                      10 |                       6 |            1 |          -1 |     ? | 2000000 |  97,721,116.74 ns |  8,854,378.736 ns | 25,404,872.403 ns |  98,165,676.50 ns | 52,096,356.50 ns | 150,054,499.50 ns |      - |      - |     - |    8384 B |


Comparison
-------------

summary:
better: 3, geomean: 3.522
total diff: 3

No Slower results for the provided threshold = 1% and noise filter = 50ns.

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Collections.Concurrent.AddRemoveFromSameThreads<String>.ConcurrentQueue(S |      6.99 |     685948586.00 |      98165676.50 |         |
| System.Collections.Concurrent.AddRemoveFromSameThreads<Int32>.ConcurrentQueue(Si |      6.06 |     659905224.50 |     108815782.50 |         |
| System.Collections.Concurrent.AddRemoveFromDifferentThreads<String>.ConcurrentQu |      1.03 |      23092573.50 |      22400435.00 |         |

Analysis

Here is the change I made on my fork. This branch is based off master, so BenchmarkDotNet may complain about versioning if you just clone this branch alone. I can push a branch off of release/5.0-rc2 if that would be convenient.

Basically, changing the threshold value used in ConcurrentQueueSegment from -1 to a value that allows threads to sleep seems to help threads spend less time spin-waiting. I played around with a few values and found Thread.OptimalMaxSpinWaitsPerSpinIteration gave me the best result, but this was just blindly guessing with various values and may not be the most optimal. Removing the parameter entirely to allow for default behavior with SpinWait.SpinOnce() also improved performance, but not as much as using the Thread.OptimalMaxSpinWaitsPerSpinIteration value.

I'm wondering if there is a case where -1 is still optimal, or could this be changed?

Please let me know if I can include any other information or clarify anything above.

Edit: Needed to remove EPYC results, but this problem does appear on EPYC with similar results to Ryzen. Please see internal email thread for those numbers.

The text was updated successfully, but these errors were encountered:

ghost · 2020-10-30T18:03:00Z

Tagging subscribers to this area: @eiriktsarpalis, @jeffhandley
See info in area-owners.md if you want to be subscribed.

stephentoub · 2020-10-30T19:33:51Z

cc: @kouvel

kouvel · 2020-10-30T21:48:54Z

Thanks @alexcovington, some of those are large differences indeed. Since Sleep(1) can cause very long delays, I'm curious if removing some of the spin-wait paths in ConcurrentQueue instead would also help. Could you please try this change in my branch called CqSpinWaitFix and see how it compares to the baseline on your Ryzen and EPYC machines? There would be more frequent memory/interlocked operations and it may not improve as much as with the Sleep(1), but I'm hoping that the slower times are mostly because of spin-wait lag.

adamsitnik · 2020-11-02T16:39:03Z

@kouvel would it be useful to run the TechEmpower benchmarks as well? I could run them on Citrine, AMD, and ARM machines using your and @alexcovington changes.

stephentoub · 2020-11-02T16:41:13Z

would it be useful to run the TechEmpower benchmarks as well?

Yes :)

kouvel · 2020-11-02T16:57:46Z

That would be great @adamsitnik, thanks!

alexcovington · 2020-11-02T19:55:06Z

@kouvel I'm seeing some changes, but not nearly as drastic as chanigng the threhsold (numbers based off the release/5.0-rc2 branch):

Skylake

summary:
better: 4, geomean: 1.051
worse: 2, geomean: 1.923
total diff: 6

| Slower                                                                           | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Collections.Concurrent.AddRemoveFromSameThreads<Int32>.ConcurrentQueue(Si |      1.93 |     184350066.00 |     355616451.50 |         |
| System.Collections.Concurrent.AddRemoveFromSameThreads<String>.ConcurrentQueue(S |      1.92 |     181463522.00 |     347893063.50 | bimodal |

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Collections.CtorFromCollection<Int32>.ConcurrentQueue(Size: 512)          |      1.07 |          7449.98 |          6942.39 |         |
| System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512)         |      1.06 |          8127.04 |          7656.38 |         |
| System.Collections.CreateAddAndClear<String>.ConcurrentQueue(Size: 512)          |      1.04 |          8559.30 |          8245.53 |         |
| System.Collections.Concurrent.AddRemoveFromDifferentThreads<String>.ConcurrentQu |      1.03 |      29232848.00 |      28355177.00 |         |

Ryzen

summary:
better: 7, geomean: 1.139
worse: 2, geomean: 1.345
total diff: 9

| Slower                                                                           | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Collections.Concurrent.AddRemoveFromSameThreads<Int32>.ConcurrentQueue(Si |      1.43 |     582391563.50 |     830691209.00 |         |
| System.Collections.Concurrent.AddRemoveFromSameThreads<String>.ConcurrentQueue(S |      1.27 |     693116413.50 |     879139261.00 |         |

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Collections.Concurrent.AddRemoveFromDifferentThreads<String>.ConcurrentQu |      1.50 |      23355512.00 |      15603404.00 | several?|
| System.Collections.Concurrent.AddRemoveFromDifferentThreads<Int32>.ConcurrentQue |      1.16 |      16426934.50 |      14183772.50 | several?|
| System.Collections.CreateAddAndClear<Int32>.ConcurrentQueue(Size: 512)           |      1.15 |          5029.00 |          4376.70 |         |
| System.Collections.CtorFromCollection<Int32>.ConcurrentQueue(Size: 512)          |      1.09 |          5156.07 |          4724.18 |         |
| System.Collections.CreateAddAndClear<String>.ConcurrentQueue(Size: 512)          |      1.07 |          5794.20 |          5391.80 |         |
| System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512)         |      1.04 |          5984.43 |          5736.33 |         |
| System.Collections.IterateForEach<String>.ConcurrentQueue(Size: 512)             |      1.02 |          5372.92 |          5280.67 |         |

I can email you the EPYC numbers, but the change for EPYC is about the same as Ryzen in this case.

Edit: Pasted the wrong numbers for Ryzen, updated my comment above.

kouvel · 2020-11-02T22:00:03Z

Thanks @alexcovington. Looks like it's just the contention in the same-thread tests. Since there are only two threads in the test doing enqueue-dequeue in a loop, the Sleep(1) would basically turn the test into a single-threaded test for the most part. In the different-thread test the two threads would mostly not be contending with one another. I'm still leaning towards not adding the Sleep(1) due to the long delays it can add.

Do any of you think the same-thread test is realistic enough to be worth optimizing for? Maybe the test can be modified to be a bit more realistic, like for each iteration to do a batch of enqueues, then some random work, then a batch of dequeues with a bit of random work in-between dequeues.

kouvel · 2020-11-02T22:01:19Z

We haven't run the TechEmpower tests with these comparisons so that may also be interesting.

alexcovington · 2020-11-02T23:50:24Z

@kouvel I ran the TechEmpower benchmark using Crank and the following options:

$ crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/master/scenarios/platform.benchmarks.yml --config amd.benchmarks.yml --profile $PROFILE --scenario plaintext --application.framework netcoreapp5.0 --application.options.outputFiles /path/to/runtime/$ARTIFACT_DIR/bin/testhost/\*\* --output $PROFILE.$ARTIFACT_DIR.json

I just want to confirm this is the right way to run the TechEmpower benchmarks with a local build? I'll be sending the numbers in an internal email thread shortly.

sebastienros · 2020-11-03T00:06:06Z

@alexcovington that looks correct, and it would be better to confirm you get the same number when building the assets without your changes or when using the nightly runtime.

I assume right now it's using rc2, because these are the latest published bits. You can get the rtm ones (release branch) with --application.framework net5.0 --application.channel edge. The runtime version should be rendered in the output.

You can also use both saved results to generate a comparison table: crank compare file1.json file2.json and share it here.

adamsitnik · 2020-11-03T10:25:32Z

I've run the TechEmpower benchmarks using a copy of System.Private.CoreLib.dll from RC2 branch, RC2 branch with @alexcovington changes applied and RC2 branch with @kouvel changes applied.

So far I was able to get only the 12 and 28 core Intel x64 machines results:

Machine	Benchmark	RC2	Alex	Kount
Perf (Intel x64 12 cores)	Plaintext	5,874,635	5,867,430	5,856,361
	JSON	635,586	637,510	635,513
	Fortunes

Citrine (Intel x64 28 cores)	Plaintext	10,883,034	10,715,820	10,911,054
	JSON	1,204,613	1,208,411	1,208,898
	Fortunes	419,920	418,355	428,360

I don't see a significant difference except for the reproducible +10k for Fortunes with @kouvel changes

adamsitnik · 2020-11-03T10:29:10Z

@sebastienros when I am trying to use the aspnet-citrine-amd machine I get the following error:

Job failed at runtime:
WRK Client
args: -c 512 http://10.0.0.106:5000/json --latency -d 15s -w 15s -t 32 --header Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7 --header Connection: keep-alive
[STDERR] Unhandled exception. System.Net.Http.HttpRequestException: No route to host

and I am unable to ping the machine. Is it offline?

For the Mono machine I am getting a different error:

  mono:
    jobs:
      db:
        endpoints: 
          - http://asp-citrine-amd:5001
      application:
        endpoints: 
          - http://asp-mono-lin:5001
        variables:
          databaseServer: 10.0.0.106
      load:
        endpoints: 
          - http://asp-mono-load:5002
        variables:
          serverUri: http://asp-mono-lin

The specified endpoint url 'http://asp-mono-lin:5001' for 'application' is invalid or not responsive: "The operation was canceled."

@sebastienros is there anything I could do to make it work?

sebastienros · 2020-11-03T18:11:34Z

@adamsitnik the amd machine is not available anymore. Nic issues, AMD (Alex) couldn't repro the problem, Mellanox (card brand) support didn't help because it's a Dell machine, and the labs people closed the ticket. Next step is for me to contact Dell and hope they can diagnose it.

Then for mono, I know they have moved the machines and got new IPs, but I don't think they gave me the new values or updated the records. I will ask to get the records updated, but you shouldn't use mix a citrine machine with the mono machine if you have other options. I can't guaranty the stability and efficiency of the network between these machines.

sebastienros · 2020-11-03T18:17:34Z

@adamsitnik might be worth using a scenario that explicitly uses the structure that is changed in this PR, and I don't know how much Kestrel/Json relies on it, even indirectly (did a profile say?). And maybe just create a more realistic usage of the concurrent queue within a web app? Might make sense for any concurrent data structure btw, though I haven't checked how the micro benchmarks are built.

sebastienros · 2020-11-03T18:21:38Z

Forgot to mention that crank can also run BND benchmarks now, without any change on the benchmark. Here is the documentation about the feature, pointing to some example using the dotnet/performance repos: https://github.com/dotnet/crank/blob/master/docs/microbenchmarks.md

This means you can use the labs machines to run the benchmarks, not just the machines you have access to. Or run the micro benchmarks and the TE ones on the same machines.

adamsitnik · 2020-11-03T19:14:22Z

@sebastienros I want to use a machine that has more cores than the Citrine machine to see how @alexcovington proposal affects #36447 where profiles have proven that ConcurentQueue can be a performance bottleneck. I just hope that it's going to improve our scalability for machines with many cores.

alexcovington added the tenet-performance Performance related issue label Oct 30, 2020

Dotnet-GitSync-Bot added area-System.Collections untriaged New issue has not been triaged by the area owner labels Oct 30, 2020

alexcovington mentioned this issue Nov 4, 2020

ConcurrentQueueSegment allows spinning threads to sleep. #44265

Merged

stephentoub closed this as completed in #44265 Nov 5, 2020

ghost locked as resolved and limited conversation to collaborators Dec 6, 2020

eiriktsarpalis removed the untriaged New issue has not been triaged by the area owner label Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConcurrentQueue spending excess time in SpinWait #44077

ConcurrentQueue spending excess time in SpinWait #44077

alexcovington commented Oct 30, 2020 •

edited

Loading

ghost commented Oct 30, 2020

stephentoub commented Oct 30, 2020

kouvel commented Oct 30, 2020

adamsitnik commented Nov 2, 2020

stephentoub commented Nov 2, 2020

kouvel commented Nov 2, 2020

alexcovington commented Nov 2, 2020 •

edited

Loading

kouvel commented Nov 2, 2020

kouvel commented Nov 2, 2020

alexcovington commented Nov 2, 2020

sebastienros commented Nov 3, 2020

adamsitnik commented Nov 3, 2020

adamsitnik commented Nov 3, 2020

sebastienros commented Nov 3, 2020

sebastienros commented Nov 3, 2020

sebastienros commented Nov 3, 2020

adamsitnik commented Nov 3, 2020

ConcurrentQueue spending excess time in SpinWait #44077

ConcurrentQueue spending excess time in SpinWait #44077

Comments

alexcovington commented Oct 30, 2020 • edited Loading

Description

Configuration

Regression?

Data

Analysis

ghost commented Oct 30, 2020

stephentoub commented Oct 30, 2020

kouvel commented Oct 30, 2020

adamsitnik commented Nov 2, 2020

stephentoub commented Nov 2, 2020

kouvel commented Nov 2, 2020

alexcovington commented Nov 2, 2020 • edited Loading

kouvel commented Nov 2, 2020

kouvel commented Nov 2, 2020

alexcovington commented Nov 2, 2020

sebastienros commented Nov 3, 2020

adamsitnik commented Nov 3, 2020

adamsitnik commented Nov 3, 2020

sebastienros commented Nov 3, 2020

sebastienros commented Nov 3, 2020

sebastienros commented Nov 3, 2020

adamsitnik commented Nov 3, 2020

alexcovington commented Oct 30, 2020 •

edited

Loading

alexcovington commented Nov 2, 2020 •

edited

Loading