ConcurrentQueue is HOT in TechEmpower profiles for machines with MANY cores #36447

adamsitnik · 2020-05-14T13:36:34Z

In #35330 and #35800 we have changed the crucial part of Sockets implementation on Linux which allowed us to get some really nice gains in the vast majority of the Benchmarks and hardware configurations.

Today I've looked into the AMD numbers (the AMD machine was off when we were working on recent changes) and it looks like overall 5.0 is much faster than 3.1 but the recent changes have slightly decreased the performance:

I've quickly profiled it by running the following command:

dotnet run -- --server http://$secret1 --client $secret2 --connections 512 --jobs ..\BenchmarksApps\Kestrel\PlatformBenchmarks\benchmarks.json.json --scenario JsonPlatform  --sdk 5.0.100-preview.5.20264.2 --runtime  5.0.0-preview.6.20262.14  --collect-trace

15% of the total exclusive CPU time is spent in two ConcurrentQueue methods:

This machine has 46 cores. Even if I set the number of epoll threads to the old value, we never get 100% CPU utilization. In fact even before our recent changes we never did. I've even run Netty benchmarks and it's also struggling - it's consuming only 39% of CPU.

On an Intel machine with 56 cores we spent a similar amount of time in these two methods:

But despite that, the recent changes have almost doubled the throughput on this machine.

On 28 core Intel machine it's less than 3% in total:

I believe that this phenomenon requires further investigation.

Perhaps we should give a single-producer-multiple-consumer concurrent queue a try (a suggestion from @stephentoub from a while ago)?

cc @stephentoub @kouvel @tannergooding @tmds @benaadams

The text was updated successfully, but these errors were encountered:

ghost · 2020-05-14T13:36:41Z

Tagging subscribers to this area: @eiriktsarpalis
Notify danmosemsft if you want to be subscribed.

stephentoub · 2020-05-14T13:47:23Z

Perhaps we should give a single-producer-multiple-consumer concurrent queue a try

You could experiment quickly with that by using https://github.com/dotnet/runtime/blob/master/src/libraries/Common/src/System/Collections/Concurrent/SingleProducerConsumerQueue.cs and guarding access to TryDequeues with a SpinLock or Monitor. It'd require more investigation to find or design and implement a new SPMC queue.

benaadams · 2020-05-14T14:01:18Z

Plaintext platform also dropped; latency is much lower, but CPU is ~50% on AMD

Also dropped a bit on the smaller Intel

benaadams · 2020-05-14T14:02:49Z

Which are the callers of ConcurrentQueue in inclusive stacks; as its used in several places (including ThreadPool and IOQueue)

stephentoub · 2020-05-14T14:04:26Z

Which are the callers of ConcurrentQueue in inclusive stacks; as its used in several places (including ThreadPool and IOQueue)

That's a good question; I just assumed Adam was referring to the queue added as part of the previous change, and that's what my comments were in reference to. But you're right that these could be coming from multiple other places.

jkotas · 2020-05-14T15:10:20Z

You may be seeing effect of AMD being more NUMA than Intel. If it is the case, we should see similar results on ARM64 and likely going to see similar trend on Intel as it becomes more NUMA by adding more cores. NUMA is laws of physics. It just does not work well for 100+ cores to be going to through the same memory pipe.

We may need to get the thread pool and the socket engines to be NUMA aware to address this scalability bottleneck. The other option is to tell people to run one instance per NUMA node for best performance. (Running one instance per NUMA node is likely provide better scalability in most cases even if we do the NUMA optimizations.)

NUMA is configurable on AMD machines. To get more data, you can try to experiment with different NUMA configuration to see how it affects the results.

Some more info: https://www.amd.com/system/files/2018-03/AMD-Optimizes-EPYC-Memory-With-NUMA.pdf

adamsitnik · 2020-05-14T16:10:21Z

Which are the callers of ConcurrentQueue in inclusive stacks; as its used in several places

That's a very good question.

I am going to test it with different number of IOQueue threads which is limited to 16.

https://github.com/dotnet/aspnetcore/blob/cef755fd8251a1a869b7b892603adb772993546b/src/Servers/Kestrel/Transport.Sockets/src/SocketTransportOptions.cs#L17

adamsitnik · 2020-05-14T16:47:54Z

I've finished running the benchmarks with IOQueue thread count set to Environment.ProcessorCount / 2 for JSONPlatform with 512 connections:

IOQueue	16	cpus / 2
Citrine 28 cores	1,146,236	1,126,055
Mono 56 cores	1,047,780	1,225,314
AMD 46 cores	676,425	738,437

tmds · 2020-05-14T19:08:22Z

15% of the total exclusive CPU time is spent in two ConcurrentQueue methods:

Maybe also interesting to investigate: 8.6% spent in CLRLifoSemaphore::Wait.

we never get 100% CPU utilization.

Can we figure out why we go idle?

You may be seeing effect of AMD being more NUMA than Intel.

NUMA makes synchronization (like locks, and Interlocked operation) more costly, right?

jkotas · 2020-05-14T19:19:01Z

NUMA makes synchronization (like locks, and Interlocked operation) more costly, right?

Yep, it is not just synchronization - any memory access is more expensive when different NUMA nodes are accessing same memory.

rducom · 2020-05-23T16:56:45Z

any memory access is more expensive when different NUMA nodes are accessing same memory.

This NUMA data locality / affinity subject could be a dedicated GH issue, since there's lot of sub-subjects and consequences. (and some issues about it)

On one side, processors are being less and less monolithics either on AMD or Intel side (Ryzen inter-CCX is twice the intra-CCX latency), on the other side we are using more and more multi-socket platforms (2, 4, 8 sockets are the new standard).
https://twitter.com/IanCutress/status/1226519413978992640
https://www.anandtech.com/show/15785/the-intel-comet-lake-review-skylake-we-go-again/4
Our current production servers are actually 2 socket xeons, and we keep our Vmware VMs with cores count < 1 socket cores (same for RAM) to avoid inter-socket latency-penalty in non-numa-aware workloads (like .NET). Our only VM using all cores (2 sockets) are the SQL-Server ones (which is numa aware).

adamsitnik · 2020-11-06T13:11:36Z

Based on an offline discussion that contained EPYC numbers that can't be shared publicly, I can just say that #44265 helped with this issue.

@sebastienros please let me know when our high-core count CPU machines are online again ;)

ghost · 2021-10-29T12:04:35Z

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of the experimental issue cleanup initiative we are currently trialing in a limited number of areas. Please share any feedback you might have in the linked issue.

sebastienros · 2021-10-29T18:08:45Z

I want @kouvel to decide to close this or not and not the bot.

adamsitnik added os-linux Linux OS (any supported distro) tenet-performance Performance related issue labels May 14, 2020

adamsitnik added this to the 5.0 milestone May 14, 2020

Dotnet-GitSync-Bot added area-System.Collections untriaged New issue has not been triaged by the area owner labels May 14, 2020

layomia removed the untriaged New issue has not been triaged by the area owner label Jun 24, 2020

adamsitnik modified the milestones: 5.0.0, Future Jun 24, 2020

adamsitnik mentioned this issue Nov 3, 2020

ConcurrentQueue spending excess time in SpinWait #44077

Closed

eiriktsarpalis added the no-recent-activity label Oct 29, 2021

ghost removed the no-recent-activity label Oct 29, 2021

eiriktsarpalis added the backlog-cleanup-candidate An inactive issue that has been marked for automated closure. label Feb 18, 2022

EgorBo mentioned this issue Jul 1, 2022

[Arm64 Server Throughput] Address many-core scaling challenges #70528

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConcurrentQueue is HOT in TechEmpower profiles for machines with MANY cores #36447

ConcurrentQueue is HOT in TechEmpower profiles for machines with MANY cores #36447

adamsitnik commented May 14, 2020

ghost commented May 14, 2020

stephentoub commented May 14, 2020

benaadams commented May 14, 2020

benaadams commented May 14, 2020 •

edited

Loading

stephentoub commented May 14, 2020

jkotas commented May 14, 2020

adamsitnik commented May 14, 2020

adamsitnik commented May 14, 2020

tmds commented May 14, 2020

jkotas commented May 14, 2020

rducom commented May 23, 2020

adamsitnik commented Nov 6, 2020

ghost commented Oct 29, 2021

sebastienros commented Oct 29, 2021

ConcurrentQueue is HOT in TechEmpower profiles for machines with MANY cores #36447

ConcurrentQueue is HOT in TechEmpower profiles for machines with MANY cores #36447

Comments

adamsitnik commented May 14, 2020

ghost commented May 14, 2020

stephentoub commented May 14, 2020

benaadams commented May 14, 2020

benaadams commented May 14, 2020 • edited Loading

stephentoub commented May 14, 2020

jkotas commented May 14, 2020

adamsitnik commented May 14, 2020

adamsitnik commented May 14, 2020

tmds commented May 14, 2020

jkotas commented May 14, 2020

rducom commented May 23, 2020

adamsitnik commented Nov 6, 2020

ghost commented Oct 29, 2021

sebastienros commented Oct 29, 2021

benaadams commented May 14, 2020 •

edited

Loading