Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConcurrentQueue is HOT in TechEmpower profiles for machines with MANY cores #36447

Open
adamsitnik opened this issue May 14, 2020 · 14 comments
Open
Labels
area-System.Collections backlog-cleanup-candidate An inactive issue that has been marked for automated closure. os-linux Linux OS (any supported distro) tenet-performance Performance related issue
Milestone

Comments

@adamsitnik
Copy link
Member

In #35330 and #35800 we have changed the crucial part of Sockets implementation on Linux which allowed us to get some really nice gains in the vast majority of the Benchmarks and hardware configurations.

Today I've looked into the AMD numbers (the AMD machine was off when we were working on recent changes) and it looks like overall 5.0 is much faster than 3.1 but the recent changes have slightly decreased the performance:

obraz

I've quickly profiled it by running the following command:

dotnet run -- --server http://$secret1 --client $secret2 --connections 512 --jobs ..\BenchmarksApps\Kestrel\PlatformBenchmarks\benchmarks.json.json --scenario JsonPlatform  --sdk 5.0.100-preview.5.20264.2 --runtime  5.0.0-preview.6.20262.14  --collect-trace

15% of the total exclusive CPU time is spent in two ConcurrentQueue methods:

obraz

This machine has 46 cores. Even if I set the number of epoll threads to the old value, we never get 100% CPU utilization. In fact even before our recent changes we never did. I've even run Netty benchmarks and it's also struggling - it's consuming only 39% of CPU.

On an Intel machine with 56 cores we spent a similar amount of time in these two methods:

obraz

But despite that, the recent changes have almost doubled the throughput on this machine.

On 28 core Intel machine it's less than 3% in total:

obraz

I believe that this phenomenon requires further investigation.

Perhaps we should give a single-producer-multiple-consumer concurrent queue a try (a suggestion from @stephentoub from a while ago)?

cc @stephentoub @kouvel @tannergooding @tmds @benaadams

@adamsitnik adamsitnik added os-linux Linux OS (any supported distro) tenet-performance Performance related issue labels May 14, 2020
@adamsitnik adamsitnik added this to the 5.0 milestone May 14, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Collections untriaged New issue has not been triaged by the area owner labels May 14, 2020
@ghost
Copy link

ghost commented May 14, 2020

Tagging subscribers to this area: @eiriktsarpalis
Notify danmosemsft if you want to be subscribed.

@stephentoub
Copy link
Member

Perhaps we should give a single-producer-multiple-consumer concurrent queue a try

You could experiment quickly with that by using https://github.com/dotnet/runtime/blob/master/src/libraries/Common/src/System/Collections/Concurrent/SingleProducerConsumerQueue.cs and guarding access to TryDequeues with a SpinLock or Monitor. It'd require more investigation to find or design and implement a new SPMC queue.

@benaadams
Copy link
Member

Plaintext platform also dropped; latency is much lower, but CPU is ~50% on AMD

image

image

Also dropped a bit on the smaller Intel

image

@benaadams
Copy link
Member

benaadams commented May 14, 2020

Which are the callers of ConcurrentQueue in inclusive stacks; as its used in several places (including ThreadPool and IOQueue)

@stephentoub
Copy link
Member

Which are the callers of ConcurrentQueue in inclusive stacks; as its used in several places (including ThreadPool and IOQueue)

That's a good question; I just assumed Adam was referring to the queue added as part of the previous change, and that's what my comments were in reference to. But you're right that these could be coming from multiple other places.

@jkotas
Copy link
Member

jkotas commented May 14, 2020

You may be seeing effect of AMD being more NUMA than Intel. If it is the case, we should see similar results on ARM64 and likely going to see similar trend on Intel as it becomes more NUMA by adding more cores. NUMA is laws of physics. It just does not work well for 100+ cores to be going to through the same memory pipe.

We may need to get the thread pool and the socket engines to be NUMA aware to address this scalability bottleneck. The other option is to tell people to run one instance per NUMA node for best performance. (Running one instance per NUMA node is likely provide better scalability in most cases even if we do the NUMA optimizations.)

NUMA is configurable on AMD machines. To get more data, you can try to experiment with different NUMA configuration to see how it affects the results.

Some more info: https://www.amd.com/system/files/2018-03/AMD-Optimizes-EPYC-Memory-With-NUMA.pdf

@adamsitnik
Copy link
Member Author

Which are the callers of ConcurrentQueue in inclusive stacks; as its used in several places

That's a very good question.

obraz

obraz

I am going to test it with different number of IOQueue threads which is limited to 16.

https://github.com/dotnet/aspnetcore/blob/cef755fd8251a1a869b7b892603adb772993546b/src/Servers/Kestrel/Transport.Sockets/src/SocketTransportOptions.cs#L17

@adamsitnik
Copy link
Member Author

I've finished running the benchmarks with IOQueue thread count set to Environment.ProcessorCount / 2 for JSONPlatform with 512 connections:

IOQueue 16 cpus / 2
Citrine 28 cores 1,146,236 1,126,055
Mono 56 cores 1,047,780 1,225,314
AMD 46 cores 676,425 738,437

@tmds
Copy link
Member

tmds commented May 14, 2020

15% of the total exclusive CPU time is spent in two ConcurrentQueue methods:

Maybe also interesting to investigate: 8.6% spent in CLRLifoSemaphore::Wait.

we never get 100% CPU utilization.

Can we figure out why we go idle?

You may be seeing effect of AMD being more NUMA than Intel.

NUMA makes synchronization (like locks, and Interlocked operation) more costly, right?

@jkotas
Copy link
Member

jkotas commented May 14, 2020

NUMA makes synchronization (like locks, and Interlocked operation) more costly, right?

Yep, it is not just synchronization - any memory access is more expensive when different NUMA nodes are accessing same memory.

@rducom
Copy link

rducom commented May 23, 2020

any memory access is more expensive when different NUMA nodes are accessing same memory.

This NUMA data locality / affinity subject could be a dedicated GH issue, since there's lot of sub-subjects and consequences. (and some issues about it)

On one side, processors are being less and less monolithics either on AMD or Intel side (Ryzen inter-CCX is twice the intra-CCX latency), on the other side we are using more and more multi-socket platforms (2, 4, 8 sockets are the new standard).
https://twitter.com/IanCutress/status/1226519413978992640
https://www.anandtech.com/show/15785/the-intel-comet-lake-review-skylake-we-go-again/4
Our current production servers are actually 2 socket xeons, and we keep our Vmware VMs with cores count < 1 socket cores (same for RAM) to avoid inter-socket latency-penalty in non-numa-aware workloads (like .NET). Our only VM using all cores (2 sockets) are the SQL-Server ones (which is numa aware).

@layomia layomia removed the untriaged New issue has not been triaged by the area owner label Jun 24, 2020
@adamsitnik adamsitnik modified the milestones: 5.0.0, Future Jun 24, 2020
@adamsitnik
Copy link
Member Author

Based on an offline discussion that contained EPYC numbers that can't be shared publicly, I can just say that #44265 helped with this issue.

@sebastienros please let me know when our high-core count CPU machines are online again ;)

@ghost
Copy link

ghost commented Oct 29, 2021

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of the experimental issue cleanup initiative we are currently trialing in a limited number of areas. Please share any feedback you might have in the linked issue.

@sebastienros
Copy link
Member

I want @kouvel to decide to close this or not and not the bot.

@ghost ghost removed the no-recent-activity label Oct 29, 2021
@eiriktsarpalis eiriktsarpalis added the backlog-cleanup-candidate An inactive issue that has been marked for automated closure. label Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Collections backlog-cleanup-candidate An inactive issue that has been marked for automated closure. os-linux Linux OS (any supported distro) tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

10 participants