-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConcurrentQueue is HOT in TechEmpower profiles for machines with MANY cores #36447
Comments
Tagging subscribers to this area: @eiriktsarpalis |
You could experiment quickly with that by using https://github.com/dotnet/runtime/blob/master/src/libraries/Common/src/System/Collections/Concurrent/SingleProducerConsumerQueue.cs and guarding access to TryDequeues with a SpinLock or Monitor. It'd require more investigation to find or design and implement a new SPMC queue. |
Which are the callers of |
That's a good question; I just assumed Adam was referring to the queue added as part of the previous change, and that's what my comments were in reference to. But you're right that these could be coming from multiple other places. |
You may be seeing effect of AMD being more NUMA than Intel. If it is the case, we should see similar results on ARM64 and likely going to see similar trend on Intel as it becomes more NUMA by adding more cores. NUMA is laws of physics. It just does not work well for 100+ cores to be going to through the same memory pipe. We may need to get the thread pool and the socket engines to be NUMA aware to address this scalability bottleneck. The other option is to tell people to run one instance per NUMA node for best performance. (Running one instance per NUMA node is likely provide better scalability in most cases even if we do the NUMA optimizations.) NUMA is configurable on AMD machines. To get more data, you can try to experiment with different NUMA configuration to see how it affects the results. Some more info: https://www.amd.com/system/files/2018-03/AMD-Optimizes-EPYC-Memory-With-NUMA.pdf |
That's a very good question. I am going to test it with different number of |
I've finished running the benchmarks with IOQueue thread count set to
|
Maybe also interesting to investigate: 8.6% spent in
Can we figure out why we go idle?
NUMA makes synchronization (like locks, and Interlocked operation) more costly, right? |
Yep, it is not just synchronization - any memory access is more expensive when different NUMA nodes are accessing same memory. |
This NUMA data locality / affinity subject could be a dedicated GH issue, since there's lot of sub-subjects and consequences. (and some issues about it) On one side, processors are being less and less monolithics either on AMD or Intel side (Ryzen inter-CCX is twice the intra-CCX latency), on the other side we are using more and more multi-socket platforms (2, 4, 8 sockets are the new standard). |
Based on an offline discussion that contained EPYC numbers that can't be shared publicly, I can just say that #44265 helped with this issue. @sebastienros please let me know when our high-core count CPU machines are online again ;) |
Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process. This process is part of the experimental issue cleanup initiative we are currently trialing in a limited number of areas. Please share any feedback you might have in the linked issue. |
I want @kouvel to decide to close this or not and not the bot. |
In #35330 and #35800 we have changed the crucial part of Sockets implementation on Linux which allowed us to get some really nice gains in the vast majority of the Benchmarks and hardware configurations.
Today I've looked into the AMD numbers (the AMD machine was off when we were working on recent changes) and it looks like overall 5.0 is much faster than 3.1 but the recent changes have slightly decreased the performance:
I've quickly profiled it by running the following command:
dotnet run -- --server http://$secret1 --client $secret2 --connections 512 --jobs ..\BenchmarksApps\Kestrel\PlatformBenchmarks\benchmarks.json.json --scenario JsonPlatform --sdk 5.0.100-preview.5.20264.2 --runtime 5.0.0-preview.6.20262.14 --collect-trace
15% of the total exclusive CPU time is spent in two
ConcurrentQueue
methods:This machine has 46 cores. Even if I set the number of epoll threads to the old value, we never get 100% CPU utilization. In fact even before our recent changes we never did. I've even run Netty benchmarks and it's also struggling - it's consuming only 39% of CPU.
On an Intel machine with 56 cores we spent a similar amount of time in these two methods:
But despite that, the recent changes have almost doubled the throughput on this machine.
On 28 core Intel machine it's less than 3% in total:
I believe that this phenomenon requires further investigation.
Perhaps we should give a single-producer-multiple-consumer concurrent queue a try (a suggestion from @stephentoub from a while ago)?
cc @stephentoub @kouvel @tannergooding @tmds @benaadams
The text was updated successfully, but these errors were encountered: