Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

win32threadpool.cpp should use GetQueuedCompletionStatusEx rather than GetQueuedCompletionStatus #11314

Closed
benaadams opened this issue Oct 24, 2018 · 19 comments

Comments

@benaadams
Copy link
Member

Pulling back multiple events (when available) per kernel call to dispatch rather than a single one per kernel call.

Also maybe use a completion port per core rather than a single completion port (to reduce contention)? (Not sure what strategy to use for attaching handles though; probably something better than round robin)

/cc @stephentoub @kouvel @vancem

@davidfowl
Copy link
Member

cc @geoffkizer did some experimentation here and didn't see much difference in plaintext. On windows all of the IO queues pull from the same port (we have a single global IO completion port) so it's sorta single producer multiple consumers model vs libuv where there's a unique port per loop thread polling for multiple items. I dunno if that explains the differences we saw though....

@stephentoub
Copy link
Member

Pulling back multiple events (when available) per kernel call to dispatch rather than a single one per kernel call.

This also means that either a) some events may have been able to run on other threads but are now waiting to be run by the thread that dequeued them while it processes other items, or b) additional queueing costs will be incurred to queue the execution of those additional items to the worker pool.

should use

Are you investigating this?

@benaadams
Copy link
Member Author

Are you investigating this?

I'm thinking about it; not sure on a good way to objectively evaluate the approach; or the right approach atm.

Multiple completion ports would probably be good (though risks being imbalanced) but then offloading to the main threadpool would reintroduce contention that it would reduce.

@benaadams
Copy link
Member Author

For GetQueuedCompletionStatusEx checking the averages and min/max of ulNumEntriesRemoved to see how often it is above 1 might be interesting (or show its not worth it)

@benaadams
Copy link
Member Author

Though there is easier low hanging fruit to pick first... https://github.com/dotnet/coreclr/issues/19672

@JeffCyr
Copy link
Contributor

JeffCyr commented Oct 25, 2018

This also means that either a) some events may have been able to run on other threads but are now waiting to be run by the thread that dequeued them while it processes other items, or b) additional queueing costs will be incurred to queue the execution of those additional items to the worker pool.

Could it work by implementing a work stealing queue for IO pool threads similar to the worker pool threads? So if a batch bigger than 1 is fetched from the io completion port, it goes to a thread local queue that other idle threads can consume.

@stephentoub
Copy link
Member

Could it work by implementing a work stealing queue for IO pool threads similar to the worker pool threads?

That's one form of "additional queueing costs will be incurred to queue the execution of those additional items" (I shouldn't have been specific about "worker pool"). But that adds non-trivial costs, makes it likely that certain work items will get starved when they otherwise would have been processed sooner, etc., and I'm skeptical it would really help. The work-stealing benefit for the worker pool is primarily based on the notion that a given piece of work is being split into multiple pieces, so it's the work item being processed that's generating subwork, preferred by this thread but made available to others just in case... with the suggested approach, the work items would likely be unrelated. Someone could experiment with it of course.

@stephentoub
Copy link
Member

Though there is easier low hanging fruit to pick first...

Are you working through some trace/profile? What app?

@GSPP
Copy link

GSPP commented Oct 27, 2018

Completion ports are not meant to be sharded. The single completion port design is what's recommended.

@geoffkizer
Copy link
Contributor

@benaadams (or anyone else interested):

I have some code that implements this. It's Socket-specific but that's likely good enough for testing.

https://github.com/geoffkizer/corefx/tree/socketcustomiocp

There are env vars for controlling different parameters like # of IOCPs and # of completions to retrieve per GQCSEx call.

As @davidfowl mentioned above, I tried playing around with this on Plaintext but the wins were tiny at best (basically noise). In particular, sharding the IOCP doesn't seem to help; presumably the kernel is already doing a good job here internally. Batching with GQCSEx doesn't really seem to help either; the overhead specifically in GQCS doesn't seem to be that high.

That said, I do think this is worth additional investigation.

Be aware that this code replaces the existing IOCP infrastructure with a custom implementation. I compared the current coreclr implementation vs this custom implementation with equivalent configuration (batch size = 1, completion ports = 1) and didn't really see much difference, from which I conclude that the current IOCP implementation is reasonably efficient. If you do find scenarios/configurations where you see wins with this code, it would be good to understand whether the wins are due to the specific configuration or the IOCP implementation or both.

@benaadams
Copy link
Member Author

Mostly was wondering if OverlappedData.CheckVMForIOPacket from this loop could be skipped with GQCSEx

internal static unsafe void PerformIOCompletionCallback(uint errorCode, uint numBytes, NativeOverlapped* pNativeOverlapped)
{
    do
    {
        OverlappedData overlapped = OverlappedData.GetOverlappedFromNative(pNativeOverlapped);

        if (overlapped._callback is IOCompletionCallback iocb)
        {
            // We got here because of UnsafePack (or) Pack with EC flow suppressed
            iocb(errorCode, numBytes, pNativeOverlapped);
        }
        else
        {
            // We got here because of Pack
            var helper = (_IOCompletionCallback)overlapped._callback;
            helper._errorCode = errorCode;
            helper._numBytes = numBytes;
            helper._pNativeOverlapped = pNativeOverlapped;
            ExecutionContext.RunInternal(helper._executionContext, _ccb, helper);
        }

        //Quickly check the VM again, to see if a packet has arrived.
        OverlappedData.CheckVMForIOPacket(out pNativeOverlapped, out errorCode, out numBytes);
    } while (pNativeOverlapped != null);
}

@geoffkizer
Copy link
Contributor

This code, as I understand it, is a legacy of when we supported multiple AppDomains per process. There's a single thread pool per process and a single IOCP. Thus at a low level (below this code), we're doing the usual GQCS logic, and when a completion arrives, we "enter" the appropriate AppDomain and execute the callback. But "entering" the AppDomain used to have some nontrivial cost, so to optimize the common case of a single AppDomain, we also have the logic above, which does a 0-timeout GQCS and checks if the received completion (if any) is in the same AppDomain. If so, keep looping without having to switch AppDomains.

This isn't ideal, because if there's no completion to retrieve immediately, then the GCQS is pointless and just adds extra overhead. However, in practice it's not clear that this matters. In benchmark scenarios, there's usually a completion available.

I played around with trying to change this at one point, but didn't see any useful results.

The complexity here is part of why I've tried to prototype other IOCP approaches by simply replacing the whole IOCP/thread pool implementation instead of modifying it in place.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020
@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Apr 9, 2020
@sandersaares
Copy link

sandersaares commented Aug 27, 2020

We have a scenario where our app processes very large quantities of packets, using IOCP to retrieve and process batches of packets on particular threads. This app is currently written in C++ to ensure we have the fine grained control we need. Making greater use of .NET for more developer-friendly business logic and higher-level feature implementation is something that keeps popping up now and then. The app is quite performance-critical, however. Basically a packet switch.

A quick experiment with a naive .NET Core sockets implementation shows less than 50% of the performance we get from our C++ implementation. It is better than I would have expected a few years ago but not enough. I suspect if we could have greater control over IOCP handling - in particular but not exclusively to take more items at once from the completion port -, we could eliminate one obstacle to greater use of .NET in our suite of services. Our sockets are each also associated with specific CPU cores, which reduces chatter across cores - we do not want our completion handling distributed across different cores.

I have only done a cursory examination of how this is implemented in .NET Core but from what I understand, completions are simply handled by the thread pool, triggered by single-item retrievals from the completion port on whatever thread gets the task.

To better meet our needs, we would be looking for functionality here that allows us to:

  1. associate a specific completion port with a specific Socket
  2. have completions from each completion port be processed on a specific thread, retrieved in bulk fashion via GetQCSEx
  3. do so as efficiently as possible and, ideally, without allocating memory

Of course, we could always simply implement this ourselves via PInvoke but given that we already have a C++ implementation, this does not seem like a worthwhile investment. I hope this feedback can help bring .NET Core closer to our needs one day.

(Perhaps not ideal thread to post this in - I will see if I can split it into more individual requests)

@geoffkizer
Copy link
Contributor

@sandersaares Thanks for posting this. As you can see in the above discussion, we've explored doing this like this in the past but we haven't landed on a solution here.

I would love to get your benchmark code and play with it a bit. Can you post it somewhere or email it to me?

@sandersaares
Copy link

The app I used for benchmarking the C# Socket performance is https://github.com/sandersaares/Shubar/

This app can generate the test input streams: grinch.zip

In my scenario I used 5 machines running the script in the zip to generate load for 1 benchmark app.

@davidfowl
Copy link
Member

Ah this is udp, there are some low hanging fruit optimizations that can be made #33418, though that's likely only part of the story here.

@geoffkizer
Copy link
Contributor

The specific issue for UDP optimization is here: #30797

Feedback welcome.

@kouvel
Copy link
Member

kouvel commented Aug 28, 2020

It might be interesting to try using GetQueuedCompletionStatusEx again and having it work similarly to how epoll threads work on Linux, only posting to the IO completion port rather than to the worker thread pool (to preserve the current prioritization).

@kouvel
Copy link
Member

kouvel commented Apr 25, 2022

GetQueuedCompletionStatusEx is used after #64834. Using multiple IOCPs didn't seem to improve perf.

@kouvel kouvel closed this as completed Apr 25, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests