Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize threadpool paths in CoreCLR used by ASP.NET #8299

Closed
vancem opened this issue Jun 6, 2017 · 12 comments
Closed

Optimize threadpool paths in CoreCLR used by ASP.NET #8299

vancem opened this issue Jun 6, 2017 · 12 comments

Comments

@vancem
Copy link
Contributor

vancem commented Jun 6, 2017

See the profile

\\clrmain\tools\PerfView.exe \\clrmain\public\PerfInvestigations\17-01-25.AspNet.TechEmpower\Data.05-30\json.wrk.Asp.2.0.sockets.CPU.1040K.18s.coreclr.perfView.xml.zip

You will see that

module coreclr <<coreclr!Thread::intermediateThreadProc>>

uses 1351 msec (approximate 1-2% of the total CPU time in a Techempower benchmark). There is just alot of 'goo' that is between the OS getting a callback on a socket and the managed code getting called.

@stephentoub @geoffkizer

@geoffkizer
Copy link
Contributor

To be clear, this is Linux-specific, right?

@vancem
Copy link
Contributor Author

vancem commented Jun 6, 2017

Actually we want both. The traces I have are actually for Windows, but you make a good point that we care about Linux more.

@geoffkizer
Copy link
Contributor

Sorry, I should have just looked at the trace :)

This is with thread dispatch in Kestrel, right? It looks like some of this is related to thread dispatch. That said, most seems related to I/O completion.

This is particularly worrisome:

Name Inc % Inc Exc % Exc

  • coreclr!ThreadpoolMgr::CompletionPortThreadStart 38.4 4,139 2.1 227
    |+ coreclr!BindIoCompletionCallbackStub 33.6 3,618 0.8 85
    ||+ coreclr!ManagedThreadBase::ThreadPool 31.0 3,339 0.3 28
    |||+ coreclr!ManagedThreadBase_DispatchOuter 30.7 3,311 0.3 35
    ||| + coreclr!ManagedThreadBase_DispatchMiddle 30.1 3,248 0.4 43
    ||| |+ coreclr!ManagedThreadBase_DispatchInner 28.9 3,110 0.3 37
    ||| ||+ coreclr!BindIoCompletionCallBack_Worker 28.5 3,071 0.8 90
    ||| |||+ coreclr!DispatchCallSimple 27.4 2,953 0.6 63
    ||| ||||+ coreclr!CallDescrWorkerInternal 26.7 2,874 0.6 61
    ||| |||||+ system.private.corelib.ni!_IOCompletionCallback.PerformIOCompletionCallback 26.1 2,813 0.0 0
    ||| ||||| + system.net.sockets!System.Net.Sockets.SocketAsyncEventArgs+<>c.<.cctor>b__200_0(UInt32, UInt32, System.Threading.NativeOverlapped*) 24.6 2,650 0.0 0

Between coreclr!ThreadpoolMgr::CompletionPortThreadStart and system.net.sockets!System.Net.Sockets.SocketAsyncEventArgs+<>c.<.cctor>b__200_0(UInt32, UInt32, System.Threading.NativeOverlapped*), something is taking up 13.8% of CPU time. That seems huge.

That said, in past runs I have usually not seen this so high. (Those were typically Plaintext runs or similar.) Do you see this consistently across runs?

@kouvel who may have some insight here.

@geoffkizer
Copy link
Contributor

BTW @vancem: You said this is 1-2%, but if I'm reading the trace right, it's actually more like 13%, right? 1351 out of 10774? Am I missing something?

@vancem
Copy link
Contributor Author

vancem commented Jun 6, 2017

The trace above was filtered to just show you the costs in Coreclr.dll.
Here is the complete trace that that one derived from

\\clrmain\tools\PerfView.exe \\clrmain\public\PerfInvestigations\17-01-25.AspNet.TechEmpower\Data.05-30\json.wrk.Asp.2.0.sockets.CPU.1040K.18s.etl.zip

Yes, it is 13% of the cost in CoreCLR (which is a lot), but the total CPU for the process is 71K (not 10K) so you have to divide by 7 which is what gets you the 1-2%.

@danmoseley
Copy link
Member

@stephentoub @vancem is there still work here? This is marked 2.1 and I wonder whether it is superseded or should be marked future etc.

@stephentoub
Copy link
Member

It's still relevant.

@jkotas jkotas changed the title Optimize I/O thread paths int CoreCLR used by ASP.NET Optimize threadpool paths int CoreCLR used by ASP.NET Dec 2, 2017
@jkotas jkotas changed the title Optimize threadpool paths int CoreCLR used by ASP.NET Optimize threadpool paths in CoreCLR used by ASP.NET Dec 2, 2017
@benaadams
Copy link
Member

benaadams commented Nov 26, 2018

Might be reading this wrong; but its it resolving the managed callsite each time?

If so is this due to the AppDomain being given as the first param to ManagedThreadBase::ThreadPool which then calls ManagedThreadBase_FullTransitionWithAD

e.g.

ArgIteratorTemplate<ArgIteratorBase>::ForceSigWalk
ArgIteratorTemplate<ArgIteratorBase>::HasRetBuffArg
MetaSig::GetReturnTypeNormalized
SigPointer::PeekElemTypeNormalized
MetaSig::Init
SigParser::SkipExactlyOne
CorSigUncompressData
MethodDesc::IsVtableMethod
MscorlibBinder::GetSignatureLocal
MethodDesc::GetMethodEntryPoint
void ManagedThreadBase::ThreadPool(ADID pAppDomain, Context::ADCallBackFcnType pTarget, LPVOID args)
{
    WRAPPER_NO_CONTRACT;
    ManagedThreadBase_FullTransitionWithAD(pAppDomain, pTarget, args, ThreadPoolThread);
}

from

    coreclr.dll!ThreadpoolMgr::WorkerThreadStart
  |- coreclr.dll!ManagedPerAppDomainTPCount::DispatchWorkItem
  |    |- coreclr.dll!ManagedThreadBase_DispatchOuter
  |    |    |- coreclr.dll!ManagedThreadBase_DispatchMiddle
  |    |    |    |- coreclr.dll!QueueUserWorkItemManagedCallback
  |    |    |    |    |- coreclr.dll!MethodDescCallSite::CallTargetWorker
 *|    |    |    |    |    |- coreclr.dll!CallDescrWorkerInternal  *** ThreadPoolWorkQueue.Dispatch
  |    |    |    |    |    |- coreclr.dll!__chkstk
  |    |    |    |    |    |- coreclr.dll!memcpy
  |    |    |    |    |    |- coreclr.dll!ArgIteratorTemplate<ArgIteratorBase>::HasRetBuffArg
  |    |    |    |    |    |- coreclr.dll!MoveSmall8
  |    |    |    |    |    |- coreclr.dll!__security_check_cookie
  |    |    |    |    |- coreclr.dll!ArgIteratorTemplate<ArgIteratorBase>::ForceSigWalk
  |    |    |    |    |    |- coreclr.dll!ArgIteratorTemplate<ArgIteratorBase>::HasRetBuffArg
  |    |    |    |    |    |    |- coreclr.dll!MetaSig::GetReturnTypeNormalized
  |    |    |    |    |    |    |    |- coreclr.dll!SigPointer::PeekElemTypeNormalized
  |    |    |    |    |- coreclr.dll!MetaSig::Init
  |    |    |    |    |    |- coreclr.dll!SigParser::SkipExactlyOne
  |    |    |    |    |    |- coreclr.dll!CorSigUncompressData
  |    |    |    |    |- coreclr.dll!MethodDesc::IsVtableMethod
  |    |    |    |    |- coreclr.dll!MscorlibBinder::GetSignatureLocal
  |    |    |    |    |- coreclr.dll!MethodDesc::GetMethodEntryPoint
  |    |    |    |- ntdll.dll!LdrpDispatchUserCallTarget
  |- coreclr.dll!McTemplateCoU0qqh
  |- coreclr.dll!CLRLifoSemaphore::WaitForSignal
  |    |- KernelBase.dll!GetQueuedCompletionStatus
  |    |    |- ntdll.dll!NtRemoveIoCompletion
  |    |- kernel32.dll!GetQueuedCompletionStatusStub
  |- ntdll.dll!LdrpDispatchUserCallTarget
  |- KernelBase.dll!SetThreadLocale
  |- KernelBase.dll!GetThreadLocale
  |- coreclr.dll!ManagedPerAppDomainTPCount::TakeActiveRequest
  |- coreclr.dll!SVR::GCHeap::IsGCInProgressHelper
  |- kernel32.dll!GetTickCountKernel32
  |- kernel32.dll!SleepEx
  |- coreclr.dll!GCCoop::GCCoop
  |- coreclr.dll!CExecutionEngine::ClrSleepEx
  |- kernel32.dll!GetThreadLocaleStub
  |- kernel32.dll!SetThreadLocaleStub
  |- coreclr.dll!UnManagedPerAppDomainTPCount::DispatchWorkItem
  |- coreclr.dll!ManagedPerAppDomainTPCount::IsRequestPending
  |- coreclr.dll!ThreadpoolMgr::MaybeAddWorkingWorker
  |- coreclr.dll!CLRLifoSemaphore::Wait
  |- coreclr.dll!GCCoop::~GCCoop

image

@benaadams
Copy link
Member

The native preamble looks to be 5% of running something on the ThreadPool thread

  100%   ManagedPerAppDomainTPCount::DispatchWorkItem  •  16,951 ms  •  coreclr.dll!ManagedPerAppDomainTPCount::DispatchWorkItem
    99.6%   ManagedThreadBase_DispatchOuter  •  16,882 ms  •  coreclr.dll!ManagedThreadBase_DispatchOuter
      99.3%   ManagedThreadBase_DispatchMiddle  •  16,827 ms  •  coreclr.dll!ManagedThreadBase_DispatchMiddle
        98.9%   QueueUserWorkItemManagedCallback  •  16,764 ms  •  coreclr.dll!QueueUserWorkItemManagedCallback
          96.2%   MethodDescCallSite::CallTargetWorker  •  16,305 ms  •  coreclr.dll!MethodDescCallSite::CallTargetWorker
            95.3%   CallDescrWorkerInternal  •  16,153 ms  •  coreclr.dll!CallDescrWorkerInternal
---->       ► 94.8%   Dispatch  •  16,073 ms  •  System.Threading.ThreadPoolWorkQueue.Dispatch()
            ► 0.13%   PerformWaitCallback  •  22 ms  •  System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()
          ► 0.15%   __chkstk  •  25 ms  •  coreclr.dll!__chkstk
          ► 0.12%   memcpy  •  20 ms  •  coreclr.dll!memcpy
          ► 0.07%   ArgIteratorTemplate<ArgIteratorBase>::HasRetBuffArg  •  11 ms  •  coreclr.dll!ArgIteratorTemplate<ArgIteratorBase>::HasRetBuffArg
          ► 0.01%   __security_check_cookie  •  2.5 ms  •  coreclr.dll!__security_check_cookie
        ► 0.44%   ArgIteratorTemplate<ArgIteratorBase>::ForceSigWalk  •  75 ms  •  coreclr.dll!ArgIteratorTemplate<ArgIteratorBase>::ForceSigWalk
        ► 0.40%   MetaSig::Init  •  68 ms  •  coreclr.dll!MetaSig::Init
        ► 0.34%   MethodDesc::IsVtableMethod  •  57 ms  •  coreclr.dll!MethodDesc::IsVtableMethod
        ► 0.22%   MscorlibBinder::GetSignatureLocal  •  37 ms  •  coreclr.dll!MscorlibBinder::GetSignatureLocal
        ► 0.13%   MethodDesc::GetMethodEntryPoint  •  22 ms  •  coreclr.dll!MethodDesc::GetMethodEntryPoint
      ► 0.07%   ntdll.dll  •  12 ms

i.e. 5% is already lost when it gets to System.Threading.ThreadPoolWorkQueue.Dispatch()

@benaadams
Copy link
Member

  100%   ThreadpoolMgr::WorkerThreadStart  •  193,478 ms  •  coreclr.dll!ThreadpoolMgr::WorkerThreadStart
  ► 53.7%   KernelBase.dll  •  103,971 ms
  ► 22.3%   CLRLifoSemaphore::Wait  •  43,085 ms  •  coreclr.dll!CLRLifoSemaphore::Wait
    8.76%   ManagedPerAppDomainTPCount::DispatchWorkItem  •  16,951 ms  •  coreclr.dll!ManagedPerAppDomainTPCount::DispatchWorkItem
      8.73%   ManagedThreadBase_DispatchOuter  •  16,882 ms  •  coreclr.dll!ManagedThreadBase_DispatchOuter
        8.70%   ManagedThreadBase_DispatchMiddle  •  16,827 ms  •  coreclr.dll!ManagedThreadBase_DispatchMiddle
          8.66%   QueueUserWorkItemManagedCallback  •  16,764 ms  •  coreclr.dll!QueueUserWorkItemManagedCallback
            8.43%   MethodDescCallSite::CallTargetWorker  •  16,305 ms  •  coreclr.dll!MethodDescCallSite::CallTargetWorker
              8.35%   CallDescrWorkerInternal  •  16,153 ms  •  coreclr.dll!CallDescrWorkerInternal
---->         ► 8.31%   Dispatch  •  16,073 ms  •  System.Threading.ThreadPoolWorkQueue.Dispatch()
              ► 0.01%   [Unknown]  •  23 ms
              ► 0.01%   PerformWaitCallback  •  22 ms  •  System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()
                0.01%   [Waiting for CPU]  •  13 ms
            ► 0.02%   [Unknown]  •  40 ms
            ► 0.01%   __chkstk  •  25 ms  •  coreclr.dll!__chkstk
            ► 0.01%   memcpy  •  20 ms  •  coreclr.dll!memcpy
              0.01%   [Waiting for CPU]  •  19 ms
            ► 0.01%   ArgIteratorTemplate<ArgIteratorBase>::HasRetBuffArg  •  11 ms  •  coreclr.dll!ArgIteratorTemplate<ArgIteratorBase>::HasRetBuffArg
            ► 0.00%   __security_check_cookie  •  2.5 ms  •  coreclr.dll!__security_check_cookie
          ► 0.06%   [Unknown]  •  113 ms
          ► 0.04%   ArgIteratorTemplate<ArgIteratorBase>::ForceSigWalk  •  75 ms  •  coreclr.dll!ArgIteratorTemplate<ArgIteratorBase>::ForceSigWalk
          ► 0.04%   MetaSig::Init  •  68 ms  •  coreclr.dll!MetaSig::Init
          ► 0.03%   MethodDesc::IsVtableMethod  •  57 ms  •  coreclr.dll!MethodDesc::IsVtableMethod
          ► 0.02%   MscorlibBinder::GetSignatureLocal  •  37 ms  •  coreclr.dll!MscorlibBinder::GetSignatureLocal
            0.02%   [Waiting for CPU]  •  33 ms
          ► 0.01%   MethodDesc::GetMethodEntryPoint  •  22 ms  •  coreclr.dll!MethodDesc::GetMethodEntryPoint
        ► 0.01%   [Unknown]  •  20 ms
          0.01%   [Waiting for CPU]  •  13 ms
        ► 0.01%   ntdll.dll  •  12 ms
      ► 0.01%   [Unknown]  •  18 ms
        0.01%   [Waiting for CPU]  •  14 ms
      0.01%   [Waiting for CPU]  •  20 ms
    ► 0.01%   [Unknown]  •  17 ms
  ► 7.52%   CLRLifoSemaphore::WaitForSignal  •  14,550 ms  •  coreclr.dll!CLRLifoSemaphore::WaitForSignal
  ► 6.19%   [Unknown]  •  11,977 ms
  ► 0.74%   DestroyThread  •  1,433 ms  •  coreclr.dll!DestroyThread
    0.24%   [Waiting for CPU]  •  471 ms
  ► 0.05%   ntdll.dll  •  99 ms
  ► 0.03%   UnManagedPerAppDomainTPCount::TakeActiveRequest  •  53 ms  •  coreclr.dll!UnManagedPerAppDomainTPCount::TakeActiveRequest
  ► 0.02%   kernel32.dll  •  47 ms
  ► 0.01%   WKS::GCHeap::IsGCInProgressHelper  •  25 ms  •  coreclr.dll!WKS::GCHeap::IsGCInProgressHelper
  ► 0.01%   GCCoop::GCCoop  •  15 ms  •  coreclr.dll!GCCoop::GCCoop
  ► 0.01%   GCHeapUtilities::WaitForGCCompletion  •  13 ms  •  coreclr.dll!GCHeapUtilities::WaitForGCCompletion
  ► 0.01%   UnManagedPerAppDomainTPCount::DispatchWorkItem  •  10 ms  •  coreclr.dll!UnManagedPerAppDomainTPCount::DispatchWorkItem
  ► 0.00%   GCCoop::~GCCoop  •  6.1 ms  •  coreclr.dll!GCCoop::~GCCoop
  ► 0.00%   CExecutionEngine::ClrSleepEx  •  5.9 ms  •  coreclr.dll!CExecutionEngine::ClrSleepEx
  ► 0.00%   ManagedPerAppDomainTPCount::IsRequestPending  •  4.9 ms  •  coreclr.dll!ManagedPerAppDomainTPCount::IsRequestPending
  ► 0.00%   ThreadpoolMgr::MaybeAddWorkingWorker  •  3.2 ms  •  coreclr.dll!ThreadpoolMgr::MaybeAddWorkingWorker
  ► 0.00%   coreclr.dll  •  0.1 ms

If you go up another level, then it looses 91.69% before it hits ThreadPoolWorkQueue.Dispatch; but I assume that's sleep time or something?

@benaadams
Copy link
Member

Linux is more like

Inc % Exc %                                                                                                         
100.0   0.0 + 2.27.so <<libpthread-2.27.so!start_thread>>                                                           
100.0  31.9  + libcoreclr.so <<libcoreclr.so!unknown>>                                                              
 38.3   4.8   + System.Private.CoreLib <<System.Private.CoreLib!System.Threading.ThreadPoolWorkQueue::Dispatch()>>  
 21.8   4.1   + 2.27.so <<libc-2.27.so!__sched_yield>>                                                              
  4.4   0.6   + libcoreclrtraceptprovider.so <<libcoreclrtraceptprovider.so!unknown>>                               
  0.6   0.6   + kernel.kallsyms <<kernel.kallsyms!swapgs_restore_regs_and_return_to_usermode>>                      
  0.5   0.5   + 2.27.so <<ld-2.27.so!__tls_get_addr>>                                                               
  0.5   0.5   + tracepoint.so.0.0.0 <<liblttng-ust-tracepoint.so.0.0.0!tp_rcu_read_lock_bp>>                        
  0.4   0.4   + unknown <<unknown!/tmp/perf-4281.map>>                                                              
  0.3   0.1   + System.Private.CoreLib <<System.Private.CoreLib!System.Threading.ThreadPoolWorkQueue::EnsureCurrentT
  0.1   0.1   + tracepoint.so.0.0.0 <<liblttng-ust-tracepoint.so.0.0.0!tp_rcu_read_unlock_bp>>                      
  0.1   0.1   + 2.27.so <<libpthread-2.27.so!__pthread_getspecific>>                                                
  0.1   0.1   +  <<stub GenerateResolveStub !/tmp/perf-4281.map>>                                                   
  0.1   0.1   +  <<stub AllocateTemporaryEntryPoints !/tmp/perf-4281.map>>                                          
  0.1   0.1   + ust.so.0.0.0 <<liblttng-ust.so.0.0.0!unknown>>                                                      
  0.1   0.1   + 2.27.so <<libc-2.27.so!unknown>>                                                                    
  0.1   0.1   + System.Private.CoreLib <<System.Private.CoreLib!System.Threading.QueueUserWorkItemCallbackDefaultCon
  0.1   0.1   + System.Private.CoreLib <<System.Private.CoreLib!System.Threading.QueueUserWorkItemCallback::ExecuteW
  0.1   0.1   + System.Private.CoreLib <<System.Private.CoreLib!System.Threading._ThreadPoolWaitCallback::PerformWai
  0.0   0.0   + System.Private.CoreLib <<System.Private.CoreLib!System.Threading.ThreadPoolWorkQueue::Dequeue(class 
  0.0   0.0   + 2.27.so <<libc-2.27.so!__clock_gettime>>                                                            
  0.0   0.0   + System.Private.CoreLib <<System.Private.CoreLib!System.Threading.QueueUserWorkItemCallbackDefaultCon
  0.0   0.0   + 2.27.so <<libpthread-2.27.so!__lll_unlock_wake>>                                                    
  0.0   0.0   + System.Net.Sockets <<System.Net.Sockets!System.Net.Sockets.SocketAsyncEventArgs::TransferCompletionC
  0.0   0.0   + tracepoint.so.0.0.0 <<liblttng-ust-tracepoint.so.0.0.0!tp_rcu_dereference_sym_bp>>                  
  0.0   0.0   + System.Net.Sockets <<System.Net.Sockets!System.Net.Sockets.SocketAsyncEventArgs::CompletionCallback(
  0.0   0.0   + tracepoint.so.0.0.0 <<liblttng-ust-tracepoint.so.0.0.0!__tls_get_addr@plt>>                         
  0.0   0.0   + System.Private.CoreLib <<System.Private.CoreLib!System.Threading.ExecutionContext::RunInternal(class
  0.0   0.0   + 2.27.so <<libpthread-2.27.so!__pthread_mutex_lock>>                                                 
  0.0   0.0   +  <<stub AllocateTemporaryEntryPoints !/tmp/perf-4281.map>>                                          
  0.0   0.0   + kernel.kallsyms <<kernel.kallsyms!hyperv_callback_vector>>                                          
  0.0   0.0   + System.Net.Sockets <<System.Net.Sockets!System.Net.Sockets.SocketAsyncContext+OperationQueue`1[Syste
  0.0   0.0   +  <<stub AllocateTemporaryEntryPoints !/tmp/perf-4281.map>>                                          
  0.0   0.0   + System.Net.Sockets <<System.Net.Sockets!System.Net.Sockets.SocketAsyncEventArgs::FinishOperationAsyn
  0.0   0.0   + System.Private.CoreLib <<System.Private.CoreLib!System.Threading.ThreadPoolWorkQueue::EnsureThreadRe
  0.0   0.0   + System.Net.Sockets <<System.Net.Sockets!System.Net.Sockets.SocketAsyncContext+OperationQueue`1+<>c[S
  0.0   0.0   +  <<stub AllocateTemporaryEntryPoints !/tmp/perf-4281.map>>                                          

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020
@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Apr 9, 2020
@kouvel
Copy link
Member

kouvel commented Apr 25, 2022

I think most of the overhead has been eliminated by now with several changes. One was to remove overhead from entering/exiting an AppDomain, there may have been others. A couple other changes moved to a mostly managed implementation of the thread pool, which avoids the frequent long call chain but there wasn't much overhead left there by then.

@kouvel kouvel closed this as completed Apr 25, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants