Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve software exception handling performance #108480

Merged
merged 5 commits into from
Oct 4, 2024

Conversation

janvorli
Copy link
Member

@janvorli janvorli commented Oct 2, 2024

I have recently discovered that large part of the time spent in the EH while handling software exception is in the RtlLookupFunctionEntry Windows API. That API is called when unwinding native (non-managed) frames on stack. The current implementation of the IL_Throw JIT helper that is used to throw managed exceptions needs three unwinds to get to the managed caller. Due to the two passes of EH, it means there are six calls made to that API per throw.
This change reduces that to just a single unwind, which leads to about 12% improvement in single threaded scenarios and about 30% improvement in multi-threaded one for a case when an exception is thrown over 10 managed frames and then caught on Windows. It also leads to similar improvements in async exception handling performance. I have also run the dotnet/performance EH tests and they have shown the same improvements.

An additional benefit of this change is removal of the helper method frame from IL_Throw and IL_Rethrow, which contributes to the current effort of getting rid of the helper method frames.

In a follow up change, I am planning to make similar change to the FCThrow helper and its variants, further reducing the usage of the helper method frames.

I have made this change for the new EH only.

@janvorli janvorli added this to the 10.0.0 milestone Oct 2, 2024
@janvorli janvorli requested a review from jkotas October 2, 2024 12:30
@janvorli janvorli self-assigned this Oct 2, 2024
@jkotas
Copy link
Member

jkotas commented Oct 2, 2024

I am planning to make similar change to the FCThrow helper and its variants,

This should not be needed. We are working on deleting FCThrow.

#ifndef DACCESS_COMPILE
//
// Init a new frame
// funcCallDepth: the number of frames to skip when looking for the return address.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: funCallDepth vs. funcCallDepth

Do we need this argument at all? It is always 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be passed in > 1 for the follow up change that will modify FCThrow to use this new way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need to worry about FCThrows. They will be gone soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I have missed your comment on the plan to remove the FCThrow helpers. Then I can simplify it.

src/coreclr/vm/fcall.h Outdated Show resolved Hide resolved
@janvorli
Copy link
Member Author

janvorli commented Oct 2, 2024

I've also measured the performance improvement on Linux under WSL2 on the same machine where I've ran the Windows test. The single threaded results are exactly the same as the ones on Windows, but for multi-threaded case, this change improves the performance 3 times! The same is true for multi-threaded async EH.

Comment on lines 908 to 909
#define FC_CAN_TRIGGER_GC_HAVE_THREAD(thread)
#define FC_CAN_TRIGGER_GC_HAVE_THREADEND(thread)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#define FC_CAN_TRIGGER_GC_HAVE_THREAD(thread)
#define FC_CAN_TRIGGER_GC_HAVE_THREADEND(thread)

These are debug-only macros. We do not need special optimized variants that take Thread*.

(Also, passing around current Thread* mattered a lot before current thread getter was converted to a regular thread static. It matters a lot less now.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, passing around current Thread* mattered a lot before current thread getter was converted to a regular thread static. It matters a lot less now

I have measured a noticeable difference in perf when the GetThread was called multiple times in the IL_Throw.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are debug-only macros. We do not need special optimized variants that take Thread*.

Ok, makes sense

Copy link
Member Author

@janvorli janvorli Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, they are actually not debug only, they are defined as empty for !ENABLE_CONTRACTS Right, the contracts are debug only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have measured a noticeable difference in perf when the GetThread was called multiple times in the IL_Throw.

GetThread should be just a couple of instructions and the C++ compiler should be able to CSE redundant calls. I would not expect that we are at the point where a couple more instructions in IL_Throw makes a noticeable difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, they are actually not debug only, they are defined as empty for !ENABLE_CONTRACTS

ENABLE_CONTRACTS == DEBUG

@@ -799,6 +799,22 @@ LPVOID __FCThrowArgument(LPVOID me, enum RuntimeExceptionKind reKind, LPCWSTR ar
#define HELPER_METHOD_FRAME_GET_RETURN_ADDRESS() \
( static_cast<UINT_PTR>( (__helperframe.InsureInit(NULL)), (__helperframe.MachineState()->GetRetAddr()) ) )

#define EXCEPTION_METHOD_FRAME_BEGIN(funCallDepth) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this needs to be wrapped in a macro. Inlining the code to two places where it is needed would be more readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be used in more places with the follow up changes - in the FCThrow variants.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I've missed your comment on the fact that FCThrow will go away. I am not sure what the plan is though. For example, the JIT_Div throws uses the FCThrow when there is division by zero or overflow. And there are other throwing helpers like JIT_ThrowMethodAccessException, JIT_ThrowFieldAccessException. And also some throw helpers that are QCALLs now ExceptionNative_ThrowEntryPointNotFoundException that I think could also get this improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the JIT_Div throws uses the FCThrow when there is division by zero or overflow.

We will convert them into the mix of C#, FCalls without HMF, and QCalls as needed. For example, check #102129 that converted multiplication helpers that followed very similar pattern.

And also some throw helpers that are QCALLs now ExceptionNative_ThrowEntryPointNotFoundException that I think could also get this improvement.

Any QCall can throw exception. If we were to improve something about QCalls, it should be a scheme that applies to all QCalls. My preference would be switch QCalls to an ordinary exception marshalling scheme (ie it is what code outside runtime needs to do to marshal exceptions between managed/unmanaged): store the exception on the unmanaged side in thread-local storage and rethrow it on the managed side.

If we care enough about improving perf of specific exception-throwing QCalls, we should just move more of them to C#. There is no point in throwing C++ exception, catching it, and rethrowing it as managed exception. We can throw the managed exception directly from managed code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense, so I have expanded the macros to the two locations they were used in in my last commit.

@jkotas
Copy link
Member

jkotas commented Oct 2, 2024

This change reduces that to just a single unwind,

I am wondering what it would take to get rid of the one remaining unwind. It would require asm helper, but we have that asm helper for native AOT already. Can we re-use it (maybe with some ifdefs for now) to get rid of the one remaining unwind?

We are likely going to share some of the native AOT asm helpers with CoreCLR to get some of the HMFs.

@janvorli
Copy link
Member Author

janvorli commented Oct 2, 2024

I am wondering what it would take to get rid of the one remaining unwind.

I think it should not be difficult. The asm helper would likely be quite different from the NativeAOT one though, so it doesn't seem like it would make sense sharing it. I'll try to make such change, but I'd prefer doing it as a separate change.

I have recently discovered that large part of the time spent in the EH
while handling software exception is in the RtlLookupFunctionEntry
Windows API. That API is called when unwinding native (non-managed)
frames on stack. The current implementation of the IL_Throw JIT helper
that is used to throw managed exceptions needs three unwinds to get to
the managed caller. Due to the two passes of EH, it means there are six
calls made to that API per throw.
This change reduces that to just a single unwind, which leads to about
12% improvement in single threaded scenarios and about 30% improvement
in multi threaded one for a case when an exception is thrown over 10
managed frames and then caught. It also leads to similar improvements in
async exception handling performance.

An additional benefit of this change is removal of the helper method
frame from IL_Throw and IL_Rethrow, which contributes to the current
effort of getting rid of the helper method frames.

In a follow up changes, I am planning to make similar change to the
FCThrow helper and its variants, further reducing the usage of the
helper method frames.

I have made this change for the new EH only.
@janvorli janvorli force-pushed the improve-software-eh-performance branch from 9559a8a to f941262 Compare October 3, 2024 13:13
@janvorli
Copy link
Member Author

janvorli commented Oct 3, 2024

It seems like I've broken something while making the changes based on the PR feedback

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jkotas jkotas added the tenet-performance Performance related issue label Oct 3, 2024
@janvorli
Copy link
Member Author

janvorli commented Oct 3, 2024

/azp run runtime-coreclr outerloop gcstress0x3-gcstress0xc

Copy link

No pipelines are associated with this pull request.

@janvorli
Copy link
Member Author

janvorli commented Oct 4, 2024

/azp run runtime-coreclr outerloop, runtime-coreclr gcstress0x3-gcstress0xc

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@janvorli janvorli merged commit 1fa1745 into dotnet:main Oct 4, 2024
156 of 158 checks passed
sirntar pushed a commit to sirntar/runtime that referenced this pull request Oct 8, 2024
* Improve software exception handling performance

I have recently discovered that large part of the time spent in the EH
while handling software exception is in the RtlLookupFunctionEntry
Windows API. That API is called when unwinding native (non-managed)
frames on stack. The current implementation of the IL_Throw JIT helper
that is used to throw managed exceptions needs three unwinds to get to
the managed caller. Due to the two passes of EH, it means there are six
calls made to that API per throw.
This change reduces that to just a single unwind, which leads to about
12% improvement in single threaded scenarios and about 30% improvement
in multi threaded one for a case when an exception is thrown over 10
managed frames and then caught. It also leads to similar improvements in
async exception handling performance. On Linux, the multi-threaded gain is much higher,
I've measured 3 fold improvement.

An additional benefit of this change is removal of the helper method
frame from IL_Throw and IL_Rethrow, which contributes to the current
effort of getting rid of the helper method frames.

I have made this change for the new EH only.
@EgorBo
Copy link
Member

EgorBo commented Oct 10, 2024

@janvorli quite a few improvements in dotnet/performance, nice

@github-actions github-actions bot locked and limited conversation to collaborators Nov 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants