Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in ILRewriter for a .NET 6 app in w3wp #5799

Closed
dustinsoftware opened this issue Jul 23, 2024 · 5 comments · Fixed by #5836
Closed

Crash in ILRewriter for a .NET 6 app in w3wp #5799

dustinsoftware opened this issue Jul 23, 2024 · 5 comments · Fixed by #5836

Comments

@dustinsoftware
Copy link

Describe the bug
We are observing a crash in the datadog .net tracer when updating from 2.20 to 2.55. 2.49 is also impacted. Unfortunately we are only able to repro this on one of our internal apps so I don't have a repo to share on this public issue, although I can provide a full crash dump and logs through a private channel if that would be helpful.

The code in question uses the Datadog.Trace.Annotations library, and the [Trace] attribute is on the method that is present in the stack when the crash occurs.

The method is an overload:

        public async Task PushUpdate(Guid deliveryId, DeliveryStatusUpdate statusUpdate, DeliveryStatusUpdateSource updateSource = DeliveryStatusUpdateSource.Webhook)
        {
...
        }

        public async Task PushUpdate(DeliveryStatusUpdate statusUpdate, DeliveryStatusUpdateSource updateSource = DeliveryStatusUpdateSource.Webhook)
        {
...
        }

        [Trace] // Trace applies to method name not signature, all three methods will be instrumented.
        public async Task PushUpdate(Delivery delivery, DeliveryStatusUpdate statusUpdate, DeliveryStatusUpdateSource updateSource = DeliveryStatusUpdateSource.Webhook)
        {
...
        }

When the crash occurs, we see this in the Application logs:

Faulting application name: w3wp.exe, version: 8.5.9600.16384, time stamp: 0x5215df96
Faulting module name: coreclr.dll, version: 6.0.3224.31407, time stamp: 0x666c84f2
Exception code: 0xc0000005
Fault offset: 0x00000000001d4089
Faulting process id: 0x66a8
Faulting application start time: 0x01dadd0fc8fe1bfc
Faulting application path: c:\\windows\\system32\\inetsrv\\w3wp.exe
Faulting module path: C:\\Program Files\\dotnet\\shared\\Microsoft.NETCore.App\\6.0.32\\coreclr.dll
Report Id: 20f7d9c8-4903-11ef-8191-1204dc7bc319
Faulting package full name: 
Faulting package-relative application ID: "

Capturing a full process dump and loading in visual studio, we see:

Unhandled exception at 0x00007FF88569386D (Datadog.Tracer.Native.dll) in w3wp.exe_240723_150807.dmp: 0xC0000005: Access violation reading location 0x0000000000000014.

The full stack with symbols loaded (2.49)

>	[Inline Frame] Datadog.Tracer.Native.dll!ILRewriter::Export::__l71::<lambda_1>::operator()(EHClause) Line 687	C++
 	[Inline Frame] Datadog.Tracer.Native.dll!std::_Insertion_sort_unchecked(EHClause * const) Line 7912	C++
 	Datadog.Tracer.Native.dll!std::_Sort_unchecked<EHClause *,`ILRewriter::Export'::`71'::<lambda_1>>(EHClause * _First, EHClause * _Last, __int64 _Ideal, ILRewriter::Export::__l71::<lambda_1> _Pred) Line 8037	C++
 	[Inline Frame] Datadog.Tracer.Native.dll!std::sort(EHClause * const _Last, EHClause * const) Line 8067	C++
 	Datadog.Tracer.Native.dll!ILRewriter::Export() Line 691	C++
 	Datadog.Tracer.Native.dll!trace::TracerMethodRewriter::Rewrite(trace::RejitHandlerModule * moduleHandler, trace::RejitHandlerModuleMethod * methodHandler, ICorProfilerFunctionControl * pFunctionControl) Line 696	C++
 	Datadog.Tracer.Native.dll!trace::RejitHandler::NotifyReJITParameters(unsigned __int64 moduleId, unsigned int methodId, ICorProfilerFunctionControl * pFunctionControl) Line 622	C++
 	Datadog.Tracer.Native.dll!trace::CorProfiler::GetReJITParameters(unsigned __int64 moduleId, unsigned int methodId, ICorProfilerFunctionControl * pFunctionControl) Line 3884	C++

To Reproduce

  • Install the datadog .NET tracer version either 2.49 or 2.55
  • Run procdump64 -ma -e 1 -l -f c0000005 -g <pid> to capture the dump when the crash happens
  • Run the affected endpoint, observe w3wp crash
  • Set DD_TRACE_ENABLED=0 in web.config
  • Observe no crash

Screenshots
Screenshot 2024-07-23 at 11 57 58

Runtime environment (please complete the following information):

  • Instrumentation mode: msi
  • Tracer version: 2.55
  • OS: Windows Server 2012 R2
  • CLR: .NET 6.0.32

Additional context
We have several apps using 2.49 in production and have only observed this issue on .NET 6 in this one test environment.

Unfortunately, later in the day when we went to capture a process dump for 2.55, we discovered that the crash was no longer reproducible. We only captured a process dump with the 2.49 tracer in memory. We'll keep an eye out for the crash with 2.55 installed again, and capture a process dump when that occurs.

@kevingosse
Copy link
Collaborator

Hello 🙂 Sorry for the crash and thanks for the thorough preliminary investigation. It would really help if you could send us the memory dump so I can look at the method being rewritten. Any chance you could open a support ticket, to have a private channel to send the dump?

@dustinsoftware
Copy link
Author

Reported as https://help.datadoghq.com/hc/requests/1782753

Spotted something very curious in the logs right before the crash:

7/23/24 03:09:23.844 PM [35196|32292] [info] RegisterCallTargetDefinitions: Added 397 call targets (enabled: 393, enabled categories: 1) 
07/23/24 03:09:23.955 PM [35196|36572] [info] Method enqueued for ReJIT for Olo.Dispatch.Logic.LiveStatusUpdater.PushUpdate(18446744073709551615 params).
07/23/24 03:09:23.955 PM [35196|36572] [info] Method enqueued for ReJIT for Olo.Dispatch.Logic.LiveStatusUpdater.PushUpdate(18446744073709551615 params).
07/23/24 03:09:23.956 PM [35196|36572] [info] Method enqueued for ReJIT for Olo.Dispatch.Logic.LiveStatusUpdater.PushUpdate(18446744073709551615 params).
07/23/24 03:09:23.956 PM [35196|36572] [info] Method enqueued for ReJIT for Olo.Dispatch.Logic.Events.StatusTransitionReader.HandleEventMessage(18446744073709551615 params).
07/23/24 03:09:23.956 PM [35196|36572] [info] Request ReJIT done for 4 methods

@dustinsoftware
Copy link
Author

I was able to repro on 2.55 again this morning with a fresh set of eyes, so I've attached a crash dump to the support ticket for that as well. :)

@dustinsoftware
Copy link
Author

Update: we were able to work around the crash by setting the app pool mode to "No Managed Code", since this is a .NET 6 app. The default IIS app pool setting is unfortunate for new .NET apps, and was missed in this one testing environment.

The app works fine otherwise with the tracer disabled and the incorrect ".NET CLR" setting for the app pool.

image

@kevingosse
Copy link
Collaborator

kevingosse commented Jul 31, 2024

Thanks for the dumps. We're unclear on the root cause (though running both .NET Framework and .NET Core in the same pool is known to mess up with our instrumentation, so that could be it), but we've identified the logic error that caused the tracer to crash instead of handling the error condition gracefully. We're working on a fix.

kevingosse added a commit that referenced this issue Aug 2, 2024
## Summary of changes

Fix error checking when emitting the CallTargetBubbleUpException code.
As a bonus, add synchronization when creating the metadata tokens.

## Reason for change

If for some reason we fail to create the metadata tokens, we could
create an invalid EH clause, which causes an access violation later on.

## Implementation details

Converted the calltarget_tokens mutex to a recursive mutex, because the
`EnsureBaseCalltargetTokens` methods call each other.

## Test coverage

Hard to test, but coincidentally I discovered that the the added
synchronization fixes a crash when running powershell with SentinelOne
activated.

## Other details

Fixes #5799
kevingosse added a commit that referenced this issue Aug 8, 2024
## Summary of changes

Fix error checking when emitting the CallTargetBubbleUpException code.
As a bonus, add synchronization when creating the metadata tokens.

## Reason for change

If for some reason we fail to create the metadata tokens, we could
create an invalid EH clause, which causes an access violation later on.

## Implementation details

Converted the calltarget_tokens mutex to a recursive mutex, because the
`EnsureBaseCalltargetTokens` methods call each other.

## Test coverage

Hard to test, but coincidentally I discovered that the the added
synchronization fixes a crash when running powershell with SentinelOne
activated.

## Other details

Fixes #5799
Backport of #5836 to v2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants