-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash tracking #5451
Crash tracking #5451
Conversation
Execution-Time Benchmarks Report ⏱️Execution-time results for samples comparing the following branches/commits: Execution-time benchmarks measure the whole time it takes to execute a program. And are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are shown in red. The following thresholds were used for comparing the execution times:
Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard. Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph). gantt
title Execution time (ms) FakeDbCommand (.NET Framework 4.6.2)
dateFormat X
axisFormat %s
todayMarker off
section Baseline
This PR (5451) - mean (75ms) : 63, 87
. : milestone, 75,
master - mean (75ms) : 64, 86
. : milestone, 75,
section CallTarget+Inlining+NGEN
This PR (5451) - mean (989ms) : 958, 1019
. : milestone, 989,
master - mean (998ms) : 959, 1036
. : milestone, 998,
gantt
title Execution time (ms) FakeDbCommand (.NET Core 3.1)
dateFormat X
axisFormat %s
todayMarker off
section Baseline
This PR (5451) - mean (110ms) : 108, 113
. : milestone, 110,
master - mean (110ms) : 106, 113
. : milestone, 110,
section CallTarget+Inlining+NGEN
This PR (5451) - mean (698ms) : 668, 728
. : milestone, 698,
master - mean (701ms) : 673, 728
. : milestone, 701,
gantt
title Execution time (ms) FakeDbCommand (.NET 6)
dateFormat X
axisFormat %s
todayMarker off
section Baseline
This PR (5451) - mean (92ms) : 90, 94
. : milestone, 92,
master - mean (94ms) : 90, 98
. : milestone, 94,
section CallTarget+Inlining+NGEN
This PR (5451) - mean (653ms) : 630, 677
. : milestone, 653,
master - mean (655ms) : 630, 679
. : milestone, 655,
gantt
title Execution time (ms) HttpMessageHandler (.NET Framework 4.6.2)
dateFormat X
axisFormat %s
todayMarker off
section Baseline
This PR (5451) - mean (191ms) : 185, 197
. : milestone, 191,
master - mean (190ms) : 186, 194
. : milestone, 190,
section CallTarget+Inlining+NGEN
This PR (5451) - mean (1,079ms) : 1051, 1106
. : milestone, 1079,
master - mean (1,079ms) : 1052, 1106
. : milestone, 1079,
gantt
title Execution time (ms) HttpMessageHandler (.NET Core 3.1)
dateFormat X
axisFormat %s
todayMarker off
section Baseline
This PR (5451) - mean (277ms) : 270, 283
. : milestone, 277,
master - mean (276ms) : 271, 281
. : milestone, 276,
section CallTarget+Inlining+NGEN
This PR (5451) - mean (862ms) : 838, 886
. : milestone, 862,
master - mean (864ms) : 841, 888
. : milestone, 864,
gantt
title Execution time (ms) HttpMessageHandler (.NET 6)
dateFormat X
axisFormat %s
todayMarker off
section Baseline
This PR (5451) - mean (265ms) : 260, 271
. : milestone, 265,
master - mean (265ms) : 261, 269
. : milestone, 265,
section CallTarget+Inlining+NGEN
This PR (5451) - mean (855ms) : 822, 888
. : milestone, 855,
master - mean (845ms) : 824, 866
. : milestone, 845,
|
Datadog ReportBranch report: ✅ 0 Failed, 331892 Passed, 1836 Skipped, 14h 56m 57.81s Total Time |
Throughput/Crank Report:zap:Throughput results for AspNetCoreSimpleController comparing the following branches/commits: Cases where throughput results for the PR are worse than latest master (5% drop or greater), results are shown in red. Note that these results are based on a single point-in-time result for each branch. For full results, see one of the many, many dashboards! gantt
title Throughput Linux x64 (Total requests)
dateFormat X
axisFormat %s
section Baseline
This PR (5451) (11.872M) : 0, 11872182
master (11.849M) : 0, 11849466
benchmarks/2.9.0 (11.905M) : 0, 11905297
section Automatic
This PR (5451) (8.035M) : 0, 8035235
master (7.908M) : 0, 7907936
benchmarks/2.9.0 (8.378M) : 0, 8378100
section Trace stats
master (8.335M) : 0, 8334859
section Manual
This PR (5451) (10.208M) : 0, 10207518
master (10.049M) : 0, 10048746
section Manual + Automatic
This PR (5451) (7.591M) : 0, 7590568
master (7.580M) : 0, 7579532
section Version Conflict
master (6.733M) : 0, 6733343
gantt
title Throughput Linux arm64 (Total requests)
dateFormat X
axisFormat %s
section Baseline
This PR (5451) (9.651M) : 0, 9651452
master (9.620M) : 0, 9619876
benchmarks/2.9.0 (9.517M) : 0, 9516587
section Automatic
This PR (5451) (6.637M) : 0, 6637280
master (6.446M) : 0, 6446273
section Trace stats
master (6.992M) : 0, 6992127
section Manual
This PR (5451) (8.223M) : 0, 8222877
master (8.285M) : 0, 8285417
section Manual + Automatic
This PR (5451) (6.269M) : 0, 6269138
master (6.251M) : 0, 6251406
section Version Conflict
master (5.629M) : 0, 5629094
gantt
title Throughput Windows x64 (Total requests)
dateFormat X
axisFormat %s
section Baseline
This PR (5451) (9.697M) : 0, 9697494
master (9.953M) : 0, 9953121
benchmarks/2.9.0 (10.013M) : 0, 10013461
section Automatic
This PR (5451) (6.990M) : 0, 6989856
master (7.141M) : 0, 7141172
benchmarks/2.9.0 (7.404M) : 0, 7403989
section Trace stats
master (7.366M) : 0, 7366104
section Manual
This PR (5451) (8.593M) : 0, 8592975
master (8.746M) : 0, 8745815
section Manual + Automatic
This PR (5451) (6.698M) : 0, 6698134
master (6.776M) : 0, 6776058
section Version Conflict
master (6.176M) : 0, 6175582
|
Benchmarks Report for tracer 🐌Benchmarks for #5451 compared to master:
The following thresholds were used for comparing the benchmark speeds:
Allocation changes below 0.5% are ignored. Benchmark detailsBenchmarks.Trace.ActivityBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.AgentWriterBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.AspNetCoreBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.DbCommandBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.ElasticsearchBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.GraphQLBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.HttpClientBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.ILoggerBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.Log4netBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.NLogBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.RedisBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.SerilogBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.SpanBenchmark - Same speed ✔️ Same allocations ✔️Raw results
Benchmarks.Trace.TraceAnnotationsBenchmark - Same speed ✔️ Same allocations ✔️Raw results
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:blindfold:
|
||
private static bool ShouldRedactFrame(string? assemblyName) | ||
{ | ||
// It would be nice to get those names directly from the source-generated InstrumentationDefinitions.IsInstrumentedAssembly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should move this out of the source generation into a Nuke step? 🤔 That would make it easier to share, but adds a bit of extra noise dev time when adding an integration (like the trimming/nullability files that we all forget to update). Not this PR obviously, but this is going to go out of date as soon as a new integration is added, and we can't expect people to just remember to update this, especially if there's no explicit tests for it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a job for future-one-of-us
{ | ||
var value = Environment.GetEnvironmentVariable(ConfigurationKeys.Telemetry.Enabled); | ||
|
||
if (string.Equals(value, "false", StringComparison.OrdinalIgnoreCase) || value == "0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh, surprised we don't have a helper for this on our IConfigurationSource
in dd_dotnet
|
||
private static bool IsTelemetryEnabled() | ||
{ | ||
var value = Environment.GetEnvironmentVariable(ConfigurationKeys.Telemetry.Enabled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should be using the full ConfigurationSource
here, incase the variable is set elsewhere (e.g. in json). But maybe that's too much overhead/risk to be thinking about at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm not sure it's worth the complexity
// DD_TRACE_CRASH_HANDLER_PASSTHROUGH environment variable, which codifies the result of the | ||
// "was COMPlus_DbgEnableMiniDump set?" check. | ||
|
||
SkipOn.Platform(SkipOn.PlatformValue.MacOs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this works on Windows either, right? If it does, we should add the RunOnWindows
trait so that we test it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's linux only.
|
||
File.Exists(reportFile.Path).Should().BeTrue(); | ||
|
||
var report = JObject.Parse(reportFile.GetContent()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's worth snapshot testing this? Probably not due to too much variation in runtimes, but just a thought 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost everything will be different from one run to the other, so yuck
]; | ||
} | ||
|
||
private class TemporaryFile : IDisposable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Seems like something genuinely useful to put into TestHelpers
return Path.Combine(EnvironmentHelper.GetMonitoringHomePath(), rid, RuntimeInformation.IsOSPlatform(OSPlatform.Windows) ? "dd-dotnet.exe" : "dd-dotnet"); | ||
} | ||
|
||
internal static bool IsAlpine() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In CI, we also just set a variable that you can check 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I prefer this way. At least it doesn't break in mysterious way when you run it locally and forget to set the environment variable.
|
||
auto resolved = resolveManagedCallstack(tid, context, &managedCallstack, &numberOfManagedFrames); | ||
|
||
std::vector<StackFrame> managedFrames; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can call reserve(numberOfManagedFrames)
on managedFrames
to preallocate and avoid temporary allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 4c07518
{ | ||
for (int i = 0; i < numberOfManagedFrames; i++) | ||
{ | ||
auto managedFrame = managedCallstack[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use auto const& managedFrame
to avoid the copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 4c07518
stackFrame.symbolAddress = managedFrame.symbolAddress; | ||
stackFrame.isSuspicious = managedFrame.isSuspicious; | ||
|
||
managedFrames.push_back(stackFrame); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use std::move(stackFrame)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 4c07518
} | ||
|
||
// TODO: Check if the stacktrace is from the tracer or the profiler | ||
stackFrame.isSuspicious = false; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on line 238, you can use std::move(stackFrame)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 4c07518
|
||
std::vector<StackFrame> CrashReportingLinux::MergeFrames(const std::vector<StackFrame>& nativeFrames, const std::vector<StackFrame>& managedFrames) | ||
{ | ||
std::vector<StackFrame> result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can call reserve(std::max(nativeFrames.size(), managedFrames.size())
to preallocate and avoid temporary allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 4c07518
4c07518
to
c64dafe
Compare
c64dafe
to
5bb8348
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Superb job
Summary of changes
Monitor crashes in instrumented applications. If there is suspicion that the crash might be caused by us, send a crash report through telemetry logs.
We currently only support Linux.
The crash handler is enabled by setting the
DD_TRACE_CRASH_HANDLER
environment variable and making it point to dd-dotnet. The goal is to have SSI automatically set that environment variable.Reason for change
When we crash customer applications, it can take them days/weeks before they narrow it down to Datadog and open a support ticket. With this, hopefully we can be more proactive.
Implementation details
On Linux, there is no good way to know if a .NET application is crashing (listening to signals is not enough because .NET use segfault signals even for NullReferenceException that are going to be caught). The solution we found is to use LD_PRELOAD to hook the
execve
syscall to find out if .NET is trying to callcreatedump
. If that's the case, we redirect the call todd-dotnet
instead.In
dd-dotnet
, we use ClrMD to find the managed exception that caused the crash (if it's a managed crash). To resolve the callstack, we use some native code (written in the profiler) relying on libunwind. The native code callbacks the managed code to resolve the managed frames using ClrMD.All the data we collect is sent to libdatadog, which is responsible of formatting and sending the crash report.
There are currently two heuristics implemented to try to figure out if we caused the crash:
InvalidProgramException
,VerificationException
,MissingMethodException
,BadImageFormatException
), or if the exception is in theDatadog.*
namespace (for instance for ducktyping exceptions)BlockingMiddleware
or theTaskContinuationGenerator
(because they "stay" in the callstack even if they do nothing)The crash tracker tries to be discreet, and only display information if we think we caused the crash (or if there's a really unexpected error).
We still want customers to be able to generate crash dumps by setting
COMPlus_DbgEnableMiniDump
. If it was set, dd-dotnet callscreatedump
after generating the crash report. BecauseDatadog.Linux.ApiWrapper
will always setCOMPlus_DbgEnableMiniDump
(because otherwise .NET won't call createdump, and so there is noexecve
call to hook), we set theDD_TRACE_CRASH_HANDLER_PASSTHROUGH
environment variable. dd-dotnet uses that information to know ifCOMPlus_DbgEnableMiniDump
was originally set or just added by us, and to know if it must forward the call to createdump.When telemetry is explicitly disabled, we disable the crash report.
Test coverage
Added some tests to check that dd-dotnet is correctly invoked. I will add more tests to validate the report itself.
Other details
Please be very careful about reviewing the changes in
Datadog.Linux.ApiWrapper
. Be nitpicky, let nothing slide. If you're feeling lazy, think the PR is too big, and decide to review only one file, that's the one. It runs in all applications so it's paramount that this part of the code is perfect.Things left before merging: