[Exception Replay] Normalized exception hashing for more fine-grained aggregation #5872

GreenMatan · 2024-08-10T11:49:43Z

Summary of changes

In Exception Replay, the exception could be in one of several phases. Two of these phases are Done and Invalidated.

Done: The exception has already been captured and is waiting for the next epoch to be wakened up.
Invalidated: None of the frames could be captured, reporting tags that tend to assist in understanding the reasoning for troubleshooting.

To be able to capture those phases as quick as possible, a cache is used for a lookup before performing intensive calculation on the System.Exception object itself, since the execution path is hot and run as part of unwinding the request with an exception. The lookup key used to be simply Fnv1a hashing of exception.ToString() of the exception reaching the service root span.
Basing the hashing on exception.ToString() lead to scenarios where two identical exceptions seemed different based on one of various factors; non-deterministic participating frame, exception messages, PDB info (file path + line number), etc.

To be able to determine quickly in which phase the exception is in without performing costly computations, a new way of hashing is required that should cleanse the exceptions in a way that two similar exception should fall into the same case, even though their stack traces are not identical. Also, the new algorithm should be as performant as possible with as little temporal allocations as possible - they play part every time a service root span is finalized with an exception.

Reason for change

Improve the experience of Exception Replay where we failed to report an exception due to failure in determining if the exception is in Done / Invalidated phases, as a result of it's previous occurrence looking a bit different.

Implementation details

A new class, ExceptionNormalizer, has been added that takes as input the string representation of the exception alongside it's outermost exception type, and one level deep of inner exception. It cleanses the exception from the aforementioned attributes, and performs a more fine-grained hash that shall have a better distribution based on the actual exception, leaving out all the non-relevant bits that might differ.

Test coverage

ExceptionNormalizerTests with approvals on the hash + string representing the cleansed stack trace.

Other details

Fixes #DEBUG-2674

andrewlock · 2024-08-10T12:19:16Z

Execution-Time Benchmarks Report ⏱️

Execution-time results for samples comparing the following branches/commits:

Execution-time benchmarks measure the whole time it takes to execute a program. And are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are shown in red. The following thresholds were used for comparing the execution times:

Welch test with statistical test for significance of 5%
Only results indicating a difference greater than 5% and 5 ms are considered.

Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard.

Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph).

gantt
    title Execution time (ms) FakeDbCommand (.NET Framework 4.6.2) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5872) - mean (74ms)  : 65, 82
     .   : milestone, 74,
    master - mean (73ms)  : 63, 84
     .   : milestone, 73,

    section CallTarget+Inlining+NGEN
    This PR (5872) - mean (1,072ms)  : 1049, 1094
     .   : milestone, 1072,
    master - mean (1,068ms)  : 1047, 1088
     .   : milestone, 1068,

gantt
    title Execution time (ms) FakeDbCommand (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5872) - mean (110ms)  : 105, 115
     .   : milestone, 110,
    master - mean (109ms)  : 105, 114
     .   : milestone, 109,

    section CallTarget+Inlining+NGEN
    This PR (5872) - mean (749ms)  : 728, 770
     .   : milestone, 749,
    master - mean (748ms)  : 725, 771
     .   : milestone, 748,

gantt
    title Execution time (ms) FakeDbCommand (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5872) - mean (93ms)  : 89, 96
     .   : milestone, 93,
    master - mean (92ms)  : 90, 95
     .   : milestone, 92,

    section CallTarget+Inlining+NGEN
    This PR (5872) - mean (706ms)  : 688, 725
     .   : milestone, 706,
    master - mean (702ms)  : 685, 719
     .   : milestone, 702,

gantt
    title Execution time (ms) HttpMessageHandler (.NET Framework 4.6.2) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5872) - mean (192ms)  : 188, 195
     .   : milestone, 192,
    master - mean (193ms)  : 188, 198
     .   : milestone, 193,

    section CallTarget+Inlining+NGEN
    This PR (5872) - mean (1,167ms)  : 1144, 1190
     .   : milestone, 1167,
    master - mean (1,169ms)  : 1137, 1200
     .   : milestone, 1169,

gantt
    title Execution time (ms) HttpMessageHandler (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5872) - mean (277ms)  : 272, 281
     .   : milestone, 277,
    master - mean (276ms)  : 272, 281
     .   : milestone, 276,

    section CallTarget+Inlining+NGEN
    This PR (5872) - mean (926ms)  : 902, 950
     .   : milestone, 926,
    master - mean (918ms)  : 897, 939
     .   : milestone, 918,

gantt
    title Execution time (ms) HttpMessageHandler (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5872) - mean (266ms)  : 261, 270
     .   : milestone, 266,
    master - mean (266ms)  : 262, 270
     .   : milestone, 266,

    section CallTarget+Inlining+NGEN
    This PR (5872) - mean (906ms)  : 883, 928
     .   : milestone, 906,
    master - mean (903ms)  : 884, 921
     .   : milestone, 903,

datadog-ddstaging · 2024-08-10T12:26:25Z

Datadog Report

Branch report: matang/exception-replay-hashing
Commit report: 9951543
Test service: dd-trace-dotnet

✅ 0 Failed, 354796 Passed, 2258 Skipped, 23h 12m 45.69s Total Time

andrewlock · 2024-08-10T13:28:41Z

Benchmarks Report for tracer 🐌

Benchmarks for #5872 compared to master:

1 benchmarks are faster, with geometric mean 1.116
3 benchmarks are slower, with geometric mean 1.126
1 benchmarks have more allocations

The following thresholds were used for comparing the benchmark speeds:

Mann–Whitney U test with statistical test for significance of 5%
Only results indicating a difference greater than 10% and 0.3 ns are considered.

Allocation changes below 0.5% are ignored.

Benchmark details

Benchmarks.Trace.ActivityBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Gen 1	Gen 2	Allocated
master	`StartStopWithChild`	net6.0	7.78μs	41.8ns	304ns	0.015	0.0075	0	5.43 KB
master	`StartStopWithChild`	netcoreapp3.1	9.96μs	53.9ns	310ns	0.0192	0.00962	0	5.62 KB
master	`StartStopWithChild`	net472	15.9μs	45ns	168ns	1.02	0.293	0.1	6.06 KB
#5872	`StartStopWithChild`	net6.0	7.57μs	39.2ns	207ns	0.0198	0.0079	0	5.43 KB
#5872	`StartStopWithChild`	netcoreapp3.1	10μs	52.5ns	252ns	0.0244	0.00974	0	5.62 KB
#5872	`StartStopWithChild`	net472	15.9μs	63.9ns	248ns	1.03	0.325	0.0952	6.08 KB

Benchmarks.Trace.AgentWriterBenchmark - Slower ⚠️ Same allocations ✔️

Slower ⚠️ in #5872

Benchmark	diff/base	Base Median (ns)	Diff Median (ns)	Modality
Benchmarks.Trace.AgentWriterBenchmark.WriteAndFlushEnrichedTraces‑net6.0	1.122	441,019.33	494,748.88

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`WriteAndFlushEnrichedTraces`	net6.0	441μs	379ns	1.47μs	0	2.7 KB
master	`WriteAndFlushEnrichedTraces`	netcoreapp3.1	625μs	463ns	1.79μs	0	2.7 KB
master	`WriteAndFlushEnrichedTraces`	net472	851μs	337ns	1.22μs	0.425	3.3 KB
#5872	`WriteAndFlushEnrichedTraces`	net6.0	495μs	222ns	831ns	0	2.7 KB
#5872	`WriteAndFlushEnrichedTraces`	netcoreapp3.1	642μs	749ns	2.9μs	0	2.7 KB
#5872	`WriteAndFlushEnrichedTraces`	net472	849μs	616ns	2.3μs	0.419	3.3 KB

Benchmarks.Trace.AspNetCoreBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`SendRequest`	net6.0	197μs	1.15μs	10μs	0.194	18.45 KB
master	`SendRequest`	netcoreapp3.1	221μs	1.24μs	7.93μs	0.208	20.61 KB
master	`SendRequest`	net472	0ns	0ns	0ns	0	0 b
#5872	`SendRequest`	net6.0	198μs	1.12μs	7.61μs	0.2	18.45 KB
#5872	`SendRequest`	netcoreapp3.1	223μs	1.24μs	8.16μs	0.21	20.61 KB
#5872	`SendRequest`	net472	0.00706ns	0.00204ns	0.00792ns	0	0 b

Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark - Same speed ✔️ More allocations ⚠️

More allocations ⚠️ in #5872

Benchmark	Base Allocated	Diff Allocated	Change	Change %
Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark.WriteAndFlushEnrichedTraces‑net6.0	41.52 KB	41.77 KB	250 B	0.60%

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Gen 1	Gen 2	Allocated
master	`WriteAndFlushEnrichedTraces`	net6.0	582μs	2.94μs	13.8μs	0.573	0	0	41.52 KB
master	`WriteAndFlushEnrichedTraces`	netcoreapp3.1	683μs	3.75μs	22.2μs	0.324	0	0	41.93 KB
master	`WriteAndFlushEnrichedTraces`	net472	853μs	4.08μs	16.8μs	8.33	2.5	0.417	53.28 KB
#5872	`WriteAndFlushEnrichedTraces`	net6.0	585μs	2.97μs	14.5μs	0.573	0	0	41.77 KB
#5872	`WriteAndFlushEnrichedTraces`	netcoreapp3.1	726μs	3.62μs	17μs	0.351	0	0	41.73 KB
#5872	`WriteAndFlushEnrichedTraces`	net472	845μs	2.61μs	9.76μs	8.08	2.55	0.425	53.27 KB

Benchmarks.Trace.DbCommandBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`ExecuteNonQuery`	net6.0	1.25μs	1.32ns	4.94ns	0.0143	1.02 KB
master	`ExecuteNonQuery`	netcoreapp3.1	1.76μs	2.79ns	10.8ns	0.0131	1.02 KB
master	`ExecuteNonQuery`	net472	2.01μs	1.23ns	4.28ns	0.156	987 B
#5872	`ExecuteNonQuery`	net6.0	1.26μs	0.953ns	3.57ns	0.014	1.02 KB
#5872	`ExecuteNonQuery`	netcoreapp3.1	1.75μs	1.21ns	4.68ns	0.0141	1.02 KB
#5872	`ExecuteNonQuery`	net472	2.03μs	1.76ns	6.8ns	0.157	987 B

Benchmarks.Trace.ElasticsearchBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`CallElasticsearch`	net6.0	1.14μs	0.551ns	2.06ns	0.0137	976 B
master	`CallElasticsearch`	netcoreapp3.1	1.51μs	1.46ns	5.27ns	0.0129	976 B
master	`CallElasticsearch`	net472	2.47μs	2.52ns	9.75ns	0.158	995 B
master	`CallElasticsearchAsync`	net6.0	1.24μs	0.384ns	1.44ns	0.0137	952 B
master	`CallElasticsearchAsync`	netcoreapp3.1	1.64μs	1.24ns	4.64ns	0.014	1.02 KB
master	`CallElasticsearchAsync`	net472	2.69μs	1.66ns	6.42ns	0.167	1.05 KB
#5872	`CallElasticsearch`	net6.0	1.24μs	0.578ns	2.09ns	0.0137	976 B
#5872	`CallElasticsearch`	netcoreapp3.1	1.54μs	3.53ns	13.7ns	0.013	976 B
#5872	`CallElasticsearch`	net472	2.48μs	2.5ns	9.67ns	0.158	995 B
#5872	`CallElasticsearchAsync`	net6.0	1.29μs	0.517ns	1.93ns	0.0135	952 B
#5872	`CallElasticsearchAsync`	netcoreapp3.1	1.67μs	0.444ns	1.66ns	0.0141	1.02 KB
#5872	`CallElasticsearchAsync`	net472	2.59μs	1.36ns	5.07ns	0.166	1.05 KB

Benchmarks.Trace.GraphQLBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`ExecuteAsync`	net6.0	1.24μs	1.12ns	4.33ns	0.0136	952 B
master	`ExecuteAsync`	netcoreapp3.1	1.56μs	0.597ns	2.31ns	0.0125	952 B
master	`ExecuteAsync`	net472	1.77μs	0.499ns	1.87ns	0.145	915 B
#5872	`ExecuteAsync`	net6.0	1.22μs	3.34ns	12.9ns	0.0134	952 B
#5872	`ExecuteAsync`	netcoreapp3.1	1.61μs	0.638ns	2.21ns	0.0129	952 B
#5872	`ExecuteAsync`	net472	1.73μs	0.593ns	2.22ns	0.145	915 B

Benchmarks.Trace.HttpClientBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`SendAsync`	net6.0	4.17μs	1.97ns	7.65ns	0.0314	2.22 KB
master	`SendAsync`	netcoreapp3.1	5.11μs	1.45ns	5.44ns	0.0359	2.76 KB
master	`SendAsync`	net472	7.86μs	8.74ns	33.8ns	0.497	3.15 KB
#5872	`SendAsync`	net6.0	4.13μs	1.41ns	5.28ns	0.0308	2.22 KB
#5872	`SendAsync`	netcoreapp3.1	5.07μs	3.21ns	12ns	0.0354	2.76 KB
#5872	`SendAsync`	net472	7.82μs	2.31ns	8.96ns	0.5	3.15 KB

Benchmarks.Trace.ILoggerBenchmark - Faster 🎉 Same allocations ✔️

Faster 🎉 in #5872

Benchmark	base/diff	Base Median (ns)	Diff Median (ns)	Modality
Benchmarks.Trace.ILoggerBenchmark.EnrichedLog‑netcoreapp3.1	1.116	2,345.13	2,101.80

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`EnrichedLog`	net6.0	1.48μs	1.11ns	4.3ns	0.0226	1.64 KB
master	`EnrichedLog`	netcoreapp3.1	2.34μs	1.33ns	4.97ns	0.0223	1.64 KB
master	`EnrichedLog`	net472	2.74μs	0.723ns	2.71ns	0.249	1.57 KB
#5872	`EnrichedLog`	net6.0	1.57μs	0.871ns	3.37ns	0.0229	1.64 KB
#5872	`EnrichedLog`	netcoreapp3.1	2.1μs	0.932ns	3.49ns	0.0223	1.64 KB
#5872	`EnrichedLog`	net472	2.68μs	1.77ns	6.85ns	0.249	1.57 KB

Benchmarks.Trace.Log4netBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Gen 1	Allocated
master	`EnrichedLog`	net6.0	117μs	235ns	912ns	0.0581	0	4.28 KB
master	`EnrichedLog`	netcoreapp3.1	120μs	173ns	669ns	0	0	4.28 KB
master	`EnrichedLog`	net472	150μs	126ns	471ns	0.678	0.226	4.46 KB
#5872	`EnrichedLog`	net6.0	116μs	149ns	577ns	0	0	4.28 KB
#5872	`EnrichedLog`	netcoreapp3.1	119μs	259ns	1μs	0	0	4.28 KB
#5872	`EnrichedLog`	net472	148μs	155ns	602ns	0.66	0.22	4.46 KB

Benchmarks.Trace.NLogBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`EnrichedLog`	net6.0	3.04μs	1.88ns	7.29ns	0.0304	2.2 KB
master	`EnrichedLog`	netcoreapp3.1	4.12μs	1.73ns	6.68ns	0.0296	2.2 KB
master	`EnrichedLog`	net472	4.98μs	1.46ns	5.66ns	0.32	2.02 KB
#5872	`EnrichedLog`	net6.0	3.07μs	0.761ns	2.95ns	0.0306	2.2 KB
#5872	`EnrichedLog`	netcoreapp3.1	4.12μs	1.32ns	5.12ns	0.0288	2.2 KB
#5872	`EnrichedLog`	net472	4.8μs	1.64ns	6.35ns	0.319	2.02 KB

Benchmarks.Trace.RedisBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Gen 1	Allocated
master	`SendReceive`	net6.0	1.35μs	0.629ns	2.44ns	0.0161	0	1.14 KB
master	`SendReceive`	netcoreapp3.1	1.81μs	1.65ns	6.38ns	0.0154	0	1.14 KB
master	`SendReceive`	net472	2.18μs	1.14ns	4.4ns	0.183	0.00109	1.16 KB
#5872	`SendReceive`	net6.0	1.4μs	1.01ns	3.9ns	0.0161	0	1.14 KB
#5872	`SendReceive`	netcoreapp3.1	1.66μs	0.722ns	2.7ns	0.015	0	1.14 KB
#5872	`SendReceive`	net472	2.15μs	0.526ns	1.82ns	0.183	0.00108	1.16 KB

Benchmarks.Trace.SerilogBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`EnrichedLog`	net6.0	2.87μs	0.871ns	3.26ns	0.0215	1.6 KB
master	`EnrichedLog`	netcoreapp3.1	3.83μs	3.39ns	13.1ns	0.0229	1.65 KB
master	`EnrichedLog`	net472	4.42μs	1.99ns	7.44ns	0.323	2.04 KB
#5872	`EnrichedLog`	net6.0	2.77μs	0.977ns	3.79ns	0.0222	1.6 KB
#5872	`EnrichedLog`	netcoreapp3.1	3.88μs	1.82ns	7.06ns	0.0212	1.65 KB
#5872	`EnrichedLog`	net472	4.3μs	1.58ns	5.9ns	0.323	2.04 KB

Benchmarks.Trace.SpanBenchmark - Slower ⚠️ Same allocations ✔️

Slower ⚠️ in #5872

Benchmark	diff/base	Base Median (ns)	Diff Median (ns)	Modality
Benchmarks.Trace.SpanBenchmark.StartFinishScope‑net6.0	1.133	472.18	535.09
Benchmarks.Trace.SpanBenchmark.StartFinishSpan‑net472	1.123	627.10	704.20

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`StartFinishSpan`	net6.0	409ns	0.145ns	0.561ns	0.00811	576 B
master	`StartFinishSpan`	netcoreapp3.1	589ns	0.232ns	0.867ns	0.00767	576 B
master	`StartFinishSpan`	net472	627ns	0.305ns	1.18ns	0.0915	578 B
master	`StartFinishScope`	net6.0	472ns	0.106ns	0.411ns	0.0099	696 B
master	`StartFinishScope`	netcoreapp3.1	740ns	0.351ns	1.36ns	0.00934	696 B
master	`StartFinishScope`	net472	862ns	0.449ns	1.74ns	0.104	658 B
#5872	`StartFinishSpan`	net6.0	414ns	0.142ns	0.551ns	0.00811	576 B
#5872	`StartFinishSpan`	netcoreapp3.1	560ns	0.832ns	3.22ns	0.00806	576 B
#5872	`StartFinishSpan`	net472	704ns	0.303ns	1.17ns	0.0918	578 B
#5872	`StartFinishScope`	net6.0	535ns	0.149ns	0.557ns	0.00991	696 B
#5872	`StartFinishScope`	netcoreapp3.1	706ns	0.244ns	0.912ns	0.00947	696 B
#5872	`StartFinishScope`	net472	821ns	0.266ns	0.996ns	0.104	658 B

Benchmarks.Trace.TraceAnnotationsBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`RunOnMethodBegin`	net6.0	673ns	0.338ns	1.31ns	0.00975	696 B
master	`RunOnMethodBegin`	netcoreapp3.1	911ns	0.42ns	1.63ns	0.00915	696 B
master	`RunOnMethodBegin`	net472	1.09μs	0.277ns	1.04ns	0.104	658 B
#5872	`RunOnMethodBegin`	net6.0	685ns	0.414ns	1.6ns	0.00966	696 B
#5872	`RunOnMethodBegin`	netcoreapp3.1	969ns	0.693ns	2.59ns	0.00958	696 B
#5872	`RunOnMethodBegin`	net472	1.12μs	0.212ns	0.82ns	0.104	658 B

dudikeleti

LGTM.
Left a few comments.

dudikeleti · 2024-08-12T12:27:41Z

tracer/src/Datadog.Trace/Debugger/ExceptionAutoInstrumentation/TestExceptionNormalizer.cs

+namespace Datadog.Trace.Debugger.ExceptionAutoInstrumentation
+{
+    /// <summary>
+    /// Important: Should only be used in testing. Not thread-safe.


Can we put this class in Datadog.Trace.Tests.Debugger?

Unfortunately, no. It uses ReadOnlySpan that is weaved as part of Datadog.Trace. If I try to move it elsewhere, I end up facing an error that MemoryExtesions has default implementations. Do you have a workaround?

tracer/src/Datadog.Trace/Debugger/ExceptionAutoInstrumentation/ExceptionNormalizer.cs

… aggregation (#5872) ## Summary of changes In Exception Replay, the exception could be in one of several phases. Two of these phases are `Done` and `Invalidated`. - `Done`: The exception has already been captured and is waiting for the next epoch to be wakened up. - `Invalidated`: None of the frames could be captured, reporting tags that tend to assist in understanding the reasoning for troubleshooting. To be able to capture those phases as quick as possible, a cache is used for a lookup before performing intensive calculation on the `System.Exception` object itself, since the execution path is hot and run as part of unwinding the request with an exception. The lookup key used to be simply **Fnv1a** hashing of `exception.ToString()` of the exception reaching the service root span. Basing the hashing on `exception.ToString()` lead to scenarios where two identical exceptions seemed different based on one of various factors; non-deterministic participating frame, exception messages, PDB info (file path + line number), etc. To be able to determine quickly in which phase the exception is in without performing costly computations, a new way of hashing is required that should cleanse the exceptions in a way that two _similar_ exception should fall into the same case, even though their stack traces are not identical. Also, the new algorithm should be as performant as possible with as little temporal allocations as possible - they play part every time a service root span is finalized with an exception. ## Reason for change Improve the experience of Exception Replay where we failed to report an exception due to failure in determining if the exception is in `Done` / `Invalidated` phases, as a result of it's previous occurrence looking a bit different. ## Implementation details A new class, `ExceptionNormalizer`, has been added that takes as input the string representation of the exception alongside it's outermost exception type, and one level deep of inner exception. It cleanses the exception from the aforementioned attributes, and performs a more fine-grained hash that shall have a better distribution based on the actual exception, leaving out all the non-relevant bits that might differ. ## Test coverage [ExceptionNormalizerTests](https://github.com/DataDog/dd-trace-dotnet/blob/821e4860632a8fcb258bbbe74506249cb6865659/tracer/test/Datadog.Trace.Debugger.IntegrationTests/ExceptionNormalizerTests.cs) with approvals on the hash + string representing the cleansed stack trace. ## Other details Fixes #DEBUG-2674

…ne-grained aggregation (#5872 -> v2) (#5890) ## Summary of changes In Exception Replay, the exception could be in one of several phases. Two of these phases are `Done` and `Invalidated`. - `Done`: The exception has already been captured and is waiting for the next epoch to be wakened up. - `Invalidated`: None of the frames could be captured, reporting tags that tend to assist in understanding the reasoning for troubleshooting. To be able to capture those phases as quick as possible, a cache is used for a lookup before performing intensive calculation on the `System.Exception` object itself, since the execution path is hot and run as part of unwinding the request with an exception. The lookup key used to be simply **Fnv1a** hashing of `exception.ToString()` of the exception reaching the service root span. Basing the hashing on `exception.ToString()` lead to scenarios where two identical exceptions seemed different based on one of various factors; non-deterministic participating frame, exception messages, PDB info (file path + line number), etc. To be able to determine quickly in which phase the exception is in without performing costly computations, a new way of hashing is required that should cleanse the exceptions in a way that two _similar_ exception should fall into the same case, even though their stack traces are not identical. Also, the new algorithm should be as performant as possible with as little temporal allocations as possible - they play part every time a service root span is finalized with an exception. ## Reason for change Improve the experience of Exception Replay where we failed to report an exception due to failure in determining if the exception is in `Done` / `Invalidated` phases, as a result of it's previous occurrence looking a bit different. ## Implementation details A new class, `ExceptionNormalizer`, has been added that takes as input the string representation of the exception alongside it's outermost exception type, and one level deep of inner exception. It cleanses the exception from the aforementioned attributes, and performs a more fine-grained hash that shall have a better distribution based on the actual exception, leaving out all the non-relevant bits that might differ. ## Test coverage [ExceptionNormalizerTests](https://github.com/DataDog/dd-trace-dotnet/blob/821e4860632a8fcb258bbbe74506249cb6865659/tracer/test/Datadog.Trace.Debugger.IntegrationTests/ExceptionNormalizerTests.cs) with approvals on the hash + string representing the cleansed stack trace. ## Other details Fixes #DEBUG-2674

GreenMatan requested a review from a team as a code owner August 10, 2024 11:49

GreenMatan force-pushed the matang/exception-replay-hashing branch from 64bbcab to 821e486 Compare August 10, 2024 11:52

GreenMatan changed the title ~~[Exception Replay] Normalized exception hashing for more fine-grained aggregation + Improved diagnostic capabilities~~ [Exception Replay] Normalized exception hashing for more fine-grained aggregation Aug 10, 2024

GreenMatan force-pushed the matang/exception-replay-hashing branch from 821e486 to 6ae130f Compare August 10, 2024 12:24

GreenMatan force-pushed the matang/exception-replay-hashing branch 6 times, most recently from a433b60 to e54200c Compare August 12, 2024 12:08

dudikeleti approved these changes Aug 12, 2024

View reviewed changes

GreenMatan force-pushed the matang/exception-replay-hashing branch 10 times, most recently from d1d76e2 to d520dd7 Compare August 13, 2024 10:14

Hashing exceptions with normalization + improved diagnostic capabilities

9951543

GreenMatan force-pushed the matang/exception-replay-hashing branch from d520dd7 to 9951543 Compare August 13, 2024 10:21

GreenMatan merged commit ceb00d0 into master Aug 13, 2024
58 of 65 checks passed

GreenMatan deleted the matang/exception-replay-hashing branch August 13, 2024 12:36

github-actions bot added this to the vNext-v3 milestone Aug 13, 2024

andrewlock added the area:debugger label Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Exception Replay] Normalized exception hashing for more fine-grained aggregation #5872

[Exception Replay] Normalized exception hashing for more fine-grained aggregation #5872

GreenMatan commented Aug 10, 2024 •

edited

Loading

andrewlock commented Aug 10, 2024 •

edited

Loading

datadog-ddstaging bot commented Aug 10, 2024 •

edited

Loading

andrewlock commented Aug 10, 2024 •

edited

Loading

Raw results

Slower ⚠️ in #5872

Raw results

Raw results

More allocations ⚠️ in #5872

Raw results

Raw results

Raw results

Raw results

Raw results

Faster 🎉 in #5872

Raw results

Raw results

Raw results

Raw results

Raw results

Slower ⚠️ in #5872

Raw results

Raw results

dudikeleti left a comment

dudikeleti Aug 12, 2024

GreenMatan Aug 12, 2024 •

edited

Loading

[Exception Replay] Normalized exception hashing for more fine-grained aggregation #5872

[Exception Replay] Normalized exception hashing for more fine-grained aggregation #5872

Conversation

GreenMatan commented Aug 10, 2024 • edited Loading

Summary of changes

Reason for change

Implementation details

Test coverage

Other details

andrewlock commented Aug 10, 2024 • edited Loading

Execution-Time Benchmarks Report ⏱️

datadog-ddstaging bot commented Aug 10, 2024 • edited Loading

Datadog Report

andrewlock commented Aug 10, 2024 • edited Loading

Benchmarks Report for tracer 🐌

Benchmark details

Raw results

Slower ⚠️ in #5872

Raw results

Raw results

More allocations ⚠️ in #5872

Raw results

Raw results

Raw results

Raw results

Raw results

Faster 🎉 in #5872

Raw results

Raw results

Raw results

Raw results

Raw results

Slower ⚠️ in #5872

Raw results

Raw results

dudikeleti left a comment

Choose a reason for hiding this comment

dudikeleti Aug 12, 2024

Choose a reason for hiding this comment

GreenMatan Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

GreenMatan commented Aug 10, 2024 •

edited

Loading

andrewlock commented Aug 10, 2024 •

edited

Loading

datadog-ddstaging bot commented Aug 10, 2024 •

edited

Loading

andrewlock commented Aug 10, 2024 •

edited

Loading

GreenMatan Aug 12, 2024 •

edited

Loading