Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Exception Replay] Normalized exception hashing for more fine-grained aggregation #5872

Merged
merged 1 commit into from
Aug 13, 2024

Conversation

GreenMatan
Copy link
Contributor

@GreenMatan GreenMatan commented Aug 10, 2024

Summary of changes

In Exception Replay, the exception could be in one of several phases. Two of these phases are Done and Invalidated.

  • Done: The exception has already been captured and is waiting for the next epoch to be wakened up.
  • Invalidated: None of the frames could be captured, reporting tags that tend to assist in understanding the reasoning for troubleshooting.

To be able to capture those phases as quick as possible, a cache is used for a lookup before performing intensive calculation on the System.Exception object itself, since the execution path is hot and run as part of unwinding the request with an exception. The lookup key used to be simply Fnv1a hashing of exception.ToString() of the exception reaching the service root span.
Basing the hashing on exception.ToString() lead to scenarios where two identical exceptions seemed different based on one of various factors; non-deterministic participating frame, exception messages, PDB info (file path + line number), etc.

To be able to determine quickly in which phase the exception is in without performing costly computations, a new way of hashing is required that should cleanse the exceptions in a way that two similar exception should fall into the same case, even though their stack traces are not identical. Also, the new algorithm should be as performant as possible with as little temporal allocations as possible - they play part every time a service root span is finalized with an exception.

Reason for change

Improve the experience of Exception Replay where we failed to report an exception due to failure in determining if the exception is in Done / Invalidated phases, as a result of it's previous occurrence looking a bit different.

Implementation details

A new class, ExceptionNormalizer, has been added that takes as input the string representation of the exception alongside it's outermost exception type, and one level deep of inner exception. It cleanses the exception from the aforementioned attributes, and performs a more fine-grained hash that shall have a better distribution based on the actual exception, leaving out all the non-relevant bits that might differ.

Test coverage

ExceptionNormalizerTests with approvals on the hash + string representing the cleansed stack trace.

Other details

Fixes #DEBUG-2674

@GreenMatan GreenMatan requested a review from a team as a code owner August 10, 2024 11:49
@GreenMatan GreenMatan force-pushed the matang/exception-replay-hashing branch from 64bbcab to 821e486 Compare August 10, 2024 11:52
@GreenMatan GreenMatan changed the title [Exception Replay] Normalized exception hashing for more fine-grained aggregation + Improved diagnostic capabilities [Exception Replay] Normalized exception hashing for more fine-grained aggregation Aug 10, 2024
@andrewlock
Copy link
Member

andrewlock commented Aug 10, 2024

Execution-Time Benchmarks Report ⏱️

Execution-time results for samples comparing the following branches/commits:

Execution-time benchmarks measure the whole time it takes to execute a program. And are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are shown in red. The following thresholds were used for comparing the execution times:

  • Welch test with statistical test for significance of 5%
  • Only results indicating a difference greater than 5% and 5 ms are considered.

Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard.

Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph).

gantt
    title Execution time (ms) FakeDbCommand (.NET Framework 4.6.2) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5872) - mean (74ms)  : 65, 82
     .   : milestone, 74,
    master - mean (73ms)  : 63, 84
     .   : milestone, 73,

    section CallTarget+Inlining+NGEN
    This PR (5872) - mean (1,072ms)  : 1049, 1094
     .   : milestone, 1072,
    master - mean (1,068ms)  : 1047, 1088
     .   : milestone, 1068,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5872) - mean (110ms)  : 105, 115
     .   : milestone, 110,
    master - mean (109ms)  : 105, 114
     .   : milestone, 109,

    section CallTarget+Inlining+NGEN
    This PR (5872) - mean (749ms)  : 728, 770
     .   : milestone, 749,
    master - mean (748ms)  : 725, 771
     .   : milestone, 748,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5872) - mean (93ms)  : 89, 96
     .   : milestone, 93,
    master - mean (92ms)  : 90, 95
     .   : milestone, 92,

    section CallTarget+Inlining+NGEN
    This PR (5872) - mean (706ms)  : 688, 725
     .   : milestone, 706,
    master - mean (702ms)  : 685, 719
     .   : milestone, 702,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET Framework 4.6.2) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5872) - mean (192ms)  : 188, 195
     .   : milestone, 192,
    master - mean (193ms)  : 188, 198
     .   : milestone, 193,

    section CallTarget+Inlining+NGEN
    This PR (5872) - mean (1,167ms)  : 1144, 1190
     .   : milestone, 1167,
    master - mean (1,169ms)  : 1137, 1200
     .   : milestone, 1169,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5872) - mean (277ms)  : 272, 281
     .   : milestone, 277,
    master - mean (276ms)  : 272, 281
     .   : milestone, 276,

    section CallTarget+Inlining+NGEN
    This PR (5872) - mean (926ms)  : 902, 950
     .   : milestone, 926,
    master - mean (918ms)  : 897, 939
     .   : milestone, 918,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5872) - mean (266ms)  : 261, 270
     .   : milestone, 266,
    master - mean (266ms)  : 262, 270
     .   : milestone, 266,

    section CallTarget+Inlining+NGEN
    This PR (5872) - mean (906ms)  : 883, 928
     .   : milestone, 906,
    master - mean (903ms)  : 884, 921
     .   : milestone, 903,

Loading

@GreenMatan GreenMatan force-pushed the matang/exception-replay-hashing branch from 821e486 to 6ae130f Compare August 10, 2024 12:24
@datadog-ddstaging
Copy link

datadog-ddstaging bot commented Aug 10, 2024

Datadog Report

Branch report: matang/exception-replay-hashing
Commit report: 9951543
Test service: dd-trace-dotnet

✅ 0 Failed, 354796 Passed, 2258 Skipped, 23h 12m 45.69s Total Time

@andrewlock
Copy link
Member

andrewlock commented Aug 10, 2024

Benchmarks Report for tracer 🐌

Benchmarks for #5872 compared to master:

  • 1 benchmarks are faster, with geometric mean 1.116
  • 3 benchmarks are slower, with geometric mean 1.126
  • 1 benchmarks have more allocations

The following thresholds were used for comparing the benchmark speeds:

  • Mann–Whitney U test with statistical test for significance of 5%
  • Only results indicating a difference greater than 10% and 0.3 ns are considered.

Allocation changes below 0.5% are ignored.

Benchmark details

Benchmarks.Trace.ActivityBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master StartStopWithChild net6.0 7.78μs 41.8ns 304ns 0.015 0.0075 0 5.43 KB
master StartStopWithChild netcoreapp3.1 9.96μs 53.9ns 310ns 0.0192 0.00962 0 5.62 KB
master StartStopWithChild net472 15.9μs 45ns 168ns 1.02 0.293 0.1 6.06 KB
#5872 StartStopWithChild net6.0 7.57μs 39.2ns 207ns 0.0198 0.0079 0 5.43 KB
#5872 StartStopWithChild netcoreapp3.1 10μs 52.5ns 252ns 0.0244 0.00974 0 5.62 KB
#5872 StartStopWithChild net472 15.9μs 63.9ns 248ns 1.03 0.325 0.0952 6.08 KB
Benchmarks.Trace.AgentWriterBenchmark - Slower ⚠️ Same allocations ✔️

Slower ⚠️ in #5872

Benchmark diff/base Base Median (ns) Diff Median (ns) Modality
Benchmarks.Trace.AgentWriterBenchmark.WriteAndFlushEnrichedTraces‑net6.0 1.122 441,019.33 494,748.88

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master WriteAndFlushEnrichedTraces net6.0 441μs 379ns 1.47μs 0 0 0 2.7 KB
master WriteAndFlushEnrichedTraces netcoreapp3.1 625μs 463ns 1.79μs 0 0 0 2.7 KB
master WriteAndFlushEnrichedTraces net472 851μs 337ns 1.22μs 0.425 0 0 3.3 KB
#5872 WriteAndFlushEnrichedTraces net6.0 495μs 222ns 831ns 0 0 0 2.7 KB
#5872 WriteAndFlushEnrichedTraces netcoreapp3.1 642μs 749ns 2.9μs 0 0 0 2.7 KB
#5872 WriteAndFlushEnrichedTraces net472 849μs 616ns 2.3μs 0.419 0 0 3.3 KB
Benchmarks.Trace.AspNetCoreBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendRequest net6.0 197μs 1.15μs 10μs 0.194 0 0 18.45 KB
master SendRequest netcoreapp3.1 221μs 1.24μs 7.93μs 0.208 0 0 20.61 KB
master SendRequest net472 0ns 0ns 0ns 0 0 0 0 b
#5872 SendRequest net6.0 198μs 1.12μs 7.61μs 0.2 0 0 18.45 KB
#5872 SendRequest netcoreapp3.1 223μs 1.24μs 8.16μs 0.21 0 0 20.61 KB
#5872 SendRequest net472 0.00706ns 0.00204ns 0.00792ns 0 0 0 0 b
Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark - Same speed ✔️ More allocations ⚠️

More allocations ⚠️ in #5872

Benchmark Base Allocated Diff Allocated Change Change %
Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark.WriteAndFlushEnrichedTraces‑net6.0 41.52 KB 41.77 KB 250 B 0.60%

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master WriteAndFlushEnrichedTraces net6.0 582μs 2.94μs 13.8μs 0.573 0 0 41.52 KB
master WriteAndFlushEnrichedTraces netcoreapp3.1 683μs 3.75μs 22.2μs 0.324 0 0 41.93 KB
master WriteAndFlushEnrichedTraces net472 853μs 4.08μs 16.8μs 8.33 2.5 0.417 53.28 KB
#5872 WriteAndFlushEnrichedTraces net6.0 585μs 2.97μs 14.5μs 0.573 0 0 41.77 KB
#5872 WriteAndFlushEnrichedTraces netcoreapp3.1 726μs 3.62μs 17μs 0.351 0 0 41.73 KB
#5872 WriteAndFlushEnrichedTraces net472 845μs 2.61μs 9.76μs 8.08 2.55 0.425 53.27 KB
Benchmarks.Trace.DbCommandBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master ExecuteNonQuery net6.0 1.25μs 1.32ns 4.94ns 0.0143 0 0 1.02 KB
master ExecuteNonQuery netcoreapp3.1 1.76μs 2.79ns 10.8ns 0.0131 0 0 1.02 KB
master ExecuteNonQuery net472 2.01μs 1.23ns 4.28ns 0.156 0 0 987 B
#5872 ExecuteNonQuery net6.0 1.26μs 0.953ns 3.57ns 0.014 0 0 1.02 KB
#5872 ExecuteNonQuery netcoreapp3.1 1.75μs 1.21ns 4.68ns 0.0141 0 0 1.02 KB
#5872 ExecuteNonQuery net472 2.03μs 1.76ns 6.8ns 0.157 0 0 987 B
Benchmarks.Trace.ElasticsearchBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master CallElasticsearch net6.0 1.14μs 0.551ns 2.06ns 0.0137 0 0 976 B
master CallElasticsearch netcoreapp3.1 1.51μs 1.46ns 5.27ns 0.0129 0 0 976 B
master CallElasticsearch net472 2.47μs 2.52ns 9.75ns 0.158 0 0 995 B
master CallElasticsearchAsync net6.0 1.24μs 0.384ns 1.44ns 0.0137 0 0 952 B
master CallElasticsearchAsync netcoreapp3.1 1.64μs 1.24ns 4.64ns 0.014 0 0 1.02 KB
master CallElasticsearchAsync net472 2.69μs 1.66ns 6.42ns 0.167 0 0 1.05 KB
#5872 CallElasticsearch net6.0 1.24μs 0.578ns 2.09ns 0.0137 0 0 976 B
#5872 CallElasticsearch netcoreapp3.1 1.54μs 3.53ns 13.7ns 0.013 0 0 976 B
#5872 CallElasticsearch net472 2.48μs 2.5ns 9.67ns 0.158 0 0 995 B
#5872 CallElasticsearchAsync net6.0 1.29μs 0.517ns 1.93ns 0.0135 0 0 952 B
#5872 CallElasticsearchAsync netcoreapp3.1 1.67μs 0.444ns 1.66ns 0.0141 0 0 1.02 KB
#5872 CallElasticsearchAsync net472 2.59μs 1.36ns 5.07ns 0.166 0 0 1.05 KB
Benchmarks.Trace.GraphQLBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master ExecuteAsync net6.0 1.24μs 1.12ns 4.33ns 0.0136 0 0 952 B
master ExecuteAsync netcoreapp3.1 1.56μs 0.597ns 2.31ns 0.0125 0 0 952 B
master ExecuteAsync net472 1.77μs 0.499ns 1.87ns 0.145 0 0 915 B
#5872 ExecuteAsync net6.0 1.22μs 3.34ns 12.9ns 0.0134 0 0 952 B
#5872 ExecuteAsync netcoreapp3.1 1.61μs 0.638ns 2.21ns 0.0129 0 0 952 B
#5872 ExecuteAsync net472 1.73μs 0.593ns 2.22ns 0.145 0 0 915 B
Benchmarks.Trace.HttpClientBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendAsync net6.0 4.17μs 1.97ns 7.65ns 0.0314 0 0 2.22 KB
master SendAsync netcoreapp3.1 5.11μs 1.45ns 5.44ns 0.0359 0 0 2.76 KB
master SendAsync net472 7.86μs 8.74ns 33.8ns 0.497 0 0 3.15 KB
#5872 SendAsync net6.0 4.13μs 1.41ns 5.28ns 0.0308 0 0 2.22 KB
#5872 SendAsync netcoreapp3.1 5.07μs 3.21ns 12ns 0.0354 0 0 2.76 KB
#5872 SendAsync net472 7.82μs 2.31ns 8.96ns 0.5 0 0 3.15 KB
Benchmarks.Trace.ILoggerBenchmark - Faster 🎉 Same allocations ✔️

Faster 🎉 in #5872

Benchmark base/diff Base Median (ns) Diff Median (ns) Modality
Benchmarks.Trace.ILoggerBenchmark.EnrichedLog‑netcoreapp3.1 1.116 2,345.13 2,101.80

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net6.0 1.48μs 1.11ns 4.3ns 0.0226 0 0 1.64 KB
master EnrichedLog netcoreapp3.1 2.34μs 1.33ns 4.97ns 0.0223 0 0 1.64 KB
master EnrichedLog net472 2.74μs 0.723ns 2.71ns 0.249 0 0 1.57 KB
#5872 EnrichedLog net6.0 1.57μs 0.871ns 3.37ns 0.0229 0 0 1.64 KB
#5872 EnrichedLog netcoreapp3.1 2.1μs 0.932ns 3.49ns 0.0223 0 0 1.64 KB
#5872 EnrichedLog net472 2.68μs 1.77ns 6.85ns 0.249 0 0 1.57 KB
Benchmarks.Trace.Log4netBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net6.0 117μs 235ns 912ns 0.0581 0 0 4.28 KB
master EnrichedLog netcoreapp3.1 120μs 173ns 669ns 0 0 0 4.28 KB
master EnrichedLog net472 150μs 126ns 471ns 0.678 0.226 0 4.46 KB
#5872 EnrichedLog net6.0 116μs 149ns 577ns 0 0 0 4.28 KB
#5872 EnrichedLog netcoreapp3.1 119μs 259ns 1μs 0 0 0 4.28 KB
#5872 EnrichedLog net472 148μs 155ns 602ns 0.66 0.22 0 4.46 KB
Benchmarks.Trace.NLogBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net6.0 3.04μs 1.88ns 7.29ns 0.0304 0 0 2.2 KB
master EnrichedLog netcoreapp3.1 4.12μs 1.73ns 6.68ns 0.0296 0 0 2.2 KB
master EnrichedLog net472 4.98μs 1.46ns 5.66ns 0.32 0 0 2.02 KB
#5872 EnrichedLog net6.0 3.07μs 0.761ns 2.95ns 0.0306 0 0 2.2 KB
#5872 EnrichedLog netcoreapp3.1 4.12μs 1.32ns 5.12ns 0.0288 0 0 2.2 KB
#5872 EnrichedLog net472 4.8μs 1.64ns 6.35ns 0.319 0 0 2.02 KB
Benchmarks.Trace.RedisBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendReceive net6.0 1.35μs 0.629ns 2.44ns 0.0161 0 0 1.14 KB
master SendReceive netcoreapp3.1 1.81μs 1.65ns 6.38ns 0.0154 0 0 1.14 KB
master SendReceive net472 2.18μs 1.14ns 4.4ns 0.183 0.00109 0 1.16 KB
#5872 SendReceive net6.0 1.4μs 1.01ns 3.9ns 0.0161 0 0 1.14 KB
#5872 SendReceive netcoreapp3.1 1.66μs 0.722ns 2.7ns 0.015 0 0 1.14 KB
#5872 SendReceive net472 2.15μs 0.526ns 1.82ns 0.183 0.00108 0 1.16 KB
Benchmarks.Trace.SerilogBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net6.0 2.87μs 0.871ns 3.26ns 0.0215 0 0 1.6 KB
master EnrichedLog netcoreapp3.1 3.83μs 3.39ns 13.1ns 0.0229 0 0 1.65 KB
master EnrichedLog net472 4.42μs 1.99ns 7.44ns 0.323 0 0 2.04 KB
#5872 EnrichedLog net6.0 2.77μs 0.977ns 3.79ns 0.0222 0 0 1.6 KB
#5872 EnrichedLog netcoreapp3.1 3.88μs 1.82ns 7.06ns 0.0212 0 0 1.65 KB
#5872 EnrichedLog net472 4.3μs 1.58ns 5.9ns 0.323 0 0 2.04 KB
Benchmarks.Trace.SpanBenchmark - Slower ⚠️ Same allocations ✔️

Slower ⚠️ in #5872

Benchmark diff/base Base Median (ns) Diff Median (ns) Modality
Benchmarks.Trace.SpanBenchmark.StartFinishScope‑net6.0 1.133 472.18 535.09
Benchmarks.Trace.SpanBenchmark.StartFinishSpan‑net472 1.123 627.10 704.20

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master StartFinishSpan net6.0 409ns 0.145ns 0.561ns 0.00811 0 0 576 B
master StartFinishSpan netcoreapp3.1 589ns 0.232ns 0.867ns 0.00767 0 0 576 B
master StartFinishSpan net472 627ns 0.305ns 1.18ns 0.0915 0 0 578 B
master StartFinishScope net6.0 472ns 0.106ns 0.411ns 0.0099 0 0 696 B
master StartFinishScope netcoreapp3.1 740ns 0.351ns 1.36ns 0.00934 0 0 696 B
master StartFinishScope net472 862ns 0.449ns 1.74ns 0.104 0 0 658 B
#5872 StartFinishSpan net6.0 414ns 0.142ns 0.551ns 0.00811 0 0 576 B
#5872 StartFinishSpan netcoreapp3.1 560ns 0.832ns 3.22ns 0.00806 0 0 576 B
#5872 StartFinishSpan net472 704ns 0.303ns 1.17ns 0.0918 0 0 578 B
#5872 StartFinishScope net6.0 535ns 0.149ns 0.557ns 0.00991 0 0 696 B
#5872 StartFinishScope netcoreapp3.1 706ns 0.244ns 0.912ns 0.00947 0 0 696 B
#5872 StartFinishScope net472 821ns 0.266ns 0.996ns 0.104 0 0 658 B
Benchmarks.Trace.TraceAnnotationsBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master RunOnMethodBegin net6.0 673ns 0.338ns 1.31ns 0.00975 0 0 696 B
master RunOnMethodBegin netcoreapp3.1 911ns 0.42ns 1.63ns 0.00915 0 0 696 B
master RunOnMethodBegin net472 1.09μs 0.277ns 1.04ns 0.104 0 0 658 B
#5872 RunOnMethodBegin net6.0 685ns 0.414ns 1.6ns 0.00966 0 0 696 B
#5872 RunOnMethodBegin netcoreapp3.1 969ns 0.693ns 2.59ns 0.00958 0 0 696 B
#5872 RunOnMethodBegin net472 1.12μs 0.212ns 0.82ns 0.104 0 0 658 B

@GreenMatan GreenMatan force-pushed the matang/exception-replay-hashing branch 6 times, most recently from a433b60 to e54200c Compare August 12, 2024 12:08
Copy link
Contributor

@dudikeleti dudikeleti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Left a few comments.

namespace Datadog.Trace.Debugger.ExceptionAutoInstrumentation
{
/// <summary>
/// Important: Should only be used in testing. Not thread-safe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this class in Datadog.Trace.Tests.Debugger?

Copy link
Contributor Author

@GreenMatan GreenMatan Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, no. It uses ReadOnlySpan that is weaved as part of Datadog.Trace. If I try to move it elsewhere, I end up facing an error that MemoryExtesions has default implementations. Do you have a workaround?

@GreenMatan GreenMatan force-pushed the matang/exception-replay-hashing branch 10 times, most recently from d1d76e2 to d520dd7 Compare August 13, 2024 10:14
@GreenMatan GreenMatan force-pushed the matang/exception-replay-hashing branch from d520dd7 to 9951543 Compare August 13, 2024 10:21
@GreenMatan GreenMatan merged commit ceb00d0 into master Aug 13, 2024
58 of 65 checks passed
@GreenMatan GreenMatan deleted the matang/exception-replay-hashing branch August 13, 2024 12:36
@github-actions github-actions bot added this to the vNext-v3 milestone Aug 13, 2024
GreenMatan added a commit that referenced this pull request Aug 13, 2024
… aggregation (#5872)

## Summary of changes
In Exception Replay, the exception could be in one of several phases.
Two of these phases are `Done` and `Invalidated`.
- `Done`: The exception has already been captured and is waiting for the
next epoch to be wakened up.
- `Invalidated`: None of the frames could be captured, reporting tags
that tend to assist in understanding the reasoning for troubleshooting.

To be able to capture those phases as quick as possible, a cache is used
for a lookup before performing intensive calculation on the
`System.Exception` object itself, since the execution path is hot and
run as part of unwinding the request with an exception. The lookup key
used to be simply **Fnv1a** hashing of `exception.ToString()` of the
exception reaching the service root span.
Basing the hashing on `exception.ToString()` lead to scenarios where two
identical exceptions seemed different based on one of various factors;
non-deterministic participating frame, exception messages, PDB info
(file path + line number), etc.

To be able to determine quickly in which phase the exception is in
without performing costly computations, a new way of hashing is required
that should cleanse the exceptions in a way that two _similar_ exception
should fall into the same case, even though their stack traces are not
identical. Also, the new algorithm should be as performant as possible
with as little temporal allocations as possible - they play part every
time a service root span is finalized with an exception.

## Reason for change
Improve the experience of Exception Replay where we failed to report an
exception due to failure in determining if the exception is in `Done` /
`Invalidated` phases, as a result of it's previous occurrence looking a
bit different.

## Implementation details
A new class, `ExceptionNormalizer`, has been added that takes as input
the string representation of the exception alongside it's outermost
exception type, and one level deep of inner exception. It cleanses the
exception from the aforementioned attributes, and performs a more
fine-grained hash that shall have a better distribution based on the
actual exception, leaving out all the non-relevant bits that might
differ.

## Test coverage

[ExceptionNormalizerTests](https://github.com/DataDog/dd-trace-dotnet/blob/821e4860632a8fcb258bbbe74506249cb6865659/tracer/test/Datadog.Trace.Debugger.IntegrationTests/ExceptionNormalizerTests.cs)
with approvals on the hash + string representing the cleansed stack
trace.

## Other details
Fixes #DEBUG-2674
andrewlock pushed a commit that referenced this pull request Aug 13, 2024
…ne-grained aggregation (#5872 -> v2) (#5890)

## Summary of changes
In Exception Replay, the exception could be in one of several phases.
Two of these phases are `Done` and `Invalidated`.
- `Done`: The exception has already been captured and is waiting for the
next epoch to be wakened up.
- `Invalidated`: None of the frames could be captured, reporting tags
that tend to assist in understanding the reasoning for troubleshooting.

To be able to capture those phases as quick as possible, a cache is used
for a lookup before performing intensive calculation on the
`System.Exception` object itself, since the execution path is hot and
run as part of unwinding the request with an exception. The lookup key
used to be simply **Fnv1a** hashing of `exception.ToString()` of the
exception reaching the service root span.
Basing the hashing on `exception.ToString()` lead to scenarios where two
identical exceptions seemed different based on one of various factors;
non-deterministic participating frame, exception messages, PDB info
(file path + line number), etc.

To be able to determine quickly in which phase the exception is in
without performing costly computations, a new way of hashing is required
that should cleanse the exceptions in a way that two _similar_ exception
should fall into the same case, even though their stack traces are not
identical. Also, the new algorithm should be as performant as possible
with as little temporal allocations as possible - they play part every
time a service root span is finalized with an exception.

## Reason for change
Improve the experience of Exception Replay where we failed to report an
exception due to failure in determining if the exception is in `Done` /
`Invalidated` phases, as a result of it's previous occurrence looking a
bit different.

## Implementation details
A new class, `ExceptionNormalizer`, has been added that takes as input
the string representation of the exception alongside it's outermost
exception type, and one level deep of inner exception. It cleanses the
exception from the aforementioned attributes, and performs a more
fine-grained hash that shall have a better distribution based on the
actual exception, leaving out all the non-relevant bits that might
differ.

## Test coverage

[ExceptionNormalizerTests](https://github.com/DataDog/dd-trace-dotnet/blob/821e4860632a8fcb258bbbe74506249cb6865659/tracer/test/Datadog.Trace.Debugger.IntegrationTests/ExceptionNormalizerTests.cs)
with approvals on the hash + string representing the cleansed stack
trace.

## Other details
Fixes #DEBUG-2674
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants