Skip to content

Latest commit

 

History

History
499 lines (406 loc) · 22.7 KB

README.md

File metadata and controls

499 lines (406 loc) · 22.7 KB

OpenTelemetry .NET Metrics

Table of Contents

Best Practices

The following tutorials have demonstrated the best practices for using metrics with OpenTelemetry .NET:

Package Version

✔️ You should always use the System.Diagnostics.Metrics APIs from the latest stable version of System.Diagnostics.DiagnosticSource package, regardless of the .NET runtime version being used:

  • If you are using the latest stable version of OpenTelemetry .NET SDK, you do not have to worry about the version of System.Diagnostics.DiagnosticSource package because it is already taken care of for you via package dependency.
  • The .NET runtime team is holding a high bar for backward compatibility on System.Diagnostics.DiagnosticSource even during major version bumps, so compatibility is not a concern here.

Metrics API

Meter

🛑 You should avoid creating System.Diagnostics.Metrics.Meter too frequently. Meter is fairly expensive and meant to be reused throughout the application. For most applications, it can be modeled as static readonly field (e.g. Program.cs) or singleton via dependency injection (e.g. Instrumentation.cs).

✔️ You should use dot-separated UpperCamelCase as the Meter.Name. In many cases, using the fully qualified class name might be a good option.

static readonly Meter MyMeter = new("MyCompany.MyProduct.MyLibrary", "1.0");

Instruments

✔️ You should understand and pick the right instrument type.

Note

.NET runtime has provided several instrument types based on the OpenTelemetry Specification. Picking the right instrument type for your use case is crucial to ensure the correct semantics and performance. Check the Instrument Selection section from the supplementary guidelines for more information.

OpenTelemetry Specification .NET Instrument Type
Asynchronous Counter ObservableCounter<T>
Asynchronous Gauge ObservableGauge<T>
Asynchronous UpDownCounter ObservableUpDownCounter<T>
Counter Counter<T>
Gauge (experimental) N/A
Histogram Histogram<T>
UpDownCounter UpDownCounter<T>

🛑 You should avoid creating instruments (e.g. Counter<T>) too frequently. Instruments are fairly expensive and meant to be reused throughout the application. For most applications, instruments can be modeled as static readonly fields (e.g. Program.cs) or singleton via dependency injection (e.g. Instrumentation.cs).

🛑 You should avoid invalid instrument names.

Note

OpenTelemetry will not collect metrics from instruments that are using invalid names. Refer to the OpenTelemetry Specification for the valid syntax.

🛑 You should avoid changing the order of tags while reporting measurements.

Warning

The last line of code has bad performance since the tags are not following the same order:

counter.Add(2, new("name", "apple"), new("color", "red"));
counter.Add(3, new("name", "lime"), new("color", "green"));
counter.Add(5, new("name", "lemon"), new("color", "yellow"));
counter.Add(8, new("color", "yellow"), new("name", "lemon")); // bad perf

✔️ You should use TagList properly to achieve the best performance.

There are two different ways of passing tags to an instrument API:

  • Pass the tags directly to the instrument API:

    counter.Add(100, ("Key1", "Value1"), ("Key2", "Value2"));
  • Use TagList:

    var tags = new TagList
    {
        { "DimName1", "DimValue1" },
        { "DimName2", "DimValue2" },
        { "DimName3", "DimValue3" },
        { "DimName4", "DimValue4" },
    };
    
    counter.Add(100, tags);

Here is the rule of thumb:

  • When reporting measurements with 3 tags or less, pass the tags directly to the instrument API.
  • When reporting measurements with 4 to 8 tags (inclusive), use TagList to avoid heap allocation if avoiding GC pressure is a primary performance goal. For high performance code which consider reducing CPU utilization more important (e.g. to reduce latency, to save battery, etc.) than optimizing memory allocations, use profiler and stress test to determine which approach is better. Here are some metrics benchmark results for reference.
  • When reporting measurements with more than 8 tags, the two approaches share very similar CPU performance and heap allocation. TagList is recommended due to its better readability and maintainability.

Note

When reporting measurements with more than 8 tags, the API allocates memory on the hot code path. You SHOULD try to keep the number of tags less than or equal to 8. If you are exceeding this, check if you can model some of the tags as Resource, as shown here.

MeterProvider Management

🛑 You should avoid creating MeterProvider instances too frequently, MeterProvider is fairly expensive and meant to be reused throughout the application. For most applications, one MeterProvider instance per process would be sufficient.

graph LR

subgraph Meter A
  InstrumentX
end

subgraph Meter B
  InstrumentY
  InstrumentZ
end

subgraph Meter Provider 2
  MetricReader2
  MetricExporter2
  MetricReader3
  MetricExporter3
end

subgraph Meter Provider 1
  MetricReader1
  MetricExporter1
end

InstrumentX --> | Measurements | MetricReader1
InstrumentY --> | Measurements | MetricReader1 --> MetricExporter1
InstrumentZ --> | Measurements | MetricReader2 --> MetricExporter2
InstrumentZ --> | Measurements | MetricReader3 --> MetricExporter3
Loading

✔️ You should properly manage the lifecycle of MeterProvider instances if they are created by you.

Here is the rule of thumb when managing the lifecycle of MeterProvider:

Memory Management

In OpenTelemetry, measurements are reported via the metrics API. The SDK aggregates metrics using certain algorithms and memory management strategies to achieve good performance and efficiency. Here are the rules which OpenTelemetry .NET follows while implementing the metrics aggregation logic:

  1. Pre-Aggregation: aggregation occurs within the SDK.
  2. Cardinality Limits: the aggregation logic respects cardinality limits, so the SDK does not use indefinite amount of memory when there is cardinality explosion.
  3. Memory Preallocation: the memory used by aggregation logic is allocated during the SDK initialization, so the SDK does not have to allocate memory on-the-fly. This is to avoid garbage collection being triggered on the hot code path.

Example

Let us take the following example:

  • During the time range (T0, T1]:
    • value = 1, name = apple, color = red
    • value = 2, name = lemon, color = yellow
  • During the time range (T1, T2]:
    • no fruit has been received
  • During the time range (T2, T3]:
    • value = 5, name = apple, color = red
    • value = 2, name = apple, color = green
    • value = 4, name = lemon, color = yellow
    • value = 2, name = lemon, color = yellow
    • value = 1, name = lemon, color = yellow
    • value = 3, name = lemon, color = yellow

If we aggregate and export the metrics using Cumulative Aggregation Temporality:

  • (T0, T1]
    • attributes: {name = apple, color = red}, count: 1
    • attributes: {verb = lemon, color = yellow}, count: 2
  • (T0, T2]
    • attributes: {name = apple, color = red}, count: 1
    • attributes: {verb = lemon, color = yellow}, count: 2
  • (T0, T3]
    • attributes: {name = apple, color = red}, count: 6
    • attributes: {name = apple, color = green}, count: 2
    • attributes: {verb = lemon, color = yellow}, count: 12

If we aggregate and export the metrics using Delta Aggregation Temporality:

  • (T0, T1]
    • attributes: {name = apple, color = red}, count: 1
    • attributes: {verb = lemon, color = yellow}, count: 2
  • (T1, T2]
    • nothing since we do not have any measurement received
  • (T2, T3]
    • attributes: {name = apple, color = red}, count: 5
    • attributes: {name = apple, color = green}, count: 2
    • attributes: {verb = lemon, color = yellow}, count: 10

Pre-Aggregation

Taking the fruit example, there are 6 measurements reported during (T2, T3]. Instead of exporting every individual measurement event, the SDK aggregates them and only exports the summarized results. This approach, as illustrated in the following diagram, is called pre-aggregation:

graph LR

subgraph SDK
  Instrument --> | Measurements | Pre-Aggregation[Pre-Aggregation]
end

subgraph Collector
  Aggregation
end

Pre-Aggregation --> | Metrics | Aggregation
Loading

Pre-aggregation brings several benefits:

  1. Although the amount of calculation remains the same, the amount of data transmitted can be significantly reduced using pre-aggregation, thus improving the overall efficiency.
  2. Pre-aggregation makes it possible to apply cardinality limits during SDK initialization, combined with memory preallocation, they make the metrics data collection behavior more predictable (e.g. a server under denial-of-service attack would still produce a constant volume of metrics data, rather than flooding the observability system with large volume of measurement events).

There are cases where users might want to export raw measurement events instead of using pre-aggregation, as illustrated in the following diagram. OpenTelemetry does not support this scenario at the moment, if you are interested, please join the discussion by replying to this feature ask.

graph LR

subgraph SDK
  Instrument
end

subgraph Collector
  Aggregation
end

Instrument --> | Measurements | Aggregation
Loading

Cardinality Limits

The number of unique combinations of attributes is called cardinality. Taking the fruit example, if we know that we can only have apple/lemon as the name, red/yellow/green as the color, then we can say the cardinality is 6. No matter how many apples and lemons we have, we can always use the following table to summarize the total number of fruits based on the name and color.

Name Color Count
apple red 6
apple yellow 0
apple green 2
lemon red 0
lemon yellow 12
lemon green 0

In other words, we know how much storage and network are needed to collect and transmit these metrics, regardless of the traffic pattern.

In real world applications, the cardinality can be extremely high. Imagine if we have a long running service and we collect metrics with 7 attributes and each attribute can have 30 different values. We might eventually end up having to remember the complete set of all 21,870,000,000 combinations! This cardinality explosion is a well-known challenge in the metrics space. For example, it can cause surprisingly high costs in the observability system, or even be leveraged by hackers to launch a denial-of-service attack.

Cardinality limit is a throttling mechanism which allows the metrics collection system to have a predictable and reliable behavior when excessive cardinality happens, whether it was due to a malicious attack or developer making mistakes while writing code.

OpenTelemetry has a default cardinality limit of 2000 per metric. This limit can be configured at the individual metric level using the View API and the MetricStreamConfiguration.CardinalityLimit setting. Refer to this doc for more information.

Given a metric, once the cardinality limit is reached, any new measurement which cannot be independently aggregated because of the limit will be dropped or aggregated using the overflow attribute (if enabled). When NOT using the overflow attribute feature a warning is written to the self-diagnostic log the first time an overflow is detected for a given metric.

Note

Overflow attribute was introduced in OpenTelemetry .NET 1.6.0-rc.1. It is currently an experimental feature which can be turned on by setting the environment variable OTEL_DOTNET_EXPERIMENTAL_METRICS_EMIT_OVERFLOW_ATTRIBUTE=true. Once the OpenTelemetry Specification become stable, this feature will be turned on by default.

When Delta Aggregation Temporality is used, it is possible to choose a smaller cardinality limit by allowing the SDK to reclaim unused metric points.

Note

Reclaim unused metric points feature was introduced in OpenTelemetry .NET 1.7.0-alpha.1. It is currently an experimental feature which can be turned on by setting the environment variable OTEL_DOTNET_EXPERIMENTAL_METRICS_RECLAIM_UNUSED_METRIC_POINTS=true. Once the OpenTelemetry Specification become stable, this feature will be turned on by default.

Memory Preallocation

OpenTelemetry .NET SDK aims to avoid memory allocation on the hot code path. When this is combined with proper use of Metrics API, heap allocation can be avoided on the hot code path. Refer to the metrics benchmark results to learn more.

✔️ You should measure memory allocation on hot code path, and ideally avoid any heap allocation while using the metrics API and SDK, especially when you use metrics to measure the performance of your application (for example, you do not want to spend 2 seconds doing garbage collection while measuring an operation which normally takes 10 milliseconds).

Metrics Correlation

In OpenTelemetry, metrics can be correlated to traces via exemplars. Check the Exemplars tutorial to learn more.

Metrics Enrichment

When metrics are being collected, they normally get stored in a time series database. From storage and consumption perspective, metrics can be multi-dimensional. Taking the fruit example, there are two dimensions - "name" and "color". For basic scenarios, all the dimensions can be reported during the Metrics API invocation, however, for less trivial scenarios, the dimensions can come from different sources:

Note

Instrument level tags support is not yet implemented in OpenTelemetry .NET since the OpenTelemetry Specification does not support it.

Here is the rule of thumb when modeling the dimensions:

  • If the dimension is static throughout the process lifetime (e.g. the name of the machine, data center):
    • If the dimension applies to all metrics, model it as Resource, or even better, let the collector add these dimensions if feasible (e.g. a collector running in the same data center should know the name of the data center, rather than relying on / trusting each service instance to report the data center name).
    • If the dimension applies to a subset of metrics (e.g. the version of a client library), model it as meter level tags.
  • If the dimension value is dynamic, report it via the Metrics API.

Note

There were discussions around adding a new concept called MeasurementProcessor, which allows dimensions to be added to / removed from measurements dynamically. This idea did not get traction due to the complexity and performance implications, refer to this pull request for more context.

Common issues that lead to missing metrics

  • The Meter used to create the instruments is not added to the MeterProvider. Use AddMeter method to enable the processing for the required metrics.