Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable Cardinality limit / pre-aggregation? #5618

Open
johanstenberg92 opened this issue May 14, 2024 · 13 comments
Open

Disable Cardinality limit / pre-aggregation? #5618

johanstenberg92 opened this issue May 14, 2024 · 13 comments
Labels
documentation Documentation related metrics Metrics signal related question Further information is requested

Comments

@johanstenberg92
Copy link

What is the question?

Hello,

is there a way to disable the pre-aggregation and cardinality limit and let the system which receives the metrics handle the throttling problem?

An alternative would be to heavily over estimate the cardinality limit, but then there’s a concern with the initial memory allocation (which could also be better explained in the docs).

Our product has a backing system which can handle the cardinality we need, but we are concerned to put fix numbers in apps reporting to it and the memory consumption since we have some huge cardinalities.

The documentation doesn’t have a solution for this scenario, do you have any advice? Thanks

Additional context

No response

@johanstenberg92 johanstenberg92 added the question Further information is requested label May 14, 2024
@cijothomas
Copy link
Member

is there a way to disable the pre-aggregation and cardinality limit

No. MeasurementProcessor, as a concept would technically allow one to by pass all in-memory aggregations and export raw measurements directly, but such a thing do not exist in the spec.

Cardinality Limit docs are here, which talks about an experimental feature to reclaim unused points.
https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#cardinality-limits

Yes the part about the upfront memory allocation part is not very explicit in the doc, good callout. Feel free to send a PR if you are up for it, else we'll do it.

(Note: 1 metric point is less than 100 bytes, so even with 100,000 extra metric points, its ~10 MB extra memory. Do your own benchmarks and see if this is acceptable.)

We don't yet have a mechanism to monitor the utilization - once that lands, it'll be easy to monitor how much is actually utilized vs wasted..

@cijothomas cijothomas added documentation Documentation related metrics Metrics signal related labels May 15, 2024
@johanstenberg92
Copy link
Author

Thank you for your response. Just to expand:

I previously used Datadog’s “metrics without limits” where you essentially let the apps send whatever they can and the even configure what dimensions you care about in the central system, and don’t aggregate on those you don’t care about. I feel a bit constrained with this solution, and I’m concerned with the burden of maintaining max cardinality stats in the app and the potential risk for the memory.

that being said we’ll start experimenting, thanks again.

@hugo-brito
Copy link

From what I read in https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#cardinality-limits it seems like the cardinality limit is tweakable but uniformly enforced across all metrics.

How would the SDK know about and pre-allocate all the needed objects for all the metrics, if these are unknown at the beginning of the program?

If we are to estimate a worst-case for the most complex metric, which will then dictate the memory allocation for all the other metrics, wouldn't it be more prudent to consider metric-specific cardinality? The "one size fits all" approach feels a bit lacking...

Furthermore, with the current approach, we now must maintain this cardinality limit... Code changes will be needed if suddenly your cluster can fit double or triple the users.

So in summary, it would be great to either set the cardinality per metric and/or, emit the metrics raw.

@cijothomas
Copy link
Member

How would the SDK know about and pre-allocate all the needed objects for all the metrics, if these are unknown at the beginning of the program?

Not at the beginning of the program, but whenever an instrument is created, SDK pre-allocates 2000 MetricPoints by default.

@cijothomas
Copy link
Member

dictate the memory allocation for all the other metrics, wouldn't it be more prudent to consider metric-specific cardinality? The "one size fits all" approach feels a bit lacking

So in summary, it would be great to either set the cardinality per metric and/or, emit the metrics raw.

You are right! Ability to set the cardinality per metric is already supported as experimental feature, available in pre-release builds.

or, emit the metrics raw.

This is not something we plan to offer, until spec allows it!

@hugo-brito
Copy link

With the current implementation, shouldn't at least exist a mechanism for us to know if metrics are being dropped silently (due to low cardinality)?

@cijothomas
Copy link
Member

With the current implementation, shouldn't at least exist a mechanism for us to know if metrics are being dropped silently (due to low cardinality)?

There is an internal log emitted when the limit is hit for the first time. This is the current state. (It is not ideal, and overflow attribute will go a long way into making this experience smoother. And once we expose utilization metric, that'd make things much better than today)

@hugo-brito
Copy link

Is there any guidance on how to expose such metric? That way we could at least know if we're lowballing the max cardinality.

@cijothomas
Copy link
Member

Is there any guidance on how to expose such metric? That way we could at least know if we're lowballing the max cardinality.

#3880 This is the tracking issue! There were few attempts in the past, but nothing got shipped. If you are passionate about this space, consider contributing and we can guide you through the process!
The linked issue can point you to the previous PRs attempting this, to see if you can pick it up.

@reyang
Copy link
Member

reyang commented May 21, 2024

Is there any guidance on how to expose such metric? That way we could at least know if we're lowballing the max cardinality.

@hugo-brito https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#cardinality-limits has captured some useful links, note that there are lots of moving pieces, and the specification is still Experimental https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#cardinality-limits.

@okonaraddi-msft
Copy link

(Note: 1 metric point is less than 100 bytes, so even with 100,000 extra metric points, its ~10 MB extra memory. Do your own benchmarks and see if this is acceptable.)

Is there more info on where the 100 bytes comes from?

I'm wondering if a metric point could be >100 bytes. Like what if there were many, large key-value pairs (like 50 keys each with a 50-character name and a 50-character string value) stored in the MetricPoint's Tags?

@cijothomas
Copy link
Member

Is there more info on where the 100 bytes comes from?

Size of MetricPoint struct. (Of course the thing it points to could be very large as that depends on the size of keys/values etc, but MetricPoint itself is fixed size)

@clupo
Copy link

clupo commented Sep 17, 2024

@hugo-brito

Is there any guidance on how to expose such metric? That way we could at least know if we're lowballing the max cardinality.

Appears there's an environment variable OTEL_DOTNET_EXPERIMENTAL_METRICS_EMIT_OVERFLOW_ATTRIBUTE to flip on to help with that

https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#cardinality-limits

In my testing the tag that shows up on the offending metrics is otel.metric.overflow:true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation related metrics Metrics signal related question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants