-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for synchronous Gauge and an UpDownCounter that can be Set() #2318
Comments
Hey @jmacd, I appreciate your effort on this matter. I was wondering if there was any progress made regarding this issue? I'll try to provide some additional context. We are using the OpenTelemetry Go SDK to instrument our services. Our current workaround is to use a custom in memory map (to achieve last value aggregation) and an asynchronous gauge to report those map values, and then flush it every collect interval. Having to implement a custom data structure just to support a simple metric type (which exists in practically any other metric stack) is both cumbersome and not intuitive at all. I think that the argument of not implementing this as it improves the efficiency of monitoring is somewhat inconsistent with the whole approach of OpenTelemetry. Is there anything else that can be done to make this happen? |
Related since I asked about this in the dotnet implementation repo open-telemetry/opentelemetry-dotnet#2362 (comment) Our use case is that we have millions of actors whose mailbox depth we would like to periodically report on - but only when the actor is actually processing messages. Having an observable gauge for each of those would be non-performant for our use cases - right now we're trying to use histograms and it's a bit of a hack. |
From CNCF Slack we have a use-case described:
In this case, the application as built does not keep persistent state, it just observes changes to the current value of a cumulative sum or gauge quantity. |
@Aaronontheweb why the current "UpDownCounter" is not good? Curios what exactly is the metric about the actors that you want to report. |
@jmacd now that I see better this example, can that be an UpDownCounter where +1 on start and -1 on end? I know it is a bit of an abuse of the API, but want to understand that if this is good as a short term until we get more proof that we need sync Gauge. |
Because I'm measuring the point-in-time size of the queue depth at any given time, rather than its rate of change - I can measure the rate at which actors process messages with a |
@Aaronontheweb "UpDownCounter" is not for "rate". The "UpDownCounter" is as a name says a counter (sum) that supports positive and negative values, so your backlog will do "+1" when element is added and "-1" when is removed. And you get the right value. |
I can't track when a value is added to it - I can only check it when the actor is running. |
I have a similar issue - when processing messages from e.g. Azure EventHub, I have the offset and enqueued time of all messages in the batch and I get the last offset and last enqueued time from EventHub every time I fetch a new batch. There are many other cases when you might need a synchronous gauge. An example might be an IoT device where some module periodically reports some values to the main control unit. This main control unit would like to record these values somewhere the moment it gets them. At this point there is no way to record gauge metrics this way. Sure, you could store them in local variables and then use an async gauge to collect the values on metrics collection but it seems very convoluted. |
Any updates on this as a consideration? |
Would really, really like to see this supported. Our use case: we run a set of tests against our service via an AWS Lambda every minute. We want to emit a gauge indicating whether the tests succeeded or not (basically a binary "1" or "0"). The code is much more complicated with an async gauge than a synchronous gauge since we just want to issue it once and right at the end of the test run, and because we're in a Lambda and need to make sure it emits while the Lambda is running. |
This is a top priority (IMO) and I will begin working on the (very straightforward) proposal in the coming few weeks. |
^^^ Here is the prototype I am working with,. |
Hi, I'd love to see support for synchronous Gauges as well! The lack of sync gauges can make it difficult to interface with existing libraries which push metrics to your app at specific intervals which they define. In this example (https://github.com/ty-elastic/videojs_observability), I am pushed metrics which I then copy into a global map. From there, I have to coax otel into turning around and reading/reporting those metrics (some as gauges) asynchronously. At best, the timing is wrong (the data is timestamped at the time of async reading rather than then time when they are pushed to me). A better approach would allow one to write gauge values synchronously with a proper timestamp; the last value is then reported when the exporter gathers metrics. |
I've been assuming that the timestamps reported for synchronous gauges would be the time of collection, not the time the instrument was most recently "set". A deviation from this would be inconsistent with other instruments / aggregations. |
interesting.. that is a reasonable assumption for sync counters, but for sync gauges, it seems like ideally you would assert the timestamp and value together to ensure that, from an observability perspective, you can say "oh - this gauge was at this water mark at this specific time" rather than "this gauge was at this water mark sometime in between this collection interval". for most apps, it probably matters not. for "real-time" apps (e.g., a video compression engine), that difference could be confusing. |
Any news on this? I think a lot of people need this. |
Just a quick comment to describe a use-case of mines: we have an application that runs a scheduled task to count inconsistencies in our data. This computation might be long enough to not want to do it when the instrument is observed but only do it on a schedule that the application itself defines and is responsible of. We used to use Prometheus gauges. With OpenTelemetry, this is not possible with gauges. Technically it would be possible to use a counter but from a "functional" point of view, we are not computing delta values, we are computing a "total value" at a scheduled rate. Edit: I guess that the thing that makes more sense in my case is to store the value on my end somehow (in memory) and read it async via the async gauge. |
I have a few applications that do data processing and I have a desire to expose some of the internal variables as metrics around the internal health of the system. The metrics aren't time-series as such, they are samples of a value of a variable at a point-in-time that I desire to emit to the monitoring system for observation. For example, several batches of CSV files come in each day and I may want to observe the number of lines read from those files. The Gauge datatype looks like the correct type for publishing internal metrics because each sample is distinct from the previous sample. However, there's no way to publish a Gauge value directly (synchronously), like this: file_lines = open("/tmp/datafile.csv", "r").readlines()
metric_lines = meter.create_gauge("line-count")
metric_lines.set_value(len(file_lines)) It's unclear how I would go about emitting the value of a variable through an ObservableGauge without first publishing the variable into a global variable for sampling with the callback, which isn't a realistic solution: LINE_COUNT = 0
def observable_gauge_func(options):
global LINE_COUNT
yield Observation(LINE_COUNT, {})
file_lines = open("/tmp/datafile.csv", "r").readlines()
meter.create_observable_gauge("line-count", [observable_gauge_func])
LINE_COUNT = len(file_lines) |
We also need such functions very much, In our scenario, there are a large number of test cases to be executed, but due to performance reasons, they are only run once a day (configurable). We use the histogram to get whether each case is successful (0, 1 binary) But now, there are some use cases that need to be returned. We want to care about their return values. At this time, histogram is no longer appropriate, because some values in the bucket will be lost. Asynchronous dashboards are also not suitable for us because we execute use cases through scheduling. The operation of UpDownCounter is strange. |
+1 on the need for this synchronous gauge type. |
I could also benefit from a synchronous gauge type. Like others in this thread, I want to use it to measure queue depth and unique users in the queue across time. I'm using the Python library. Do let me know if there is another metrics instrument that could be used for this but I didn't see one that would be a better fit. In the meantime, I have a simple Python class that wraps the asynchronous gauge so I can set point in time values without using the callback logic in my application code. It only reports out on the interval you have set on the metrics exported. That means if I set a value, and then set another value before the initial value has been exported, it'll get dropped. There is probably a way to fix that but for my use case that trade-off is fine. from typing import Iterable
from opentelemetry import metrics
# Alias to make it easier to access the Observations class.
# This takes in a single value (float or int) and an optional dictionary
# of attributes.
# https://opentelemetry-python.readthedocs.io/en/latest/api/metrics.html#opentelemetry.metrics.Observation
MetricsGaugeObservation = metrics.Observation
class MetricsGauge:
def __init__(self, name: str, meter: metrics.Meter):
self._name = name
self._meter = meter
self._internal_gauge = None
self._observations = []
def set_observations(self, observations: Iterable[MetricsGaugeObservation]):
self._observations = observations
# The async gauge starts reporting as soon as it is created. Create
# the async gauge once a real value has been set to avoid creating noise
# prior to values getting set.
if self._internal_gauge is None:
self._internal_gauge = self._meter.create_observable_gauge(
name=self._name, callbacks=[self._gauge_callback]
)
def _gauge_callback(self, _options):
return self._observations
# In application code
meter = get_meter(__name__)
gauge = MetricsGauge("gauge.name", meter)
# When the gauge value changes
new_value, new_attributes = func_to_get_gauge_value_and_attributes()
# This value takes an interable so that you can set multiple values and attributes
# for the same timestamp.
gauge.set_observations([MetricsGaugeObservation(new_value, new_attributes)]) |
@jmacd Did you ever get any traction on the proposal you mentioned in July? The lack of sync gauges is causing a massive headache for us in a service where state isn't saved, so there isn't a great way to handle callbacks. |
any update for this proposal? |
Here's my polyfill for JavaScript: class OTelGauge {
private currentValue: number;
private gauge: Observable;
constructor(meter: Meter, name: string, initValue?: number) {
this.currentValue = initValue;
this.gauge = meter.createObservableGauge(name);
this.gauge.addCallback((result) => result.observe(this.currentValue));
}
observeValue(newValue: number) {
this.currentValue = newValue;
}
} |
For queue tracking purposes this one is also really essential for us. |
Having a |
Synchronous version of gauge would be helpful for our use case as well, as we have a need for emitting metrics with negative values by absolute value and not incrementation, and currently the only support for negative values is asynchronous gauge/up-down-counter or synchronous up-down-counter. |
I've ported an application from Prometheus Client to OpenCensus. Both have synchroneous gauge implementation. So the switch wasn't that hard. Now, OpenCensus is no longer supported starting from July and recommends using OpenTelementry. And here is only an asynchroneus gauge. |
FYI, I've built a prototype java implementation of this here and am planning on opening a spec PR with a change proposal. |
@jmacd very nice written argument for how to implement Gauge, but I feel that there are not enough examples where the async/observer pattern does not work. The link issue regarding async up down is very weak, unable to understand the argument why that needs to expose a state... |
…ence on metric point persistence (#3540) Resolves #2318. Extends the metrics API to include a new synchronous gauge instrument, whose value can be set as measurements occur. This is useful when the measurements are made accessible via subscription, such as JFR events. There are interesting questions around the behavior of `start_time_unix_nano` and whether or not sync gauges continue to export series without new measurements. This PR has gone through some iteration (available to see via commit history), but as it currently stands, I've made the following changes: - Add new language stating that metric reader's configurable aggregation temporality as a function of instrument type influences the persistence of metric points across successive collections, as described in this #3540 (comment). This influence occurs regardless of whether the resulting metric point has an aggregation temporality field. In other words, sync gauge persistence behavior is influenced by aggregation temporality selection. - Add new language stating that metric reader's configurable aggregation temporality as a function of instrument type influences the starting timestamp of metric points. When an instrument is cumulative, its starting timestamp will always be the same. When an instrument is delta, its starting timestamp advances with successive collections. This influence occurs regardless of whether the resulting metric point has an aggregation temporality field. In other words, sync gauge starting timestamp is influenced by aggregation temporality selection. --------- Co-authored-by: Reiley Yang <reyang@microsoft.com>
There are commits for Python, C++ and .Net - do you plan to provide it for other languages / SDKs like Golang etc.? |
My bad. Saw that this was picked up in #3540 |
I found that the synchronous Here's the official code of So I used this code and worked it around for now. Will replace it with the original synchronous |
…ence on metric point persistence (open-telemetry#3540) Resolves open-telemetry#2318. Extends the metrics API to include a new synchronous gauge instrument, whose value can be set as measurements occur. This is useful when the measurements are made accessible via subscription, such as JFR events. There are interesting questions around the behavior of `start_time_unix_nano` and whether or not sync gauges continue to export series without new measurements. This PR has gone through some iteration (available to see via commit history), but as it currently stands, I've made the following changes: - Add new language stating that metric reader's configurable aggregation temporality as a function of instrument type influences the persistence of metric points across successive collections, as described in this open-telemetry#3540 (comment). This influence occurs regardless of whether the resulting metric point has an aggregation temporality field. In other words, sync gauge persistence behavior is influenced by aggregation temporality selection. - Add new language stating that metric reader's configurable aggregation temporality as a function of instrument type influences the starting timestamp of metric points. When an instrument is cumulative, its starting timestamp will always be the same. When an instrument is delta, its starting timestamp advances with successive collections. This influence occurs regardless of whether the resulting metric point has an aggregation temporality field. In other words, sync gauge starting timestamp is influenced by aggregation temporality selection. --------- Co-authored-by: Reiley Yang <reyang@microsoft.com>
What are you trying to achieve?
There are long-standing requests to support a settable Gauge, as this is an API both Prometheus and Statsd support.
In the OpenTelemetry data model, the same request happens for an UpDownCounter that can be set. Whereas an asynchronous UpDownCounter can be set from a callback, users are asking for an API to set a Gauge or cumulative UpDownCounter from synchronous code (and, for that matter, a cumulative Counter)
The OpenTelemetry data model says exactly what kind of point should be produced for these events (Gauges and cumulative Sums), so we know what the API should be and what the data should look like, but still there's been an objection. Let's focus on Gauge, since it has a longer history.
You can argue that by making the OTel Gauge instrument only support asynchronous use from a callback, we are improving the efficiency of a monitoring solution by preventing the user from making extraneous observations.
The counter-argument is that you're making the user's life more difficult, particularly in situations where they are holding a value and do not want to defer it until a callback. See, e.g., #2054.
One of the objections to having a synchronous Gauge is that depending on the collection interval, there may or may not be gaps in the series. However, the data model discusses gap handling, and the resulting series with gaps is well-defined. The result, some will argue, is working as intended.
The big distinction in terms of implementation between the synchronous and asynchronous instruments inside the SDK is whether an aggregator will see many updates per interval or is expected to see one update per interval. Our specification is not very strong on what should happen when multiple observations are made for a single instrument and attributes during a collection interval. The result is currently unspecified.
Proposal
This proposal would be easy to pull off in OTel-Go because we are leaning toward using similar APIs for synchronous and asynchronous instruments, i.e., in OTel-Go an asynchronous instrument is just like a synchronous instrument that you only call from a registered callback.
To add the requested APIs:
Observe(value)
for setting cumulative and current values but would be used outside of registered callbacksThe result would be as expected. If the user does not update a synchronous gauge during a collection interval, missing data (i.e., a gap). If the user updates once or more between collection intervals, the last value will be observed.
Overall, this seems like a small extension to the current specification, does not create any new instruments or APIs or data model, and gives users what they've been asking for.
The text was updated successfully, but these errors were encountered: