Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics: Consider introducing dedicated measure instrument for timings #464

Open
jmacd opened this issue Feb 13, 2020 · 25 comments
Open

Metrics: Consider introducing dedicated measure instrument for timings #464

jmacd opened this issue Feb 13, 2020 · 25 comments
Labels
area:api Cross language API specification issue spec:metrics Related to the specification/metrics directory triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor

Comments

@jmacd
Copy link
Contributor

jmacd commented Feb 13, 2020

This text was considered for the 0.3 release, but held back. Consider this for 0.4.

### Option: Dedicated Measure instrument for timing measurements

As a language-optional feature, the API MAY support a dedicated
instrument for reporting timing measurements.  This kind of
instrument, with recommended name `Timer` (and `BoundTimer`), is
semantically equivalent to a Measure instrument.  Like the Measure
instrument, Timers support a `Record()` function.  The input value to
this instrument is in the language's conventional data type for timing
measurements.

Timer instruments MUST capture only the magnitude of the input value
(i.e., an absolute value).  When the user provides a negative value to
the Timer `Record()` function, the captured measurement is the number
stripped of its sign.

For example, in Go the API will accept a `time.Duration`, and in C++
the API will accept a `std::chrono::duration`.  These instruments
apply the correct units automatically, reducing the potential for
confusion over timing metric events.

See the comment: #430 (comment)

@andrewhsu
Copy link
Member

from the spec sig mtg today, sounds like this one could be suitable for v0.5 milestone

@carlosalberto carlosalberto modified the milestones: v0.5, v0.6 Jun 9, 2020
@bogdandrutu bogdandrutu added the area:api Cross language API specification issue label Jun 26, 2020
@jmacd
Copy link
Contributor Author

jmacd commented Jun 29, 2020

This issue is being highlighted in a comparison of Micrometer and OpenTelemetry.
(Not surprisingly!) We really need a timing instrument.

@jkwatson
Copy link
Contributor

I continue to think this is a good idea. It provides a very concrete instrument for doing what is likely the most common instrumentation metric recordings.

@nnanda2016
Copy link

@jmacd You mentioned few design points in the Gitter channel. Should they be updated here?

@jmacd
Copy link
Contributor Author

jmacd commented Jun 29, 2020

I wrote:

There are some design questions perhaps. Two basic APIs are:
(1) This is an instrument semantically like ValueRecorder that takes the builtin duration type, 
(2) this is an instrument that knows about clocks and can time you a section and metric it (which is a lot like a Span). 

I think we want (1) and to tell users of (2) they might consider using a Span instead.

@james-bebbington
Copy link
Member

I think we want (1) and to tell users of (2) they might consider using a Span instead.

If all we want is (1), couldn't we support an additional data type (i.e. int64, double, & duration) rather than creating a new instrument?

@reyang reyang added the release:required-for-ga Must be resolved before GA release, or nice to have before GA label Jul 10, 2020
@bogdandrutu bogdandrutu added the priority:p3 Lowest priority level label Jul 24, 2020
@kenfinnigan
Copy link
Member

What's the status of this?

With respect to the comment above, I think (2) is still valid because not all spans are sampled. It's not possible to determine the latency and throughput of a method without timers being present for all executions.

@andrewhsu andrewhsu added release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs and removed release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Oct 20, 2020
@jmacd jmacd removed their assignment Feb 4, 2021
@reyang reyang self-assigned this Mar 3, 2021
@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Jun 1, 2021

My two cents:

Recording time values is not a first-class concept in OpenTelemetry, there is no component that supports this by default but users need to measure time on their own and record it using a Histogram.

My experience with measuring elapsed time is the following:

  1. It is hard to do it right (see later)
  2. It can be hard to test (see later)
  3. Handling the base time unit can be confusing for different backends (see later)
  4. It is the most used scenario in terms of recording metrics (please let us know if you have a different experience)

Please also notice that having a Timer can (should) also mean more than "just" recording a latency Histogram:

  1. A Counter for the rate of the recordings (by attributes)
  2. A Gauge for the max latency
  3. Configurable "percentile" values (e.g.: TP99)

Why measuring elapsed time is hard?

Since there is no clock abstraction currently in OTel, it is easy for users to calculate time values incorrectly (e.g.: using wall time instead of monotonic time) so depending on how do you get the current time, you might get into various issues.

Granularity

The method that you use maybe returns the current time in milliseconds but the granularity of the time value can depend on the underlying operating system and may be larger than the base unit. For example, many operating systems measure time in units of tens of milliseconds.

Let's say your OS measures time in chunks of 100ms. In this scenario there is no way to measure anything that needs to be more precise than this. So if an operation takes somewhere between 20 and 80ms, your measurement will always be either 0 or 100ms (but this precision can be worse, e.g.: 1s).

Time can change (non-monotonic)

The method that you use to get the current time might consider:

  • Leap seconds: so measurements could be off when this happens
  • The un-uniformness of Earth's rotation (slows down and speeds up in complicated ways): so time might not always flow uniformly, measurements could be off
  • Adjustments due to synchronizing the system clock (Network Time Protocol (NTP)): so measurements could be really off when this happens
  • Daylight Saving Time (DST): off by one hour?

If you want to have some fun, read this: https://infiniteundo.com/post/25326999628/falsehoods-programmers-believe-about-time.

There might be other issues too but my point is: measuring elapsed time is hard (because of this languages/language-SDKs has support for measuring elapsed time).

Let me show two OpenTelemetry examples that demonstrate some of the problems above:

... not all systems provide time with a better precision than 1 second. While this function normally returns non-decreasing values, it can return a lower value than a previous call if the system clock has been set back between the two calls.

Why testing a timer is hard?

Because of the issues above and because of the uncertainty of the duration of operations, it can be hard to verify if your instrumentation is correct. Having a clock abstraction makes this easy since you can mock/fake the clock so that it will return what you want.

Why could the time unit be a problem?

Some metrics backends have a preference for a base time unit (e.g.: Prometheus expects times in seconds). This preference is a concern for exporting but recording should be possible in any time unit. Without having a first-class concept of a Timer, this doesn’t seem possible currently without parsing the String unit to determine if it is a time unit.

/cc: @reyang @jmacd @jkwatson @bogdandrutu

@jsuereth
Copy link
Contributor

jsuereth commented Jun 3, 2021

C++ also has std::chrono::high_resolution_clock. I'd be a fan of a Timer-specific instrument that simplifies measuring latency as I think this is a critical use case for metrics.

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Jun 11, 2021

Per our previous SIG meeting, here's a Java example: https://github.com/jonatan-ivanov/otel-spec-gh-464
It tries to demonstrate the followings:

  • The user is guided while they want to use a clock (monotonic and wall time, see: Clock)
  • The user is guided while they want to measure time (they don't need to the recorded values themselves, see: Timer)
  • Testing is easy using a fake clock (see: MockClock and TimerDemo)

Please check out the code and the "tests", here's an example to use the timer and fake the clock:

Timer timer = meter.createTimer("exampleTimer", "just an example");
Timer.Sample sample = timer.start(fakeClock);
fakeClock.add(Duration.ofSeconds(1));
long recordedDuration = timer.stop(sample);

assertThat(recordedDuration).isEqualTo(Duration.ofSeconds(1).toNanos());

This is just an example, there are other arrangements to do this, this is just an example (e.g.: timer methods and how the clock is injected, histogram is not used in the example, etc.).

@Aneurysm9
Copy link
Member

Here's a possible implementation in Go:

type TimerInstrument struct {
	syncInstrument
}

type Timer struct {
	mu sync.Mutex
	instr TimerInstrument
	start time.Time
	labels []attribute.KeyValue
}

func (t TimerInstrument) Start(ctx context.Context, labels ...attribute.KeyValue) *Timer {
	return &Timer{instr: t, start: time.Now(), labels: labels}
}

func (t TimerInstrument) Record(ctx context.Context, f func(context.Context) error, labels ...attribute.KeyValue) error {
	timer := t.Start(ctx, labels...)
	err := f(ctx)
	err2 := timer.Stop(ctx)

	if err == nil && err2 != nil {
		return fmt.Errorf("error stopping timer: %w", err2)
	} else if err != nil && err2 == nil {
		return fmt.Errorf("error executing timed function: %w", err)
	} else if err != nil && err2 != nil {
		return fmt.Errorf("error executing timed function (%s) and error stopping timer (%s)", err, err2)
	}

	return nil
}

func (t *Timer) Stop(ctx context.Context, labels ...attribute.KeyValue) error {
	t.mu.Lock()
	defer t.mu.Unlock()

	if t.start.IsZero() {
		return fmt.Errorf("timer not active, unable to stop")
	}

	dur := time.Since(t.start)
	t.instr.directRecord(ctx, number.NewInt64Number(int64(dur)), append(t.labels, labels...))
	t.start = time.Time{}

	return nil
}

It seems largely similar to what @victlu and @jonatan-ivanov have proposed, though it would have the user call Stop() on the value returned from Start() rather than passing that value back to the instrument. I'm not married to this approach, this is just what I've done in the past.

I think it will be important to be able to provide attributes both when starting and stopping a timer.

I've included a Record() method that could be used to record the time consumed by a function. I'm not sure how useful it is, but there was discussion earlier about adding such a capability and this demonstrates that it should be relatively easy once Start() and Stop() are implemented.

@jmacd
Copy link
Contributor Author

jmacd commented Jun 14, 2021

Here is a Go prototype as well. The thing I like about this design is that you can write a one-liner to record a duration.

https://github.com/lightstep/opentelemetry-prometheus-sidecar/blob/main/telemetry/timer.go

@anuraaga
Copy link
Contributor

anuraaga commented Jun 14, 2021

For reference, here's our current implementation of request metrics in Java

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/instrumenter/http/HttpServerMetrics.java#L154

At first I was quite sad not to have a Timer API which I would have just put into the Context directly - but since I ended up needing to propagate labels from start to end of the request, adding a long field became less of a deal. The discoverability of the API for Timer and ease of testing are of course still great reasons to have it though.

Edit: But if there were a similar class for the counter too, where add and sub don't operate on the meter but happens based on an intermediate object, then labels don't have to be propagated / grokked and that would be nice too. I think this issue may be not only about timings but about any metric that spans a request's beginning and end which includes active request counting as well.

@reyang reyang added release:after-ga Not required before GA release, and not going to work on before GA and removed release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs labels Sep 14, 2021
@reyang reyang removed this from the Metrics API/SDK Feature Freeze milestone Sep 14, 2021
@reyang reyang added this to the Metrics Future Release milestone Nov 7, 2021
@reyang reyang removed their assignment Nov 7, 2021
@donbourne
Copy link

Hello! I'm working on the next generation of the MicroProfile Metrics spec. We're intending to allow for any metrics API to be used as the implementation of our next API. One important part of that API will be (and has been in past releases of our spec) a Timer metric. So it would be great if OTel could find a solution for this issue so that OTel metrics could be used by implementers of the next version of the MicroProfile Metrics spec.

@trask
Copy link
Member

trask commented Apr 7, 2022

hi @donbourne! we are currently capturing "timing" metrics as histograms, e.g. http.server.duration

it's likely that a future "timer" instrument would just be a (very helpful) abstraction on top of histogram

since you would be providing your own Timer abstraction in MicroProfile metrics, would it work for you to bridge that directly to otel histogram for now?

@jmacd
Copy link
Contributor Author

jmacd commented Apr 7, 2022

Agreeing with @trask -- to me the reason for a dedicated timing instrument is to abstract away details about handling clocks and/or to have a more stopwatch-like API. If you already have a timing measurement, then just use a histogram with the correct units.

@donbourne
Copy link

thanks @trask and @jmacd , knowing that the plan for OTel is to continue to use a histogram for Timers is useful. I realize that different metrics APIs will use different ways of representing a Timer, so it would be difficult for us to try to standardize on the exact output expected for timers. That said, using a histogram is similar in spirit to what we've done in the past for Timers (and what Micrometer can do when using a Prometheus Meter Registry).

So while I agree that having a timing instrument seems very useful from a syntax perspective, I get your point that an implementer of MicroProfile Metrics could make do with a histogram (particularly if future timing instrument will be implemented on top of a histogram).

@MrAlias
Copy link
Contributor

MrAlias commented Feb 6, 2023

Related to open-telemetry/oteps#129

@EugeneKrapivin
Copy link

Hey all! interesting discussion, I'm migrating the organization I work in from multiple metric strategies (Metrics.Net and app-metrics to influxDB) to a single industry standard solution, which I'm happy to embrace.

This conversation caught my eye because OTEL doesn't define a "timer", and here you are discussing a possible implementation.

I'm afraid that defining a Timer as a useful abstraction on top of a histogram might lead to a confusion with developers that are not familiar with the spec and discussion leading to it.

How would such an instrument help me to get mean, min, max and quantiles?
I'm a bit unsure regarding the use of histogram for the measurement of timed operations, since it lacks precision.

@jsuereth
Copy link
Contributor

Histograms are designed for scenarios where you want mean, min, max and quantiles. The difference between a Histogram and other types of in-memory aggregations (like Summary/Counter) is that Histogram pushes as much data as reasonable (according to bucket boundaries) so the Quantile computation can happen server side.

Every histogram has these components:

  • sum of all values
  • count of all values
  • min of all values
  • max of all values
  • "bucket counts" (used to compute quantiles with sum/count/min/max to varying degrees of accuracy).

While there are many ways of calculating Quantiles, Histograms have become very popular due to better ability to handle time-alignment issues at query time over raw "gauge of quantile" type behavior we had before.

Hope that helps!

@jonatan-ivanov
Copy link
Member

I'm afraid that defining a Timer as a useful abstraction on top of a histogram might lead to a confusion with developers that are not familiar with the spec and discussion leading to it.

Wouldn't defining a Histogram lead to the same confusion with developers that are not familiar with the spec? :)

To me the biggest advantage of having a Timer is the ability of measuring elapsed time. Which seems to be surprisingly hard because it seems most of the users are not get it right even though the solution is surprisingly easy in most platforms (see above or the article I linked above, though time is hard in general).

There are more advantages like testing and handling time units too, see my comment above.

@dyladan dyladan added triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor and removed release:after-ga Not required before GA release, and not going to work on before GA priority:p3 Lowest priority level labels Mar 29, 2024
@austinlparker austinlparker moved this to Spec Icebox in 🔭 Main Backlog Apr 16, 2024
@austinlparker austinlparker moved this from Spec - Accepted to Spec - Priority Backlog in 🔭 Main Backlog Apr 16, 2024
@fstab
Copy link
Member

fstab commented Apr 18, 2024

For reference: The Prometheus Java client library has a TimerApi, but that's just syntactic sugar for Histogram, Summary, and Gauge. https://github.com/prometheus/client_java/blob/main/prometheus-metrics-core/src/main/java/io/prometheus/metrics/core/datapoints/TimerApi.java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:api Cross language API specification issue spec:metrics Related to the specification/metrics directory triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor
Projects
Status: Spec - Priority Backlog
Development

No branches or pull requests