Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make metrics filterable based upon metadata #55

Closed
bryannaegele opened this issue Jan 14, 2020 · 26 comments · Fixed by #61
Closed

Make metrics filterable based upon metadata #55

bryannaegele opened this issue Jan 14, 2020 · 26 comments · Fixed by #61
Labels
discussion enhancement New feature or request

Comments

@bryannaegele
Copy link
Contributor

bryannaegele commented Jan 14, 2020

The "prefix or not to prefix" issue seems to boil down to the inability to allow users to conditionally handle an event based upon a predicate other than the event name. The proposal would be to add a filter predicate option which would be used by reporters to conditionally handle the event.

This should allow for simpler APIs for library authors going forward and help some tracing vendors with their needs by discouraging the use of required prefixes or auto-generated prefixes. User-supplied prefixes should still be accepted for power users with tight performance requirements where auto-instrumentation by vendors is less of a concern.

Example

I am read/write splitting to my DB which requires separate Ecto Repos. The latency requirements for my writes are much different than my requirements for my reads, resulting in different bucket sizes for each.

Currently, I must utilize event prefixes provided by Ecto to distinguish between these repos.

distribution("ecto.writer.query.duration.ms",
  buckets: [5, 10, 20, 50, 100, 200, 500, 1000, 1500, 2000, 5000, 10000],
  event_name: [:writer, :repo, :query],
  measurement: :query_time,
  unit: {:native, :millisecond},
  description: "Ecto query duration - Time spent executing the query",
  tags: [:repo]
),
distribution("ecto.reader.query.duration.ms",
  buckets: [5, 10, 15, 25, 35, 50, 100, 200],
  event_name: [:reader, :repo, :query],
  measurement: :query_time,
  unit: {:native, :millisecond},
  description: "Ecto query duration - Time spent executing the query",
  tags: [:repo]
)

If I could instead conditionally handle an event based upon the metadata, the need for a prefix could be removed in most use cases.

distribution("ecto.writer.query.duration.ms",
  buckets: [5, 10, 20, 50, 100, 200, 500, 1000, 1500, 2000, 5000, 10000],
  event_name: [:repo, :query],
  measurement: :query_time,
  unit: {:native, :millisecond},
  filter: fn meta ->
    match?(%{repo: MyApp.Repo}, meta)
  end,
  description: "Ecto query duration - Time spent executing the query",
  tags: [:repo]
),
distribution("ecto.reader.query.duration.ms",
  buckets: [5, 10, 15, 25, 35, 50, 100, 200],
  event_name: [:repo, :query],
  measurement: :query_time,
  unit: {:native, :millisecond},
  filter: &match?(%{repo: MyApp.ReadOnlyRepo}, &1),
  description: "Ecto query duration - Time spent executing the query",
  tags: [:repo]
)

Reporters should honor the predicate when implementing handlers when the filter is a function. To maintain the highest performance, reporters could add a second event handler which processes metrics defined with a predicate to remove an additional function call, e.g. handle_event/4 and handle_filtered_event/4.

/cc @binaryseed @josevalim @arkgil @GregMefford @tsloughter @hauleth

@bryannaegele bryannaegele added enhancement New feature or request discussion labels Jan 14, 2020
@arkgil
Copy link
Contributor

arkgil commented Jan 14, 2020

I like this idea. It gives enough flexibility to support both same or different prefix cases. Also very useful to report only subset of events as metrics.


To maintain the highest performance, reporters could add a second event handler which processes metrics defined with a predicate to remove an additional function call, e.g. handle_event/4 and handle_filtered_event/4.

@bryannaegele I'm not sure I understand this part - can you explain?

@hauleth
Copy link

hauleth commented Jan 14, 2020

Simple question. Is that possible to update metric filter after creation? Maybe something like that would be feasible in some cases. If so, then maybe filter should be a list? Maybe we should take a look on how logger did it:

  • filters is a list of {filter_fun(), Opts :: term()}
  • there is default_action on each handler which is either log or stop which will be action if none of the filters decides otherwise
  • filter_fun() takes 2 arguments - event and Opts
  • filter_fun() can return:
    • stop - to prevent event from being logged
    • ignore - this filter doesn't do anything and further filters should decide what to do
    • event itself (can be modified) - event that will be sent to the handler

@bryannaegele
Copy link
Contributor Author

To maintain the highest performance, reporters could add a second event handler which processes metrics defined with a predicate to remove an additional function call, e.g. handle_event/4 and handle_filtered_event/4.

@bryannaegele I'm not sure I understand this part - can you explain?

Sure. I had two concerns about that.

With one handler, the handler would need to check if there is a predicate or provide a default, then execute it on every call. This isn't a huge hit, but it's not zero. However, user supplied predicates will almost always be an anonymous function which does have a measurable hit on performance. This pattern would help alleviate doing checks on every call to segregate known filterable calls to another handler.

I suppose you could provide a default handler in your reporter that is a function capture vs anonymous function.

@bryannaegele
Copy link
Contributor Author

bryannaegele commented Jan 14, 2020

Is that possible to update metric filter after creation?

I suppose you could if you used an MFA and your function was somehow dynamic. That doesn't sound like something you'd ever do in practice. What would be the use case?

As to filtering similarly to logger, we're only concerned with the question of "should I record this event with this definition?" which is a simple predicate rather than modifying anything. Modification of metadata is already available in tag_values/1.

That does raise a good question though of whether or not to apply tag_values/1 before filter/1. 🤔

@arkgil
Copy link
Contributor

arkgil commented Jan 14, 2020

@bryannaegele thanks for clarification 👍

@binaryseed
Copy link

Is it implied here that libraries move away from configurable event name? It seemed like in other discussions we're already past the point where we can change that...

@hauleth
Copy link

hauleth commented Jan 15, 2020

That does raise a good question though of whether or not to apply tag_values/1 before filter/1. 🤔

Squash the two into one. Like logger does. This will solve the ordering problem.

@bryannaegele
Copy link
Contributor Author

Is it implied here that libraries move away from configurable event name? It seemed like in other discussions we're already past the point where we can change that...

We could definitely discourage requiring a custom prefix. That would include the major libraries you're probably concerned with. I could possibly be swayed, but I still think power users should be able to prefix in complex cases for maximum performance.

@arkgil
Copy link
Contributor

arkgil commented Jan 16, 2020

@bryannaegele is there a particular reason why filtering should be based on metadata only? I imagine measurements could be useful as well, e.g. for counting database queries which take above x milliseconds (although one could argue that that's the goal of distribution/2 😄 ).

@bryannaegele
Copy link
Contributor Author

@arkgil I couldn't think of a situation where you'd do that with metrics since, as you mentioned, this is more of a function of your end tooling.

That is a handy technique, however. I use that for slow query logging in our services. Probably don't need this if you have tracing.

def handle_event([_repo, :repo, :query], measures, meta, _config) do
    threshold = Config.query_time_warn_threshold()
    total_time_ms = extract_time(measures, :total_time)

    if total_time_ms > threshold do
      query_time_ms = extract_time(measures, :query_time)
      decode_time_ms = extract_time(measures, :decode_time)
      queue_time_ms = extract_time(measures, :queue_time)
      idle_time_ms = extract_time(measures, :idle_time)

      query_data = %{
        title: "Slow Query",
        query: meta.query,
        params: Enum.map(meta.params, &inspect/1),
        total_time: "#{total_time_ms}ms",
        query_time: "#{query_time_ms}ms",
        decode_time: "#{decode_time_ms}ms",
        queue_time: "#{queue_time_ms}ms",
        idle_time: "#{idle_time_ms}ms",
        data_source: meta.source
      }

      _ =
        Jason.encode!(query_data)
        |> Logger.warn()

      :ok
    end

    :ok
  end

@arkgil
Copy link
Contributor

arkgil commented Jan 21, 2020

Right, that makes sense. I'd be for implementing this in Telemetry.Metrics, looks like a useful and generic (as in it can be used across reporters) feature 👍

@arkgil
Copy link
Contributor

arkgil commented Jan 21, 2020

That does raise a good question though of whether or not to apply tag_values/1 before filter/1. 🤔

I think we should filter first. Filter is for events, tag_values produces tags which are part of a metric derived from that event.

@hkrutzer
Copy link

hkrutzer commented Feb 5, 2020

Ecto doesn't always have a value for all its metrics so currently I have to set 0 as a default, which obviously skews results, so this would be a great addition. Is this waiting for a PR?

@arkgil
Copy link
Contributor

arkgil commented Feb 6, 2020

@bryannaegele what @hkrutzer says would be a valid use case for filtering based on measurements.

@hkrutzer I think we need to flesh out the details before moving into implementation, but we're certainly looking for contributors here 👍

@bryannaegele
Copy link
Contributor Author

@arkgil yes, definitely. This was a good use case which I have run into with Ecto, so I think it's good to support filtering on both.

My updated proposal is to change the proposed filter predicate function requirements to accept 2 arguments - measurements and metadata.

distribution("ecto.writer.query.duration.ms",
  buckets: [5, 10, 20, 50, 100, 200, 500, 1000, 1500, 2000, 5000, 10000],
  event_name: [:repo, :query],
  measurement: :query_time,
  unit: {:native, :millisecond},
  filter: fn _measurements, meta ->
    match?(%{repo: MyApp.Repo}, meta)
  end,
  description: "Ecto query duration - Time spent executing the query",
  tags: [:repo]
)

Thoughts?

@arkgil
Copy link
Contributor

arkgil commented Feb 11, 2020

@bryannaegele looks great now 👌

@bryannaegele
Copy link
Contributor Author

bryannaegele commented Feb 13, 2020

I was working on this last night and realized the described use case is supposed to already be handled by reporters but I fail to do so in the Prometheus one.

Under https://github.com/beam-telemetry/telemetry_metrics/blob/master/docs/writing_reporters.md#reacting-to-events

Measurements are optional, so we must skip reporting that particular measurement if it is not available

If we have this as a standard for reporters and still think it's a good idea (I do), then this is just a bug in the Prometheus reporter.

I don't see any harm with passing measurements to the filter, but I don't see a need to any longer.

The only "harm" I would see is not having a need for it and making filter functions more verbose where a simple capture would suffice, such as filter: &(&match?(%{read_only: true}, &1))

@arkgil
Copy link
Contributor

arkgil commented Feb 13, 2020

@bryannaegele isn't the problem in the described use case not that the values are missing, but that they are set to 0, and skew the aggregations?

The only "harm" I would see is not having a need for it and making filter functions more verbose where a simple capture would suffice

I agree. Maybe this requires more thinking.

@bryannaegele
Copy link
Contributor Author

They are set to nil

@josevalim
Copy link
Contributor

I agree that we can close this for now as the use case is address. We don't emitting metadata instead of emitting zero on purpose, exactly to avoid skews.

@bryannaegele
Copy link
Contributor Author

@josevalim the metadata filtering case is still relevant and removes the propensity of event prefixes. The remaining question was if there's any reason to include measurements as an argument when we can't think of a use case.

@josevalim
Copy link
Contributor

we can always ship with only the meta and in the future if we need the measurements we can check the function arity and dispatch accordingly with both meta and measurements.

@hkrutzer
Copy link

Measurements are optional, so we must skip reporting that particular measurement if it is not available

So this code should work?

Telemetry.Metrics.summary("somerepo.query.idle_time_ms",
  event_name: "somerepo.query",
  measurement: fn x -> Map.get(x, :idle_time) end,
  tags: [:environment ],
  unit: {:native, :millisecond}
)

This results in

[error] Handler {TelemetryMetricsStatsd.EventHandler, #PID<0.810.0>, [:somerepo, , :query]} has failed and has been detached. Class=:error
Reason=:badarith
Stacktrace=[{:erlang, :*, [nil, 1.0e-6], []}, {Telemetry.Metrics, :"-maybe_convert_measurement/2-fun-0-", 3, [file: 'lib/telemetry_metrics.ex', line: 639]}

Am I doing it wrong?

@josevalim
Copy link
Contributor

josevalim commented Feb 15, 2020

Hi @hkrutzer. Nope, that was a bug, thanks for the report. It has been fixed in master.

@hkrutzer
Copy link

Thank you!

@bryannaegele
Copy link
Contributor Author

@hkrutzer This is fixed in the Prometheus library but I see your example is from StatsD. If this is happening in the StatsD reporter, that will need to be fixed in there, as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion enhancement New feature or request
Development

Successfully merging a pull request may close this issue.

6 participants