Different aggregation functions for the latency metrics #1741

vladzcloudius · 2022-05-18T15:44:11Z

Installation details
Panel Name: average XXX latency
Dashboard Name: Detailed
Scylla-Monitoring Version: 3.10.0
Scylla-Version: 2021.1.10-0.20220410.e8e681dee

Description
Changing an aggregation function to, for instance, max doesn't change the output result in a consistent way:

"sum" is a default but it definitely doesn't show a sum - it looks more like an average:

And with the "max" it looks quite the same as with the "sum":

The text was updated successfully, but these errors were encountered:

amnonh · 2022-05-19T07:04:21Z

In the detailed dashboard, we show the calculated latency. So for example, when we are showing a cluster view, we don't calcualte p99 for shards and than avg over it, instead, we calculate the actuall p99 of the cluster. So there's no aggreation.

Specificaly for avg latencies becasue it's mathematically possible to avg over avg, we can add aggregation function, ~~I'll do that~~
It's problematic, there's no point in summing over shard latencies, it's probably better to state that aggregation function does not apply to latencies.

vladzcloudius · 2022-05-24T21:39:59Z

It's problematic, there's no point in summing over shard latencies, it's probably better to state that aggregation function does not apply to latencies.

Not sure I understand the point above. Could you, please, clarify?
Also, note that in the issue description I'm showing that max function doesn't work as expected. And max definitely applies for any latency graph (both averages and Ps).

amnonh · 2022-05-25T06:31:06Z

I mean that min and max are useful, but I currently have no good way of applying it on latency without making the latencies graph unsuable by default.

amnonh · 2023-10-26T14:02:22Z

@vladzcloudius I've made an enhencement out of it to keep it open for future functionality. The question is what is the use case you are trying to solve?

amnonh · 2024-05-23T14:41:06Z

@tzach We had this discussion in the past about latency max. I think the overall understanding is that max is not relevant for latency, just quantile (for example, P99).
@vladzcloudius, can you articulate what you want to see?

tzach · 2024-05-23T18:41:13Z

Indeed, showing Max can lead to an outlier, when one random point affects the entire day or year of data.
I do not mind allowing users to run functions, as long as we do not show it by default.

vladzcloudius · 2024-05-23T22:39:01Z

@tzach We had this discussion in the past about latency max. I think the overall understanding is that max is not relevant for latency, just quantile (for example, P99). @vladzcloudius, can you articulate what you want to see?

@amnonh you are confusing things.
This issue is not about showing max latency (which is quite relevant BTW!).
It's about the regression in the already supported ability to change the aggregation function.
I've updated the issue subject to make it clearer.

max was used to demonstrate the regression.
In the opening message I'm showing how it's broken using (some) latency graph.
The expected output was supposed to be the maximum value (across all shards in my example) - however it wasn't.

amnonh · 2024-06-02T19:14:42Z

The solution will be: when the aggregation function is set to sum, we will have the current behavior (e.g. the actual latency of the node, dc or cluster).
On all other aggregation functions, we will have have the the function based on the shard latency.
For example, setting to max, with aggregation by instance, will show for each instance what is the max shard latency.

($func(wlatencya{by="instance,shard", instance=~"[[node]]|$^",cluster="$cluster", dc=~"$dc", shard=~"[[shard]]|$^",scheduling_group_name=~"$sg"}) by ([[by]]) unless on() (sum(count(wlatencya{by="instance,shard"}) by (instance,shard)) == $func(count(wlatencya{by="instance,shard"}) by (instance,shard)))) or on() wlatencya{by="[[by]]", instance=~"[[node]]|$^",cluster="$cluster", dc=~"$dc", shard=~"[[shard]]|$^",scheduling_group_name=~"$sg"}

vladzcloudius · 2024-06-03T17:04:31Z

The solution will be: when the aggregation function is set to sum, we will have the current behavior (e.g. the actual latency of the node, dc or cluster).

Aren't we currently showing the average value?

amnonh · 2024-06-03T17:30:38Z

The solution will be: when the aggregation function is set to sum, we will have the current behavior (e.g. the actual latency of the node, dc or cluster).

Aren't we currently showing the average value?

We are calculating the P50, P95 and P99 not simply averaging over the per-shard values, especially with P95 and P99 it could be a very different result than simple average.

amnonh · 2024-07-09T11:45:20Z

@vladzcloudius I'm working on an updated version, when looking at the latencies shard view (per scheduling group) do we want to remove the one that are zero?

vladzcloudius · 2024-07-10T00:13:05Z

@vladzcloudius I'm working on an updated version, when looking at the latencies shard view (per scheduling group) do we want to remove the one that are zero?

This will only be the case when corresponding groups are not used at all.
In such cases IMO you can remove always-zero ones.

vladzcloudius added the bug Something isn't working right label May 18, 2022

vladzcloudius assigned amnonh May 18, 2022

amnonh added enhancement New feature or request and removed bug Something isn't working right labels Oct 26, 2023

amnonh changed the title ~~Changing a different aggregation function doesn't work as expected~~ Have an option to show the latency max Oct 26, 2023

vladzcloudius added bug Something isn't working right and removed bug Something isn't working right labels Oct 26, 2023

vladzcloudius changed the title ~~Have an option to show the latency max~~ An option to change the aggregation function seems to be broken May 23, 2024

amnonh added this to the Monitoring 4.8 milestone Jun 2, 2024

amnonh changed the title ~~An option to change the aggregation function seems to be broken~~ Different aggregation functions for the latency metrics Jun 2, 2024

amnonh mentioned this issue Jul 10, 2024

scylla-detailed: Support latencies aggregation function #2350

Merged

amnonh closed this as completed in #2350 Jul 10, 2024

amnonh closed this as completed in 8d54238 Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different aggregation functions for the latency metrics #1741

Different aggregation functions for the latency metrics #1741

vladzcloudius commented May 18, 2022 •

edited

Loading

amnonh commented May 19, 2022 •

edited

Loading

vladzcloudius commented May 24, 2022

amnonh commented May 25, 2022

amnonh commented Oct 26, 2023

amnonh commented May 23, 2024

tzach commented May 23, 2024

vladzcloudius commented May 23, 2024

amnonh commented Jun 2, 2024

vladzcloudius commented Jun 3, 2024

amnonh commented Jun 3, 2024

amnonh commented Jul 9, 2024

vladzcloudius commented Jul 10, 2024

Different aggregation functions for the latency metrics #1741

Different aggregation functions for the latency metrics #1741

Comments

vladzcloudius commented May 18, 2022 • edited Loading

amnonh commented May 19, 2022 • edited Loading

vladzcloudius commented May 24, 2022

amnonh commented May 25, 2022

amnonh commented Oct 26, 2023

amnonh commented May 23, 2024

tzach commented May 23, 2024

vladzcloudius commented May 23, 2024

amnonh commented Jun 2, 2024

vladzcloudius commented Jun 3, 2024

amnonh commented Jun 3, 2024

amnonh commented Jul 9, 2024

vladzcloudius commented Jul 10, 2024

vladzcloudius commented May 18, 2022 •

edited

Loading

amnonh commented May 19, 2022 •

edited

Loading