Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different aggregation functions for the latency metrics #1741

Closed
vladzcloudius opened this issue May 18, 2022 · 12 comments · Fixed by #2350
Closed

Different aggregation functions for the latency metrics #1741

vladzcloudius opened this issue May 18, 2022 · 12 comments · Fixed by #2350
Assignees
Labels
enhancement New feature or request

Comments

@vladzcloudius
Copy link
Contributor

vladzcloudius commented May 18, 2022

Installation details
Panel Name: average XXX latency
Dashboard Name: Detailed
Scylla-Monitoring Version: 3.10.0
Scylla-Version: 2021.1.10-0.20220410.e8e681dee

Description
Changing an aggregation function to, for instance, max doesn't change the output result in a consistent way:

image

"sum" is a default but it definitely doesn't show a sum - it looks more like an average:

image

And with the "max" it looks quite the same as with the "sum":
image

@vladzcloudius vladzcloudius added the bug Something isn't working right label May 18, 2022
@amnonh
Copy link
Collaborator

amnonh commented May 19, 2022

In the detailed dashboard, we show the calculated latency. So for example, when we are showing a cluster view, we don't calcualte p99 for shards and than avg over it, instead, we calculate the actuall p99 of the cluster. So there's no aggreation.

Specificaly for avg latencies becasue it's mathematically possible to avg over avg, we can add aggregation function, I'll do that
It's problematic, there's no point in summing over shard latencies, it's probably better to state that aggregation function does not apply to latencies.

@vladzcloudius
Copy link
Contributor Author

It's problematic, there's no point in summing over shard latencies, it's probably better to state that aggregation function does not apply to latencies.

Not sure I understand the point above. Could you, please, clarify?
Also, note that in the issue description I'm showing that max function doesn't work as expected. And max definitely applies for any latency graph (both averages and Ps).

@amnonh
Copy link
Collaborator

amnonh commented May 25, 2022

I mean that min and max are useful, but I currently have no good way of applying it on latency without making the latencies graph unsuable by default.

@amnonh amnonh added enhancement New feature or request and removed bug Something isn't working right labels Oct 26, 2023
@amnonh amnonh changed the title Changing a different aggregation function doesn't work as expected Have an option to show the latency max Oct 26, 2023
@amnonh
Copy link
Collaborator

amnonh commented Oct 26, 2023

@vladzcloudius I've made an enhencement out of it to keep it open for future functionality. The question is what is the use case you are trying to solve?

@vladzcloudius vladzcloudius added bug Something isn't working right and removed bug Something isn't working right labels Oct 26, 2023
@amnonh
Copy link
Collaborator

amnonh commented May 23, 2024

@tzach We had this discussion in the past about latency max. I think the overall understanding is that max is not relevant for latency, just quantile (for example, P99).
@vladzcloudius, can you articulate what you want to see?

@tzach
Copy link
Contributor

tzach commented May 23, 2024

Indeed, showing Max can lead to an outlier, when one random point affects the entire day or year of data.
I do not mind allowing users to run functions, as long as we do not show it by default.

@vladzcloudius vladzcloudius changed the title Have an option to show the latency max An option to change the aggregation function seems to be broken May 23, 2024
@vladzcloudius
Copy link
Contributor Author

@tzach We had this discussion in the past about latency max. I think the overall understanding is that max is not relevant for latency, just quantile (for example, P99). @vladzcloudius, can you articulate what you want to see?

@amnonh you are confusing things.
This issue is not about showing max latency (which is quite relevant BTW!).
It's about the regression in the already supported ability to change the aggregation function.
I've updated the issue subject to make it clearer.

max was used to demonstrate the regression.
In the opening message I'm showing how it's broken using (some) latency graph.
The expected output was supposed to be the maximum value (across all shards in my example) - however it wasn't.

@amnonh amnonh added this to the Monitoring 4.8 milestone Jun 2, 2024
@amnonh amnonh changed the title An option to change the aggregation function seems to be broken Different aggregation functions for the latency metrics Jun 2, 2024
@amnonh
Copy link
Collaborator

amnonh commented Jun 2, 2024

The solution will be: when the aggregation function is set to sum, we will have the current behavior (e.g. the actual latency of the node, dc or cluster).
On all other aggregation functions, we will have have the the function based on the shard latency.
For example, setting to max, with aggregation by instance, will show for each instance what is the max shard latency.

($func(wlatencya{by="instance,shard", instance=~"[[node]]|$^",cluster="$cluster", dc=~"$dc", shard=~"[[shard]]|$^",scheduling_group_name=~"$sg"}) by ([[by]]) unless on() (sum(count(wlatencya{by="instance,shard"}) by (instance,shard)) == $func(count(wlatencya{by="instance,shard"}) by (instance,shard)))) or on() wlatencya{by="[[by]]", instance=~"[[node]]|$^",cluster="$cluster", dc=~"$dc", shard=~"[[shard]]|$^",scheduling_group_name=~"$sg"} 

@vladzcloudius
Copy link
Contributor Author

The solution will be: when the aggregation function is set to sum, we will have the current behavior (e.g. the actual latency of the node, dc or cluster).

Aren't we currently showing the average value?

@amnonh
Copy link
Collaborator

amnonh commented Jun 3, 2024

The solution will be: when the aggregation function is set to sum, we will have the current behavior (e.g. the actual latency of the node, dc or cluster).

Aren't we currently showing the average value?

We are calculating the P50, P95 and P99 not simply averaging over the per-shard values, especially with P95 and P99 it could be a very different result than simple average.

@amnonh
Copy link
Collaborator

amnonh commented Jul 9, 2024

@vladzcloudius I'm working on an updated version, when looking at the latencies shard view (per scheduling group) do we want to remove the one that are zero?

@vladzcloudius
Copy link
Contributor Author

@vladzcloudius I'm working on an updated version, when looking at the latencies shard view (per scheduling group) do we want to remove the one that are zero?

This will only be the case when corresponding groups are not used at all.
In such cases IMO you can remove always-zero ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants