Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting framework: Ability to do multiple queries and arithmetics between Metric threshold's metric values #145444

Closed
georgivalentinov opened this issue Nov 16, 2022 · 3 comments · Fixed by #148732
Labels
Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge"

Comments

@georgivalentinov
Copy link

georgivalentinov commented Nov 16, 2022

Describe the feature:

Metric threshold alerts count on each metric having the right high-level granularity in order to alert on something meaningful, e.g. container_cpu_limit_pct (for getting container cpu utilization percent of limit). Unfortunately a lot of metrics out there are more of a lower-level ones and need a bit of arithmetics to get to something useful.
Currently we can do basic aggregations (min, max, sum, cardinality, percentile, ...) on metric values but nothing more complex.

They say a picture is worth a thousand words, so here's one simplified example (it's how Datadog does this):
Screenshot 2022-11-16 at 20 34 15

A set of lower-level metrics, with which one is still able to get the same high-level "percent utilization of cpu limit" number by doing multiple queries and some arithmetic on them.

Describe a specific use case for the feature:

As hinted above - most metrics are more of a lower-level ones, which severely limits what the current Metric threshold rule can be used for.

Here's another example use case:

{
    "prometheus": {
        "labels": {
            "instance": "monitoring-kube-state-metrics:8080",
            "job": "prometheus",
            "namespace": "<my_namespace>",
            "horizontalpodautoscaler": "<my_hpa_name>"
        },
        "metrics": {
            "kube_horizontalpodautoscaler_spec_max_replicas": 5,
            "kube_horizontalpodautoscaler_status_current_replicas": 1,
            "kube_horizontalpodautoscaler_metadata_generation": 0,
            "kube_horizontalpodautoscaler_status_desired_replicas": 1,
            "kube_horizontalpodautoscaler_spec_min_replicas": 1
        }
    }
}

This is a snippet from a document with metrics collected from a Kubernetes horizontal pod autoscaler with the prometheus/collector Metricbeat's metricset.
With the current capabilities one can't compute a ratio and thus percentage of currently running replicas compared to the maximum possible (kube_horizontalpodautoscaler_status_current_replicas / kube_horizontalpodautoscaler_spec_max_replicas * 100).

A third example:

We have the following AWS ALB metrics at hand: HTTPCode_ELB_5XX_Count, HTTPCode_Target_5XX_Count & RequestCount. First two represent http 5xx errors for an application, where one is being reported by an AWS target group and the other by an AWS LB. The third one represents total http request processed by the same AWS LB.
Frequently one wants to sum the first two and then compare that number to the total requests in order to get percentage 5xx errors of all traffic - so it's a similar formula as the first example: (a + b) / c * 100. Which value can then be compared to the warning and/or alert % thresholds.

Sorry if all this has already been requested, but I couldn't find such an issue back.

@botelastic botelastic bot added the needs-team Issues missing a team label label Nov 16, 2022
@Dosant Dosant added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Nov 18, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Nov 18, 2022
@mikecote mikecote added Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" and removed Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Nov 18, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/actionable-observability (Team: Actionable Observability)

@maryam-saeidi
Copy link
Member

cc @vinaychandrasekhar

simianhacker added a commit that referenced this issue Feb 1, 2023
## Summary

This PR closes #145444 by adding a custom equation editor to the Metric
Threshold rule. I also added support for custom metrics to the Metric
Explorer API which powers the preview chart on the rule editor.
Eventually we could do a follow up PR to the Metrics Explorer UI to
expose this new functionality; which is outside the scope of this PR.

### Notable changes with this PR

I changed the reason message for Metric Threshold rules which do not
have a group by. The original message would say something like
`system.cpu.user.pct is 82% in the last 1 min for all hosts. Alert when
> 81%.` I removed the `for all hosts` portion because the Metric
Threshold rule is not limited to just the concept of hosts, our users
rely on this rule as their "Swiss Army Knife" rule for all types of
data.

I also had to change the format of the `currentPeriod` bucket for the
Metric Threshold aggregation to support the "document count with KQL
filter" use case. One of the requirements of a `filter` aggregation is
that it must be a child of a multi-bucket aggregation. This is why I
converted it from a 'filter' aggregation to a `filters` aggregation with
an `all` key for the time range query.

I added basic validation for the equations with a regular expression
that just limits the characters to the allowable: `A-Z, +, -, /, *, (,
), ?, !, &, :, |, >, <, =`. I feel like for now this is good enough. If
we want to expose some of the Painless `Math.*` libraries then we can
follow up in a later release with a PegJS parser which would do some
syntax validation as well.

### Rule with custom equation

<img width="538" alt="image"
src="https://user-images.githubusercontent.com/41702/213583128-1adbc405-828e-4571-aeb4-9900baeaabee.png">

### Rule with custom ratio equation

<img width="538" alt="image"
src="https://user-images.githubusercontent.com/41702/213583239-a39d15d2-7023-4daf-af97-cb25a9965433.png">


### Reason message with custom label


![image](https://user-images.githubusercontent.com/41702/211936062-4b696f0c-dfec-4e48-b89c-b0462fb5f7f0.png)

---------

Co-authored-by: Carlos Crespo <crespocarlos@users.noreply.github.com>
Co-authored-by: Maryam Saeidi <maryam.saeidi@elastic.co>
kqualters-elastic pushed a commit to kqualters-elastic/kibana that referenced this issue Feb 6, 2023
## Summary

This PR closes elastic#145444 by adding a custom equation editor to the Metric
Threshold rule. I also added support for custom metrics to the Metric
Explorer API which powers the preview chart on the rule editor.
Eventually we could do a follow up PR to the Metrics Explorer UI to
expose this new functionality; which is outside the scope of this PR.

### Notable changes with this PR

I changed the reason message for Metric Threshold rules which do not
have a group by. The original message would say something like
`system.cpu.user.pct is 82% in the last 1 min for all hosts. Alert when
> 81%.` I removed the `for all hosts` portion because the Metric
Threshold rule is not limited to just the concept of hosts, our users
rely on this rule as their "Swiss Army Knife" rule for all types of
data.

I also had to change the format of the `currentPeriod` bucket for the
Metric Threshold aggregation to support the "document count with KQL
filter" use case. One of the requirements of a `filter` aggregation is
that it must be a child of a multi-bucket aggregation. This is why I
converted it from a 'filter' aggregation to a `filters` aggregation with
an `all` key for the time range query.

I added basic validation for the equations with a regular expression
that just limits the characters to the allowable: `A-Z, +, -, /, *, (,
), ?, !, &, :, |, >, <, =`. I feel like for now this is good enough. If
we want to expose some of the Painless `Math.*` libraries then we can
follow up in a later release with a PegJS parser which would do some
syntax validation as well.

### Rule with custom equation

<img width="538" alt="image"
src="https://user-images.githubusercontent.com/41702/213583128-1adbc405-828e-4571-aeb4-9900baeaabee.png">

### Rule with custom ratio equation

<img width="538" alt="image"
src="https://user-images.githubusercontent.com/41702/213583239-a39d15d2-7023-4daf-af97-cb25a9965433.png">


### Reason message with custom label


![image](https://user-images.githubusercontent.com/41702/211936062-4b696f0c-dfec-4e48-b89c-b0462fb5f7f0.png)

---------

Co-authored-by: Carlos Crespo <crespocarlos@users.noreply.github.com>
Co-authored-by: Maryam Saeidi <maryam.saeidi@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge"
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants