Alerting framework: Ability to do multiple queries and arithmetics between Metric threshold's metric values #145444

georgivalentinov · 2022-11-16T19:28:44Z

Describe the feature:

Metric threshold alerts count on each metric having the right high-level granularity in order to alert on something meaningful, e.g. container_cpu_limit_pct (for getting container cpu utilization percent of limit). Unfortunately a lot of metrics out there are more of a lower-level ones and need a bit of arithmetics to get to something useful.
Currently we can do basic aggregations (min, max, sum, cardinality, percentile, ...) on metric values but nothing more complex.

They say a picture is worth a thousand words, so here's one simplified example (it's how Datadog does this):

A set of lower-level metrics, with which one is still able to get the same high-level "percent utilization of cpu limit" number by doing multiple queries and some arithmetic on them.

Describe a specific use case for the feature:

As hinted above - most metrics are more of a lower-level ones, which severely limits what the current Metric threshold rule can be used for.

Here's another example use case:

{
    "prometheus": {
        "labels": {
            "instance": "monitoring-kube-state-metrics:8080",
            "job": "prometheus",
            "namespace": "<my_namespace>",
            "horizontalpodautoscaler": "<my_hpa_name>"
        },
        "metrics": {
            "kube_horizontalpodautoscaler_spec_max_replicas": 5,
            "kube_horizontalpodautoscaler_status_current_replicas": 1,
            "kube_horizontalpodautoscaler_metadata_generation": 0,
            "kube_horizontalpodautoscaler_status_desired_replicas": 1,
            "kube_horizontalpodautoscaler_spec_min_replicas": 1
        }
    }
}

This is a snippet from a document with metrics collected from a Kubernetes horizontal pod autoscaler with the prometheus/collector Metricbeat's metricset.
With the current capabilities one can't compute a ratio and thus percentage of currently running replicas compared to the maximum possible (kube_horizontalpodautoscaler_status_current_replicas / kube_horizontalpodautoscaler_spec_max_replicas * 100).

A third example:

We have the following AWS ALB metrics at hand: HTTPCode_ELB_5XX_Count, HTTPCode_Target_5XX_Count & RequestCount. First two represent http 5xx errors for an application, where one is being reported by an AWS target group and the other by an AWS LB. The third one represents total http request processed by the same AWS LB.
Frequently one wants to sum the first two and then compare that number to the total requests in order to get percentage 5xx errors of all traffic - so it's a similar formula as the first example: (a + b) / c * 100. Which value can then be compared to the warning and/or alert % thresholds.

Sorry if all this has already been requested, but I couldn't find such an issue back.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-11-18T11:37:51Z

Pinging @elastic/response-ops (Team:ResponseOps)

elasticmachine · 2022-11-18T11:53:07Z

Pinging @elastic/actionable-observability (Team: Actionable Observability)

maryam-saeidi · 2022-11-23T15:42:12Z

cc @vinaychandrasekhar

## Summary This PR closes #145444 by adding a custom equation editor to the Metric Threshold rule. I also added support for custom metrics to the Metric Explorer API which powers the preview chart on the rule editor. Eventually we could do a follow up PR to the Metrics Explorer UI to expose this new functionality; which is outside the scope of this PR. ### Notable changes with this PR I changed the reason message for Metric Threshold rules which do not have a group by. The original message would say something like `system.cpu.user.pct is 82% in the last 1 min for all hosts. Alert when > 81%.` I removed the `for all hosts` portion because the Metric Threshold rule is not limited to just the concept of hosts, our users rely on this rule as their "Swiss Army Knife" rule for all types of data. I also had to change the format of the `currentPeriod` bucket for the Metric Threshold aggregation to support the "document count with KQL filter" use case. One of the requirements of a `filter` aggregation is that it must be a child of a multi-bucket aggregation. This is why I converted it from a 'filter' aggregation to a `filters` aggregation with an `all` key for the time range query. I added basic validation for the equations with a regular expression that just limits the characters to the allowable: `A-Z, +, -, /, *, (, ), ?, !, &, :, |, >, <, =`. I feel like for now this is good enough. If we want to expose some of the Painless `Math.*` libraries then we can follow up in a later release with a PegJS parser which would do some syntax validation as well. ### Rule with custom equation <img width="538" alt="image" src="https://user-images.githubusercontent.com/41702/213583128-1adbc405-828e-4571-aeb4-9900baeaabee.png"> ### Rule with custom ratio equation <img width="538" alt="image" src="https://user-images.githubusercontent.com/41702/213583239-a39d15d2-7023-4daf-af97-cb25a9965433.png"> ### Reason message with custom label ![image](https://user-images.githubusercontent.com/41702/211936062-4b696f0c-dfec-4e48-b89c-b0462fb5f7f0.png) --------- Co-authored-by: Carlos Crespo <crespocarlos@users.noreply.github.com> Co-authored-by: Maryam Saeidi <maryam.saeidi@elastic.co>

## Summary This PR closes elastic#145444 by adding a custom equation editor to the Metric Threshold rule. I also added support for custom metrics to the Metric Explorer API which powers the preview chart on the rule editor. Eventually we could do a follow up PR to the Metrics Explorer UI to expose this new functionality; which is outside the scope of this PR. ### Notable changes with this PR I changed the reason message for Metric Threshold rules which do not have a group by. The original message would say something like `system.cpu.user.pct is 82% in the last 1 min for all hosts. Alert when > 81%.` I removed the `for all hosts` portion because the Metric Threshold rule is not limited to just the concept of hosts, our users rely on this rule as their "Swiss Army Knife" rule for all types of data. I also had to change the format of the `currentPeriod` bucket for the Metric Threshold aggregation to support the "document count with KQL filter" use case. One of the requirements of a `filter` aggregation is that it must be a child of a multi-bucket aggregation. This is why I converted it from a 'filter' aggregation to a `filters` aggregation with an `all` key for the time range query. I added basic validation for the equations with a regular expression that just limits the characters to the allowable: `A-Z, +, -, /, *, (, ), ?, !, &, :, |, >, <, =`. I feel like for now this is good enough. If we want to expose some of the Painless `Math.*` libraries then we can follow up in a later release with a PegJS parser which would do some syntax validation as well. ### Rule with custom equation <img width="538" alt="image" src="https://user-images.githubusercontent.com/41702/213583128-1adbc405-828e-4571-aeb4-9900baeaabee.png"> ### Rule with custom ratio equation <img width="538" alt="image" src="https://user-images.githubusercontent.com/41702/213583239-a39d15d2-7023-4daf-af97-cb25a9965433.png"> ### Reason message with custom label ![image](https://user-images.githubusercontent.com/41702/211936062-4b696f0c-dfec-4e48-b89c-b0462fb5f7f0.png) --------- Co-authored-by: Carlos Crespo <crespocarlos@users.noreply.github.com> Co-authored-by: Maryam Saeidi <maryam.saeidi@elastic.co>

botelastic bot added the needs-team Issues missing a team label label Nov 16, 2022

Dosant added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Nov 18, 2022

botelastic bot removed the needs-team Issues missing a team label label Nov 18, 2022

mikecote added Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" and removed Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Nov 18, 2022

simianhacker mentioned this issue Jan 11, 2023

Custom equation editor for Metric Threshold Rule #148732

Merged

simianhacker closed this as completed in #148732 Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerting framework: Ability to do multiple queries and arithmetics between Metric threshold's metric values #145444

Alerting framework: Ability to do multiple queries and arithmetics between Metric threshold's metric values #145444

georgivalentinov commented Nov 16, 2022 •

edited

Loading

elasticmachine commented Nov 18, 2022

elasticmachine commented Nov 18, 2022

maryam-saeidi commented Nov 23, 2022

Alerting framework: Ability to do multiple queries and arithmetics between Metric threshold's metric values #145444

Alerting framework: Ability to do multiple queries and arithmetics between Metric threshold's metric values #145444

Comments

georgivalentinov commented Nov 16, 2022 • edited Loading

elasticmachine commented Nov 18, 2022

elasticmachine commented Nov 18, 2022

maryam-saeidi commented Nov 23, 2022

georgivalentinov commented Nov 16, 2022 •

edited

Loading