-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alerting framework: Ability to do multiple queries and arithmetics between Metric threshold's metric values #145444
Labels
Team: Actionable Observability - DEPRECATED
For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge"
Comments
Dosant
added
the
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
label
Nov 18, 2022
Pinging @elastic/response-ops (Team:ResponseOps) |
mikecote
added
Team: Actionable Observability - DEPRECATED
For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge"
and removed
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
labels
Nov 18, 2022
Pinging @elastic/actionable-observability (Team: Actionable Observability) |
simianhacker
added a commit
that referenced
this issue
Feb 1, 2023
## Summary This PR closes #145444 by adding a custom equation editor to the Metric Threshold rule. I also added support for custom metrics to the Metric Explorer API which powers the preview chart on the rule editor. Eventually we could do a follow up PR to the Metrics Explorer UI to expose this new functionality; which is outside the scope of this PR. ### Notable changes with this PR I changed the reason message for Metric Threshold rules which do not have a group by. The original message would say something like `system.cpu.user.pct is 82% in the last 1 min for all hosts. Alert when > 81%.` I removed the `for all hosts` portion because the Metric Threshold rule is not limited to just the concept of hosts, our users rely on this rule as their "Swiss Army Knife" rule for all types of data. I also had to change the format of the `currentPeriod` bucket for the Metric Threshold aggregation to support the "document count with KQL filter" use case. One of the requirements of a `filter` aggregation is that it must be a child of a multi-bucket aggregation. This is why I converted it from a 'filter' aggregation to a `filters` aggregation with an `all` key for the time range query. I added basic validation for the equations with a regular expression that just limits the characters to the allowable: `A-Z, +, -, /, *, (, ), ?, !, &, :, |, >, <, =`. I feel like for now this is good enough. If we want to expose some of the Painless `Math.*` libraries then we can follow up in a later release with a PegJS parser which would do some syntax validation as well. ### Rule with custom equation <img width="538" alt="image" src="https://user-images.githubusercontent.com/41702/213583128-1adbc405-828e-4571-aeb4-9900baeaabee.png"> ### Rule with custom ratio equation <img width="538" alt="image" src="https://user-images.githubusercontent.com/41702/213583239-a39d15d2-7023-4daf-af97-cb25a9965433.png"> ### Reason message with custom label ![image](https://user-images.githubusercontent.com/41702/211936062-4b696f0c-dfec-4e48-b89c-b0462fb5f7f0.png) --------- Co-authored-by: Carlos Crespo <crespocarlos@users.noreply.github.com> Co-authored-by: Maryam Saeidi <maryam.saeidi@elastic.co>
kqualters-elastic
pushed a commit
to kqualters-elastic/kibana
that referenced
this issue
Feb 6, 2023
## Summary This PR closes elastic#145444 by adding a custom equation editor to the Metric Threshold rule. I also added support for custom metrics to the Metric Explorer API which powers the preview chart on the rule editor. Eventually we could do a follow up PR to the Metrics Explorer UI to expose this new functionality; which is outside the scope of this PR. ### Notable changes with this PR I changed the reason message for Metric Threshold rules which do not have a group by. The original message would say something like `system.cpu.user.pct is 82% in the last 1 min for all hosts. Alert when > 81%.` I removed the `for all hosts` portion because the Metric Threshold rule is not limited to just the concept of hosts, our users rely on this rule as their "Swiss Army Knife" rule for all types of data. I also had to change the format of the `currentPeriod` bucket for the Metric Threshold aggregation to support the "document count with KQL filter" use case. One of the requirements of a `filter` aggregation is that it must be a child of a multi-bucket aggregation. This is why I converted it from a 'filter' aggregation to a `filters` aggregation with an `all` key for the time range query. I added basic validation for the equations with a regular expression that just limits the characters to the allowable: `A-Z, +, -, /, *, (, ), ?, !, &, :, |, >, <, =`. I feel like for now this is good enough. If we want to expose some of the Painless `Math.*` libraries then we can follow up in a later release with a PegJS parser which would do some syntax validation as well. ### Rule with custom equation <img width="538" alt="image" src="https://user-images.githubusercontent.com/41702/213583128-1adbc405-828e-4571-aeb4-9900baeaabee.png"> ### Rule with custom ratio equation <img width="538" alt="image" src="https://user-images.githubusercontent.com/41702/213583239-a39d15d2-7023-4daf-af97-cb25a9965433.png"> ### Reason message with custom label ![image](https://user-images.githubusercontent.com/41702/211936062-4b696f0c-dfec-4e48-b89c-b0462fb5f7f0.png) --------- Co-authored-by: Carlos Crespo <crespocarlos@users.noreply.github.com> Co-authored-by: Maryam Saeidi <maryam.saeidi@elastic.co>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Team: Actionable Observability - DEPRECATED
For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge"
Describe the feature:
Metric threshold alerts count on each metric having the right high-level granularity in order to alert on something meaningful, e.g.
container_cpu_limit_pct
(for getting container cpu utilization percent of limit). Unfortunately a lot of metrics out there are more of a lower-level ones and need a bit of arithmetics to get to something useful.Currently we can do basic aggregations (min, max, sum, cardinality, percentile, ...) on metric values but nothing more complex.
They say a picture is worth a thousand words, so here's one simplified example (it's how Datadog does this):
A set of lower-level metrics, with which one is still able to get the same high-level "percent utilization of cpu limit" number by doing multiple queries and some arithmetic on them.
Describe a specific use case for the feature:
As hinted above - most metrics are more of a lower-level ones, which severely limits what the current Metric threshold rule can be used for.
Here's another example use case:
This is a snippet from a document with metrics collected from a Kubernetes horizontal pod autoscaler with the
prometheus/collector
Metricbeat's metricset.With the current capabilities one can't compute a ratio and thus percentage of currently running replicas compared to the maximum possible (
kube_horizontalpodautoscaler_status_current_replicas / kube_horizontalpodautoscaler_spec_max_replicas * 100
).A third example:
We have the following AWS ALB metrics at hand:
HTTPCode_ELB_5XX_Count
,HTTPCode_Target_5XX_Count
&RequestCount
. First two represent http 5xx errors for an application, where one is being reported by an AWS target group and the other by an AWS LB. The third one represents total http request processed by the same AWS LB.Frequently one wants to sum the first two and then compare that number to the total requests in order to get percentage 5xx errors of all traffic - so it's a similar formula as the first example:
(a + b) / c * 100
. Which value can then be compared to the warning and/or alert % thresholds.Sorry if all this has already been requested, but I couldn't find such an issue back.
The text was updated successfully, but these errors were encountered: