Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Investigate recurring task to detect broken rules #122984

Open
ymao1 opened this issue Jan 13, 2022 · 2 comments
Open

[Alerting] Investigate recurring task to detect broken rules #122984

ymao1 opened this issue Jan 13, 2022 · 2 comments
Labels
blocked estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@ymao1
Copy link
Contributor

ymao1 commented Jan 13, 2022

As part of the research to identify when rules stop running continuously, we have proposed creating a recurring task to detect and possible repair broken rules. This single task could check for a variety of scenarios:

  • Ensure there is a task manager task for each alerting rule. If there does not exist a task for each enabled alerting rule, create one.
  • Ensure there is no more than one task manager task for each alerting rule. If there exists multiple tasks for an alerting rule, keep only the valid task document.
  • Ensure that rules and connectors are decryptable with the current encryption key
  • Ensure that nothing has broken AAD for rules & connectors (manual SO updates, for example)
  • Ensure that rule type parameter validation succeeds for each rule
  • Ensure there exists a valid API key for each enabled rule. If there does not exist an API key for each enabled alerting rule, create one (is this possible as a background task without the user?)

With the output of these tasks we could explore doing the following:

  • Storing the output somewhere and adding telemetry on this data.
  • Merging with the existing alerting health framework status checks to provide an API for users to get this information
  • Pushing the outputs to stack monitoring in order to leverage the O11y of Alerting efforts to get rule information into Stack Monitoring.
@ymao1 ymao1 added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Jan 13, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote
Copy link
Contributor

@chrisronline, @XavierM during @ymao1's research, we've identified some scenarios where rules end up in a state where they stop running, and we should notify the user. Some other bullet points relate to "self-healing" issues.

We should discuss the proper way to detect and notify these scenarios to our users before starting the implementation. Example: is a task the right approach to detect?

Is something like this near term for the O11y of Alerting project?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

4 participants