Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack Monitoring rule types failing due to empty buckets #120111

Closed
Tracked by #127224
gmmorris opened this issue Dec 1, 2021 · 9 comments · Fixed by #131332
Closed
Tracked by #127224

Stack Monitoring rule types failing due to empty buckets #120111

gmmorris opened this issue Dec 1, 2021 · 9 comments · Fixed by #131332
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types Feature:Alerting Feature:Stack Monitoring Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services

Comments

@gmmorris
Copy link
Contributor

gmmorris commented Dec 1, 2021

Kibana version: 7.15.0

Describe the bug:
I'm seeing alot of Stack Monitoring rules failing on cloud with the following errors:

Cannot destructure property 'buckets' of '(intermediate value)(intermediate value)(intermediate value)' as it is undefined.

and a couple of variations on:

Cannot read property 'buckets' of undefined

Cannot read property '_source' of undefined

I'm assuming there is JS code that assumes aggregations are returned by Elasticsearch in their queries, but actually causing a null-pointer exception due to the data set being empty and the aggregation being omitted as a result.

I've seen this happen with:

  1. monitoring_alert_jvm_memory_usage
  2. monitoring_alert_disk_usage
  3. monitoring_ccr_read_exceptions
  4. monitoring_alert_thread_pool_search_rejections
  5. monitoring_shard_size
  6. monitoring_alert_thread_pool_write_rejections

So this is likely the same problem repeating across these all.

Acceptance Criteria

  • Stack monitoring rules should handle the case when response data is missing, and should not let execution fall through and cause a null-pointer exception error (e.g. "Cannot read property 'buckets' of undefined") in the rule execution logs.

Notes:

@gmmorris gmmorris added the bug Fixes for quality problems that affect the customer experience label Dec 1, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Dec 1, 2021
@gmmorris gmmorris added Feature:Alerting Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types Feature:Stack Monitoring Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services and removed needs-team Issues missing a team label labels Dec 1, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@simianhacker
Copy link
Member

simianhacker commented Dec 2, 2021

I believe this is the query that's failing to return data:

POST .monitoring-es-6-*,*:.monitoring-es-6-*,.monitoring-es-7-*,*:.monitoring-es-7-*,metricbeat-*,*:metricbeat-*/_search
{
  "size": 1000,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "type": "cluster_stats"
          }
        },
        {
          "range": {
            "timestamp": {
              "gte": "now-2m"
            }
          }
        }
      ]
    }
  },
  "_source": ["cluster_uuid"], 
  "collapse": {
    "field": "cluster_uuid"
  }
}

@simianhacker
Copy link
Member

I did some investigation work and found one of the cloud instances that WE (Elastic) owns. I checked it out and confirmed they don't actually have any Stack Monitoring data in their cluster but they do have the rules for Stack Monitoring enabled.

I also noticed clusters that looked to be production clusters shipping their monitoring data to a separate cluster. Their monitoring clusters are not in our errors but the production cluster is. I suspect that some how these alerts are being created on clusters that either had stack monitoring data at one point OR they were created by accident (probably in Stack Management).

@jasonrhodes How do you want to handle this? Maybe we should add a check in the Stack Monitoring executor that throws a "No Data" error instead of throwing the exceptions above?

@simianhacker
Copy link
Member

Here are the steps to reproduce this:

  1. Start with a clean (no data) cluster.
  2. Go to the Stack Monitoring Page, it should give you the "No Data" screen
  3. Click on "Alerts & Rules" in the upper right hand corner
  4. Click on "Create default rules"
  5. Go to "Rules and Connectors" under "Stack Management"
  6. There should be 6 rules with errors (see below)

image

To fix this we need to check there is data present before we start destructuring variables off the response object. Here is an example of where the error manifests in the disk usage library function:

const { buckets: clusterBuckets } = response.aggregations?.clusters;

If you look at the errors in the alerts then check the corresponding library function under https://github.com/elastic/kibana/tree/main/x-pack/plugins/monitoring/server/lib/alerts , you should be able to find the line of code where it's destructuring the response object.

In my opinion, I think these alerts should just do nothing when the data is missing instead of throw errors or "No Data" alerts.

@weltenwort
Copy link
Member

the log threshold rule has a similar problem right now in #119777 and we're still debating whether and how to communicate that situation to the user

@simianhacker
Copy link
Member

Here is a second way this could happen: #121129

@jasonrhodes
Copy link
Member

jasonrhodes commented Apr 19, 2022

I think we should fix the null pointer exception on this bug and then leave the rest of the decision (should we actually alert the user to this problem?) until later. Filling up the Kibana logs with these errors doesn't seem like a good solution, no matter what we want to do re: communicating to the user.

@miltonhultgren miltonhultgren self-assigned this May 2, 2022
miltonhultgren added a commit that referenced this issue May 12, 2022
…131332)

* [Stack Monitoring] Prevent exceptions in rule when no data present (#120111)

* Remove optional chaining

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@miltonhultgren
Copy link
Contributor

The PR I merged only makes it fail silently with some grace but we still need to communicate that state to the user (to either populate with SM data or disable/delete the rules).
Should we create a new issue for that?

Bamieh pushed a commit to Bamieh/kibana that referenced this issue May 16, 2022
…lastic#131332)

* [Stack Monitoring] Prevent exceptions in rule when no data present (elastic#120111)

* Remove optional chaining

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@simianhacker
Copy link
Member

I think there's an issue already for improving the "missing data" alert. Most of the other SM alerts have the same behavior, do nothing if the data is missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types Feature:Alerting Feature:Stack Monitoring Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants