Stack Monitoring rule types failing due to empty buckets #120111

gmmorris · 2021-12-01T14:48:48Z

Kibana version: 7.15.0

Describe the bug:
I'm seeing alot of Stack Monitoring rules failing on cloud with the following errors:

Cannot destructure property 'buckets' of '(intermediate value)(intermediate value)(intermediate value)' as it is undefined.

and a couple of variations on:

Cannot read property 'buckets' of undefined

Cannot read property '_source' of undefined

I'm assuming there is JS code that assumes aggregations are returned by Elasticsearch in their queries, but actually causing a null-pointer exception due to the data set being empty and the aggregation being omitted as a result.

I've seen this happen with:

monitoring_alert_jvm_memory_usage
monitoring_alert_disk_usage
monitoring_ccr_read_exceptions
monitoring_alert_thread_pool_search_rejections
monitoring_shard_size
monitoring_alert_thread_pool_write_rejections

So this is likely the same problem repeating across these all.

Acceptance Criteria

Stack monitoring rules should handle the case when response data is missing, and should not let execution fall through and cause a null-pointer exception error (e.g. "Cannot read property 'buckets' of undefined") in the rule execution logs.

Notes:

Reproduction steps [1] from @simianhacker
Reproduction steps [2] from @simianhacker
We don't need to handle this case perfectly for right now. Defensively guarding against the destructuring error is a good enough step for now.

elasticmachine · 2021-12-01T14:49:22Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

simianhacker · 2021-12-02T17:51:14Z

I believe this is the query that's failing to return data:

POST .monitoring-es-6-*,*:.monitoring-es-6-*,.monitoring-es-7-*,*:.monitoring-es-7-*,metricbeat-*,*:metricbeat-*/_search
{
  "size": 1000,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "type": "cluster_stats"
          }
        },
        {
          "range": {
            "timestamp": {
              "gte": "now-2m"
            }
          }
        }
      ]
    }
  },
  "_source": ["cluster_uuid"], 
  "collapse": {
    "field": "cluster_uuid"
  }
}

simianhacker · 2021-12-02T20:43:15Z

I did some investigation work and found one of the cloud instances that WE (Elastic) owns. I checked it out and confirmed they don't actually have any Stack Monitoring data in their cluster but they do have the rules for Stack Monitoring enabled.

I also noticed clusters that looked to be production clusters shipping their monitoring data to a separate cluster. Their monitoring clusters are not in our errors but the production cluster is. I suspect that some how these alerts are being created on clusters that either had stack monitoring data at one point OR they were created by accident (probably in Stack Management).

@jasonrhodes How do you want to handle this? Maybe we should add a check in the Stack Monitoring executor that throws a "No Data" error instead of throwing the exceptions above?

simianhacker · 2021-12-06T21:56:52Z

Here are the steps to reproduce this:

Start with a clean (no data) cluster.
Go to the Stack Monitoring Page, it should give you the "No Data" screen
Click on "Alerts & Rules" in the upper right hand corner
Click on "Create default rules"
Go to "Rules and Connectors" under "Stack Management"
There should be 6 rules with errors (see below)

To fix this we need to check there is data present before we start destructuring variables off the response object. Here is an example of where the error manifests in the disk usage library function:

kibana/x-pack/plugins/monitoring/server/lib/alerts/fetch_disk_usage_node_stats.ts

Line 115 in 903e75e

const { buckets: clusterBuckets } = response.aggregations?.clusters;

If you look at the errors in the alerts then check the corresponding library function under https://github.com/elastic/kibana/tree/main/x-pack/plugins/monitoring/server/lib/alerts , you should be able to find the line of code where it's destructuring the response object.

In my opinion, I think these alerts should just do nothing when the data is missing instead of throw errors or "No Data" alerts.

weltenwort · 2021-12-07T16:22:42Z

the log threshold rule has a similar problem right now in #119777 and we're still debating whether and how to communicate that situation to the user

simianhacker · 2021-12-13T19:42:27Z

Here is a second way this could happen: #121129

jasonrhodes · 2022-04-19T20:14:26Z

I think we should fix the null pointer exception on this bug and then leave the rest of the decision (should we actually alert the user to this problem?) until later. Filling up the Kibana logs with these errors doesn't seem like a good solution, no matter what we want to do re: communicating to the user.

…lastic#120111)

…131332) * [Stack Monitoring] Prevent exceptions in rule when no data present (#120111) * Remove optional chaining Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

miltonhultgren · 2022-05-12T13:32:37Z

The PR I merged only makes it fail silently with some grace but we still need to communicate that state to the user (to either populate with SM data or disable/delete the rules).
Should we create a new issue for that?

…lastic#131332) * [Stack Monitoring] Prevent exceptions in rule when no data present (elastic#120111) * Remove optional chaining Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

simianhacker · 2022-05-16T15:09:13Z

I think there's an issue already for improving the "missing data" alert. Most of the other SM alerts have the same behavior, do nothing if the data is missing.

gmmorris added the bug Fixes for quality problems that affect the customer experience label Dec 1, 2021

botelastic bot added the needs-team Issues missing a team label label Dec 1, 2021

simianhacker mentioned this issue Dec 13, 2021

[Stack Monitoring] Stack Monitoring alerts no longer work with monitoring.ui.elasticsearch.* config settings #121129

Open

jasonrhodes mentioned this issue Apr 19, 2022

Stack Monitoring Tech Debt Plan #127224

Closed

39 tasks

miltonhultgren self-assigned this May 2, 2022

miltonhultgren added a commit to miltonhultgren/kibana that referenced this issue May 2, 2022

[Stack Monitoring] Prevent exceptions in rule when no data present (e…

8a451d7

…lastic#120111)

miltonhultgren mentioned this issue May 2, 2022

[Stack Monitoring] Prevent exceptions in rule when no data present #131332

Merged

miltonhultgren closed this as completed in #131332 May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack Monitoring rule types failing due to empty buckets #120111

Stack Monitoring rule types failing due to empty buckets #120111

gmmorris commented Dec 1, 2021 •

edited by jasonrhodes

Loading

elasticmachine commented Dec 1, 2021

simianhacker commented Dec 2, 2021 •

edited

Loading

simianhacker commented Dec 2, 2021

simianhacker commented Dec 6, 2021

weltenwort commented Dec 7, 2021

simianhacker commented Dec 13, 2021

jasonrhodes commented Apr 19, 2022 •

edited

Loading

miltonhultgren commented May 12, 2022

simianhacker commented May 16, 2022

Stack Monitoring rule types failing due to empty buckets #120111

Stack Monitoring rule types failing due to empty buckets #120111

Comments

gmmorris commented Dec 1, 2021 • edited by jasonrhodes Loading

Acceptance Criteria

elasticmachine commented Dec 1, 2021

simianhacker commented Dec 2, 2021 • edited Loading

simianhacker commented Dec 2, 2021

simianhacker commented Dec 6, 2021

weltenwort commented Dec 7, 2021

simianhacker commented Dec 13, 2021

jasonrhodes commented Apr 19, 2022 • edited Loading

miltonhultgren commented May 12, 2022

simianhacker commented May 16, 2022

gmmorris commented Dec 1, 2021 •

edited by jasonrhodes

Loading

simianhacker commented Dec 2, 2021 •

edited

Loading

jasonrhodes commented Apr 19, 2022 •

edited

Loading