[Stack Monitoring] Alerts firing for default values #105659

neptunian · 2021-07-14T19:29:46Z

The default value of xpack.searchable.snapshot.shared_cache.size Elasticsearch setting is 90% (ref) while the default disk usage threshold for Elasticsearch nodes in Kibana Stack Monitoring is 80%. This leads to false alerts because frozen cache is allocated upfront (frozen nodes are expected to have 90% disk utilisation in default settings).
From APM: I noticed we got alerts for large shard size for our APM indices. They are using default ILM policy rollover at 50, but it's 55GB because rollover takes some time to trigger. Is it possible to coordinate the values so that the APM default doens't trigger the alert?

Perhaps we should have an option in the rule that is checked by default to not include frozen tier nodes or not query for them if possible. We should also think about other default values that are triggering alerts.

Summarizing the discussion and specific actions items:

Fix #111889 with 15 minutes default
Change the default value for the "Large shard size" rule from 55gb to 75gb

This is to avoid false positives during forcemerge where the size of the shard can grow to 2x in some scenarios.

elasticmachine · 2021-07-14T19:29:48Z

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

jasonrhodes · 2021-08-10T01:47:11Z

@ravikesarwani do you have thoughts on these?

ravikesarwani · 2021-08-11T18:47:36Z

Yes, we need to handle disk usage alert better for frozen nodes.

Few options I see:

At a high level we can check if the data node has data_frozen role and skip that node for disk usage alert. But, a node can belong to multiple tiers so this may miss out on real alerts in some cases. We can extend this concept further and create a separate rule that applies only to frozen nodes and the other one covers the rest of nodes. This way the users can tweak the values separately.
We have [Monitoring]Add Filter in stack monitoring rules #96800 that we are planning to tackle in 7.16. This can provides a way for users to exclude certain nodes from the alert and also create new rules to target specific nodes. This I think provides for a better solution IMO as a first step solution for this issue.

I would like to handle (2) first and then see if there is something more that needs to be done.
With our changes in 7.15 to allow creation of new rules of SM rule types combined with filters for node and cluster we will have a robust solution to handle many different use cases.

ravikesarwani · 2021-08-11T18:51:12Z

cc: @Leaf-Lin I am tagging you here as you opened 106

ravikesarwani · 2021-08-11T18:59:06Z

@neptunian The shard size alert for SM by defaults triggers when shard average is over 55GB. Most default policy rollover at 50GB. I don't understand the issue?
Is the issue that most rollovers may be very delayed and thus likely to cross 55GB mark most of the time OR what you saw is more of an exception scenario? I think the alert did its job and alerted because the shard grew to 55GB and our recommendation is to keep the shard around 50 GB.
The values can be tweaked by the user if in a certain environment they would like that to be higher.
At this point I am not sure I have enough evidence to change the default value for the shard size alert.

neptunian · 2021-08-19T15:19:14Z

@ravikesarwani The issue is that rollover probably takes some time to trigger so even though it's set at 50GB it hasn't actually rollover yet and is often hitting the 55GB and getting the alert. In their words:

I noticed we got alerts for large shard size for our APM indices. They are using default ILM policy rollover at 50, but it's 55GB because rollover takes some time to trigger. Is it possible to coordinate the values so that the APM default doens't trigger the alert?

I think I'm inclined to agree with you but I'm not sure how much of an exception it is. @henrikno Is it often or an exception that you get the alerts? And I don't think they are wanting us to necessarily change the default size but somehow coordinate it specifically for the APM indices.

ravikesarwani · 2021-08-19T17:25:00Z

50gb rollover is default policy for many use cases including logs and metrics (not just APM) because as a general rule ES recommends to keep shards around 50gb. The default threshold of 55gb was chosen based on that general recommendation.

This helps us catch issues when say a rollover is failing. I wonder if what we saw here was some issue with the ILM policy (for some time) that really was the whole purpose of the rule to alert on.

jasonrhodes · 2021-12-02T16:16:44Z

@ravikesarwani can you comment on (a) whether you want to pursue this still, and (b) what the priority is, roughly?

SpencerLN · 2021-12-08T20:41:39Z

We have been seeing this alert trigger frequently, but there seems to be no issue when we look at the index stats. However, we tracked this down to what I would consider a false positive when using ILM with a default configuration and the forcemerge action. The issue is that during a forcemerge, the storage size of an index can significantly increase (up to double), causing the match for total storage/number of shards to show that the size is over the threshold.

For example, here is a chart for an index that we received an alert for earlier today. It has 6 total shards, 3 primary shards, and 3 replicas:

For most of the day you can see that index_stats.total.store.size_in_bytes was 325,911,976,659 (325,911,976,659 / 1024 / 1024 / 1024 / 6 = 50.58GB/shard, but then during the merge, it reached 582,306,772,049 (582,306,772,049 / 1024 / 1024 / 1024 / 6 = 90.38GB/shard), triggering the alert.

I would expect that the default configuration for this alert should not trigger during routine cluster operations (ILM, etc.), especially when using it with other Elastic products (APM, Endgame, etc.) with default settings.

ravikesarwani · 2021-12-09T21:11:30Z

Looks like this alert is clearly not working on the default value (The condition is met if an index’s average shard size is 55gb or higher in the last 5 minutes) and generating false positive when using default rollover value of 50GB. Elasticsearch internal operations can increase the shard size to double temporarily and this can cause the false positive. If this is normal ES operation then we should change the default value for this alert.

Looks like we have 2 levers to tweak. Default threshold value (currently 55gb) and what timeframe going back to use (currently 5 minutes default). My take would be to increase both but we need to find an optimal value that will work most of the time and not miss out real issues.

@jakelandis can someone from ES confirm about shard size reaching double capacity temporarily is okay scenario?
In that case is there a suggestion for a value to use as default?

@SpencerLN If you change alert configuration to 75gb and looking back 10 minutes does that stop the false positives?

Based on ES team comment and some testing we need to find the value that will work out-of-the-box and change our alert defaults appropriately. I am little hesitant to change our default to say 110gb since then we may miss out on roll over failures for a little while but may have to do that if we can't really find out a good default to use.

SpencerLN · 2021-12-09T22:41:10Z

@ravikesarwani unless I am missing something, I don't see a way to adjust the lookback period:

I've adjusted the threshold value to 75GB and will leave it to run over the weekend and see how things look.

ravikesarwani · 2021-12-12T20:17:29Z

@SpencerLN I forgot we have this option missing in this rule. I had opened #111889 for this that we still have to fix. @jasonrhodes @neptunian can we get #111889 fixed. In that I have suggested to increase the default to 15 minutes so short spikes (because of internal ES operations) doesn't cause a false positive alert. I feel that itself will take care of this issue in most normal cases.

SpencerLN · 2021-12-13T16:08:44Z

That looks great, being able to adjust the time period sounds like it should resolve this, thank you!

jakelandis · 2021-12-13T21:38:51Z

@jakelandis can someone from ES confirm about shard size reaching double capacity temporarily is okay scenario?
In that case is there a suggestion for a value to use as default?

Yes, it is possible that during a forcemerge the size can temporarily grow. I am not sure about the specifics upper bound but it is likely ~2x but in reality is likely much less (especially in 7.15 via elastic/elasticsearch#76221). Without testing and digging in pretty deep it is hard to recommend a specific value. I think 75GB will provide less false positives, but also by the time a shard hits 75GB it might be too late.

jasonrhodes · 2022-03-03T23:17:33Z

This issue has been replaced by #111889

neptunian added bug Fixes for quality problems that affect the customer experience Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring labels Jul 14, 2021

ravikesarwani mentioned this issue Aug 13, 2021

[StackMonitoring][Rules and Alerting] Frozen nodes should not trigger disk usage warning #108142

Open

jasonrhodes mentioned this issue Aug 30, 2021

Add new cluster level disk usage alert on a specific data tier #110138

Open

3 tasks

ravikesarwani added the SM alerting improvements label Dec 14, 2021

jasonrhodes closed this as completed Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stack Monitoring] Alerts firing for default values #105659

[Stack Monitoring] Alerts firing for default values #105659

neptunian commented Jul 14, 2021 •

edited by ravikesarwani

Loading

elasticmachine commented Jul 14, 2021

jasonrhodes commented Aug 10, 2021

ravikesarwani commented Aug 11, 2021

ravikesarwani commented Aug 11, 2021 •

edited

Loading

ravikesarwani commented Aug 11, 2021

neptunian commented Aug 19, 2021 •

edited

Loading

ravikesarwani commented Aug 19, 2021

jasonrhodes commented Dec 2, 2021

SpencerLN commented Dec 8, 2021

ravikesarwani commented Dec 9, 2021

SpencerLN commented Dec 9, 2021

ravikesarwani commented Dec 12, 2021

SpencerLN commented Dec 13, 2021

jakelandis commented Dec 13, 2021

jasonrhodes commented Mar 3, 2022

[Stack Monitoring] Alerts firing for default values #105659

[Stack Monitoring] Alerts firing for default values #105659

Comments

neptunian commented Jul 14, 2021 • edited by ravikesarwani Loading

elasticmachine commented Jul 14, 2021

jasonrhodes commented Aug 10, 2021

ravikesarwani commented Aug 11, 2021

ravikesarwani commented Aug 11, 2021 • edited Loading

ravikesarwani commented Aug 11, 2021

neptunian commented Aug 19, 2021 • edited Loading

ravikesarwani commented Aug 19, 2021

jasonrhodes commented Dec 2, 2021

SpencerLN commented Dec 8, 2021

ravikesarwani commented Dec 9, 2021

SpencerLN commented Dec 9, 2021

ravikesarwani commented Dec 12, 2021

SpencerLN commented Dec 13, 2021

jakelandis commented Dec 13, 2021

jasonrhodes commented Mar 3, 2022

neptunian commented Jul 14, 2021 •

edited by ravikesarwani

Loading

ravikesarwani commented Aug 11, 2021 •

edited

Loading

neptunian commented Aug 19, 2021 •

edited

Loading