Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] Alerts firing for default values #105659

Closed
2 tasks
neptunian opened this issue Jul 14, 2021 · 15 comments
Closed
2 tasks

[Stack Monitoring] Alerts firing for default values #105659

neptunian opened this issue Jul 14, 2021 · 15 comments
Labels
bug Fixes for quality problems that affect the customer experience Feature:Stack Monitoring SM alerting improvements Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services

Comments

@neptunian
Copy link
Contributor

neptunian commented Jul 14, 2021

  • The default value of xpack.searchable.snapshot.shared_cache.size Elasticsearch setting is 90% (ref) while the default disk usage threshold for Elasticsearch nodes in Kibana Stack Monitoring is 80%. This leads to false alerts because frozen cache is allocated upfront (frozen nodes are expected to have 90% disk utilisation in default settings).
  • From APM: I noticed we got alerts for large shard size for our APM indices. They are using default ILM policy rollover at 50, but it's 55GB because rollover takes some time to trigger. Is it possible to coordinate the values so that the APM default doens't trigger the alert?

Perhaps we should have an option in the rule that is checked by default to not include frozen tier nodes or not query for them if possible. We should also think about other default values that are triggering alerts.


Summarizing the discussion and specific actions items:

  • Fix #111889 with 15 minutes default
  • Change the default value for the "Large shard size" rule from 55gb to 75gb

This is to avoid false positives during forcemerge where the size of the shard can grow to 2x in some scenarios.

@neptunian neptunian added bug Fixes for quality problems that affect the customer experience Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring labels Jul 14, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@jasonrhodes
Copy link
Member

@ravikesarwani do you have thoughts on these?

@ravikesarwani
Copy link
Contributor

Yes, we need to handle disk usage alert better for frozen nodes.

Few options I see:

  1. At a high level we can check if the data node has data_frozen role and skip that node for disk usage alert. But, a node can belong to multiple tiers so this may miss out on real alerts in some cases. We can extend this concept further and create a separate rule that applies only to frozen nodes and the other one covers the rest of nodes. This way the users can tweak the values separately.
  2. We have [Monitoring]Add Filter in stack monitoring rules #96800 that we are planning to tackle in 7.16. This can provides a way for users to exclude certain nodes from the alert and also create new rules to target specific nodes. This I think provides for a better solution IMO as a first step solution for this issue.

I would like to handle (2) first and then see if there is something more that needs to be done.
With our changes in 7.15 to allow creation of new rules of SM rule types combined with filters for node and cluster we will have a robust solution to handle many different use cases.

@ravikesarwani
Copy link
Contributor

ravikesarwani commented Aug 11, 2021

cc: @Leaf-Lin I am tagging you here as you opened 106

@ravikesarwani
Copy link
Contributor

@neptunian The shard size alert for SM by defaults triggers when shard average is over 55GB. Most default policy rollover at 50GB. I don't understand the issue?
Is the issue that most rollovers may be very delayed and thus likely to cross 55GB mark most of the time OR what you saw is more of an exception scenario? I think the alert did its job and alerted because the shard grew to 55GB and our recommendation is to keep the shard around 50 GB.
The values can be tweaked by the user if in a certain environment they would like that to be higher.
At this point I am not sure I have enough evidence to change the default value for the shard size alert.

@neptunian
Copy link
Contributor Author

neptunian commented Aug 19, 2021

@ravikesarwani The issue is that rollover probably takes some time to trigger so even though it's set at 50GB it hasn't actually rollover yet and is often hitting the 55GB and getting the alert. In their words:

I noticed we got alerts for large shard size for our APM indices. They are using default ILM policy rollover at 50, but it's 55GB because rollover takes some time to trigger. Is it possible to coordinate the values so that the APM default doens't trigger the alert?

I think I'm inclined to agree with you but I'm not sure how much of an exception it is. @henrikno Is it often or an exception that you get the alerts? And I don't think they are wanting us to necessarily change the default size but somehow coordinate it specifically for the APM indices.

@ravikesarwani
Copy link
Contributor

50gb rollover is default policy for many use cases including logs and metrics (not just APM) because as a general rule ES recommends to keep shards around 50gb. The default threshold of 55gb was chosen based on that general recommendation.

This helps us catch issues when say a rollover is failing. I wonder if what we saw here was some issue with the ILM policy (for some time) that really was the whole purpose of the rule to alert on.

@jasonrhodes
Copy link
Member

@ravikesarwani can you comment on (a) whether you want to pursue this still, and (b) what the priority is, roughly?

@SpencerLN
Copy link

We have been seeing this alert trigger frequently, but there seems to be no issue when we look at the index stats. However, we tracked this down to what I would consider a false positive when using ILM with a default configuration and the forcemerge action. The issue is that during a forcemerge, the storage size of an index can significantly increase (up to double), causing the match for total storage/number of shards to show that the size is over the threshold.

For example, here is a chart for an index that we received an alert for earlier today. It has 6 total shards, 3 primary shards, and 3 replicas:
image

For most of the day you can see that index_stats.total.store.size_in_bytes was 325,911,976,659 (325,911,976,659 / 1024 / 1024 / 1024 / 6 = 50.58GB/shard, but then during the merge, it reached 582,306,772,049 (582,306,772,049 / 1024 / 1024 / 1024 / 6 = 90.38GB/shard), triggering the alert.

I would expect that the default configuration for this alert should not trigger during routine cluster operations (ILM, etc.), especially when using it with other Elastic products (APM, Endgame, etc.) with default settings.

@ravikesarwani
Copy link
Contributor

Looks like this alert is clearly not working on the default value (The condition is met if an index’s average shard size is 55gb or higher in the last 5 minutes) and generating false positive when using default rollover value of 50GB. Elasticsearch internal operations can increase the shard size to double temporarily and this can cause the false positive. If this is normal ES operation then we should change the default value for this alert.

Looks like we have 2 levers to tweak. Default threshold value (currently 55gb) and what timeframe going back to use (currently 5 minutes default). My take would be to increase both but we need to find an optimal value that will work most of the time and not miss out real issues.

@jakelandis can someone from ES confirm about shard size reaching double capacity temporarily is okay scenario?
In that case is there a suggestion for a value to use as default?

@SpencerLN If you change alert configuration to 75gb and looking back 10 minutes does that stop the false positives?

Based on ES team comment and some testing we need to find the value that will work out-of-the-box and change our alert defaults appropriately. I am little hesitant to change our default to say 110gb since then we may miss out on roll over failures for a little while but may have to do that if we can't really find out a good default to use.

@SpencerLN
Copy link

@ravikesarwani unless I am missing something, I don't see a way to adjust the lookback period:
image

I've adjusted the threshold value to 75GB and will leave it to run over the weekend and see how things look.

@ravikesarwani
Copy link
Contributor

@SpencerLN I forgot we have this option missing in this rule. I had opened #111889 for this that we still have to fix. @jasonrhodes @neptunian can we get #111889 fixed. In that I have suggested to increase the default to 15 minutes so short spikes (because of internal ES operations) doesn't cause a false positive alert. I feel that itself will take care of this issue in most normal cases.

@SpencerLN
Copy link

That looks great, being able to adjust the time period sounds like it should resolve this, thank you!

@jakelandis
Copy link
Contributor

@jakelandis can someone from ES confirm about shard size reaching double capacity temporarily is okay scenario?
In that case is there a suggestion for a value to use as default?

Yes, it is possible that during a forcemerge the size can temporarily grow. I am not sure about the specifics upper bound but it is likely ~2x but in reality is likely much less (especially in 7.15 via elastic/elasticsearch#76221). Without testing and digging in pretty deep it is hard to recommend a specific value. I think 75GB will provide less false positives, but also by the time a shard hits 75GB it might be too late.

@jasonrhodes
Copy link
Member

This issue has been replaced by #111889

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Stack Monitoring SM alerting improvements Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

No branches or pull requests

6 participants