-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Stack Monitoring] Alerts firing for default values #105659
Comments
Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui) |
@ravikesarwani do you have thoughts on these? |
Yes, we need to handle disk usage alert better for frozen nodes. Few options I see:
I would like to handle (2) first and then see if there is something more that needs to be done. |
cc: @Leaf-Lin I am tagging you here as you opened 106 |
@neptunian The shard size alert for SM by defaults triggers when shard average is over 55GB. Most default policy rollover at 50GB. I don't understand the issue? |
@ravikesarwani The issue is that rollover probably takes some time to trigger so even though it's set at 50GB it hasn't actually rollover yet and is often hitting the 55GB and getting the alert. In their words:
I think I'm inclined to agree with you but I'm not sure how much of an exception it is. @henrikno Is it often or an exception that you get the alerts? And I don't think they are wanting us to necessarily change the default size but somehow coordinate it specifically for the APM indices. |
50gb rollover is default policy for many use cases including logs and metrics (not just APM) because as a general rule ES recommends to keep shards around 50gb. The default threshold of 55gb was chosen based on that general recommendation. This helps us catch issues when say a rollover is failing. I wonder if what we saw here was some issue with the ILM policy (for some time) that really was the whole purpose of the rule to alert on. |
@ravikesarwani can you comment on (a) whether you want to pursue this still, and (b) what the priority is, roughly? |
Looks like this alert is clearly not working on the default value (The condition is met if an index’s average shard size is 55gb or higher in the last 5 minutes) and generating false positive when using default rollover value of 50GB. Elasticsearch internal operations can increase the shard size to double temporarily and this can cause the false positive. If this is normal ES operation then we should change the default value for this alert. Looks like we have 2 levers to tweak. Default threshold value (currently 55gb) and what timeframe going back to use (currently 5 minutes default). My take would be to increase both but we need to find an optimal value that will work most of the time and not miss out real issues. @jakelandis can someone from ES confirm about shard size reaching double capacity temporarily is okay scenario? @SpencerLN If you change alert configuration to 75gb and looking back 10 minutes does that stop the false positives? Based on ES team comment and some testing we need to find the value that will work out-of-the-box and change our alert defaults appropriately. I am little hesitant to change our default to say 110gb since then we may miss out on roll over failures for a little while but may have to do that if we can't really find out a good default to use. |
@ravikesarwani unless I am missing something, I don't see a way to adjust the lookback period: I've adjusted the threshold value to 75GB and will leave it to run over the weekend and see how things look. |
@SpencerLN I forgot we have this option missing in this rule. I had opened #111889 for this that we still have to fix. @jasonrhodes @neptunian can we get #111889 fixed. In that I have suggested to increase the default to 15 minutes so short spikes (because of internal ES operations) doesn't cause a false positive alert. I feel that itself will take care of this issue in most normal cases. |
That looks great, being able to adjust the time period sounds like it should resolve this, thank you! |
Yes, it is possible that during a forcemerge the size can temporarily grow. I am not sure about the specifics upper bound but it is likely ~2x but in reality is likely much less (especially in 7.15 via elastic/elasticsearch#76221). Without testing and digging in pretty deep it is hard to recommend a specific value. I think 75GB will provide less false positives, but also by the time a shard hits 75GB it might be too late. |
This issue has been replaced by #111889 |
Perhaps we should have an option in the rule that is checked by default to not include frozen tier nodes or not query for them if possible. We should also think about other default values that are triggering alerts.
Summarizing the discussion and specific actions items:
This is to avoid false positives during forcemerge where the size of the shard can grow to 2x in some scenarios.
The text was updated successfully, but these errors were encountered: