Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring][Additional-Alerting] Shard Size #74820

Closed
igoristic opened this issue Aug 12, 2020 · 14 comments · Fixed by #89410
Closed

[Monitoring][Additional-Alerting] Shard Size #74820

igoristic opened this issue Aug 12, 2020 · 14 comments · Fixed by #89410
Assignees

Comments

@igoristic
Copy link
Contributor

igoristic commented Aug 12, 2020

Acceptance Criteria

  • Be able to define threshold value (in gigabytes)
  • Define the desired index pattern via Elastic's glob type standard defined here

Current "Next step" items

@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

@ravikesarwani
Copy link
Contributor

Apply only on the active shards - New data written in the last 7 days

@igoristic
Copy link
Contributor Author

@ravikesarwani cc: @chrisronline @jasonrhodes

I’m still trying to understand the UX flow and how we can provide useful feedback/notifications to the user.

I know we have discussed some of these points over zoom, but I think it would be helpful if we address some of these questions here...


- Do we give people the right levers for tuning this query?

I consider this alert as a "per index" type of metric (where CPU for example is a "per node" type), however, each index pattern can have different tuning/ilm-policies. Which can result in different sharding/allocation behaviors. And, because of this I feel like it's also important to define the index pattern/s with their respective thresholds.


- What are the right defaults for the values in this alert?

One of the shards is more than 55gb

This fixed default doesn't really make sense (to me). Wouldn't it depend on the overall cluster size (and also the index's tuning/polices)? I don't know what the right number should be, but what makes sense to me is something dynamic/relative. So, for example (sudo): Trigger alert: if index shard size is more than 10% of the overall cluster size, because their cluster size might be less than "55gb"

Should we still use the following threshold?

Detected too many shards per index

If so, what should be the default?


Also, here is a snippet from the doc (for additional context):
Screen Shot 2021-01-13 at 11 24 32 AM

@ravikesarwani
Copy link
Contributor

Do we give people the right levers for tuning this query?

From a UI perspective I would like control on 2 parameters by the user:

  • Shard size (default is > 55 GB)
  • Index pattern. Default is "all index/*" but user can put in an optional index pattern so that they can control which indexes this alert applies to. The UI field should provide wild cards and multiple entries etc. functionality as we have in other places.
  • What are the right defaults for the values in this alert?

Sharding strategy is a complex topic with many different parameters affecting how many shards are okay and the size.
I do not think we should try to boil the ocean here with the alert in the first deliverable. What we are trying to provide is a general catch all scenario where things are broken. Default recommendation from Elastic is to have shards less than 50 GB. This alert with a default of >55GB is trying to catch the exception scenarios where users may not have an ILM policy or ILM policy maybe failing etc.

As a further check we talked about applying this alert only to Active index (new data written in last 7 days). This makes sure that large indexes in the environment that are "cold/frozen" and maybe larger than 55GB and users are okay with it is not something we alert on. This is something that we can provide a checkbox for user to control. Default should be checked but user can un-check it if they want. In that case alert would apply to all index, based on defined index pattern.

@ravikesarwani
Copy link
Contributor

@igoristic Thanks for raising these questions. Your points are valid one.
Let me know if my response makes sense and/or if there's further questions.
I updated the doc with the above comments as well.

If there's complexities you run into implementing the above let's raise it ASAP so that we can discuss and figure out alternative approaches. It would be good to get this out to the customers and continue to get feedback and improve based on real world scenarios and challenges the customers face.

@igoristic
Copy link
Contributor Author

@ravikesarwani Thank you for addressing this!

So, to clarify: In the drawer UI we will only have the following inputs:

  • Our current default inputs like: Check every X, Notify every X, etc
  • range (date) - How far back we want to look
  • index-pattern (text) eg: .monitoring-*,metricbeat-*,data-*
  • threshold (number) X number of GB

And, we will only have 1x threashold and 1x index-pattern, right? So, they will not have the ability to set different thresholds per different index patterns? (but, maybe something to consider in the future)


As a further check we talked about applying this alert only to Active index (new data written in last 7 days). This makes sure that large indexes in the environment that are "cold/frozen"

We don't have this metric (as far as I'm aware of)

@chrisronline
Copy link
Contributor

I wonder if we should support index pattern inclusion and exclusion. ES supports this (using a - sign in front of the index pattern) and I imagine some users might find it more useful to apply to all, but a single index. WDYT?

@ravikesarwani
Copy link
Contributor

Our current default inputs like: Check every X, Notify every X, etc

Yes

range (date) - How far back we want to look

I didn't have this in my definition. I don't think "How far back we want to look" really applies for shard size.
We can say "Look at the average over last X minutes" to account for any fluctuations we may see in the shard size but I don't think this is a must in my viewpoint. WDYT?

index-pattern (text) eg: .monitoring-,metricbeat-,data-*

threshold (number) X number of GB

Yes

And, we will only have 1x threashold and 1x index-pattern, right? So, they will not have the ability to set different thresholds per different index patterns? (but, maybe something to consider in the future)

Yes

I see the UI to look something like (other than the normal top and Actions section):

Notify when shard size is over "55 GB"
Apply to the following index pattern " "
I was wondering if it would be better in this UI to just show the available index patterns already defined in the "Stack management->Index patterns" UI. User can select one or more from the available list.
Providing a way for the users to create a new index pattern maybe too much work here and introduces a way for users to make mistakes. WDYT?

@ravikesarwani
Copy link
Contributor

I wonder if we should support index pattern inclusion and exclusion. ES supports this (using a - sign in front of the index pattern) and I imagine some users might find it more useful to apply to all, but a single index. WDYT?

Supporting all the functionality around the creation of index pattern looks like is a complex task. See doc.
That is why I was wondering if rather than supporting all the functionality around creation of the index pattern we limit the selection of already created index pattern in this UI.
User needs to create the index pattern in the "Stack management->Index patterns" UI. From the Edit Alert view they can just select one or more index patterns to apply.

@chrisronline
Copy link
Contributor

I don't think we need to support the creation of index patterns, but we can support a simple text field to allow the user to specify them. I also don't mind if we source the list from known index patterns in Kibana. Either way, we can support inclusion and exclusion fairly easily if it's desirable from a product perspective.

@igoristic
Copy link
Contributor Author

The doc starts out with...

"Kibana requires an index pattern to access the Elasticsearch data that you want to explore"

What if they don't want to "explore" the production data/cluster they're monitoring? Seems like these are different use cases, and we should keep them separate. I think providing an index pattern field with a default value of * (all indices) would suffice (for now) and remove a lot of assumptions/complexity

@ravikesarwani
Copy link
Contributor

I was assuming that using the already created index pattern in stack management should be easier. I agree the main use described in the document is for exploring the data in Kibana but it serves our purpose very well as well. That UI also has functionality like showing in real time when indexes are selected as you describe the filter which is real great and removes lots of user errors.

If you think providing an edit box (include/exclude, wild card, multiple entries ...), error checking (characters allowed/not allowed etc.) and related functionality is more easier to develop then go for it. Hopefully you will use some existing class/functions and not try to reinvent the wheel here to make the code simple and test cases contained.

cc: @jasonrhodes
Not sure Jason if you have some technical opinion on this one based on things you may have seen in Log and/or Metrics UI.

@chrisronline
Copy link
Contributor

I imagine we'd do the same as Metrics does (as well as other observability solutions):

Screen Shot 2021-01-16 at 4 17 35 PM

Screen Shot 2021-01-16 at 4 19 47 PM

They don't have any validation (afaik) and are just a simple text field.

I'd imagine defaulting to * is a good idea and we can add exclusion in a later release, but I don't think the technical investment is anything more than small to add exclusions.

@ravikesarwani
Copy link
Contributor

Let's use the the text field and get this alert coded, tested and delivered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants