[Monitoring][Additional-Alerting] Shard Size #74820

igoristic · 2020-08-12T04:25:13Z

Acceptance Criteria

Be able to define threshold value (in gigabytes)
Define the desired index pattern via Elastic's glob type standard defined here

Current "Next step" items

Links the index to advance (and regular) metrics:.../elasticsearch/indices/${index}/advanced (within the SM app)
Doc link to https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html
Blog post to https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-08-12T04:25:15Z

Pinging @elastic/stack-monitoring (Team:Monitoring)

ravikesarwani · 2020-12-21T21:13:49Z

Apply only on the active shards - New data written in the last 7 days

igoristic · 2021-01-13T17:07:29Z

@ravikesarwani cc: @chrisronline @jasonrhodes

I’m still trying to understand the UX flow and how we can provide useful feedback/notifications to the user.

I know we have discussed some of these points over zoom, but I think it would be helpful if we address some of these questions here...

- Do we give people the right levers for tuning this query?

I consider this alert as a "per index" type of metric (where CPU for example is a "per node" type), however, each index pattern can have different tuning/ilm-policies. Which can result in different sharding/allocation behaviors. And, because of this I feel like it's also important to define the index pattern/s with their respective thresholds.

- What are the right defaults for the values in this alert?

One of the shards is more than 55gb

This fixed default doesn't really make sense (to me). Wouldn't it depend on the overall cluster size (and also the index's tuning/polices)? I don't know what the right number should be, but what makes sense to me is something dynamic/relative. So, for example (sudo): Trigger alert: if index shard size is more than 10% of the overall cluster size, because their cluster size might be less than "55gb"

Should we still use the following threshold?

Detected too many shards per index

If so, what should be the default?

Also, here is a snippet from the doc (for additional context):

ravikesarwani · 2021-01-13T18:38:08Z

Do we give people the right levers for tuning this query?

From a UI perspective I would like control on 2 parameters by the user:

Shard size (default is > 55 GB)
Index pattern. Default is "all index/*" but user can put in an optional index pattern so that they can control which indexes this alert applies to. The UI field should provide wild cards and multiple entries etc. functionality as we have in other places.

What are the right defaults for the values in this alert?

Sharding strategy is a complex topic with many different parameters affecting how many shards are okay and the size.
I do not think we should try to boil the ocean here with the alert in the first deliverable. What we are trying to provide is a general catch all scenario where things are broken. Default recommendation from Elastic is to have shards less than 50 GB. This alert with a default of >55GB is trying to catch the exception scenarios where users may not have an ILM policy or ILM policy maybe failing etc.

As a further check we talked about applying this alert only to Active index (new data written in last 7 days). This makes sure that large indexes in the environment that are "cold/frozen" and maybe larger than 55GB and users are okay with it is not something we alert on. This is something that we can provide a checkbox for user to control. Default should be checked but user can un-check it if they want. In that case alert would apply to all index, based on defined index pattern.

ravikesarwani · 2021-01-13T18:51:07Z

@igoristic Thanks for raising these questions. Your points are valid one.
Let me know if my response makes sense and/or if there's further questions.
I updated the doc with the above comments as well.

If there's complexities you run into implementing the above let's raise it ASAP so that we can discuss and figure out alternative approaches. It would be good to get this out to the customers and continue to get feedback and improve based on real world scenarios and challenges the customers face.

igoristic · 2021-01-15T18:02:27Z

@ravikesarwani Thank you for addressing this!

So, to clarify: In the drawer UI we will only have the following inputs:

Our current default inputs like: Check every X, Notify every X, etc
range (date) - How far back we want to look
index-pattern (text) eg: .monitoring-*,metricbeat-*,data-*
threshold (number) X number of GB

And, we will only have 1x threashold and 1x index-pattern, right? So, they will not have the ability to set different thresholds per different index patterns? (but, maybe something to consider in the future)

As a further check we talked about applying this alert only to Active index (new data written in last 7 days). This makes sure that large indexes in the environment that are "cold/frozen"

We don't have this metric (as far as I'm aware of)

chrisronline · 2021-01-15T18:42:05Z

I wonder if we should support index pattern inclusion and exclusion. ES supports this (using a - sign in front of the index pattern) and I imagine some users might find it more useful to apply to all, but a single index. WDYT?

ravikesarwani · 2021-01-15T18:44:10Z

Our current default inputs like: Check every X, Notify every X, etc

Yes

range (date) - How far back we want to look

I didn't have this in my definition. I don't think "How far back we want to look" really applies for shard size.
We can say "Look at the average over last X minutes" to account for any fluctuations we may see in the shard size but I don't think this is a must in my viewpoint. WDYT?

index-pattern (text) eg: .monitoring-,metricbeat-,data-*

threshold (number) X number of GB

Yes

And, we will only have 1x threashold and 1x index-pattern, right? So, they will not have the ability to set different thresholds per different index patterns? (but, maybe something to consider in the future)

Yes

I see the UI to look something like (other than the normal top and Actions section):

Notify when shard size is over "55 GB"
Apply to the following index pattern " "
I was wondering if it would be better in this UI to just show the available index patterns already defined in the "Stack management->Index patterns" UI. User can select one or more from the available list.
Providing a way for the users to create a new index pattern maybe too much work here and introduces a way for users to make mistakes. WDYT?

ravikesarwani · 2021-01-15T19:12:39Z

I wonder if we should support index pattern inclusion and exclusion. ES supports this (using a - sign in front of the index pattern) and I imagine some users might find it more useful to apply to all, but a single index. WDYT?

Supporting all the functionality around the creation of index pattern looks like is a complex task. See doc.
That is why I was wondering if rather than supporting all the functionality around creation of the index pattern we limit the selection of already created index pattern in this UI.
User needs to create the index pattern in the "Stack management->Index patterns" UI. From the Edit Alert view they can just select one or more index patterns to apply.

chrisronline · 2021-01-15T19:56:00Z

I don't think we need to support the creation of index patterns, but we can support a simple text field to allow the user to specify them. I also don't mind if we source the list from known index patterns in Kibana. Either way, we can support inclusion and exclusion fairly easily if it's desirable from a product perspective.

igoristic · 2021-01-15T23:36:31Z

The doc starts out with...

"Kibana requires an index pattern to access the Elasticsearch data that you want to explore"

What if they don't want to "explore" the production data/cluster they're monitoring? Seems like these are different use cases, and we should keep them separate. I think providing an index pattern field with a default value of * (all indices) would suffice (for now) and remove a lot of assumptions/complexity

ravikesarwani · 2021-01-16T03:50:38Z

I was assuming that using the already created index pattern in stack management should be easier. I agree the main use described in the document is for exploring the data in Kibana but it serves our purpose very well as well. That UI also has functionality like showing in real time when indexes are selected as you describe the filter which is real great and removes lots of user errors.

If you think providing an edit box (include/exclude, wild card, multiple entries ...), error checking (characters allowed/not allowed etc.) and related functionality is more easier to develop then go for it. Hopefully you will use some existing class/functions and not try to reinvent the wheel here to make the code simple and test cases contained.

cc: @jasonrhodes
Not sure Jason if you have some technical opinion on this one based on things you may have seen in Log and/or Metrics UI.

chrisronline · 2021-01-16T21:25:01Z

I imagine we'd do the same as Metrics does (as well as other observability solutions):

They don't have any validation (afaik) and are just a simple text field.

I'd imagine defaulting to * is a good idea and we can add exclusion in a later release, but I don't think the technical investment is anything more than small to add exclusions.

ravikesarwani · 2021-01-17T03:45:17Z

Let's use the the text field and get this alert coded, tested and delivered.

igoristic added Meta Team:Monitoring Stack Monitoring team v8.0.0 Feature:Stack Monitoring v7.10.0 labels Aug 12, 2020

igoristic added this to the Stack Monitoring UI 7.10 milestone Aug 12, 2020

igoristic self-assigned this Aug 12, 2020

sgrodzicki removed this from the Stack Monitoring UI 7.10 milestone Sep 28, 2020

sgrodzicki removed Meta v7.10.0 v8.0.0 labels Oct 1, 2020

sgrodzicki added this to the Stack Monitoring UI 7.11 milestone Oct 1, 2020

igoristic modified the milestones: Stack Monitoring UI 7.11, Stack Monitoring UI 7.12 Dec 21, 2020

igoristic mentioned this issue Jan 27, 2021

[Monitoring][Alerting] Large shard alert #89410

Merged

igoristic closed this as completed in #89410 Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring][Additional-Alerting] Shard Size #74820

[Monitoring][Additional-Alerting] Shard Size #74820

igoristic commented Aug 12, 2020 •

edited

Loading

elasticmachine commented Aug 12, 2020

ravikesarwani commented Dec 21, 2020

igoristic commented Jan 13, 2021

ravikesarwani commented Jan 13, 2021

ravikesarwani commented Jan 13, 2021

igoristic commented Jan 15, 2021

chrisronline commented Jan 15, 2021

ravikesarwani commented Jan 15, 2021

ravikesarwani commented Jan 15, 2021

chrisronline commented Jan 15, 2021

igoristic commented Jan 15, 2021

ravikesarwani commented Jan 16, 2021

chrisronline commented Jan 16, 2021

ravikesarwani commented Jan 17, 2021

[Monitoring][Additional-Alerting] Shard Size #74820

[Monitoring][Additional-Alerting] Shard Size #74820

Comments

igoristic commented Aug 12, 2020 • edited Loading

elasticmachine commented Aug 12, 2020

ravikesarwani commented Dec 21, 2020

igoristic commented Jan 13, 2021

ravikesarwani commented Jan 13, 2021

ravikesarwani commented Jan 13, 2021

igoristic commented Jan 15, 2021

chrisronline commented Jan 15, 2021

ravikesarwani commented Jan 15, 2021

ravikesarwani commented Jan 15, 2021

chrisronline commented Jan 15, 2021

igoristic commented Jan 15, 2021

ravikesarwani commented Jan 16, 2021

chrisronline commented Jan 16, 2021

ravikesarwani commented Jan 17, 2021

igoristic commented Aug 12, 2020 •

edited

Loading