Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiently search against wildcard indices regardless of underlying indexing strategy #4342

Closed
rashidkpc opened this issue Jun 26, 2015 · 21 comments

Comments

@rashidkpc
Copy link
Contributor

Details about this change

This change will not break existing index patterns.

See #4342 (comment) for more details.

The original description

Elasticsearch 1.6 introduced the _field_stats API which will, for the first time, allow us to search for indices that contain fields within a given range. For example, we can search for indices that contain an @timestamp between X and Y.

It still needs one enhancement before we can utilize it: elastic/elasticsearch#11187

This means that users will no longer be required to roll their indices at UTC midnight, nor use date patterns at all. They can effectively name indices whatever they want. and Kibana can automatically optimize requests by firing a pre-flight request for indices. We might need to add some caching here, but it should greatly enhance usability.

Update: The implementation of the above enhancement is here: elastic/elasticsearch#11259

@w33ble
Copy link
Contributor

w33ble commented Sep 10, 2015

As noted in #4886, it would be useful to allow the user to specify the range for a given field and effectively tell Kibana how far back to look for matching indices.

@rashidkpc
Copy link
Contributor Author

I wonder if it wouldn't make sense to just look all the way back, but do it in a stepped manner, with a progress bar?

@pjcard
Copy link

pjcard commented Sep 23, 2015

Hi, thanks for linking my issue. Regarding unlimited lookback, I wonder how far that would scale? Personally, I was thinking of adding a cron job to automatically update the mappings each day, I wonder if that might be another avenue to explore. My situation only occurred as I was unaware that there were new mappings needing indexing, hence it took me so long to update them that they went out of scope of the default lookback - ideally there would have been something to cause or prompt for an update within the default time period, rather than the default time period being bigger.

@simianhacker
Copy link
Member

Make sure we cover the use cases in #2017

@epixa
Copy link
Contributor

epixa commented Oct 20, 2015

Since now there will only be one field that is affected by the time-based index checkbox, does anyone object to me moving that checkbox next to said field?

So this:

screen shot 2015-10-20 at 11 22 25 am

@ruckc
Copy link

ruckc commented Oct 23, 2015

Will we still have the ability to use timestamped indexes? Having timestamped indexes provides a trivial method to remove old data.

@epixa
Copy link
Contributor

epixa commented Oct 23, 2015

@ruckc Any reason you wouldn't just adjust the time range in kibana to not look at "old data"?

@ruckc
Copy link

ruckc commented Oct 23, 2015

@epixa We handle with a single node about 20-60gb indexed volume daily. The data looses relevancy extremely quickly (days), so we only keep at most a few days/week online depending on storage space available.

@ruckc
Copy link

ruckc commented Oct 23, 2015

@epixa even if the timestamped indexes were more of a workaround to lack of the field_stats API, at this point they are probably a feature to more people than just my organization who have built workflows taking advantage of them.

@epixa
Copy link
Contributor

epixa commented Oct 23, 2015

@ruckc It's possible that I'm misunderstanding, but it seems to me that using the field stats api will work for your workflow. For your scenario, there would probably even be a very minor performance gain with the new setup.

You could still maintain separate indexes following some sort of time-based convention. In fact, I'd say it's probably a good idea to continue doing so.

Consider this hypothetical scenario: you maintain some logstash data in daily indices, and you want to retrieve any data that happens to be stored in the last 7 days.

How it currently works with pattern-based naming convention

Index pattern: [logstash-]YYYY.MM.DD
Time field: @timestamp

A list of possible index names is generated:

logstash-2015-10-17
logstash-2015-10-18
logstash-2015-10-19
logstash-2015-10-20
logstash-2015-10-21
logstash-2015-10-22
logstash-2015-10-23

Kibana does a search against all 7 of those indexes regardless of whether they actually exist. Any non-existent index is just treated as empty.

How it will work

Index pattern: logstash-*
Time field: @timestamp

An index list is generated. You've deleted all but the last 4 indexes because the data is no longer useful to you, so only the 4 indexes that actually exist are included:

logstash-2015-10-20
logstash-2015-10-21
logstash-2015-10-22
logstash-2015-10-23

Kibana does a search on those indexes that are known to exist.

Under the new setup, the strategy you use to generate and name indexes is not necessarily directly coupled to how kibana queries them. Let's say you start pulling in twice as much data, around 100GB a day. You could start storing data in half-day indices (logstash-2015-10-24-am, logstash-2015-10-24-pm) and you wouldn't need to change anything within kibana itself. Kibana would be able to search against that new indexing strategy without any intervention.

Does this make sense? Am I understanding your workflow correctly?

@ruckc
Copy link

ruckc commented Oct 23, 2015

Yes that makes sense, and will work. I just wanted to ensure that Kibana would continue to support querying timestamped indexes as a whole.

@epixa
Copy link
Contributor

epixa commented Oct 23, 2015

@ruckc Definitely! The only requirement for querying them will be that they have some similarity in their naming convention that you can represent with a wildcard index pattern (eg logstash-*)

@epixa
Copy link
Contributor

epixa commented Oct 27, 2015

Many folks have expressed concern about the changes that will result from this ticket, so I wanted to spell out the implementation plan for this and provide a bit more detail about what these changes mean for time-based index patterns in Kibana.

Why?

Kibana is now smart enough to automatically determine which indices to search against based on your current specified time range for any wildcard index pattern. This means that any wildcard index pattern (e.g. logstash-*) that has a specific time field configured will automatically get the search optimizations that you used to only be able to get when you specified a time-based naming convention (e.g. [logstash-]YYYY.MM.DD.

This makes it easier to get up and running quickly with Kibana. A wildcard index pattern will now work for both small amounts of data and large amounts of data.

This also means that users can change their indexing strategies behind the scenes without having to create entirely new index patterns. For example, a user could change from having daily indexes to having hourly or even size-based indexes and their existing index pattern in Kibana will continue to work even when looking at a time range that spans the old and new indexes.

What is changing for 4.3?

All new and existing wildcard index patterns (e.g. logstash-*) that have a time field configured will have their searches optimized.

All new and existing index patterns created using a time-based naming convention (e.g. [logstash-]YYYY.MM.DD) will continue to work.

When creating a new index pattern, users will be discouraged from using time-based naming conventions via a deprecation warning on the form. Included along with the message will be a short description about how users can now use wildcard patterns to efficiently search against time-based indexes.

What is changing for 5.0?

The ability to use time-based naming conventions when creating new index patterns will be removed.

@epixa epixa changed the title Deprecate timestamped indices, use _field_stats API Efficiently search against wildcard indices regardless of what indexing strategy is used Oct 27, 2015
@epixa epixa changed the title Efficiently search against wildcard indices regardless of what indexing strategy is used Efficiently search against wildcard indices regardless of underlying indexing strategy Oct 27, 2015
@epixa
Copy link
Contributor

epixa commented Oct 27, 2015

The last PR for this ticket just went into master.

@mac3384
Copy link

mac3384 commented Mar 1, 2016

This might not be the appropriate place to ask this question, but once the ability to use time-based naming convention on index is removed, what will be the best approach to deleted old data? As with the current approach, you can simply drop old indeces based on the date in their name. But if all my data resides in a single index, how will I be able to delete data older than X days/months/etc?? As using a curator will no longer works in this case. Am I correct?

@epixa
Copy link
Contributor

epixa commented Mar 1, 2016

@mac3384 You can (and probably should) still use time based indexing schemes for your data. Kibana is just now smart enough to intelligently query those indexes based on your currently selected time range for any wildcard index patterns you've created.

@xande
Copy link

xande commented Mar 24, 2016

@epixa, do you know what kind of algorithm Kibana is using to query only specific to time-range indices?

Would it also work with such kind of naming automatically: log_somethinghere_20160130 (i.e. date is expressed as YYYYMMDD)?

UPDATE:
I think I got it. Kibana narrows down the search using _field_stats for each of the indices?

@epixa
Copy link
Contributor

epixa commented Mar 24, 2016

@xande Your update is correct. Unless you specifically opt into the behavior when you create your index pattern, Kibana does not make any decisions about time ranges based on your index names. It uses the field_stats api to ask elasticsearch which indices have data in a given time range, and then it queries those indices specifically.

@JeremyColton
Copy link

JeremyColton commented Jul 26, 2016

Hi @epixa I am using ES 2.3.2 and Kibana 4.5. I am the only person using this ELK stack. My index pattern is 'logstash-*'. I didn't tick the ' Use event times to create index names [DEPRECATED]' checkbox. There is more text that says 'By default, searches against any time-based index pattern that contains a wildcard will automatically be expanded to query only the indices that contain data within the currently selected time range.'

When I query for 'today' in my dashboard, Kibana sends a request per visual to ES using epoch times for 'today', so this seems to fit the text above. But ES queries every single index it has (~3 months). I have the following entries in my elasticsearch.log per daily index: Eg -

[2016-07-26 08:40:27,877][DEBUG][action.search ] [Invisible Woman] [logstash-2016.06.11]

So this is an ES bug?

@epixa
Copy link
Contributor

epixa commented Jul 26, 2016

@JeremyColton I'm not completely sure of the underlying ES implementation, to be honest, but it might be. Are you able to get me information about the network requests that Kibana makes using the network tab in your dev tools?

@JeremyColton
Copy link

Kibana sent requests with epoch times for the last 24 hours.

However, I changed my index's shard number from 5 to 2 with no replicas.
I re-indexed my existing indices.
This problem then went away!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants