Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse fields track #19

Open
LucaWintergerst opened this issue Apr 20, 2017 · 7 comments
Open

sparse fields track #19

LucaWintergerst opened this issue Apr 20, 2017 · 7 comments

Comments

@LucaWintergerst
Copy link

We should consider adding a track with very sparse fields, comparing how doc_values behave over time. This is particularly interesting once we move to lucene 7.

screen shot 2017-04-20 at 1 08 12 pm

The run from14:30-16:30 with doc_values: false,
The one from 16:30-17:30 with doc_values: true,
The very last run is with doc values disabled for all fields

The data has around 2200 fields in total, split across 30 types. The one type with the most documents has 200-300 fields.
The decrease in performance comparing the first two runs is significant, with around 30-40%.
Furthermore the indexing rate also keeps slowing down, the more data gets indexed, which does not happen as much when doc_values are disabled (see run 1)

This test was run on the following hardware:
14 cores
4 SSDs, multiple data paths
30GB heap size

The cluster was CPU bound during all runs.

The merges can't keep up in the second run and "indexing throttled" messages were showing up in the logs

@LucaWintergerst
Copy link
Author

LucaWintergerst commented Apr 20, 2017

after re-running the test with with just one type and 200 fields, the indexing rate did not change significantly. Almost no change was visible in monitoring. Unfortunately I don't have exact numbers.

@jpountz
Copy link
Contributor

jpountz commented Apr 21, 2017

@LucaWintergerst FYI Elasticsearch master is on Lucene 7 since Tuesday.

The Nested track has sparse fields by design due to the use of nested fields: fields that exist in the parent do not exist in children and vice-versa. The geonames dataset also has the elevation field which is only present in 26% of documents.

I'd be fine with adding a track that has more sparse fields, but only if it is realistic. Our recommendation still is and will always be to model documents in a way that fields are dense whenever possible.

@LucaWintergerst
Copy link
Author

I do understand and fully support our stance against sparse fields and therefore our recommendation for dense documents, but from what we see our customers do this does not always apply.
Often times the source data only contains a subset of the fields which are possible to appear in a document. While it is certainly possible to model the data or indices to counteract sparsity, it creates and additional overhead that a user might not be willing to pay without understanding why he should even care.

It can also be hard to defend this stance without having credible data in the form of benchmarks to convince the user otherwise.
How bad is sparsity really? How much does it impact indexing, searching, index size and so on?
I'm sure that you (or we) can answer these questions but I would still like to have data that I can show people to convince them otherwise.
Most users don't even know about sparsity until we tell them.

@cdahlqvist
Copy link
Contributor

cdahlqvist commented Apr 26, 2017

@jpountz @LucaWintergerst @tsg I think this is a very important benchmark due to how Beats currently organises data. Metricbeat stores data related to all types of metrics in a single index, where each metric type has a prefix. As far as I know the standard Metricbeat index template has well over 1000 fields defined and is probably just going to grow when new types of metrics are introduced. This type of data is likely to be generated at scale and will be sparse by design. The same also applies to Filebeat, which sends logs as well as output from its pre-configured modules to a single index as well.

@jpountz
Copy link
Contributor

jpountz commented Apr 26, 2017

My only ask is to keep the track realistic. :) Also maybe there are things to reconsider on the beats side to create fewer sparse fields.

@tsg
Copy link

tsg commented Apr 26, 2017

What would you think about adding a track with the data created by Metricbeat in its default configuration? We're working on improving the default configuration for 6.0, so I'd wait for that before doing it, but otherwise it seems to me like a pretty logical choice?

Also maybe there are things to reconsider on the beats side to create fewer sparse fields.

It is possible and fairly easy to configure Metricbeat to create one index per module, in which case the data should be a lot more dense. But we thought that the drawbacks of doing that (more complicated index management for the user, potential shards explosion) are too big to make it the default. I'd be curious to know your thoughts on that.

I guess same happens in many logstash deployments as well, since multiple log types typically go into the same logstash-* index pattern.

@pcsanwald
Copy link
Contributor

@tsg I'd be up for doing the work on the rally side to add this track: I'm benchmarking a new aggregation and looking around for a dataset that contains sparse values, so this kind of data would be potentially quite useful. The thing I'd need is a substantial amount of metricbeat data to use for the track: any thoughts here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants