sparse fields track #19

LucaWintergerst · 2017-04-20T10:15:05Z

We should consider adding a track with very sparse fields, comparing how doc_values behave over time. This is particularly interesting once we move to lucene 7.

The run from14:30-16:30 with doc_values: false,
The one from 16:30-17:30 with doc_values: true,
The very last run is with doc values disabled for all fields

The data has around 2200 fields in total, split across 30 types. The one type with the most documents has 200-300 fields.
The decrease in performance comparing the first two runs is significant, with around 30-40%.
Furthermore the indexing rate also keeps slowing down, the more data gets indexed, which does not happen as much when doc_values are disabled (see run 1)

This test was run on the following hardware:
14 cores
4 SSDs, multiple data paths
30GB heap size

The cluster was CPU bound during all runs.

The merges can't keep up in the second run and "indexing throttled" messages were showing up in the logs

The text was updated successfully, but these errors were encountered:

LucaWintergerst · 2017-04-20T13:02:24Z

after re-running the test with with just one type and 200 fields, the indexing rate did not change significantly. Almost no change was visible in monitoring. Unfortunately I don't have exact numbers.

jpountz · 2017-04-21T09:22:25Z

@LucaWintergerst FYI Elasticsearch master is on Lucene 7 since Tuesday.

The Nested track has sparse fields by design due to the use of nested fields: fields that exist in the parent do not exist in children and vice-versa. The geonames dataset also has the elevation field which is only present in 26% of documents.

I'd be fine with adding a track that has more sparse fields, but only if it is realistic. Our recommendation still is and will always be to model documents in a way that fields are dense whenever possible.

LucaWintergerst · 2017-04-23T06:46:21Z

I do understand and fully support our stance against sparse fields and therefore our recommendation for dense documents, but from what we see our customers do this does not always apply.
Often times the source data only contains a subset of the fields which are possible to appear in a document. While it is certainly possible to model the data or indices to counteract sparsity, it creates and additional overhead that a user might not be willing to pay without understanding why he should even care.

It can also be hard to defend this stance without having credible data in the form of benchmarks to convince the user otherwise.
How bad is sparsity really? How much does it impact indexing, searching, index size and so on?
I'm sure that you (or we) can answer these questions but I would still like to have data that I can show people to convince them otherwise.
Most users don't even know about sparsity until we tell them.

cdahlqvist · 2017-04-26T14:14:43Z

@jpountz @LucaWintergerst @tsg I think this is a very important benchmark due to how Beats currently organises data. Metricbeat stores data related to all types of metrics in a single index, where each metric type has a prefix. As far as I know the standard Metricbeat index template has well over 1000 fields defined and is probably just going to grow when new types of metrics are introduced. This type of data is likely to be generated at scale and will be sparse by design. The same also applies to Filebeat, which sends logs as well as output from its pre-configured modules to a single index as well.

jpountz · 2017-04-26T16:31:32Z

My only ask is to keep the track realistic. :) Also maybe there are things to reconsider on the beats side to create fewer sparse fields.

tsg · 2017-04-26T21:55:40Z

What would you think about adding a track with the data created by Metricbeat in its default configuration? We're working on improving the default configuration for 6.0, so I'd wait for that before doing it, but otherwise it seems to me like a pretty logical choice?

Also maybe there are things to reconsider on the beats side to create fewer sparse fields.

It is possible and fairly easy to configure Metricbeat to create one index per module, in which case the data should be a lot more dense. But we thought that the drawbacks of doing that (more complicated index management for the user, potential shards explosion) are too big to make it the default. I'd be curious to know your thoughts on that.

I guess same happens in many logstash deployments as well, since multiple log types typically go into the same logstash-* index pattern.

pcsanwald · 2018-10-23T20:39:39Z

@tsg I'd be up for doing the work on the rally side to add this track: I'm benchmarking a new aggregation and looking around for a dataset that contains sparse values, so this kind of data would be potentially quite useful. The thing I'd need is a substantial amount of metricbeat data to use for the track: any thoughts here?

danielmitterdorfer added the enhancement label Apr 20, 2017

pcsanwald mentioned this issue Sep 17, 2018

Benchmarks to compare autohisto with date histogram #40

Merged

pcsanwald mentioned this issue Dec 12, 2018

Add a track for Metricbeat data #56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sparse fields track #19

sparse fields track #19

LucaWintergerst commented Apr 20, 2017

LucaWintergerst commented Apr 20, 2017 •

edited

Loading

jpountz commented Apr 21, 2017

LucaWintergerst commented Apr 23, 2017

cdahlqvist commented Apr 26, 2017 •

edited

Loading

jpountz commented Apr 26, 2017

tsg commented Apr 26, 2017

pcsanwald commented Oct 23, 2018

sparse fields track #19

sparse fields track #19

Comments

LucaWintergerst commented Apr 20, 2017

LucaWintergerst commented Apr 20, 2017 • edited Loading

jpountz commented Apr 21, 2017

LucaWintergerst commented Apr 23, 2017

cdahlqvist commented Apr 26, 2017 • edited Loading

jpountz commented Apr 26, 2017

tsg commented Apr 26, 2017

pcsanwald commented Oct 23, 2018

LucaWintergerst commented Apr 20, 2017 •

edited

Loading

cdahlqvist commented Apr 26, 2017 •

edited

Loading