-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sparse fields track #19
Comments
after re-running the test with with just one type and 200 fields, the indexing rate did not change significantly. Almost no change was visible in monitoring. Unfortunately I don't have exact numbers. |
@LucaWintergerst FYI Elasticsearch master is on Lucene 7 since Tuesday. The Nested track has sparse fields by design due to the use of I'd be fine with adding a track that has more sparse fields, but only if it is realistic. Our recommendation still is and will always be to model documents in a way that fields are dense whenever possible. |
I do understand and fully support our stance against sparse fields and therefore our recommendation for dense documents, but from what we see our customers do this does not always apply. It can also be hard to defend this stance without having credible data in the form of benchmarks to convince the user otherwise. |
@jpountz @LucaWintergerst @tsg I think this is a very important benchmark due to how Beats currently organises data. Metricbeat stores data related to all types of metrics in a single index, where each metric type has a prefix. As far as I know the standard Metricbeat index template has well over 1000 fields defined and is probably just going to grow when new types of metrics are introduced. This type of data is likely to be generated at scale and will be sparse by design. The same also applies to Filebeat, which sends logs as well as output from its pre-configured modules to a single index as well. |
My only ask is to keep the track realistic. :) Also maybe there are things to reconsider on the beats side to create fewer sparse fields. |
What would you think about adding a track with the data created by Metricbeat in its default configuration? We're working on improving the default configuration for 6.0, so I'd wait for that before doing it, but otherwise it seems to me like a pretty logical choice?
It is possible and fairly easy to configure Metricbeat to create one index per module, in which case the data should be a lot more dense. But we thought that the drawbacks of doing that (more complicated index management for the user, potential shards explosion) are too big to make it the default. I'd be curious to know your thoughts on that. I guess same happens in many logstash deployments as well, since multiple log types typically go into the same |
@tsg I'd be up for doing the work on the rally side to add this track: I'm benchmarking a new aggregation and looking around for a dataset that contains sparse values, so this kind of data would be potentially quite useful. The thing I'd need is a substantial amount of metricbeat data to use for the track: any thoughts here? |
We should consider adding a track with very sparse fields, comparing how doc_values behave over time. This is particularly interesting once we move to lucene 7.
The run from14:30-16:30 with
doc_values: false
,The one from 16:30-17:30 with
doc_values: true
,The very last run is with doc values disabled for all fields
The data has around 2200 fields in total, split across 30 types. The one type with the most documents has 200-300 fields.
The decrease in performance comparing the first two runs is significant, with around 30-40%.
Furthermore the indexing rate also keeps slowing down, the more data gets indexed, which does not happen as much when doc_values are disabled (see run 1)
This test was run on the following hardware:
14 cores
4 SSDs, multiple data paths
30GB heap size
The cluster was CPU bound during all runs.
The merges can't keep up in the second run and "indexing throttled" messages were showing up in the logs
The text was updated successfully, but these errors were encountered: