Adds a new auto-interval date histogram #26659

colings86 · 2017-09-15T10:41:33Z

This change adds a new type of histogram aggregation called auto_date_histogram where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time.

This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in breadth_first mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary.

This is also the first aggregation to merge buckets at collection time but having this aggregation will open the door to other such aggregations that merge buckets at collection time.

Things still to do:

Write documentation on the new aggregation and its parameters
Write more tests, specifically around having sub-aggregations and checking they are merged correctly in the reduce step
Investigate whether we could collect 10 * target buckets on the shards and then use the extra bucket to create buckets representing multiples of the interval in the cases where we have too many buckets for one interval but if we increase the interval we get only 1 (or a few) bucket(s) (e.g. if the target buckets is 10 and we end up with 30 x minute level buckets if we increase the rounding we'll get 1 x hour bucket so instead can we merge every 3 minute level buckets to get 10 x 3 minute buckets?) - this will probably be moved out into a separate issue to be tackled when this first pass is merged
Potentially optimise the logic a bit as I'm sure there is room for improvements.
Add an equivalent auto_histogram aggregation to work on numeric (rather than date) fields - again this might be moved into a separate issue
Add a test to ensure that if a user requests more than 10k buckets, we return an error.
Fix doc tests

Closes #9572

colings86 · 2017-09-15T10:42:33Z

@jpountz this is still WIP but I think most of the main bits are there so would be great to get a review to make sure its going in the right direction.

jpountz

It looks good to me overall. One thought I had when reviewing this PR is that you used deferring in order not to have to implement merging on all aggregations, but since histograms return all buckets all the time anyway, deferring is more likely to increase memory usage than to decrease it? Also should it be experimental for now?

jpountz · 2017-11-03T10:36:11Z

core/src/main/java/org/elasticsearch/search/aggregations/bucket/BucketsAggregator.java

+    public final void mergeBuckets(long[] mergeMap, long newNumBuckets) {
+        try (IntArray oldDocCounts = docCounts) {
+            docCounts = bigArrays.newIntArray(newNumBuckets, true);
+            docCounts.fill(0, newNumBuckets, 0);


this is not needed since you passed true on the previous line

jpountz · 2017-11-03T10:52:25Z

.../elasticsearch/search/aggregations/bucket/histogram/AutoDateHistogramAggregationBuilder.java

+            throw new IllegalArgumentException(NUM_BUCKETS_FIELD.getPreferredName() + " must be greater than 0 for [" + name + "]");
+        }
+        this.numBuckets = numBuckets;
+        return this;


it should probably be > 1 actually, no? A single bucket is not useful for a histogram.

maybe add a soft limit for the number of buckets too, something like 1000?

jpountz · 2017-11-03T10:57:51Z

...java/org/elasticsearch/search/aggregations/bucket/histogram/AutoDateHistogramAggregator.java

+    private final Rounding[] roundings;
+    private int roundingIdx = 0;
+
+    private LongHash bucketOrds;


since the number of buckets is contained anyway, we might want to use a LongToIntHashMap instead, which has less overhead

jpountz · 2017-11-03T11:02:50Z

...java/org/elasticsearch/search/aggregations/bucket/histogram/AutoDateHistogramAggregator.java

+                            collectExistingBucket(sub, doc, bucketOrd);
+                        } else {
+                            collectBucket(sub, doc, bucketOrd);
+                            while (bucketOrds.size() > targetBuckets) {


I think we also need to make sure we are not on the last level of rounding already? Otherwise passing numBuckets = 2 and 3 dates that are in 3 different centuries could cause out-of-bounds exceptions?

jpountz · 2017-11-03T11:07:01Z

...java/org/elasticsearch/search/aggregations/bucket/histogram/AutoDateHistogramAggregator.java

+                        } else {
+                            collectBucket(sub, doc, bucketOrd);
+                            while (bucketOrds.size() > targetBuckets) {
+                                increaseRounding();


this operation does two things:

compute the merge map (cheap since the number of buckets is small)

recompute buffered buckets (potentially expensive since the number of docs is not bounded)

Yet the merge map alone is enough to know how many buckets we will have in the end, so I think it should do something like this instead. I'm unsure how often it would be applied in practice, but it doesn't make things more complex and would be safer?

if (bucketOrds.size() > targetBuckets) { long[] mergeMap; do { mergeMap = increaseRounding(); } while (mergeMap.length > targetBuckets); // now renumber buckets }

This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time. This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary. Closes #9572

colings86 · 2018-03-12T13:52:26Z

closing in favour of #28993 (same code but branch living on origin so @pcsanwald and I can work on the branch together)

elastic#26659 was closed, not merged. (The followup PR didn't make it.)

#26659 was closed, not merged. (The followup PR didn't make it.)

colings86 added :Analytics/Aggregations Aggregations >feature review v6.1.0 v7.0.0 WIP labels Sep 15, 2017

colings86 self-assigned this Sep 15, 2017

colings86 requested a review from jpountz September 15, 2017 10:41

colings86 removed the WIP label Sep 26, 2017

jpountz approved these changes Nov 3, 2017

View reviewed changes

colings86 mentioned this pull request Nov 17, 2017

Should the default precision for the geohash grid aggregation be lower? #27428

Closed

lcawl added v6.2.0 and removed v6.1.0 labels Dec 12, 2017

colings86 mentioned this pull request Jan 9, 2018

Add K-means clustering feature #5512

Closed

colings86 added v6.3.0 and removed v6.2.0 labels Jan 22, 2018

colings86 mentioned this pull request Jan 30, 2018

[DOCS] Time Units for Date Histogram Interval incomplete #28432

Closed

colings86 added 7 commits March 12, 2018 10:33

Adds documentation

29e76f3

Added sub aggregator test

1480aac

Fixes failing docs test

fbe907d

Brings branch up to date with master changes

2aec5bc

trying to get tests to pass again

33458f6

Fixes multiBucketConsumer accounting

678802a

colings86 closed this Mar 12, 2018

malept added a commit to malept/elasticsearch that referenced this pull request Jun 27, 2018

Remove item listed in 6.3 notes that's not in 6.3

1b28740

elastic#26659 was closed, not merged. (The followup PR didn't make it.)

malept mentioned this pull request Jun 27, 2018

Remove item listed in 6.3 notes that's not in 6.3 #31623

Merged

colings86 removed the v6.3.0 label Jun 28, 2018

colings86 removed the v7.0.0 label Jun 28, 2018

colings86 pushed a commit that referenced this pull request Jun 28, 2018

Remove item listed in 6.3 notes that's not in 6.3 (#31623)

3def2e7

#26659 was closed, not merged. (The followup PR didn't make it.)

colings86 pushed a commit that referenced this pull request Jun 28, 2018

Remove item listed in 6.3 notes that's not in 6.3 (#31623)

84f4949

#26659 was closed, not merged. (The followup PR didn't make it.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a new auto-interval date histogram #26659

Adds a new auto-interval date histogram #26659

colings86 commented Sep 15, 2017 •

edited by pcsanwald

Loading

colings86 commented Sep 15, 2017

jpountz left a comment

jpountz Nov 3, 2017

jpountz Nov 3, 2017

jpountz Nov 3, 2017

jpountz Nov 3, 2017

jpountz Nov 3, 2017

jpountz Nov 3, 2017

colings86 commented Mar 12, 2018

Adds a new auto-interval date histogram #26659

Adds a new auto-interval date histogram #26659

Conversation

colings86 commented Sep 15, 2017 • edited by pcsanwald Loading

colings86 commented Sep 15, 2017

jpountz left a comment

Choose a reason for hiding this comment

jpountz Nov 3, 2017

Choose a reason for hiding this comment

jpountz Nov 3, 2017

Choose a reason for hiding this comment

jpountz Nov 3, 2017

Choose a reason for hiding this comment

jpountz Nov 3, 2017

Choose a reason for hiding this comment

jpountz Nov 3, 2017

Choose a reason for hiding this comment

jpountz Nov 3, 2017

Choose a reason for hiding this comment

colings86 commented Mar 12, 2018

colings86 commented Sep 15, 2017 •

edited by pcsanwald

Loading