-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a new auto-interval date histogram #26659
Conversation
@jpountz this is still WIP but I think most of the main bits are there so would be great to get a review to make sure its going in the right direction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me overall. One thought I had when reviewing this PR is that you used deferring in order not to have to implement merging on all aggregations, but since histograms return all buckets all the time anyway, deferring is more likely to increase memory usage than to decrease it? Also should it be experimental for now?
public final void mergeBuckets(long[] mergeMap, long newNumBuckets) { | ||
try (IntArray oldDocCounts = docCounts) { | ||
docCounts = bigArrays.newIntArray(newNumBuckets, true); | ||
docCounts.fill(0, newNumBuckets, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed since you passed true on the previous line
throw new IllegalArgumentException(NUM_BUCKETS_FIELD.getPreferredName() + " must be greater than 0 for [" + name + "]"); | ||
} | ||
this.numBuckets = numBuckets; | ||
return this; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should probably be > 1 actually, no? A single bucket is not useful for a histogram.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a soft limit for the number of buckets too, something like 1000?
private final Rounding[] roundings; | ||
private int roundingIdx = 0; | ||
|
||
private LongHash bucketOrds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the number of buckets is contained anyway, we might want to use a LongToIntHashMap instead, which has less overhead
collectExistingBucket(sub, doc, bucketOrd); | ||
} else { | ||
collectBucket(sub, doc, bucketOrd); | ||
while (bucketOrds.size() > targetBuckets) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to make sure we are not on the last level of rounding already? Otherwise passing numBuckets = 2 and 3 dates that are in 3 different centuries could cause out-of-bounds exceptions?
} else { | ||
collectBucket(sub, doc, bucketOrd); | ||
while (bucketOrds.size() > targetBuckets) { | ||
increaseRounding(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this operation does two things:
- compute the merge map (cheap since the number of buckets is small)
- recompute buffered buckets (potentially expensive since the number of docs is not bounded)
Yet the merge map alone is enough to know how many buckets we will have in the end, so I think it should do something like this instead. I'm unsure how often it would be applied in practice, but it doesn't make things more complex and would be safer?
if (bucketOrds.size() > targetBuckets) {
long[] mergeMap;
do {
mergeMap = increaseRounding();
} while (mergeMap.length > targetBuckets);
// now renumber buckets
}
This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time. This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary. Closes #9572
closing in favour of #28993 (same code but branch living on origin so @pcsanwald and I can work on the branch together) |
elastic#26659 was closed, not merged. (The followup PR didn't make it.)
#26659 was closed, not merged. (The followup PR didn't make it.)
#26659 was closed, not merged. (The followup PR didn't make it.)
This change adds a new type of histogram aggregation called
auto_date_histogram
where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time.This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in
breadth_first
mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary.This is also the first aggregation to merge buckets at collection time but having this aggregation will open the door to other such aggregations that merge buckets at collection time.
Things still to do:
10 * target
buckets on the shards and then use the extra bucket to create buckets representing multiples of the interval in the cases where we have too many buckets for one interval but if we increase the interval we get only 1 (or a few) bucket(s) (e.g. if the target buckets is 10 and we end up with 30 x minute level buckets if we increase the rounding we'll get 1 x hour bucket so instead can we merge every 3 minute level buckets to get 10 x 3 minute buckets?) - this will probably be moved out into a separate issue to be tackled when this first pass is mergedauto_histogram
aggregation to work on numeric (rather than date) fields - again this might be moved into a separate issueCloses #9572