-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dynamic interval and fixed buckets for histogram aggregation #9572
Comments
It would indeed be a neat feature to dynamically guess the ranges. However, it seems difficult to do in a single request in a distributed environment, as we don't know the data on other shards. So for now as a workaround I would do two requests of the same query. The first one to get the ranges and the second one to perform the aggregation on those ranges. For more discussion on a similar feature request see: #9531 |
Yeh I was really hoping to avoid separate queries as performance is fairly critical for us. |
Yes but the second query should execute much faster thanks to the file system cache, not counting also any filter in the query that may also be cached. |
Hi, {
"aggs": {
"scores_by_date": {
"date_histogram": {
"field": "date",
"format": "MM-dd-yyyy",
"interval": "week"
}
}
}
} If we use filter with date ranges as following starting from the Jan 1 YEAR {
"range": {
"document.date": {
"from": "2015-01-01",
"to": "2015-02-10"
}
}
} In the result buckets we get extra week from past year's last month Dec NDAY YEAR {
"aggregations": {
"scores_by_date": {
"buckets": [
{
HERE => "key_as_string": "12-29-2014",
"key": 1419811200000,
"doc_count": 29
},
{
"key_as_string": "01-05-2015",
"key": 1420416000000,
"doc_count": 39
},
{
"key_as_string": "01-12-2015",
"key": 1421020800000,
"doc_count": 45
},
{
"key_as_string": "01-19-2015",
"key": 1421625600000,
"doc_count": 55
},
{
"key_as_string": "01-26-2015",
"key": 1422230400000,
"doc_count": 29
}
]
}
}
} Is this a standard behaviour of Elasticsearch date histogram aggregations or is this issue the part of current discussion? |
@imanhodjaev I believe that is because it the date math code includes the entire week, instead of a partial week. It does not truncate the edges of the range. |
I'd love to vote for this issue. It's severely prohibiting me from attaining my use case of date_histograms. I'm doing date_histogram as a sub-aggregation to a terms aggregation, so I'm getting one histogram per term bucket. If I can't define the interval dynamically per-bucket, then I have to do hundreds of queries instead of just 1. At the moment, I don't care much about cost to server-side computations. The client can't be expected to perform hundreds of separate queries in my real-time interactive application. So having this feature is critical. I've attempted to work around it using groovy scripts to calculate the interval, which works, but once again I'm blocked, as I can't refer to the result of my script inside the "interval" argument of the date_histogram. Is there another workaround that doesn't multiply queries? Or can this feature be reconsidered? Thanks! |
+1 on the original use case of dynamic range histogram buckets for numeric (non-datetime fields). This is a pretty common use case for commerce applications. |
Stalled by #12316. It is still possible to implement this feature by doing two requests as described above. |
+1 for a workaround without using a 2 step request to elasticsearch. Auto-Calculating the ranges looks like a great idea, although i can see that implementing this can be tricky given the number of things that we should consider. |
Copying my comments from a duplicate issue: This would be a boon for Kibana because it would allow us to easily auto scale the histogram visualization and protect users from accidentally creating too many buckets, crashing their browser in the process. We could accomplish this by issuing two requests from the client, the first grabbing the min/max values for the given field. But since the aggregation may have a search query and/or filters associated with it we don't know how performant this would be. I imagine it would be more efficient to implement this option inside elasticsearch so the calculations can be done in a single request. |
This would be similar to MongoDB $bucketAuto aggregation. I would love to see something similar in ES. |
This issue was discussed today, here is a brief summary: one way to implement it would be to request the min/max values first like @Bargs mentioned. Doing it in Elasticsearch would have little benefit compared to doing it on the client-side. Another way to implement it would be for each shard to adapt the interval on the fly and then at reducing time we would pick the larger interval. However the latter would require us to add ways to reuse buckets, which we do not have today, but may be required by other aggs such as clustering (including percentiles). The recommended way to work around the lack of support for dynamic intervals for now is to issue two requests to Elasticsearch: one to figure out the min/max values of the field, and another one with the histogram aggregation. |
@alexksikes Can you point me to an example of the two-query approach you are suggesting? |
* Adds a new auto-interval date histogram This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time. This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary. Closes #9572 * Adds documentation * Added sub aggregator test * Fixes failing docs test * Brings branch up to date with master changes * trying to get tests to pass again * Fixes multiBucketConsumer accounting * Collects more buckets than needed on shards This gives us more options at reduce time in terms of how we do the final merge of the buckeets to produce the final result * Revert "Collects more buckets than needed on shards" This reverts commit 993c782. * Adds ability to merge within a rounding * Fixes nonn-timezone doc test failure * Fix time zone tests * iterates on tests * Adds test case and documentation changes Added some notes in the documentation about the intervals that can bbe returned. Also added a test case that utilises the merging of conseecutive buckets * Fixes performance bug The bug meant that getAppropriate rounding look a huge amount of time if the range of the data was large but also sparsely populated. In these situations the rounding would be very low so iterating through the rounding values from the min key to the max keey look a long time (~120 seconds in one test). The solution is to add a rough estimate first which chooses the rounding based just on the long values of the min and max keeys alone but selects the rounding one lower than the one it thinks is appropriate so the accurate method can choose the final rounding taking into account the fact that intervals are not always fixed length. Thee commit also adds more tests * Changes to only do complex reduction on final reduce * merge latest with master * correct tests and add a new test case for 10k buckets * refactor to perform bucket number check in innerBuild * correctly derive bucket setting, update tests to increase bucket threshold * fix checkstyle * address code review comments * add documentation for default buckets * fix typo
* Adds a new auto-interval date histogram This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time. This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary. Closes #9572 * Adds documentation * Added sub aggregator test * Fixes failing docs test * Brings branch up to date with master changes * trying to get tests to pass again * Fixes multiBucketConsumer accounting * Collects more buckets than needed on shards This gives us more options at reduce time in terms of how we do the final merge of the buckeets to produce the final result * Revert "Collects more buckets than needed on shards" This reverts commit 993c782. * Adds ability to merge within a rounding * Fixes nonn-timezone doc test failure * Fix time zone tests * iterates on tests * Adds test case and documentation changes Added some notes in the documentation about the intervals that can bbe returned. Also added a test case that utilises the merging of conseecutive buckets * Fixes performance bug The bug meant that getAppropriate rounding look a huge amount of time if the range of the data was large but also sparsely populated. In these situations the rounding would be very low so iterating through the rounding values from the min key to the max keey look a long time (~120 seconds in one test). The solution is to add a rough estimate first which chooses the rounding based just on the long values of the min and max keeys alone but selects the rounding one lower than the one it thinks is appropriate so the accurate method can choose the final rounding taking into account the fact that intervals are not always fixed length. Thee commit also adds more tests * Changes to only do complex reduction on final reduce * merge latest with master * correct tests and add a new test case for 10k buckets * refactor to perform bucket number check in innerBuild * correctly derive bucket setting, update tests to increase bucket threshold * fix checkstyle * address code review comments * add documentation for default buckets * fix typo
* Adds a new auto-interval date histogram This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time. This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary. Closes elastic#9572 * Adds documentation * Added sub aggregator test * Fixes failing docs test * Brings branch up to date with master changes * trying to get tests to pass again * Fixes multiBucketConsumer accounting * Collects more buckets than needed on shards This gives us more options at reduce time in terms of how we do the final merge of the buckeets to produce the final result * Revert "Collects more buckets than needed on shards" This reverts commit 993c782. * Adds ability to merge within a rounding * Fixes nonn-timezone doc test failure * Fix time zone tests * iterates on tests * Adds test case and documentation changes Added some notes in the documentation about the intervals that can bbe returned. Also added a test case that utilises the merging of conseecutive buckets * Fixes performance bug The bug meant that getAppropriate rounding look a huge amount of time if the range of the data was large but also sparsely populated. In these situations the rounding would be very low so iterating through the rounding values from the min key to the max keey look a long time (~120 seconds in one test). The solution is to add a rough estimate first which chooses the rounding based just on the long values of the min and max keeys alone but selects the rounding one lower than the one it thinks is appropriate so the accurate method can choose the final rounding taking into account the fact that intervals are not always fixed length. Thee commit also adds more tests * Changes to only do complex reduction on final reduce * merge latest with master * correct tests and add a new test case for 10k buckets * refactor to perform bucket number check in innerBuild * correctly derive bucket setting, update tests to increase bucket threshold * fix checkstyle * address code review comments * add documentation for default buckets * fix typo
I believe this issue should still be open. #28993 only supports date histograms. There needs to be a similar PR for non-date, regular histograms. |
@melissachang you are absolutely right that we need to support numeric histograms, but I've created a separate issue for that, #31828. We can track that work using this new issue. Thank you! |
Just for sake of completeness here's a link for Sturges' Rule, which is a sound approach to determining bin counts merely based on sample count, assuming a distribution reasonably close to normal distribution. |
@monfera did you figure out a way to use Sturges' rule to determine the number of buckets? Someone else ask this question few years ago also https://discuss.elastic.co/t/histogram-and-sturges-formula/14133 |
Implements a new histogram aggregation called `variable_width_histogram` which dynamically determines bucket intervals based on document groupings. These groups are determined by running a one-pass clustering algorithm on each shard and then reducing each shard's clusters using an agglomerative clustering algorithm. This PR addresses #9572. The shard-level clustering is done in one pass to minimize memory overhead. The algorithm was lightly inspired by [this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches a small number of documents to sample the data and determine initial clusters. Subsequent documents are then placed into one of these clusters, or a new one if they are an outlier. This algorithm is described in more details in the aggregation's docs. At reduce time, a [hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304) continually merges the closest buckets from all shards (based on their centroids) until the target number of buckets is reached. The final values produced by this aggregation are approximate. Each bucket's min value is used as its key in the histogram. Furthermore, buckets are merged based on their centroids and not their bounds. So it is possible that adjacent buckets will overlap after reduction. Because each bucket's key is its min, this overlap is not shown in the final histogram. However, when such overlap occurs, we set the key of the bucket with the larger centroid to the midpoint between its minimum and the smaller bucket’s maximum: `min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to increases the accuracy of the clustering. Nodes are unable to share centroids during the shard-level clustering phase. In the future, resolving #50863 would let us solve this issue. It doesn’t make sense for this aggregation to support the `min_doc_count` parameter, since clusters are determined dynamically. The `order` parameter is not supported here to keep this large PR from becoming too complex.
Implements a new histogram aggregation called `variable_width_histogram` which dynamically determines bucket intervals based on document groupings. These groups are determined by running a one-pass clustering algorithm on each shard and then reducing each shard's clusters using an agglomerative clustering algorithm. This PR addresses elastic#9572. The shard-level clustering is done in one pass to minimize memory overhead. The algorithm was lightly inspired by [this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches a small number of documents to sample the data and determine initial clusters. Subsequent documents are then placed into one of these clusters, or a new one if they are an outlier. This algorithm is described in more details in the aggregation's docs. At reduce time, a [hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304) continually merges the closest buckets from all shards (based on their centroids) until the target number of buckets is reached. The final values produced by this aggregation are approximate. Each bucket's min value is used as its key in the histogram. Furthermore, buckets are merged based on their centroids and not their bounds. So it is possible that adjacent buckets will overlap after reduction. Because each bucket's key is its min, this overlap is not shown in the final histogram. However, when such overlap occurs, we set the key of the bucket with the larger centroid to the midpoint between its minimum and the smaller bucket’s maximum: `min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to increases the accuracy of the clustering. Nodes are unable to share centroids during the shard-level clustering phase. In the future, resolving elastic#50863 would let us solve this issue. It doesn’t make sense for this aggregation to support the `min_doc_count` parameter, since clusters are determined dynamically. The `order` parameter is not supported here to keep this large PR from becoming too complex.
Implements a new histogram aggregation called `variable_width_histogram` which dynamically determines bucket intervals based on document groupings. These groups are determined by running a one-pass clustering algorithm on each shard and then reducing each shard's clusters using an agglomerative clustering algorithm. This PR addresses #9572. The shard-level clustering is done in one pass to minimize memory overhead. The algorithm was lightly inspired by [this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches a small number of documents to sample the data and determine initial clusters. Subsequent documents are then placed into one of these clusters, or a new one if they are an outlier. This algorithm is described in more details in the aggregation's docs. At reduce time, a [hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304) continually merges the closest buckets from all shards (based on their centroids) until the target number of buckets is reached. The final values produced by this aggregation are approximate. Each bucket's min value is used as its key in the histogram. Furthermore, buckets are merged based on their centroids and not their bounds. So it is possible that adjacent buckets will overlap after reduction. Because each bucket's key is its min, this overlap is not shown in the final histogram. However, when such overlap occurs, we set the key of the bucket with the larger centroid to the midpoint between its minimum and the smaller bucket’s maximum: `min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to increases the accuracy of the clustering. Nodes are unable to share centroids during the shard-level clustering phase. In the future, resolving #50863 would let us solve this issue. It doesn’t make sense for this aggregation to support the `min_doc_count` parameter, since clusters are determined dynamically. The `order` parameter is not supported here to keep this large PR from becoming too complex. Co-authored-by: James Dorfman <jamesdorfman@users.noreply.github.com>
I am not sure if I got it right, but here https://github.com/findinpath/elastic-price-range-aggregation/blob/master/src/test/kotlin/com/findinpath/PriceRangeAggregationTest.kt#L119 you can find an implementation of the two-query approach suggested. |
+1 for the Support dynamic interval and fixed buckets for histogram aggregation |
Wow, it's already 6 years and still waiting... |
So there is no dynamic interval yet> |
still waiting for this feature |
Welcome from 2023! So, we are still waiting :) |
This is basically the opposite of how the histogram currently acts: fixed interval and dynamic number of buckets.
My use case is that I would like to build a price range style aggregation for filtering products:
$0 - $100
$101 - $200
$201 - $300
$301 - $400
$401 - $500
While you can do this with the range agg. if you know in advance the lower and upper bounds of all your prices, I can't see a way to do this dynamically based off the current query. The range agg also suffers from other shortcomings such as generating buckets that may contain no documents.
Ideally, I'd like to be able to specify the number of ranges / buckets I want (eg. 5) with a dynamic interval calculated by the aggregation.
min_doc_count
could also come into affect here to ensure that either we don't generate buckets that don't contain any products (egmin_doc_count: 0
) and in which case the number of buckets just becomes a maximum number of potential buckets, rather than a guaranteed amount.The text was updated successfully, but these errors were encountered: