Support dynamic interval and fixed buckets for histogram aggregation #9572

rayward · 2015-02-04T23:55:24Z

This is basically the opposite of how the histogram currently acts: fixed interval and dynamic number of buckets.

My use case is that I would like to build a price range style aggregation for filtering products:

$0 - $100
$101 - $200
$201 - $300
$301 - $400
$401 - $500

While you can do this with the range agg. if you know in advance the lower and upper bounds of all your prices, I can't see a way to do this dynamically based off the current query. The range agg also suffers from other shortcomings such as generating buckets that may contain no documents.

Ideally, I'd like to be able to specify the number of ranges / buckets I want (eg. 5) with a dynamic interval calculated by the aggregation.

min_doc_count could also come into affect here to ensure that either we don't generate buckets that don't contain any products (eg min_doc_count: 0) and in which case the number of buckets just becomes a maximum number of potential buckets, rather than a guaranteed amount.

The text was updated successfully, but these errors were encountered:

alexksikes · 2015-02-05T11:21:02Z

It would indeed be a neat feature to dynamically guess the ranges. However, it seems difficult to do in a single request in a distributed environment, as we don't know the data on other shards. So for now as a workaround I would do two requests of the same query. The first one to get the ranges and the second one to perform the aggregation on those ranges.

For more discussion on a similar feature request see: #9531

rayward · 2015-02-05T22:41:10Z

Yeh I was really hoping to avoid separate queries as performance is fairly critical for us.

alexksikes · 2015-02-06T09:41:17Z

Yes but the second query should execute much faster thanks to the file system cache, not counting also any filter in the query that may also be cached.

sultaniman · 2015-02-10T11:14:42Z

Hi,
I am relatively new to Elasticsearch but I found interesting thing about date_histogram. We have aggregation query with interval=week.

{
  "aggs": {
    "scores_by_date": {
      "date_histogram": {
        "field": "date",
        "format": "MM-dd-yyyy",
        "interval": "week"
      }
    }
  }
}

If we use filter with date ranges as following starting from the Jan 1 YEAR

{
  "range": {
    "document.date":  {
      "from": "2015-01-01",
      "to": "2015-02-10"
     }
  }
}

In the result buckets we get extra week from past year's last month Dec NDAY YEAR

{
  "aggregations": {
    "scores_by_date": {
      "buckets": [
        {
  HERE => "key_as_string": "12-29-2014",
          "key": 1419811200000,
          "doc_count": 29
        },
        {
          "key_as_string": "01-05-2015",
          "key": 1420416000000,
          "doc_count": 39
        },
        {
          "key_as_string": "01-12-2015",
          "key": 1421020800000,
          "doc_count": 45
        },
        {
          "key_as_string": "01-19-2015",
          "key": 1421625600000,
          "doc_count": 55
        },
        {
          "key_as_string": "01-26-2015",
          "key": 1422230400000,
          "doc_count": 29
        }
      ]
    }
  }
}

Is this a standard behaviour of Elasticsearch date histogram aggregations or is this issue the part of current discussion?

rjernst · 2015-02-10T19:39:06Z

@imanhodjaev I believe that is because it the date math code includes the entire week, instead of a partial week. It does not truncate the edges of the range.

OPtoss · 2015-04-23T23:42:36Z

I'd love to vote for this issue. It's severely prohibiting me from attaining my use case of date_histograms. I'm doing date_histogram as a sub-aggregation to a terms aggregation, so I'm getting one histogram per term bucket. If I can't define the interval dynamically per-bucket, then I have to do hundreds of queries instead of just 1.

At the moment, I don't care much about cost to server-side computations. The client can't be expected to perform hundreds of separate queries in my real-time interactive application. So having this feature is critical.

I've attempted to work around it using groovy scripts to calculate the interval, which works, but once again I'm blocked, as I can't refer to the result of my script inside the "interval" argument of the date_histogram.

Is there another workaround that doesn't multiply queries? Or can this feature be reconsidered? Thanks!

ppf2 · 2015-06-25T05:24:55Z

+1 on the original use case of dynamic range histogram buckets for numeric (non-datetime fields). This is a pretty common use case for commerce applications.

jpountz · 2015-07-31T09:46:12Z

Stalled by #12316. It is still possible to implement this feature by doing two requests as described above.

gmoskovicz · 2016-08-25T18:31:34Z

+1 for a workaround without using a 2 step request to elasticsearch. Auto-Calculating the ranges looks like a great idea, although i can see that implementing this can be tricky given the number of things that we should consider.

Bargs · 2016-09-21T14:30:54Z

Copying my comments from a duplicate issue:

This would be a boon for Kibana because it would allow us to easily auto scale the histogram visualization and protect users from accidentally creating too many buckets, crashing their browser in the process.

We could accomplish this by issuing two requests from the client, the first grabbing the min/max values for the given field. But since the aggregation may have a search query and/or filters associated with it we don't know how performant this would be. I imagine it would be more efficient to implement this option inside elasticsearch so the calculations can be done in a single request.

bompus · 2016-12-31T01:45:09Z

This would be similar to MongoDB $bucketAuto aggregation. I would love to see something similar in ES.

jpountz · 2017-02-27T18:52:33Z

This issue was discussed today, here is a brief summary: one way to implement it would be to request the min/max values first like @Bargs mentioned. Doing it in Elasticsearch would have little benefit compared to doing it on the client-side. Another way to implement it would be for each shard to adapt the interval on the fly and then at reducing time we would pick the larger interval. However the latter would require us to add ways to reuse buckets, which we do not have today, but may be required by other aggs such as clustering (including percentiles). The recommended way to work around the lack of support for dynamic intervals for now is to issue two requests to Elasticsearch: one to figure out the min/max values of the field, and another one with the histogram aggregation.

vaibhavtripathi · 2017-12-07T21:43:59Z

@alexksikes Can you point me to an example of the two-query approach you are suggesting?

* Adds a new auto-interval date histogram This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time. This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary. Closes #9572 * Adds documentation * Added sub aggregator test * Fixes failing docs test * Brings branch up to date with master changes * trying to get tests to pass again * Fixes multiBucketConsumer accounting * Collects more buckets than needed on shards This gives us more options at reduce time in terms of how we do the final merge of the buckeets to produce the final result * Revert "Collects more buckets than needed on shards" This reverts commit 993c782. * Adds ability to merge within a rounding * Fixes nonn-timezone doc test failure * Fix time zone tests * iterates on tests * Adds test case and documentation changes Added some notes in the documentation about the intervals that can bbe returned. Also added a test case that utilises the merging of conseecutive buckets * Fixes performance bug The bug meant that getAppropriate rounding look a huge amount of time if the range of the data was large but also sparsely populated. In these situations the rounding would be very low so iterating through the rounding values from the min key to the max keey look a long time (~120 seconds in one test). The solution is to add a rough estimate first which chooses the rounding based just on the long values of the min and max keeys alone but selects the rounding one lower than the one it thinks is appropriate so the accurate method can choose the final rounding taking into account the fact that intervals are not always fixed length. Thee commit also adds more tests * Changes to only do complex reduction on final reduce * merge latest with master * correct tests and add a new test case for 10k buckets * refactor to perform bucket number check in innerBuild * correctly derive bucket setting, update tests to increase bucket threshold * fix checkstyle * address code review comments * add documentation for default buckets * fix typo

* Adds a new auto-interval date histogram This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time. This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary. Closes elastic#9572 * Adds documentation * Added sub aggregator test * Fixes failing docs test * Brings branch up to date with master changes * trying to get tests to pass again * Fixes multiBucketConsumer accounting * Collects more buckets than needed on shards This gives us more options at reduce time in terms of how we do the final merge of the buckeets to produce the final result * Revert "Collects more buckets than needed on shards" This reverts commit 993c782. * Adds ability to merge within a rounding * Fixes nonn-timezone doc test failure * Fix time zone tests * iterates on tests * Adds test case and documentation changes Added some notes in the documentation about the intervals that can bbe returned. Also added a test case that utilises the merging of conseecutive buckets * Fixes performance bug The bug meant that getAppropriate rounding look a huge amount of time if the range of the data was large but also sparsely populated. In these situations the rounding would be very low so iterating through the rounding values from the min key to the max keey look a long time (~120 seconds in one test). The solution is to add a rough estimate first which chooses the rounding based just on the long values of the min and max keeys alone but selects the rounding one lower than the one it thinks is appropriate so the accurate method can choose the final rounding taking into account the fact that intervals are not always fixed length. Thee commit also adds more tests * Changes to only do complex reduction on final reduce * merge latest with master * correct tests and add a new test case for 10k buckets * refactor to perform bucket number check in innerBuild * correctly derive bucket setting, update tests to increase bucket threshold * fix checkstyle * address code review comments * add documentation for default buckets * fix typo

melissachang · 2018-07-16T22:59:18Z

I believe this issue should still be open. #28993 only supports date histograms. There needs to be a similar PR for non-date, regular histograms.

pcsanwald · 2018-07-17T00:27:00Z

@melissachang you are absolutely right that we need to support numeric histograms, but I've created a separate issue for that, #31828. We can track that work using this new issue. Thank you!

monfera · 2018-09-14T12:39:57Z

Just for sake of completeness here's a link for Sturges' Rule, which is a sound approach to determining bin counts merely based on sample count, assuming a distribution reasonably close to normal distribution.

Rapster · 2019-04-26T09:02:07Z

@monfera did you figure out a way to use Sturges' rule to determine the number of buckets? Someone else ask this question few years ago also https://discuss.elastic.co/t/histogram-and-sturges-formula/14133

Implements a new histogram aggregation called `variable_width_histogram` which dynamically determines bucket intervals based on document groupings. These groups are determined by running a one-pass clustering algorithm on each shard and then reducing each shard's clusters using an agglomerative clustering algorithm. This PR addresses #9572. The shard-level clustering is done in one pass to minimize memory overhead. The algorithm was lightly inspired by [this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches a small number of documents to sample the data and determine initial clusters. Subsequent documents are then placed into one of these clusters, or a new one if they are an outlier. This algorithm is described in more details in the aggregation's docs. At reduce time, a [hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304) continually merges the closest buckets from all shards (based on their centroids) until the target number of buckets is reached. The final values produced by this aggregation are approximate. Each bucket's min value is used as its key in the histogram. Furthermore, buckets are merged based on their centroids and not their bounds. So it is possible that adjacent buckets will overlap after reduction. Because each bucket's key is its min, this overlap is not shown in the final histogram. However, when such overlap occurs, we set the key of the bucket with the larger centroid to the midpoint between its minimum and the smaller bucket’s maximum: `min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to increases the accuracy of the clustering. Nodes are unable to share centroids during the shard-level clustering phase. In the future, resolving #50863 would let us solve this issue. It doesn’t make sense for this aggregation to support the `min_doc_count` parameter, since clusters are determined dynamically. The `order` parameter is not supported here to keep this large PR from becoming too complex.

Implements a new histogram aggregation called `variable_width_histogram` which dynamically determines bucket intervals based on document groupings. These groups are determined by running a one-pass clustering algorithm on each shard and then reducing each shard's clusters using an agglomerative clustering algorithm. This PR addresses elastic#9572. The shard-level clustering is done in one pass to minimize memory overhead. The algorithm was lightly inspired by [this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches a small number of documents to sample the data and determine initial clusters. Subsequent documents are then placed into one of these clusters, or a new one if they are an outlier. This algorithm is described in more details in the aggregation's docs. At reduce time, a [hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304) continually merges the closest buckets from all shards (based on their centroids) until the target number of buckets is reached. The final values produced by this aggregation are approximate. Each bucket's min value is used as its key in the histogram. Furthermore, buckets are merged based on their centroids and not their bounds. So it is possible that adjacent buckets will overlap after reduction. Because each bucket's key is its min, this overlap is not shown in the final histogram. However, when such overlap occurs, we set the key of the bucket with the larger centroid to the midpoint between its minimum and the smaller bucket’s maximum: `min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to increases the accuracy of the clustering. Nodes are unable to share centroids during the shard-level clustering phase. In the future, resolving elastic#50863 would let us solve this issue. It doesn’t make sense for this aggregation to support the `min_doc_count` parameter, since clusters are determined dynamically. The `order` parameter is not supported here to keep this large PR from becoming too complex.

Implements a new histogram aggregation called `variable_width_histogram` which dynamically determines bucket intervals based on document groupings. These groups are determined by running a one-pass clustering algorithm on each shard and then reducing each shard's clusters using an agglomerative clustering algorithm. This PR addresses #9572. The shard-level clustering is done in one pass to minimize memory overhead. The algorithm was lightly inspired by [this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches a small number of documents to sample the data and determine initial clusters. Subsequent documents are then placed into one of these clusters, or a new one if they are an outlier. This algorithm is described in more details in the aggregation's docs. At reduce time, a [hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304) continually merges the closest buckets from all shards (based on their centroids) until the target number of buckets is reached. The final values produced by this aggregation are approximate. Each bucket's min value is used as its key in the histogram. Furthermore, buckets are merged based on their centroids and not their bounds. So it is possible that adjacent buckets will overlap after reduction. Because each bucket's key is its min, this overlap is not shown in the final histogram. However, when such overlap occurs, we set the key of the bucket with the larger centroid to the midpoint between its minimum and the smaller bucket’s maximum: `min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to increases the accuracy of the clustering. Nodes are unable to share centroids during the shard-level clustering phase. In the future, resolving #50863 would let us solve this issue. It doesn’t make sense for this aggregation to support the `min_doc_count` parameter, since clusters are determined dynamically. The `order` parameter is not supported here to keep this large PR from becoming too complex. Co-authored-by: James Dorfman <jamesdorfman@users.noreply.github.com>

findinpath · 2020-09-21T20:30:06Z

@alexksikes Can you point me to an example of the two-query approach you are suggesting?

I am not sure if I got it right, but here https://github.com/findinpath/elastic-price-range-aggregation/blob/master/src/test/kotlin/com/findinpath/PriceRangeAggregationTest.kt#L119 you can find an implementation of the two-query approach suggested.

mjose007 · 2020-10-08T08:08:53Z

+1 for the Support dynamic interval and fixed buckets for histogram aggregation

KPGDL · 2022-02-02T08:35:46Z

Wow, it's already 6 years and still waiting...

tmushayahama · 2022-04-15T20:08:55Z

So there is no dynamic interval yet>

shu-bham · 2022-04-30T14:05:29Z

still waiting for this feature

PendalF89 · 2023-01-27T07:44:41Z

Welcome from 2023! So, we are still waiting :)

clintongormley added :Analytics/Aggregations Aggregations discuss >enhancement labels Feb 9, 2015

clintongormley mentioned this issue Feb 9, 2015

Feature Request: Date Histogram agg bounds #9531

Closed

clintongormley mentioned this issue Jul 17, 2015

Explore option of supporting more flexible search types #12316

Closed

jpountz added stalled and removed discuss labels Jul 31, 2015

clintongormley mentioned this issue Sep 21, 2016

Enhance histogram aggregation - specify number of buckets instead of interval #20590

Closed

tbragin mentioned this issue Oct 17, 2016

Automatically generate ranges for range/histogram agg elastic/kibana#3757

Open

Bargs mentioned this issue Oct 17, 2016

add auto_scale option to histogram configuration to auto scale interval elastic/kibana#8139

Closed

dimitris-athanasiou mentioned this issue Feb 7, 2017

Optimising sorted scroll requests #23022

Closed

mrec mentioned this issue Apr 21, 2017

Enhancement: Range agg specified as max bucket count rather than explicit ranges #24254

Closed

colings86 mentioned this issue Sep 15, 2017

Adds a new auto-interval date histogram #26659

Closed

7 tasks

colings86 mentioned this issue Mar 12, 2018

Adds a new auto-interval date histogram #28993

Merged

7 tasks

$@polyfractal$ polyfractal removed the stalled label Mar 22, 2018

pcsanwald closed this as completed in #28993 Jul 13, 2018

melissachang mentioned this issue Jul 18, 2018

Add a new auto_histogram aggregation for numeric fields #31828

Open

jamesdorfman mentioned this issue May 9, 2019

Add Variable Width Histogram Aggregation #42035

Merged

$@polyfractal$ polyfractal mentioned this issue May 29, 2020

[Meta] Kibana support for ES aggregations elastic/kibana#58628

Closed

7 tasks

nik9000 mentioned this issue Jun 23, 2020

Add Variable Width Histogram Aggregation (backport of #42035) #58440

Merged

russcam mentioned this issue Jul 23, 2020

7.9.0 Meta ticket elastic/elasticsearch-net#4872

Closed

29 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dynamic interval and fixed buckets for histogram aggregation #9572

Support dynamic interval and fixed buckets for histogram aggregation #9572

rayward commented Feb 4, 2015

alexksikes commented Feb 5, 2015

rayward commented Feb 5, 2015

alexksikes commented Feb 6, 2015

sultaniman commented Feb 10, 2015

rjernst commented Feb 10, 2015

OPtoss commented Apr 23, 2015

ppf2 commented Jun 25, 2015

jpountz commented Jul 31, 2015

gmoskovicz commented Aug 25, 2016

Bargs commented Sep 21, 2016

bompus commented Dec 31, 2016

jpountz commented Feb 27, 2017

vaibhavtripathi commented Dec 7, 2017

melissachang commented Jul 16, 2018

pcsanwald commented Jul 17, 2018

monfera commented Sep 14, 2018 •

edited

Loading

Rapster commented Apr 26, 2019

findinpath commented Sep 21, 2020

mjose007 commented Oct 8, 2020

KPGDL commented Feb 2, 2022

tmushayahama commented Apr 15, 2022

shu-bham commented Apr 30, 2022

PendalF89 commented Jan 27, 2023

Support dynamic interval and fixed buckets for histogram aggregation #9572

Support dynamic interval and fixed buckets for histogram aggregation #9572

Comments

rayward commented Feb 4, 2015

alexksikes commented Feb 5, 2015

rayward commented Feb 5, 2015

alexksikes commented Feb 6, 2015

sultaniman commented Feb 10, 2015

rjernst commented Feb 10, 2015

OPtoss commented Apr 23, 2015

ppf2 commented Jun 25, 2015

jpountz commented Jul 31, 2015

gmoskovicz commented Aug 25, 2016

Bargs commented Sep 21, 2016

bompus commented Dec 31, 2016

jpountz commented Feb 27, 2017

vaibhavtripathi commented Dec 7, 2017

melissachang commented Jul 16, 2018

pcsanwald commented Jul 17, 2018

monfera commented Sep 14, 2018 • edited Loading

Rapster commented Apr 26, 2019

findinpath commented Sep 21, 2020

mjose007 commented Oct 8, 2020

KPGDL commented Feb 2, 2022

tmushayahama commented Apr 15, 2022

shu-bham commented Apr 30, 2022

PendalF89 commented Jan 27, 2023

monfera commented Sep 14, 2018 •

edited

Loading