Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dynamic interval and fixed buckets for histogram aggregation #9572

Closed
rayward opened this issue Feb 4, 2015 · 23 comments
Closed

Support dynamic interval and fixed buckets for histogram aggregation #9572

rayward opened this issue Feb 4, 2015 · 23 comments

Comments

@rayward
Copy link

rayward commented Feb 4, 2015

This is basically the opposite of how the histogram currently acts: fixed interval and dynamic number of buckets.

My use case is that I would like to build a price range style aggregation for filtering products:

$0 - $100
$101 - $200
$201 - $300
$301 - $400
$401 - $500

While you can do this with the range agg. if you know in advance the lower and upper bounds of all your prices, I can't see a way to do this dynamically based off the current query. The range agg also suffers from other shortcomings such as generating buckets that may contain no documents.

Ideally, I'd like to be able to specify the number of ranges / buckets I want (eg. 5) with a dynamic interval calculated by the aggregation.

min_doc_count could also come into affect here to ensure that either we don't generate buckets that don't contain any products (eg min_doc_count: 0) and in which case the number of buckets just becomes a maximum number of potential buckets, rather than a guaranteed amount.

@alexksikes
Copy link
Contributor

It would indeed be a neat feature to dynamically guess the ranges. However, it seems difficult to do in a single request in a distributed environment, as we don't know the data on other shards. So for now as a workaround I would do two requests of the same query. The first one to get the ranges and the second one to perform the aggregation on those ranges.

For more discussion on a similar feature request see: #9531

@rayward
Copy link
Author

rayward commented Feb 5, 2015

Yeh I was really hoping to avoid separate queries as performance is fairly critical for us.

@alexksikes
Copy link
Contributor

Yes but the second query should execute much faster thanks to the file system cache, not counting also any filter in the query that may also be cached.

@sultaniman
Copy link

Hi,
I am relatively new to Elasticsearch but I found interesting thing about date_histogram. We have aggregation query with interval=week.

{
  "aggs": {
    "scores_by_date": {
      "date_histogram": {
        "field": "date",
        "format": "MM-dd-yyyy",
        "interval": "week"
      }
    }
  }
}

If we use filter with date ranges as following starting from the Jan 1 YEAR

{
  "range": {
    "document.date":  {
      "from": "2015-01-01",
      "to": "2015-02-10"
     }
  }
}

In the result buckets we get extra week from past year's last month Dec NDAY YEAR

{
  "aggregations": {
    "scores_by_date": {
      "buckets": [
        {
  HERE => "key_as_string": "12-29-2014",
          "key": 1419811200000,
          "doc_count": 29
        },
        {
          "key_as_string": "01-05-2015",
          "key": 1420416000000,
          "doc_count": 39
        },
        {
          "key_as_string": "01-12-2015",
          "key": 1421020800000,
          "doc_count": 45
        },
        {
          "key_as_string": "01-19-2015",
          "key": 1421625600000,
          "doc_count": 55
        },
        {
          "key_as_string": "01-26-2015",
          "key": 1422230400000,
          "doc_count": 29
        }
      ]
    }
  }
}

Is this a standard behaviour of Elasticsearch date histogram aggregations or is this issue the part of current discussion?

@rjernst
Copy link
Member

rjernst commented Feb 10, 2015

@imanhodjaev I believe that is because it the date math code includes the entire week, instead of a partial week. It does not truncate the edges of the range.

@OPtoss
Copy link

OPtoss commented Apr 23, 2015

I'd love to vote for this issue. It's severely prohibiting me from attaining my use case of date_histograms. I'm doing date_histogram as a sub-aggregation to a terms aggregation, so I'm getting one histogram per term bucket. If I can't define the interval dynamically per-bucket, then I have to do hundreds of queries instead of just 1.

At the moment, I don't care much about cost to server-side computations. The client can't be expected to perform hundreds of separate queries in my real-time interactive application. So having this feature is critical.

I've attempted to work around it using groovy scripts to calculate the interval, which works, but once again I'm blocked, as I can't refer to the result of my script inside the "interval" argument of the date_histogram.

Is there another workaround that doesn't multiply queries? Or can this feature be reconsidered? Thanks!

@ppf2
Copy link
Member

ppf2 commented Jun 25, 2015

+1 on the original use case of dynamic range histogram buckets for numeric (non-datetime fields). This is a pretty common use case for commerce applications.

@jpountz
Copy link
Contributor

jpountz commented Jul 31, 2015

Stalled by #12316. It is still possible to implement this feature by doing two requests as described above.

@gmoskovicz
Copy link
Contributor

+1 for a workaround without using a 2 step request to elasticsearch. Auto-Calculating the ranges looks like a great idea, although i can see that implementing this can be tricky given the number of things that we should consider.

@Bargs
Copy link

Bargs commented Sep 21, 2016

Copying my comments from a duplicate issue:

This would be a boon for Kibana because it would allow us to easily auto scale the histogram visualization and protect users from accidentally creating too many buckets, crashing their browser in the process.

We could accomplish this by issuing two requests from the client, the first grabbing the min/max values for the given field. But since the aggregation may have a search query and/or filters associated with it we don't know how performant this would be. I imagine it would be more efficient to implement this option inside elasticsearch so the calculations can be done in a single request.

@bompus
Copy link

bompus commented Dec 31, 2016

This would be similar to MongoDB $bucketAuto aggregation. I would love to see something similar in ES.

@jpountz
Copy link
Contributor

jpountz commented Feb 27, 2017

This issue was discussed today, here is a brief summary: one way to implement it would be to request the min/max values first like @Bargs mentioned. Doing it in Elasticsearch would have little benefit compared to doing it on the client-side. Another way to implement it would be for each shard to adapt the interval on the fly and then at reducing time we would pick the larger interval. However the latter would require us to add ways to reuse buckets, which we do not have today, but may be required by other aggs such as clustering (including percentiles). The recommended way to work around the lack of support for dynamic intervals for now is to issue two requests to Elasticsearch: one to figure out the min/max values of the field, and another one with the histogram aggregation.

@vaibhavtripathi
Copy link

@alexksikes Can you point me to an example of the two-query approach you are suggesting?

pcsanwald pushed a commit that referenced this issue Jul 13, 2018
* Adds a new auto-interval date histogram

This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time.

This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary.

Closes #9572

* Adds documentation

* Added sub aggregator test

* Fixes failing docs test

* Brings branch up to date with master changes

* trying to get tests to pass again

* Fixes multiBucketConsumer accounting

* Collects more buckets than needed on shards

This gives us more options at reduce time in terms of how we do the
final merge of the buckeets to produce the final result

* Revert "Collects more buckets than needed on shards"

This reverts commit 993c782.

* Adds ability to merge within a rounding

* Fixes nonn-timezone doc test failure

* Fix time zone tests

* iterates on tests

* Adds test case and documentation changes

Added some notes in the documentation about the intervals that can bbe
returned.

Also added a test case that utilises the merging of conseecutive buckets

* Fixes performance bug

The bug meant that getAppropriate rounding look a huge amount of time
if the range of the data was large but also sparsely populated. In
these situations the rounding would be very low so iterating through
the rounding values from the min key to the max keey look a long time
(~120 seconds in one test).

The solution is to add a rough estimate first which chooses the
rounding based just on the long values of the min and max keeys alone
but selects the rounding one lower than the one it thinks is
appropriate so the accurate method can choose the final rounding taking
into account the fact that intervals are not always fixed length.

Thee commit also adds more tests

* Changes to only do complex reduction on final reduce

* merge latest with master

* correct tests and add a new test case for 10k buckets

* refactor to perform bucket number check in innerBuild

* correctly derive bucket setting, update tests to increase bucket threshold

* fix checkstyle

* address code review comments

* add documentation for default buckets

* fix typo
pcsanwald pushed a commit that referenced this issue Jul 16, 2018
* Adds a new auto-interval date histogram

This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time.

This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary.

Closes #9572

* Adds documentation

* Added sub aggregator test

* Fixes failing docs test

* Brings branch up to date with master changes

* trying to get tests to pass again

* Fixes multiBucketConsumer accounting

* Collects more buckets than needed on shards

This gives us more options at reduce time in terms of how we do the
final merge of the buckeets to produce the final result

* Revert "Collects more buckets than needed on shards"

This reverts commit 993c782.

* Adds ability to merge within a rounding

* Fixes nonn-timezone doc test failure

* Fix time zone tests

* iterates on tests

* Adds test case and documentation changes

Added some notes in the documentation about the intervals that can bbe
returned.

Also added a test case that utilises the merging of conseecutive buckets

* Fixes performance bug

The bug meant that getAppropriate rounding look a huge amount of time
if the range of the data was large but also sparsely populated. In
these situations the rounding would be very low so iterating through
the rounding values from the min key to the max keey look a long time
(~120 seconds in one test).

The solution is to add a rough estimate first which chooses the
rounding based just on the long values of the min and max keeys alone
but selects the rounding one lower than the one it thinks is
appropriate so the accurate method can choose the final rounding taking
into account the fact that intervals are not always fixed length.

Thee commit also adds more tests

* Changes to only do complex reduction on final reduce

* merge latest with master

* correct tests and add a new test case for 10k buckets

* refactor to perform bucket number check in innerBuild

* correctly derive bucket setting, update tests to increase bucket threshold

* fix checkstyle

* address code review comments

* add documentation for default buckets

* fix typo
pcsanwald pushed a commit to pcsanwald/elasticsearch that referenced this issue Jul 16, 2018
* Adds a new auto-interval date histogram

This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time.

This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary.

Closes elastic#9572

* Adds documentation

* Added sub aggregator test

* Fixes failing docs test

* Brings branch up to date with master changes

* trying to get tests to pass again

* Fixes multiBucketConsumer accounting

* Collects more buckets than needed on shards

This gives us more options at reduce time in terms of how we do the
final merge of the buckeets to produce the final result

* Revert "Collects more buckets than needed on shards"

This reverts commit 993c782.

* Adds ability to merge within a rounding

* Fixes nonn-timezone doc test failure

* Fix time zone tests

* iterates on tests

* Adds test case and documentation changes

Added some notes in the documentation about the intervals that can bbe
returned.

Also added a test case that utilises the merging of conseecutive buckets

* Fixes performance bug

The bug meant that getAppropriate rounding look a huge amount of time
if the range of the data was large but also sparsely populated. In
these situations the rounding would be very low so iterating through
the rounding values from the min key to the max keey look a long time
(~120 seconds in one test).

The solution is to add a rough estimate first which chooses the
rounding based just on the long values of the min and max keeys alone
but selects the rounding one lower than the one it thinks is
appropriate so the accurate method can choose the final rounding taking
into account the fact that intervals are not always fixed length.

Thee commit also adds more tests

* Changes to only do complex reduction on final reduce

* merge latest with master

* correct tests and add a new test case for 10k buckets

* refactor to perform bucket number check in innerBuild

* correctly derive bucket setting, update tests to increase bucket threshold

* fix checkstyle

* address code review comments

* add documentation for default buckets

* fix typo
@melissachang
Copy link

I believe this issue should still be open. #28993 only supports date histograms. There needs to be a similar PR for non-date, regular histograms.

@pcsanwald
Copy link
Contributor

@melissachang you are absolutely right that we need to support numeric histograms, but I've created a separate issue for that, #31828. We can track that work using this new issue. Thank you!

@monfera
Copy link

monfera commented Sep 14, 2018

Just for sake of completeness here's a link for Sturges' Rule, which is a sound approach to determining bin counts merely based on sample count, assuming a distribution reasonably close to normal distribution.

@Rapster
Copy link

Rapster commented Apr 26, 2019

@monfera did you figure out a way to use Sturges' rule to determine the number of buckets? Someone else ask this question few years ago also https://discuss.elastic.co/t/histogram-and-sturges-formula/14133

nik9000 pushed a commit that referenced this issue Jun 23, 2020
Implements a new histogram aggregation called `variable_width_histogram` which
dynamically determines bucket intervals based on document groupings. These
groups are determined by running a one-pass clustering algorithm on each shard
and then reducing each shard's clusters using an agglomerative
clustering algorithm.

This PR addresses #9572.

The shard-level clustering is done in one pass to minimize memory overhead. The
algorithm was lightly inspired by
[this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches
a small number of documents to sample the data and determine initial clusters.
Subsequent documents are then placed into one of these clusters, or a new one
if they are an outlier. This algorithm is described in more details in the
aggregation's docs.

At reduce time, a
[hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)
algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304)
continually merges the closest buckets from all shards (based on their
centroids) until the target number of buckets is reached.

The final values produced by this aggregation are approximate. Each bucket's
min value is used as its key in the histogram. Furthermore, buckets are merged
based on their centroids and not their bounds. So it is possible that adjacent
buckets will overlap after reduction. Because each bucket's key is its min,
this overlap is not shown in the final histogram. However, when such overlap
occurs, we set the key of the bucket with the larger centroid to the midpoint
between its minimum and the smaller bucket’s maximum:
`min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to
increases the accuracy of the clustering.

Nodes are unable to share centroids during the shard-level clustering phase. In
the future, resolving #50863
would let us solve this issue. 

It doesn’t make sense for this aggregation to support the `min_doc_count`
parameter, since clusters are determined dynamically. The `order` parameter is
not supported here to keep this large PR from becoming too complex.
nik9000 pushed a commit to nik9000/elasticsearch that referenced this issue Jun 23, 2020
Implements a new histogram aggregation called `variable_width_histogram` which
dynamically determines bucket intervals based on document groupings. These
groups are determined by running a one-pass clustering algorithm on each shard
and then reducing each shard's clusters using an agglomerative
clustering algorithm.

This PR addresses elastic#9572.

The shard-level clustering is done in one pass to minimize memory overhead. The
algorithm was lightly inspired by
[this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches
a small number of documents to sample the data and determine initial clusters.
Subsequent documents are then placed into one of these clusters, or a new one
if they are an outlier. This algorithm is described in more details in the
aggregation's docs.

At reduce time, a
[hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)
algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304)
continually merges the closest buckets from all shards (based on their
centroids) until the target number of buckets is reached.

The final values produced by this aggregation are approximate. Each bucket's
min value is used as its key in the histogram. Furthermore, buckets are merged
based on their centroids and not their bounds. So it is possible that adjacent
buckets will overlap after reduction. Because each bucket's key is its min,
this overlap is not shown in the final histogram. However, when such overlap
occurs, we set the key of the bucket with the larger centroid to the midpoint
between its minimum and the smaller bucket’s maximum:
`min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to
increases the accuracy of the clustering.

Nodes are unable to share centroids during the shard-level clustering phase. In
the future, resolving elastic#50863
would let us solve this issue.

It doesn’t make sense for this aggregation to support the `min_doc_count`
parameter, since clusters are determined dynamically. The `order` parameter is
not supported here to keep this large PR from becoming too complex.
nik9000 added a commit that referenced this issue Jun 25, 2020
Implements a new histogram aggregation called `variable_width_histogram` which
dynamically determines bucket intervals based on document groupings. These
groups are determined by running a one-pass clustering algorithm on each shard
and then reducing each shard's clusters using an agglomerative
clustering algorithm.

This PR addresses #9572.

The shard-level clustering is done in one pass to minimize memory overhead. The
algorithm was lightly inspired by
[this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches
a small number of documents to sample the data and determine initial clusters.
Subsequent documents are then placed into one of these clusters, or a new one
if they are an outlier. This algorithm is described in more details in the
aggregation's docs.

At reduce time, a
[hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)
algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304)
continually merges the closest buckets from all shards (based on their
centroids) until the target number of buckets is reached.

The final values produced by this aggregation are approximate. Each bucket's
min value is used as its key in the histogram. Furthermore, buckets are merged
based on their centroids and not their bounds. So it is possible that adjacent
buckets will overlap after reduction. Because each bucket's key is its min,
this overlap is not shown in the final histogram. However, when such overlap
occurs, we set the key of the bucket with the larger centroid to the midpoint
between its minimum and the smaller bucket’s maximum:
`min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to
increases the accuracy of the clustering.

Nodes are unable to share centroids during the shard-level clustering phase. In
the future, resolving #50863
would let us solve this issue.

It doesn’t make sense for this aggregation to support the `min_doc_count`
parameter, since clusters are determined dynamically. The `order` parameter is
not supported here to keep this large PR from becoming too complex.

Co-authored-by: James Dorfman <jamesdorfman@users.noreply.github.com>
@findinpath
Copy link

@alexksikes Can you point me to an example of the two-query approach you are suggesting?

I am not sure if I got it right, but here https://github.com/findinpath/elastic-price-range-aggregation/blob/master/src/test/kotlin/com/findinpath/PriceRangeAggregationTest.kt#L119 you can find an implementation of the two-query approach suggested.

@mjose007
Copy link

mjose007 commented Oct 8, 2020

+1 for the Support dynamic interval and fixed buckets for histogram aggregation

@KPGDL
Copy link

KPGDL commented Feb 2, 2022

Wow, it's already 6 years and still waiting...

@tmushayahama
Copy link

So there is no dynamic interval yet>

@shu-bham
Copy link

still waiting for this feature

@PendalF89
Copy link

Welcome from 2023! So, we are still waiting :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests