Enhancement: Range agg specified as max bucket count rather than explicit ranges #24254

mrec · 2017-04-21T18:43:37Z

NOTE: This is very similar to #9572 about histogram, and is probably similarly blocked by #12316, but I didn't want to hijack that one given that it's a different aggregation. I also couldn't see any mention of the "swamping" problem described below as the second motivation.

A range aggregation is specified using an array of explicit bucket ranges:

"range" : {
    "field" : "price",
    "ranges" : [
        { "to" : 50 },
        { "from" : 50, "to" : 100 },
        { "from" : 100 }
    ]
}

I'm proposing an alternative which just requests a maximum number of buckets to return; syntax is open to bikeshedding but could be e.g.

"range" : {
    "field" : "price",
    "ranges" : 3
}

The motivation for this is twofold:

One very common use case for range aggs is creating drilldown UI to augment a primary text-based search. The end-user doesn't enter the range boundaries; rather, the client application will do a calibration request with something like a percentiles agg, aiming for N buckets with doc counts as even as possible, and set the range bucket boundaries for the main request based on the results of that. The problem here is duplicate work; it means two round-trips to Elastic rather than one, both trips will need to apply the request's filtering criteria, and the dependency on those criteria means that the result of the percentiles agg can't usefully be cached. Doing the percentiles internally on the server would avoid the network overhead of the second call, and could hopefully avoid repeating the document filtering step.
Occasionally the matching docs are swamped by one value, meaning that (for example) requesting [25,50,75] percentiles can return the same value for all three making it almost useless for drilldown. We can work around this when using the Java client by binary-chopping the returned Percentiles until you find something useful, but this isn't possible via the REST API and won't be possible via the Java client either (see Feature gap between Java and HTTP APIs for Percentiles aggregation #23610) once it switches to REST transport.

The text was updated successfully, but these errors were encountered:

markharwood · 2017-04-24T12:48:58Z

As you suggest this is likely blocked because we would need multiple phases to select the appropriate bucketing strategy for use across multiple indices/shards.

Putting aside the challenges of implementing this efficiently, I imagine there would need to be some extra client criteria to control the aesthetics of derived buckets. Given a website that sells spoons, cars and houses they would need flexible bucketing depending on the individual queries i.e. tiers of rounding (prices snap to either units of 1, 10, 1,000, 10,000, 100,000).
Often such sites implement "category snapping" to determine not just appropriate ranges of prices but other category-appropriate facets such as "transmission types" apply for cars, "number of bedrooms" apply for houses etc. In these situations it makes more sense to define a whole set of categories and the appropriate aggregations for them and implement a query-pre-processing system to determine which is the most probable category for a given query.

colings86 · 2018-03-13T12:17:16Z

This will be solved by #28993 when its merged so closing in favour of #9572 which is the issue for that change

mrec · 2018-03-13T16:38:55Z

Hi @colings86 - it's great to see that work happening, but from the PR description I'm not sure it addresses this request. Your change seems to be specifically about histograms, and is constrained to keep all intervals the same width. This ticket was about range for faceted navigation, where it's sometimes more valuable for all intervals to contain roughly the same doc count and you don't necessarily care about them being the same width. See also @markharwood's example of "prices snap to either units of 1, 10, 1,000, 10,000, 100,000" .

If I'm misunderstanding, apologies; I've only read the description, not the code diffs.

colings86 · 2018-03-13T17:41:34Z

@mrec sorry, you are right, your use case is slightly different so I'll reopen this issue.

Off the back of #28993 I was actually thinking about using a similar approach of merging buckets at collection time to create an aggregation for constant-density histograms where we would aim to create buckets that have roughly the same number of documents in them. I am yet to convince myself whether this is just a case of being able to run k-means clustering on the numeric values though so it might be solved by #5512

talevy · 2020-11-04T16:49:17Z

Assuming that it is sufficient to have these N ranges be variable and do not need to be uniform ranges, then this comment suggests the Variable Width Histogram is a great alternative.

feel free to re-open if that is not the case, and the ranges must be uniform.

polyfractal · 2020-11-04T17:03:17Z

Related (constant width auto histo): #31828

markharwood added :Analytics/Aggregations Aggregations >enhancement labels Apr 24, 2017

clintongormley added the high hanging fruit label Apr 25, 2017

colings86 closed this as completed Mar 13, 2018

colings86 reopened this Mar 13, 2018

mrec mentioned this issue Sep 12, 2019

Add Variable Width Histogram Aggregation #42035

Merged

rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020

talevy closed this as completed Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Range agg specified as max bucket count rather than explicit ranges #24254

Enhancement: Range agg specified as max bucket count rather than explicit ranges #24254

mrec commented Apr 21, 2017

markharwood commented Apr 24, 2017

colings86 commented Mar 13, 2018

mrec commented Mar 13, 2018

colings86 commented Mar 13, 2018

talevy commented Nov 4, 2020

polyfractal commented Nov 4, 2020

Enhancement: Range agg specified as max bucket count rather than explicit ranges #24254

Enhancement: Range agg specified as max bucket count rather than explicit ranges #24254

Comments

mrec commented Apr 21, 2017

markharwood commented Apr 24, 2017

colings86 commented Mar 13, 2018

mrec commented Mar 13, 2018

colings86 commented Mar 13, 2018

talevy commented Nov 4, 2020

polyfractal commented Nov 4, 2020