Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Range agg specified as max bucket count rather than explicit ranges #24254

Closed
mrec opened this issue Apr 21, 2017 · 6 comments
Closed
Labels
:Analytics/Aggregations Aggregations >enhancement high hanging fruit Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@mrec
Copy link

mrec commented Apr 21, 2017

NOTE: This is very similar to #9572 about histogram, and is probably similarly blocked by #12316, but I didn't want to hijack that one given that it's a different aggregation. I also couldn't see any mention of the "swamping" problem described below as the second motivation.

A range aggregation is specified using an array of explicit bucket ranges:

"range" : {
    "field" : "price",
    "ranges" : [
        { "to" : 50 },
        { "from" : 50, "to" : 100 },
        { "from" : 100 }
    ]
}

I'm proposing an alternative which just requests a maximum number of buckets to return; syntax is open to bikeshedding but could be e.g.

"range" : {
    "field" : "price",
    "ranges" : 3
}

The motivation for this is twofold:

  1. One very common use case for range aggs is creating drilldown UI to augment a primary text-based search. The end-user doesn't enter the range boundaries; rather, the client application will do a calibration request with something like a percentiles agg, aiming for N buckets with doc counts as even as possible, and set the range bucket boundaries for the main request based on the results of that. The problem here is duplicate work; it means two round-trips to Elastic rather than one, both trips will need to apply the request's filtering criteria, and the dependency on those criteria means that the result of the percentiles agg can't usefully be cached. Doing the percentiles internally on the server would avoid the network overhead of the second call, and could hopefully avoid repeating the document filtering step.
  2. Occasionally the matching docs are swamped by one value, meaning that (for example) requesting [25,50,75] percentiles can return the same value for all three making it almost useless for drilldown. We can work around this when using the Java client by binary-chopping the returned Percentiles until you find something useful, but this isn't possible via the REST API and won't be possible via the Java client either (see Feature gap between Java and HTTP APIs for Percentiles aggregation #23610) once it switches to REST transport.
@markharwood
Copy link
Contributor

As you suggest this is likely blocked because we would need multiple phases to select the appropriate bucketing strategy for use across multiple indices/shards.

Putting aside the challenges of implementing this efficiently, I imagine there would need to be some extra client criteria to control the aesthetics of derived buckets. Given a website that sells spoons, cars and houses they would need flexible bucketing depending on the individual queries i.e. tiers of rounding (prices snap to either units of 1, 10, 1,000, 10,000, 100,000).
Often such sites implement "category snapping" to determine not just appropriate ranges of prices but other category-appropriate facets such as "transmission types" apply for cars, "number of bedrooms" apply for houses etc. In these situations it makes more sense to define a whole set of categories and the appropriate aggregations for them and implement a query-pre-processing system to determine which is the most probable category for a given query.

@colings86
Copy link
Contributor

This will be solved by #28993 when its merged so closing in favour of #9572 which is the issue for that change

@mrec
Copy link
Author

mrec commented Mar 13, 2018

Hi @colings86 - it's great to see that work happening, but from the PR description I'm not sure it addresses this request. Your change seems to be specifically about histograms, and is constrained to keep all intervals the same width. This ticket was about range for faceted navigation, where it's sometimes more valuable for all intervals to contain roughly the same doc count and you don't necessarily care about them being the same width. See also @markharwood's example of "prices snap to either units of 1, 10, 1,000, 10,000, 100,000" .

If I'm misunderstanding, apologies; I've only read the description, not the code diffs.

@colings86
Copy link
Contributor

@mrec sorry, you are right, your use case is slightly different so I'll reopen this issue.

Off the back of #28993 I was actually thinking about using a similar approach of merging buckets at collection time to create an aggregation for constant-density histograms where we would aim to create buckets that have roughly the same number of documents in them. I am yet to convince myself whether this is just a case of being able to run k-means clustering on the numeric values though so it might be solved by #5512

@colings86 colings86 reopened this Mar 13, 2018
@rjernst rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020
@talevy
Copy link
Contributor

talevy commented Nov 4, 2020

Assuming that it is sufficient to have these N ranges be variable and do not need to be uniform ranges, then this comment suggests the Variable Width Histogram is a great alternative.

feel free to re-open if that is not the case, and the ranges must be uniform.

@talevy talevy closed this as completed Nov 4, 2020
@polyfractal
Copy link
Contributor

Related (constant width auto histo): #31828

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement high hanging fruit Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

7 participants