-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: Range agg specified as max bucket count rather than explicit ranges #24254
Comments
As you suggest this is likely blocked because we would need multiple phases to select the appropriate bucketing strategy for use across multiple indices/shards. Putting aside the challenges of implementing this efficiently, I imagine there would need to be some extra client criteria to control the aesthetics of derived buckets. Given a website that sells spoons, cars and houses they would need flexible bucketing depending on the individual queries i.e. tiers of rounding (prices snap to either units of 1, 10, 1,000, 10,000, 100,000). |
Hi @colings86 - it's great to see that work happening, but from the PR description I'm not sure it addresses this request. Your change seems to be specifically about histograms, and is constrained to keep all intervals the same width. This ticket was about If I'm misunderstanding, apologies; I've only read the description, not the code diffs. |
@mrec sorry, you are right, your use case is slightly different so I'll reopen this issue. Off the back of #28993 I was actually thinking about using a similar approach of merging buckets at collection time to create an aggregation for constant-density histograms where we would aim to create buckets that have roughly the same number of documents in them. I am yet to convince myself whether this is just a case of being able to run k-means clustering on the numeric values though so it might be solved by #5512 |
Assuming that it is sufficient to have these N ranges be variable and do not need to be uniform ranges, then this comment suggests the Variable Width Histogram is a great alternative. feel free to re-open if that is not the case, and the ranges must be uniform. |
Related (constant width auto histo): #31828 |
NOTE: This is very similar to #9572 about
histogram
, and is probably similarly blocked by #12316, but I didn't want to hijack that one given that it's a different aggregation. I also couldn't see any mention of the "swamping" problem described below as the second motivation.A range aggregation is specified using an array of explicit bucket ranges:
I'm proposing an alternative which just requests a maximum number of buckets to return; syntax is open to bikeshedding but could be e.g.
The motivation for this is twofold:
percentiles
agg, aiming for N buckets with doc counts as even as possible, and set therange
bucket boundaries for the main request based on the results of that. The problem here is duplicate work; it means two round-trips to Elastic rather than one, both trips will need to apply the request's filtering criteria, and the dependency on those criteria means that the result of thepercentiles
agg can't usefully be cached. Doing thepercentiles
internally on the server would avoid the network overhead of the second call, and could hopefully avoid repeating the document filtering step.[25,50,75]
percentiles can return the same value for all three making it almost useless for drilldown. We can work around this when using the Java client by binary-chopping the returnedPercentiles
until you find something useful, but this isn't possible via the REST API and won't be possible via the Java client either (see Feature gap between Java and HTTP APIs for Percentiles aggregation #23610) once it switches to REST transport.The text was updated successfully, but these errors were encountered: