-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Make use of dynamic pruning for faster cardinality aggregations #11959
[Feature Request] Make use of dynamic pruning for faster cardinality aggregations #11959
Comments
cc @getsaurabh02 @msfroh let me know your thoughts |
this is the idea from the blogpost - https://www.elastic.co/blog/faster-cardinality-aggregations-dynamic-pruning |
Here is my attempt at it rishabhmaurya#74 . Thanks @msfroh for suggesting work around to maintain the invariant of Conjunction DISI by lazily propagating the docID on @kkmr and @msfroh Let me know your thoughts, if below algorithm is reasonable enough to proceed here? Here is the breakdown of algorithm -
The code change contains a test which covers a happy case - https://github.com/rishabhmaurya/OpenSearch/pull/74/files#diff-8c88c6062265deccbf9f504a86750ae8f6e1ae53350f91f8a226e7886d6c3e7cR101 |
I will take this forward |
Will work on this. |
TODO
|
Is your feature request related to a problem? Please describe
Dynamic pruning algorithms work by dynamically adding negating filters into the query as disjunctions which are non-competitive( or found to be no more competitive while query execution) to prune the search space.
One of the utility is cardinality aggregations. Instead of using negative filter in lucene, if the field is a low cardinality field (to avoid explosion of disjunctions), then the query can be rewritten as a disjunction query of all the unique terms of the field. As the matching documents are evaluated, if the field value is a unique encountered so far, then it can be safely removed from the disjunction query. This ensures that the documents aren't processed twice for a unique value of a field on which cardinality aggregation is run. Also, the query will be early terminated if all disjunctive filters are exhausted.
Describe the solution you'd like
This logic can easily be embedded into
Query.rewrite()
method when the cardinality aggregation is the only aggregation and field is low cardinality field. Field cardinality upper bound for a query can be estimated either fromFieldReader size()
orSortedSetDocValues getValueCount()
.We need run some benchmarks in order to know when to start enabling this optimization as it may not be very helpful in smaller corpus. I propose running it against noaa workload - https://github.com/opensearch-project/opensearch-benchmark-workloads/blob/bdbd4bbd74fbf319398de1ca169f16744821bcde/noaa/operations/default.json#L765
Related component
Search:Aggregations
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: