Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Make use of dynamic pruning for faster cardinality aggregations #11959

Closed
rishabhmaurya opened this issue Jan 21, 2024 · 6 comments · Fixed by #13821 · May be fixed by rishabhmaurya/OpenSearch#74
Closed

[Feature Request] Make use of dynamic pruning for faster cardinality aggregations #11959

rishabhmaurya opened this issue Jan 21, 2024 · 6 comments · Fixed by #13821 · May be fixed by rishabhmaurya/OpenSearch#74
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search:Aggregations v2.15.0 Issues and PRs related to version 2.15.0

Comments

@rishabhmaurya
Copy link
Contributor

rishabhmaurya commented Jan 21, 2024

Is your feature request related to a problem? Please describe

Dynamic pruning algorithms work by dynamically adding negating filters into the query as disjunctions which are non-competitive( or found to be no more competitive while query execution) to prune the search space.
One of the utility is cardinality aggregations. Instead of using negative filter in lucene, if the field is a low cardinality field (to avoid explosion of disjunctions), then the query can be rewritten as a disjunction query of all the unique terms of the field. As the matching documents are evaluated, if the field value is a unique encountered so far, then it can be safely removed from the disjunction query. This ensures that the documents aren't processed twice for a unique value of a field on which cardinality aggregation is run. Also, the query will be early terminated if all disjunctive filters are exhausted.

Describe the solution you'd like

This logic can easily be embedded into Query.rewrite() method when the cardinality aggregation is the only aggregation and field is low cardinality field. Field cardinality upper bound for a query can be estimated either from FieldReader size() or SortedSetDocValues getValueCount().

We need run some benchmarks in order to know when to start enabling this optimization as it may not be very helpful in smaller corpus. I propose running it against noaa workload - https://github.com/opensearch-project/opensearch-benchmark-workloads/blob/bdbd4bbd74fbf319398de1ca169f16744821bcde/noaa/operations/default.json#L765

Related component

Search:Aggregations

Describe alternatives you've considered

No response

Additional context

No response

@rishabhmaurya rishabhmaurya added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 21, 2024
@rishabhmaurya
Copy link
Contributor Author

rishabhmaurya commented Jan 21, 2024

cc @getsaurabh02 @msfroh let me know your thoughts

@rishabhmaurya
Copy link
Contributor Author

this is the idea from the blogpost - https://www.elastic.co/blog/faster-cardinality-aggregations-dynamic-pruning

@rishabhmaurya
Copy link
Contributor Author

rishabhmaurya commented Feb 14, 2024

Here is my attempt at it rishabhmaurya#74 . Thanks @msfroh for suggesting work around to maintain the invariant of Conjunction DISI by lazily propagating the docID on next()/advance() and also validating this approach.

@kkmr and @msfroh Let me know your thoughts, if below algorithm is reasonable enough to proceed here?
Also, if we can optimize upon any of these steps?

Here is the breakdown of algorithm -

  1. Check for all preconditions on when this optimization can be enabled -
    1. Only enabled when Cardinality Aggregation is the only aggregation.
    1. The field is a low cardinality field.
    1. Field type is one of Keyword, Numeric?
    1. Other?
  1. Once preconditions are met, while collectors are created and picked for a given segment, create a DynamicPruningCollectorWrapper to wrap the collector with optimization.

  2. DynamicPruningCollectorWrapper will enumerate all the terms for the given field and creates a DisjunctionWithDynamicPruningScorer similar to DisjunctionScorer in lucene in conjunction with the parent query. DisjunctionWithDynamicPruningScorer scorer should have following capabilities in addition to what DisjunctionScorer have -

    1. #removeAllDISIsOnCurrentDoc() - it removes all the DISIs for subscorer pointing to current doc. This is helpful in dynamic pruning for Cardinality aggregation, where once a term is found, it becomes irrelevant for rest of the search space, so this term's subscorer DISI can be safely removed from list of subscorer to process.
    1. #removeAllDISIsOnCurrentDoc() breaks the invariant of Conjuction DISI i.e. the docIDs of all sub-scorers should be less than or equal to current docID iterator is pointing to. When we remove elements from priority queue, it results in heapify action, which modifies the top of the priority queue, which represents the current docID for subscorers here. To address this, we are wrapping the iterator with SlowDocIdPropagatorDISI which keeps the iterator pointing to last docID before #removeAllDISIsOnCurrentDoc() is called and updates this docID only when next() or advance() is called.
  1. When collection of document will start and DynamicPruningCollectorWrapper is used, it will collect all the documents at once by iterating over all the document from the query created in step 3.

  2. Dynamic pruning step when collecting a document - when a match is found, all the terms for a given document will be enumerated and collected for cardinality computation. Once done, the subscorer DISI corresponding to each of these terms collected, can be safely removed from the DisjunctionWithDynamicPruningScorer by calling removeAllDISIsOnCurrentDoc(). Once all docs are collected, we can straightaway throw CollectionTerminatedException for early termination of query.

The code change contains a test which covers a happy case - https://github.com/rishabhmaurya/OpenSearch/pull/74/files#diff-8c88c6062265deccbf9f504a86750ae8f6e1ae53350f91f8a226e7886d6c3e7cR101

@kkmr
Copy link
Contributor

kkmr commented Feb 19, 2024

I will take this forward

@bowenlan-amzn
Copy link
Member

Will work on this.

@bowenlan-amzn
Copy link
Member

bowenlan-amzn commented May 28, 2024

TODO

  • Find the suitable benchmark for the cardinality aggregation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment