Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Cardinality aggregation dynamic pruning changes (to be used only for prototype and reference purpose, not intended to merge to main) #12323

Conversation

rishabhmaurya
Copy link
Contributor

@rishabhmaurya rishabhmaurya commented Feb 14, 2024

Changes to experiment with Dynamic pruning for cardinality aggregation described in #11959.

Here is the breakdown of algorithm -

  1. Check for all preconditions on when this optimization can be enabled -
    1. Only enabled when Cardinality Aggregation is the only aggregation.
    1. The field is a low cardinality field.
    1. Field type is one of Keyword, Numeric?
    1. Other?
  1. Once preconditions are met, while collectors are created and picked for a given segment, create a DynamicPruningCollectorWrapper to wrap the collector with optimization.

  2. DynamicPruningCollectorWrapper will enumerate all the terms for the given field and creates a DisjunctionWithDynamicPruningScorer similar to DisjunctionScorer in lucene in conjunction with the parent query. DisjunctionWithDynamicPruningScorer scorer should have following capabilities in addition to what DisjunctionScorer have -

    1. #removeAllDISIsOnCurrentDoc() - it removes all the DISIs for subscorer pointing to current doc. This is helpful in dynamic pruning for Cardinality aggregation, where once a term is found, it becomes irrelevant for rest of the search space, so this term's subscorer DISI can be safely removed from list of subscorer to process.
    1. #removeAllDISIsOnCurrentDoc() breaks the invariant of Conjuction DISI i.e. the docIDs of all sub-scorers should be ess than or equal to current docID iterator is pointing to. When we remove elements from priority, it results in heapify action, which modifies the top of the priority queye, which represents the current docID for subscorers here. To address this, we are wrapping the iterator with SlowDocIdPropagatorDISI which keeps the iterator pointing to last docID before #removeAllDISIsOnCurrentDoc() is called and updates this docID only when next() or advance() is called.
  1. When collection of document will start and DynamicPruningCollectorWrapper is used, it will collect all the documents at once by iterating over all the document from the query created in step 3.

  2. Dynamic pruning step when collecting a document - when a match is found, all the terms for a given document will be enumerated and collected for cardinality computation. Once done, the subscorer DISI corresponding to each of these terms collector can be safely removed from the DisjunctionWithDynamicPruningScorer by calling removeAllDISIsOnCurrentDoc(). Once all docs are collector, we can straightaway throw CollectionTerminatedException for early termination of query.

Note: to be used only for prototype and reference purpose, not intended to merge to main. It may contain a lot of bugs and definitely doesn't cover all preconditions.

Description

[Describe what this change achieves]

Related Issues

Resolves #11959

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Search:Aggregations v2.13.0 Issues and PRs related to version 2.13.0 labels Feb 14, 2024
Copy link
Contributor

❌ Gradle check result for ea3e08c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

new SortedSetDocValuesField(fieldName, new BytesRef("5"))
));
}, card -> {
assertEquals(3.0, card.getValue(), 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably add assertion on how many times collector.collect gets called, which should be 2 when dynamic pruning is applied vs 5 when its not applied?

Copy link
Contributor

Compatibility status:

Checks if related components are compatible with change ea3e08c

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer.git]

@andrross
Copy link
Member

to be used only for prototype and reference purpose, not intended to merge to main

Thanks @rishabhmaurya! Sorry, I'm going to nitpick about repository hygiene here and not about the specific content of this PR. Is there a reason you chose to publish this as a PR against main as opposed to a branch or PR on your user fork? Personally, I would prefer to keep PRs against the core repo as "code intended to be merged to main" and use issues + user forks for evaluating experiments and prototypes.

@rishabhmaurya
Copy link
Contributor Author

rishabhmaurya commented Feb 15, 2024

@andrross my bad, I should have created it against the fork. Let me do that now, thanks for pointing it out.
created - rishabhmaurya#74

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Aggregations v2.13.0 Issues and PRs related to version 2.13.0
Projects
Status: Planned work items
Development

Successfully merging this pull request may close these issues.

[Feature Request] Make use of dynamic pruning for faster cardinality aggregations
2 participants