Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the filter rewrite optimization #14464

Conversation

bowenlan-amzn
Copy link
Member

@bowenlan-amzn bowenlan-amzn commented Jun 19, 2024

Description

As more code coming into the filter rewrite optimization, it starts to become harder to understand.
Not only making the code review slower and painful, it also will slow down the new contributors into this area. So here comes the refactoring work.

Idea

The refactoring shouldn't change any business logic.
After the refactor, reader can easily find all the important information by just reading the class doc and checking the public methods of all classes.

  • Add only declarative code to the Aggregator, while keep the optimization business logic in the new package.
    • Declarative code would be the Context object, which has several public methods to invoke the optimization workflow, combined with a Bridge object to provide the optimization any necessary access to the data in Aggregator.
    • The necessary data can be passed into methods of Bridge. Comparing to saving the field into the Bridge class, this way is more readable because it tells you where this field is actually needed directly from the method name.
    • Other than providing access, Bridge can also host/hide the optimization business logic.

Refactoring

  • Split the old huge Helper calss into independent components.
  • Tighten up any member access modifier of the components, left the important methods as public.
  • Clean the unnecessary references from the components. For example, SearchContext, instead of passing into the OptimizationContext, try to utilize the functions in AggregatorBridge to provide it whenever needed.

Why the name — filter rewrite optimization?

Filter in OpenSearch world has similar meaning as query, while it indicates no relavance scoring calculated.
Rewrite in OpenSearch world can mean transform OpenSearch query into lucene query, or transform a query to perform better.

Generally speaking, the optimization rewrites the aggregation into certain filters to improve performance. Aggregation execution is plain and simple iteration and collection on all matches, while filters can take advantage of the Lucene index to get expected results in log or even constant time.

Benchmark

Using the new tool to trigger benchmark from PR #14464 (comment)

Related Issues

Resolves #14435

Check List

  • [ ] Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Split the single Helper classes and move the classes into a new package for any optimization we introduced for search path.
Rename the class name to make it more straightforward and general

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
refactor the canOptimize logic
sort out the basic rule about how to provide data from aggregator, and where to put common logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
refactor the data provider and try optimize logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
@bowenlan-amzn bowenlan-amzn changed the title Refactor Refactor the filter rewrite optimization Jun 19, 2024
@github-actions github-actions bot added Search:Aggregations v2.16.0 Issues and PRs related to version 2.16.0 labels Jun 19, 2024
Copy link
Contributor

❌ Gradle check result for 1a067ba: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
extract segment match all logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Copy link
Contributor

✅ Gradle check result for 7c491b9: SUCCESS

Copy link

codecov bot commented Jun 20, 2024

Codecov Report

Attention: Patch coverage is 85.25799% with 60 lines in your changes missing coverage. Please review.

Project coverage is 71.15%. Comparing base (97c1bf0) to head (86cacab).
Report is 3 commits behind head on main.

Files Patch % Lines
...ilterrewrite/FilterRewriteOptimizationContext.java 82.60% 9 Missing and 3 partials ⚠️
...arch/aggregations/bucket/filterrewrite/Helper.java 85.00% 4 Missing and 8 partials ⚠️
...tions/bucket/filterrewrite/PointTreeTraversal.java 88.15% 5 Missing and 4 partials ⚠️
...t/filterrewrite/DateHistogramAggregatorBridge.java 87.03% 1 Missing and 6 partials ⚠️
...ns/bucket/filterrewrite/RangeAggregatorBridge.java 80.00% 2 Missing and 5 partials ⚠️
...arch/aggregations/bucket/filterrewrite/Ranges.java 75.00% 2 Missing and 3 partials ⚠️
...egations/bucket/composite/CompositeAggregator.java 86.95% 2 Missing and 1 partial ⚠️
...ucket/filterrewrite/CompositeAggregatorBridge.java 77.77% 0 Missing and 2 partials ⚠️
.../bucket/histogram/AutoDateHistogramAggregator.java 94.73% 1 Missing ⚠️
...ions/bucket/histogram/DateHistogramAggregator.java 90.90% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #14464      +/-   ##
============================================
- Coverage     71.74%   71.15%   -0.59%     
+ Complexity    62904    62235     -669     
============================================
  Files          5178     5185       +7     
  Lines        295167   295146      -21     
  Branches      42679    42660      -19     
============================================
- Hits         211774   210020    -1754     
- Misses        66011    67800    +1789     
+ Partials      17382    17326      -56     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
inline class

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
@bowenlan-amzn bowenlan-amzn force-pushed the 14435-refactor-range-agg-optimization branch from 9040f6f to e896927 Compare August 7, 2024 18:30
Copy link
Contributor

github-actions bot commented Aug 7, 2024

✅ Gradle check result for 9040f6f: SUCCESS

Copy link
Contributor

github-actions bot commented Aug 7, 2024

✅ Gradle check result for e896927: SUCCESS

- remove map of segment ranges, pass in by calling getRanges when needed
- use AtomicInteger for the debug info

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Copy link
Contributor

github-actions bot commented Aug 8, 2024

✅ Gradle check result for 8962ee3: SUCCESS

Copy link
Contributor

github-actions bot commented Aug 8, 2024

❕ Gradle check result for 86cacab: UNSTABLE

  • TEST FAILURES:
      2 org.opensearch.common.util.concurrent.QueueResizableOpenSearchThreadPoolExecutorTests.classMethod
      1 org.opensearch.common.util.concurrent.QueueResizableOpenSearchThreadPoolExecutorTests.testResizeQueueDown

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Member

@mch2 mch2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes @bowenlan-amzn I think this is much easier to follow than the original helper class. I think we can keep going with some cleanup but my major concern re concurrent search appears resolved.

@mch2 mch2 added the backport 2.x Backport to 2.x branch label Aug 9, 2024
@mch2 mch2 merged commit 170ea27 into opensearch-project:main Aug 9, 2024
39 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 9, 2024
* Refactor

Split the single Helper classes and move the classes into a new package for any optimization we introduced for search path.
Rename the class name to make it more straightforward and general

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

refactor the canOptimize logic
sort out the basic rule about how to provide data from aggregator, and where to put common logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

refactor the data provider and try optimize logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

extract segment match all logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

inline class

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Fix a bug

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* address comment

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* prepareFromSegment now doesn't return Ranges

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* how it looks like when introduce interfaces

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* remove interface, clean up

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* improve doc

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* move multirangetraversal logic to helper

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* improve the refactor

package name -> filterrewrite
move tree traversal logic to new class
add documentation for important abstract methods
add sub class for composite aggregation bridge

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Address Marc's comments

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Address concurrent segment search concern

To save the ranges per segment, now change to a map that save ranges for segments separately.

The increment document function "incrementBucketDocCount" should already be thread safe, as it's the same method used by normal aggregation execution path

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* remove circular dependency

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Address comment

- remove map of segment ranges, pass in by calling getRanges when needed
- use AtomicInteger for the debug info

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

---------

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
(cherry picked from commit 170ea27)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@sandeshkr419
Copy link
Contributor

@mch2 @bowenlan-amzn We shouldn't skip changelog for these changes.

@bowenlan-amzn bowenlan-amzn removed the v2.16.0 Issues and PRs related to version 2.16.0 label Aug 17, 2024
mch2 pushed a commit that referenced this pull request Aug 19, 2024
* Refactor

Split the single Helper classes and move the classes into a new package for any optimization we introduced for search path.
Rename the class name to make it more straightforward and general



* Refactor

refactor the canOptimize logic
sort out the basic rule about how to provide data from aggregator, and where to put common logic



* Refactor

refactor the data provider and try optimize logic



* Refactor



* Refactor

extract segment match all logic



* Refactor



* Refactor

inline class



* Fix a bug



* address comment



* prepareFromSegment now doesn't return Ranges



* how it looks like when introduce interfaces



* remove interface, clean up



* improve doc



* move multirangetraversal logic to helper



* improve the refactor

package name -> filterrewrite
move tree traversal logic to new class
add documentation for important abstract methods
add sub class for composite aggregation bridge



* Address Marc's comments



* Address concurrent segment search concern

To save the ranges per segment, now change to a map that save ranges for segments separately.

The increment document function "incrementBucketDocCount" should already be thread safe, as it's the same method used by normal aggregation execution path



* remove circular dependency



* Address comment

- remove map of segment ranges, pass in by calling getRanges when needed
- use AtomicInteger for the debug info



---------


(cherry picked from commit 170ea27)

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
harshavamsi pushed a commit to harshavamsi/OpenSearch that referenced this pull request Aug 20, 2024
* Refactor

Split the single Helper classes and move the classes into a new package for any optimization we introduced for search path.
Rename the class name to make it more straightforward and general

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

refactor the canOptimize logic
sort out the basic rule about how to provide data from aggregator, and where to put common logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

refactor the data provider and try optimize logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

extract segment match all logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

inline class

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Fix a bug

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* address comment

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* prepareFromSegment now doesn't return Ranges

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* how it looks like when introduce interfaces

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* remove interface, clean up

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* improve doc

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* move multirangetraversal logic to helper

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* improve the refactor

package name -> filterrewrite
move tree traversal logic to new class
add documentation for important abstract methods
add sub class for composite aggregation bridge

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Address Marc's comments

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Address concurrent segment search concern

To save the ranges per segment, now change to a map that save ranges for segments separately.

The increment document function "incrementBucketDocCount" should already be thread safe, as it's the same method used by normal aggregation execution path

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* remove circular dependency

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Address comment

- remove map of segment ranges, pass in by calling getRanges when needed
- use AtomicInteger for the debug info

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

---------

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
wdongyu pushed a commit to wdongyu/OpenSearch that referenced this pull request Aug 22, 2024
* Refactor

Split the single Helper classes and move the classes into a new package for any optimization we introduced for search path.
Rename the class name to make it more straightforward and general

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

refactor the canOptimize logic
sort out the basic rule about how to provide data from aggregator, and where to put common logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

refactor the data provider and try optimize logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

extract segment match all logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

inline class

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Fix a bug

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* address comment

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* prepareFromSegment now doesn't return Ranges

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* how it looks like when introduce interfaces

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* remove interface, clean up

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* improve doc

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* move multirangetraversal logic to helper

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* improve the refactor

package name -> filterrewrite
move tree traversal logic to new class
add documentation for important abstract methods
add sub class for composite aggregation bridge

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Address Marc's comments

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Address concurrent segment search concern

To save the ranges per segment, now change to a map that save ranges for segments separately.

The increment document function "incrementBucketDocCount" should already be thread safe, as it's the same method used by normal aggregation execution path

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* remove circular dependency

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Address comment

- remove map of segment ranges, pass in by calling getRanges when needed
- use AtomicInteger for the debug info

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

---------

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
akolarkunnu pushed a commit to akolarkunnu/OpenSearch that referenced this pull request Sep 10, 2024
* Refactor

Split the single Helper classes and move the classes into a new package for any optimization we introduced for search path.
Rename the class name to make it more straightforward and general

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

refactor the canOptimize logic
sort out the basic rule about how to provide data from aggregator, and where to put common logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

refactor the data provider and try optimize logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

extract segment match all logic

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Refactor

inline class

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Fix a bug

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* address comment

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* prepareFromSegment now doesn't return Ranges

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* how it looks like when introduce interfaces

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* remove interface, clean up

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* improve doc

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* move multirangetraversal logic to helper

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* improve the refactor

package name -> filterrewrite
move tree traversal logic to new class
add documentation for important abstract methods
add sub class for composite aggregation bridge

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Address Marc's comments

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Address concurrent segment search concern

To save the ranges per segment, now change to a map that save ranges for segments separately.

The increment document function "incrementBucketDocCount" should already be thread safe, as it's the same method used by normal aggregation execution path

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* remove circular dependency

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Address comment

- remove map of segment ranges, pass in by calling getRanges when needed
- use AtomicInteger for the debug info

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

---------

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch Performance This is for any performance related enhancements or bugs Search:Aggregations skip-changelog v2.17.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor FastFilterRewriteHelper
7 participants