Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for missing HybridQuery results when concurrent segment search is enabled #800

Conversation

martin-gaievski
Copy link
Member

@martin-gaievski martin-gaievski commented Jun 21, 2024

Description

Fixed gap in implementation for case when shard has multiple (6+) segments. In such case some hits were missing in final hybrid query result.
Issue caused by wrong assumption that collector manager can have only one hybrid query result collector in reduce phase. It's true for number of segments < 6, otherwise core passes multiple collectors with a portion of result each. We have to merge those results at shard level to have correct collection of hits.

Lucene code where they define limit per slice (block that is processed by one collector). It's max of 250.000 docs or 5 segments.

Logic for merge is a bit tricky because we have to deal with TopDocs that has been formatted to special hybrid query format. Each next collector result should be merge into query result one by one. On each merge we need to find results of one sub-query and merge then separately, then wrap into hybrid query result format.

Example:
TopDocs in query result:

query1: {doc1: 10, doc3: 5, doc5: 3}
query2: {doc2: 3, doc3: 1}

result from next block of segments:

query1: {doc2: 11, doc6: 2}
query2: {doc5: 2, doc6: 1}

merged result:

query1: {doc2:11, doc1: 10, doc3: 5, doc5: 3, doc6: 2}
query2: {doc2: 3, doc5: 2, doc3:1, doc6: 1}

Added extensive list of unit tests and integ test that would fail with old logic (assertion of total hits, actual number would be lower).

Issues Resolved

#799

Check List

  • New functionality includes testing.
    • [] All tests pass
  • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@martin-gaievski martin-gaievski added bug Something isn't working backport 2.x Label will add auto workflow to backport PR to 2.x branch hybrid search labels Jun 21, 2024
@martin-gaievski martin-gaievski force-pushed the add_collector_result_merge_for_hybrid_query branch 2 times, most recently from 401f070 to 6570bcb Compare June 21, 2024 01:08
@martin-gaievski
Copy link
Member Author

martin-gaievski commented Jun 21, 2024

BWC for 2.15 will keep failing because main is still pointing to snapshot and as per our process we're changing that after the release.
Security check CI action will also fail for jdk 11 and 17, core has changed min requirement for security plugin to jdk 21. CI for jdk 21 is passing. Created PR to address it for plugin: #801

@martin-gaievski martin-gaievski force-pushed the add_collector_result_merge_for_hybrid_query branch 3 times, most recently from a3a09b1 to 1c8f50a Compare June 21, 2024 16:23
Copy link

codecov bot commented Jun 21, 2024

Codecov Report

Attention: Patch coverage is 82.27848% with 14 lines in your changes missing coverage. Please review.

Project coverage is 85.21%. Comparing base (7c54c86) to head (ed4ee13).
Report is 14 commits behind head on main.

Current head ed4ee13 differs from pull request most recent head d7bb73a

Please upload reports for the commit d7bb73a to get more accurate results.

Files Patch % Lines
...earch/search/query/HybridQueryScoreDocsMerger.java 82.14% 0 Missing and 5 partials ⚠️
...ralsearch/search/query/HybridCollectorManager.java 87.87% 2 Missing and 2 partials ⚠️
...arch/search/util/HybridSearchResultFormatUtil.java 33.33% 2 Missing and 2 partials ⚠️
...earch/neuralsearch/search/query/TopDocsMerger.java 91.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #800      +/-   ##
============================================
+ Coverage     85.02%   85.21%   +0.19%     
- Complexity      790      856      +66     
============================================
  Files            60       68       +8     
  Lines          2430     2686     +256     
  Branches        410      432      +22     
============================================
+ Hits           2066     2289     +223     
- Misses          202      222      +20     
- Partials        162      175      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

* Utility class for merging TopDocs and MaxScore across multiple search queries
*/
@RequiredArgsConstructor
public class TopDocsMerger {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: HybridQueryTopDocsMerger. This class is exclusive for Hybrid Query.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's actually generic, there isn't any special logic for hybrid query. Will leave name as it's now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge method is exclusive for hybrid query.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, it merges two TopDocsAndMaxScore object, only specific of hybrid query is in ScoreDocs merger.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHANGELOG.md Outdated Show resolved Hide resolved
@martin-gaievski martin-gaievski changed the title Fixed merge logic for multiple collector result case Fix for missing HybridQuery results when concurrent segment search is enabled Jun 24, 2024
@vibrantvarun
Copy link
Member

Overall looks good to me. Just waiting for @navneet1v review.

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski force-pushed the add_collector_result_merge_for_hybrid_query branch from a853197 to 1ed247e Compare June 25, 2024 04:16
@navneet1v
Copy link
Collaborator

navneet1v commented Jun 25, 2024

Lucene code where they define limit per slice (block that is processed by one collector). It's max of 250.000 docs or 5 segments.

@martin-gaievski Opensearch defines more control on top of this, to how to slice the segments. Ref: https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#slicing-mechanisms. are you taking any assumptions based on lucene slicing mechanism?

@martin-gaievski
Copy link
Member Author

Lucene code where they define limit per slice (block that is processed by one collector). It's max of 250.000 docs or 5 segments.

@martin-gaievski Opensearch defines more control on top of this, to how to slice the segments. Ref: https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#slicing-mechanisms. are you taking any assumptions based on lucene slicing mechanism?

thanks for sharing that link, I was not aware of the OpenSearch specific approach. Anyhow, I'm not making any assumptions based on number of segments or any other condition of how the slices are constructed. Main principle will be same - it will be either one collector or many collectors with search results. Previously we were handling results from one collector, now we can handle scenario with multiple collectors.

@martin-gaievski martin-gaievski force-pushed the add_collector_result_merge_for_hybrid_query branch from ed4ee13 to 41c5de5 Compare June 25, 2024 16:52
@martin-gaievski
Copy link
Member Author

Ran a benchmark of https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/noaa_semantic_search on main with and without this change, here are results for medium and large queries, where we have up to 8 segments and where the impact should be the visible the most:

With the change the performance is actually better. I'm taking one more round for baseline to make sure data isn't flaky, but these are preliminary results.

baseline:

   {
    "task": "hybrid-query-only-range-medium-subset",
    "operation": "hybrid-query-only-range-medium-subset",
    "throughput": {
     "min": 2.011052370071411,
     "mean": 2.021946349143982,
     "median": 2.018242835998535,
     "max": 2.0519936084747314,
     "unit": "ops/s"
    },    
    "service_time": {
     "50_0": 91.03496170043945,
     "90_0": 96.44083023071289,
     "99_0": 98.4694938659668,
     "100_0": 98.7366943359375,
     "mean": 91.10263374328613,
     "unit": "ms"
    },
    "error_rate": 0.0
   },
-----
  {
    "task": "hybrid-query-only-range-large-subset",
    "operation": "hybrid-query-only-range-large-subset",
    "throughput": {
     "min": 2.0065014362335205,
     "mean": 2.01287588596344,
     "median": 2.010721802711487,
     "max": 2.0304625034332275,
     "unit": "ops/s"
    }
    "service_time": {
     "50_0": 188.2995376586914,
     "90_0": 204.8991470336914,
     "99_0": 218.81844329833984,
     "100_0": 222.81707763671875,
     "mean": 186.44740097045897,
     "unit": "ms"
    },
    "error_rate": 0.0
   },

with this change

   {
    "task": "hybrid-query-only-range",
    "operation": "hybrid-query-only-range",
    "throughput": {
     "min": 2.0113322734832764,
     "mean": 2.0225303030014037,
     "median": 2.0187013149261475,
     "max": 2.053476572036743,
     "unit": "ops/s"
    },
    "service_time": {
     "50_0": 72.53012466430664,
     "90_0": 73.0995979309082,
     "99_0": 89.77497482299805,
     "100_0": 105.39061737060547,
     "mean": 72.93198219299316,
     "unit": "ms"
    },
    "error_rate": 0.0
   },
---
  {
    "task": "hybrid-query-only-range-medium-subset",
    "operation": "hybrid-query-only-range-medium-subset",
    "throughput": {
     "min": 2.0104377269744873,
     "mean": 2.0207511043548583,
     "median": 2.0172306299209595,
     "max": 2.0493412017822266,
     "unit": "ops/s"
    },
    "service_time": {
     "50_0": 100.66146469116211,
     "90_0": 104.38032913208008,
     "99_0": 106.95694351196289,
     "100_0": 107.02909088134766,
     "mean": 100.12281967163086,
     "unit": "ms"
    },
    "error_rate": 0.0
   }

…or param

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski force-pushed the add_collector_result_merge_for_hybrid_query branch from 41c5de5 to d7bb73a Compare June 25, 2024 17:30
@vibrantvarun
Copy link
Member

LGTM

@martin-gaievski martin-gaievski merged commit 25d2e82 into opensearch-project:main Jun 25, 2024
62 of 69 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 25, 2024
… enabled (#800)

* Adding merge logic for multiple collector result case

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
(cherry picked from commit 25d2e82)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 25, 2024
… enabled (#800)

* Adding merge logic for multiple collector result case

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
(cherry picked from commit 25d2e82)
@martin-gaievski
Copy link
Member Author

I re-ran the benchmark, it shows no change for large sub-set and 5% delta for medium sub-set. that looks more realistic as we only add some computation.

baseline:

{
    "task": "hybrid-query-only-range-medium-subset",
    "operation": "hybrid-query-only-range-medium-subset",
    "throughput": {
     "min": 2.0109918117523193,
     "mean": 2.0218198919296264,
     "median": 2.0181103944778442,
     "max": 2.051851272583008,
     "unit": "ops/s"
    },
    "service_time": {
     "50_0": 96.55980682373047,
     "90_0": 99.6461067199707,
     "99_0": 102.91871643066406,
     "100_0": 103.3681869506836,
     "mean": 95.76778579711915,
     "unit": "ms"
    },
    "duration": 62003.01305705216
   }
---
   {
    "task": "hybrid-query-only-range-large-subset",
    "operation": "hybrid-query-only-range-large-subset",
    "throughput": {
     "min": 2.0072591304779053,
     "mean": 2.014420256614685,
     "median": 2.0119868516921997,
     "max": 2.0340819358825684,
     "unit": "ops/s"
    },
    "service_time": {
     "50_0": 196.891845703125,
     "90_0": 209.67161560058594,
     "99_0": 224.50284576416016,
     "100_0": 227.4385986328125,
     "mean": 195.78350692749024,
     "unit": "ms"
    },
    "error_rate": 0.0
   }

after the change

{
    "task": "hybrid-query-only-range-medium-subset",
    "operation": "hybrid-query-only-range-medium-subset",
    "throughput": {
     "min": 2.0046942234039307,
     "mean": 2.0093229246139526,
     "median": 2.0077611207962036,
     "max": 2.0221428871154785,
     "unit": "ops/s"
    },
    "service_time": {
     "50_0": 88.03913879394531,
     "90_0": 93.1028938293457,
     "99_0": 104.01826095581055,
     "100_0": 106.14491271972656,
     "mean": 88.63995048522949,
     "unit": "ms"
    },
    "error_rate": 0.0
   },
---
   {
    "task": "hybrid-query-only-range-large-subset",
    "operation": "hybrid-query-only-range-large-subset",
    "throughput": {
     "min": 2.0080811977386475,
     "mean": 2.0161024808883665,
     "median": 2.0133992433547974,
     "max": 2.0380523204803467,
     "unit": "ops/s"
    },
    "service_time": {
     "50_0": 200.56720733642578,
     "90_0": 215.31163024902344,
     "99_0": 229.7359161376953,
     "100_0": 233.6544952392578,
     "mean": 196.9455467224121,
     "unit": "ms"
    },
    "error_rate": 0.0
   }

martin-gaievski added a commit that referenced this pull request Jun 25, 2024
… enabled (#800) (#805)

* Adding merge logic for multiple collector result case

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
(cherry picked from commit 25d2e82)

Co-authored-by: Martin Gaievski <gaievski@amazon.com>
martin-gaievski added a commit that referenced this pull request Jun 25, 2024
… enabled (#800) (#804)

* Adding merge logic for multiple collector result case

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
(cherry picked from commit 25d2e82)

Co-authored-by: Martin Gaievski <gaievski@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch backport 2.15 bug Something isn't working hybrid search v2.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants