Replaced stream.findFirst by `for` loop for hybrid query #706

martin-gaievski · 2024-04-24T00:58:19Z

Description

Hybrid query is generally slower than other compound queries with similar child sub-queries/clauses. For instance if compared to Boolean it can be up to 12 times slower, depending on the dataset, query and index/cluster configuration. Check results of benchmark that I took for released 2.13 using noaa OSB workload, all time is in ms:

One sub-query that selects 11M documents

Bool: 98.1014
Hybrid: 973.683

One sub-query that selects 1.6K documents

Bool: 181.046
Hybrid: 90.1155

Three sub-query that select 15M documents

Bool: 117.505
Hybrid: 1458.8

Based on results of profiling most of the CPU time (35 to 40%) is taken by Stream.findFirst call in HybridQueryScorer.

That code is executed for each document returned by each of sub-query. That explains much longer execution time for queries that return larger sub-sets of a dataset.

That section of the code can be optimized to a plain for loop, plus the list of Integer is replaced by the plain array of ints. After optimization same code section takes 5 to 8% of overall execution time. Total time for clean hybrid query has been decreased 3-4 times for large sub-sets.

Below are detailed results for the same workload:

One sub-query that selects 11M documents

Bool: 84.7201
Hybrid: 256.799

One sub-query that selects 1.6K documents

Bool: 89.6258
Hybrid: 85.4563

Three sub-query that select 15M documents

Bool: 90.1481
Hybrid: 326.331

following were bool queries used in testing

Query 1
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "term": {
                          "station.country_code": "JA"
                      }
                  },
                  {
                      "range": {
                          "TRANGE": {
                              "gte": 0,
                              "lte": 30
                          }
                      }
                  },
                  {
                    "range": {
                        "date": {
                            "gte": "2016-06-04",
                            "format":"yyyy-MM-dd"
                        }
                    }
                  }
              ]
          }
        }

Query 2
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "range": {
                          "TRANGE": {
                              "gte": -100,
                              "lte": -50
                          }
                      }
                  }
              ]
          }
        }

Query 3
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "range": {
                        "TRANGE": {
                          "gte": 1,
                          "lte": 35
                        }
                      }
                  }
              ]
          }

equivalent hybrid queres are:

Query 1
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "term": {
                        "station.country_code": "JA"
                    }
                },
                {
                    "range": {
                        "TRANGE": {
                            "gte": 0,
                            "lte": 30
                        }
                    }
                },
                {
                  "range": {
                      "date": {
                          "gte": "2016-06-04",
                          "format":"yyyy-MM-dd"
                      }
                  }
                }
            ]
          }
        }

Query 2
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "range": {
                        "TRANGE": {
                          "gte": -100,
                          "lte": -50
                        }
                    }
                }
            ]
          }

Query 3
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "range": {
                      "TRANGE": {
                        "gte": 1,
                        "lte": 35
                      }
                    }
                }
            ]
          }

Issues Resolved

#705

Check List

All tests pass
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

codecov · 2024-04-24T01:17:56Z

Codecov Report

Attention: Patch coverage is 58.82353% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 84.74%. Comparing base (cc6a6b2) to head (0376e9b).
Report is 12 commits behind head on main.

❗ Current head 0376e9b differs from pull request most recent head e34a7bc. Consider uploading reports for the commit e34a7bc to get more accurate results

Files	Patch %	Lines
...ensearch/neuralsearch/query/HybridQueryScorer.java	58.82%	4 Missing and 3 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #706      +/-   ##
============================================
+ Coverage     84.04%   84.74%   +0.69%     
- Complexity      744      774      +30     
============================================
  Files            59       59              
  Lines          2313     2412      +99     
  Branches        374      405      +31     
============================================
+ Hits           1944     2044     +100     
+ Misses          214      203      -11     
- Partials        155      165      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

CHANGELOG.md

navneet1v · 2024-04-24T01:34:33Z

This is an interesting improvement for such a small change. Really liked the deep-dive here.

Few things I would like you to do here just for bookkeeping purpose:

The numbers and details which are quoted in the description please update them in the Issue.
[Next step] Can you look at other places where we might be using streams in the hybrid query clause which may be adding some extra latency?

vibrantvarun · 2024-04-24T02:38:47Z

Great catch @martin-gaievski . LGTM.

chishui · 2024-04-24T05:21:58Z

This is a great improvement, curious to know if you have checked the profiling after this optimization to see if the CPU usage goes down and which one is the next culprit to CPU usage

Co-authored-by: Navneet Verma <vermanavneet003@gmail.com> Signed-off-by: Martin Gaievski <gaievski@amazon.com>

* Change stream.findFirst to for loop Signed-off-by: Martin Gaievski <gaievski@amazon.com> Co-authored-by: Navneet Verma <vermanavneet003@gmail.com> (cherry picked from commit b277b07)

opensearch-trigger-bot · 2024-04-24T15:34:28Z

The backport to 2.13 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.13 2.13
# Navigate to the new working tree
cd .worktrees/backport-2.13
# Create a new branch
git switch --create backport/backport-706-to-2.13
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 b277b07e89e25f1abaa9de3a326fda3556dc8a77
# Push it to GitHub
git push --set-upstream origin backport/backport-706-to-2.13
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.13

Then, create a pull request where the base branch is 2.13 and the compare/head branch is backport/backport-706-to-2.13.

* Change stream.findFirst to for loop Signed-off-by: Martin Gaievski <gaievski@amazon.com> Co-authored-by: Navneet Verma <vermanavneet003@gmail.com> (cherry picked from commit b277b07)

* Change stream.findFirst to for loop Signed-off-by: Martin Gaievski <gaievski@amazon.com> Co-authored-by: Navneet Verma <vermanavneet003@gmail.com> (cherry picked from commit b277b07) Co-authored-by: Martin Gaievski <gaievski@amazon.com>

* Change stream.findFirst to for loop Signed-off-by: Martin Gaievski <gaievski@amazon.com> Co-authored-by: Navneet Verma <vermanavneet003@gmail.com> (cherry picked from commit b277b07)

martin-gaievski · 2024-04-24T16:29:26Z

This is an interesting improvement for such a small change. Really liked the deep-dive here.

Few things I would like you to do here just for bookkeeping purpose:

The numbers and details which are quoted in the description please update them in the Issue.

[Next step] Can you look at other places where we might be using streams in the hybrid query clause which may be adding some extra latency?

Done
Sure, we need to scan the code for similar calls. I haven't noticed any other places in code with similar problem that significantly add to performance of the query. For now the next step is to exclude TopDocsCollector from the flow when there is a hybrid query, that section has second highest latency.

martin-gaievski · 2024-04-24T16:35:40Z

This is a great improvement, curious to know if you have checked the profiling after this optimization to see if the CPU usage goes down and which one is the next culprit to CPU usage

Next slowest section after Stream.findFirst is extra doc collector. It's added by core for all the queries it consumes resources (per my results it's from 40 to 75% of CPU time) but for hybrid query those results are just ignored. Current feasible approach requires changes in both core and the plugin.
I need to check how CPU usage is after the fix.

martin-gaievski added backport 2.x Label will add auto workflow to backport PR to 2.x branch v2.14.0 hybrid query performance optimization labels Apr 24, 2024

martin-gaievski changed the title ~~Change stream.findFirst to for loop~~ Replace stream.findFirst by for loop Apr 24, 2024

Change stream.findFirst to for loop

348c1a8

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski force-pushed the performance_optimization_hybrid_query_replace_streaming_with_for_loop branch from 0376e9b to 348c1a8 Compare April 24, 2024 01:03

martin-gaievski marked this pull request as ready for review April 24, 2024 01:19

martin-gaievski requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, sean-zheng-amazon, model-collapse, zane-neo, ylwu-amzn, jngz-es, vibrantvarun and zhichao-aws as code owners April 24, 2024 01:19

martin-gaievski changed the title ~~Replace stream.findFirst by for loop~~ Replaced stream.findFirst by for loop for hybrid query Apr 24, 2024

navneet1v reviewed Apr 24, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

vibrantvarun approved these changes Apr 24, 2024

View reviewed changes

Update CHANGELOG.md

e34a7bc

Co-authored-by: Navneet Verma <vermanavneet003@gmail.com> Signed-off-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski added the backport 2.13 label Apr 24, 2024

navneet1v approved these changes Apr 24, 2024

View reviewed changes

martin-gaievski merged commit b277b07 into opensearch-project:main Apr 24, 2024
73 checks passed

opensearch-trigger-bot bot mentioned this pull request Apr 24, 2024

[Backport 2.x] Replaced stream.findFirst by for loop for hybrid query #707

Merged

martin-gaievski mentioned this pull request Apr 24, 2024

[Backport 2.13] Replaced stream.findFirst by for loop for hybrid query #708

Merged

martin-gaievski added the hybrid search label Apr 26, 2024

navneet1v mentioned this pull request May 17, 2024

Implement parallel execution of sub-queries for hybrid search #749

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replaced stream.findFirst by `for` loop for hybrid query #706

Replaced stream.findFirst by `for` loop for hybrid query #706

martin-gaievski commented Apr 24, 2024 •

edited

Loading

codecov bot commented Apr 24, 2024 •

edited

Loading

navneet1v commented Apr 24, 2024

vibrantvarun commented Apr 24, 2024

chishui commented Apr 24, 2024

opensearch-trigger-bot bot commented Apr 24, 2024

martin-gaievski commented Apr 24, 2024

martin-gaievski commented Apr 24, 2024

Replaced stream.findFirst by for loop for hybrid query #706

Replaced stream.findFirst by for loop for hybrid query #706

Conversation

martin-gaievski commented Apr 24, 2024 • edited Loading

Description

Issues Resolved

Check List

codecov bot commented Apr 24, 2024 • edited Loading

Codecov Report

navneet1v commented Apr 24, 2024

vibrantvarun commented Apr 24, 2024

chishui commented Apr 24, 2024

opensearch-trigger-bot bot commented Apr 24, 2024

martin-gaievski commented Apr 24, 2024

martin-gaievski commented Apr 24, 2024

Replaced stream.findFirst by `for` loop for hybrid query #706

Replaced stream.findFirst by `for` loop for hybrid query #706

martin-gaievski commented Apr 24, 2024 •

edited

Loading

codecov bot commented Apr 24, 2024 •

edited

Loading