-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address inconsistent scoring in hybrid query results #998
Address inconsistent scoring in hybrid query results #998
Conversation
dce2dd3
to
2edefec
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #998 +/- ##
============================================
+ Coverage 80.29% 80.35% +0.06%
Complexity 997 997
============================================
Files 78 78
Lines 3405 3411 +6
Branches 577 578 +1
============================================
+ Hits 2734 2741 +7
Misses 427 427
+ Partials 244 243 -1 ☔ View full report in Codecov by Sentry. |
} | ||
|
||
@SneakyThrows | ||
public void testHybridScores_withTwoPhaseIterator() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test case would fail with old code
@@ -30,7 +31,7 @@ | |||
* corresponds to order of sub-queries in an input Hybrid query. | |||
*/ | |||
@Log4j2 | |||
public final class HybridQueryScorer extends Scorer { | |||
public class HybridQueryScorer extends Scorer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to remove final
to make this class "mockable"
@@ -187,7 +189,7 @@ public int docID() { | |||
*/ | |||
public float[] hybridScores() throws IOException { | |||
float[] scores = new float[numSubqueries]; | |||
DisiWrapper topList = subScorersPQ.topList(); | |||
DisiWrapper topList = getSubMatches(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is main change for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain here what is the difference between subScorersPQ.topList();
and getSubMatches()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
critical difference is the case when we do have two phase iterator, this can happen when underlying doc iterator uses approximation.
In this case getSubMatches() will return approximation based iterator while subScorersPQ.topList() will be exact Disi iterator. Iterator with approximation can skip some doc ids and jump to next block if current block cannot give competitive score, while disi iterator will always go one by one.
This isn't a problem for simpler cases when there is no two phase iterators. We're seeing two phase iterators often for multi_match
query cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Just wanted to avoid the case where an engineer change this back to subScorersPQ.topList() in the future and introduce same bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, let me put that as a comment
@@ -108,12 +108,17 @@ public void collect(int doc) throws IOException { | |||
} | |||
// Increment total hit count which represents unique doc found on the shard | |||
totalHits++; | |||
hitsThresholdChecker.incrementHitCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a gap team noted while working on the fix, not directly related to reported issue
32a9754
to
ccc5809
Compare
@peterzhuamazon
I've tried to remove ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION param as per this suggestion, it's not helping unfortunately. |
src/main/java/org/opensearch/neuralsearch/search/collector/HybridTopScoreDocCollector.java
Outdated
Show resolved
Hide resolved
693dde6
to
0df0b43
Compare
|
Hi @martin-gaievski you can now rebase and fix the conflicts before getting it merged. Thanks. |
f53f598
to
8e650b3
Compare
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
8e650b3
to
bbe5bac
Compare
Signed-off-by: Martin Gaievski <gaievski@amazon.com> (cherry picked from commit 3c7f275)
Description
For hybrid query we are using
DocIdSetIterator
to iterate over doc ids to get the hybrid scores. That is a problem when sub-queries became complex and system switches to iterator with approximation and two phase approach (introduced in 2.13 in scope of #624).When both
two-phase
andDocIdSetIterator
are in use there is a high chance they will go out of sync and point to different doc ids. This leads toe_o_f_exception
described in the GH issue.Big thanks to @msfroh who pointed that out in opensearch-project/OpenSearch#16660 (comment)
I've added unit tests, but main verification is done manually, as per steps described in here as part of GH issue, mainly it's because some manual steps by installation of analyzer plugin.
Here are request/response after setting up things with
nori
analyser:response
I also tested with concurrent segment search, got similar response which is expected
Related Issues
#964
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.