-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for Multi Values in innerHit for Nested k-NN Fields in Lucene and FAISS #2283
Conversation
36cab2b
to
80998eb
Compare
src/main/java/org/opensearch/knn/index/query/ExactSearcher.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am reviewing the lucene lib classes as of now and it will take some time for me get through them but publishing the comments here. to kick start the discussion on some of the other comments in the code
src/main/java/org/opensearch/knn/index/query/common/DocAndScoreQuery.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/query/nativelib/NativeEngineKnnVectorQuery.java
Outdated
Show resolved
Hide resolved
if (createQueryRequest.getRescoreContext().isPresent()) { | ||
return new NativeEngineKnnVectorQuery(knnQuery, QUERY_UTILS, isInnerHitQuery); | ||
} else if (ENGINES_SUPPORTING_MULTI_VECTORS.contains(knnEngine) && isInnerHitQuery) { | ||
return new NativeEngineKnnVectorQuery(knnQuery, QUERY_UTILS, isInnerHitQuery); | ||
} else { | ||
return knnQuery; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify this logic. We can always call the NativeEngineKnnVectorQuery. Since we are not doing running the query rewrites.
@shatejas , @jmazanec15 was there a reason we kept the logic of to send query in different paths with NativeEngineKnnVectorQuery and knnQuery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to isolate disk based vector search as a precaution, we can always call NativeQuery. If we do that we should consider if we need KNNQuery and KNNWeight classes as it makes the code convoluted with NativeEngine again delegating to another query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shatejas at-least for this PR, I would like us to track and simplify this logic and then make be take another PR for removing the KNNQuery
. Atleast for now I think if we want to remove the KNNQuery
it will be a big refactor which is completely out of scope for this PR. Open for suggestions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can create a separate PR for the change if needed, making it easier to revert in case of any unforeseen issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you say a separate PR you mean for the simplification of this condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
src/main/java/org/opensearch/knn/index/query/lucenelib/NestedKnnVectorInnerHitQuery.java
Outdated
Show resolved
Hide resolved
) throws IOException { | ||
// Construct query | ||
List<Callable<TopDocs>> nestedQueryTasks = new ArrayList<>(leafReaderContexts.size()); | ||
Weight filterWeight = getFilterWeight(indexSearcher); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filter query seems to be executing twice in the flow (one in rewrite and another in here). Its redundant and might add to latencies.
Is there an alternative solution where the support for innerhits can be added in existing lucene queries instead? there might be optimizations like single execution of filter query, not creating Doc and score query multiple times, that can be leveraged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FilterWeight in AbstractKnnVectorQuery
class need to be stored in variable and it should be accessible from child class. Then, we can reuse it.
src/main/java/org/opensearch/knn/index/query/nativelib/NativeEngineKnnVectorQuery.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/query/common/DocAndScoreQuery.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/query/ExactSearcher.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/query/iterators/GroupedNestedDocIdSetIterator.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/query/iterators/GroupedNestedDocIdSetIterator.java
Outdated
Show resolved
Hide resolved
661cff7
to
2b1c552
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me. Just check this thing, since we are not returning all the child documents of the parent docs, will this results into same behavior where if 1 parent child docs are better than other parent child docs, will Opensearch returns just 1 parent doc to customer or it will return 2 parent docs to customers.
Even when multiple nested documents are returned per parent document, they are joined back to the parent document, ensuring that the final parent document count remains unaffected. It has been confirmed that, in such cases, the result will still include 2 parent documents. Create Index With 2 shards
Ingest 4 documents with 10 nested docs per each
Check documents are distributed across 2 shards
Search
ResultConfirmed that 4 result is returned properly.
|
@heemin32 thanks for confirming. Do we have a similar IT test added? also were you able to figure out in the code where this translation of child to parent docs is happening and ensuring that are picking up the parent docs == size only. |
Let me add one. There was a hidden bug that I missed as well. The translation of child to parent docs is happening in |
Signed-off-by: Heemin Kim <heemin@amazon.com>
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-2283-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 88792e42f121b050f2fc9cf32b039052aab62128
# Push it to GitHub
git push --set-upstream origin backport/backport-2283-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x Then, create a pull request where the |
Signed-off-by: Heemin Kim <heemin@amazon.com> (cherry picked from commit 88792e4)
Signed-off-by: Heemin Kim <heemin@amazon.com> (cherry picked from commit 88792e4)
Signed-off-by: Heemin Kim <heemin@amazon.com>
Description
This PR introduces support for returning all nested fields with their scores inside innerHit for nested k-NN fields, applicable to both Lucene and FAISS engines.
The implementation involves executing a search request across all segments and collecting results at the shard level, similar to the approach used in disk-based k-NN searches. After reducing the results to the top k, we retrieve all sibling documents associated with these results. Using the IDs of the retrieved sibling documents as filtered document IDs, we perform another exact search to score them comprehensively.
Here are additional explanations for the changes made:
Related Issues
Resolves #2249
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.