Add Support for Multi Values in innerHit for Nested k-NN Fields in Lucene and FAISS #2283

heemin32 · 2024-11-20T17:59:33Z

Description

This PR introduces support for returning all nested fields with their scores inside innerHit for nested k-NN fields, applicable to both Lucene and FAISS engines.

The implementation involves executing a search request across all segments and collecting results at the shard level, similar to the approach used in disk-based k-NN searches. After reducing the results to the top k, we retrieve all sibling documents associated with these results. Using the IDs of the retrieved sibling documents as filtered document IDs, we perform another exact search to score them comprehensively.

Here are additional explanations for the changes made:

Added JsonPath as a dependency exclusively for integration testing, using version 2.8.0. 2.9.0 has an dependency conflict issue with SLF4J.
Adopted a composite approach in NestedKnnVectorInnerHitQuery.java to enable code reuse between byte vectors and float vectors.
Replaced the use of BitSet with DocIdSetIterator for filteredDocId to eliminate the overhead of converting from an iterator to a BitSet and back to an iterator.

Related Issues

Resolves #2249

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

src/main/java/org/opensearch/knn/index/query/ExactSearcher.java

navneet1v

I am reviewing the lucene lib classes as of now and it will take some time for me get through them but publishing the comments here. to kick start the discussion on some of the other comments in the code

src/main/java/org/opensearch/knn/index/query/common/QueryUtils.java

src/main/java/org/opensearch/knn/index/query/common/DocAndScoreQuery.java

src/main/java/org/opensearch/knn/index/query/nativelib/NativeEngineKnnVectorQuery.java

navneet1v · 2024-11-22T01:55:28Z

src/main/java/org/opensearch/knn/index/query/KNNQueryFactory.java

+            if (createQueryRequest.getRescoreContext().isPresent()) {
+                return new NativeEngineKnnVectorQuery(knnQuery, QUERY_UTILS, isInnerHitQuery);
+            } else if (ENGINES_SUPPORTING_MULTI_VECTORS.contains(knnEngine) && isInnerHitQuery) {
+                return new NativeEngineKnnVectorQuery(knnQuery, QUERY_UTILS, isInnerHitQuery);
+            } else {
+                return knnQuery;
+            }


I think we can simplify this logic. We can always call the NativeEngineKnnVectorQuery. Since we are not doing running the query rewrites.

@shatejas , @jmazanec15 was there a reason we kept the logic of to send query in different paths with NativeEngineKnnVectorQuery and knnQuery.

It was to isolate disk based vector search as a precaution, we can always call NativeQuery. If we do that we should consider if we need KNNQuery and KNNWeight classes as it makes the code convoluted with NativeEngine again delegating to another query

@shatejas at-least for this PR, I would like us to track and simplify this logic and then make be take another PR for removing the KNNQuery. Atleast for now I think if we want to remove the KNNQuery it will be a big refactor which is completely out of scope for this PR. Open for suggestions here.

I can create a separate PR for the change if needed, making it easier to revert in case of any unforeseen issues.

when you say a separate PR you mean for the simplification of this condition

src/main/java/org/opensearch/knn/index/query/ResultUtil.java

src/main/java/org/opensearch/knn/index/query/lucenelib/NestedKnnVectorInnerHitQuery.java

shatejas · 2024-11-25T16:53:45Z

src/main/java/org/opensearch/knn/index/query/lucenelib/NestedKnnVectorInnerHitQuery.java

+    ) throws IOException {
+        // Construct query
+        List<Callable<TopDocs>> nestedQueryTasks = new ArrayList<>(leafReaderContexts.size());
+        Weight filterWeight = getFilterWeight(indexSearcher);


The filter query seems to be executing twice in the flow (one in rewrite and another in here). Its redundant and might add to latencies.

Is there an alternative solution where the support for innerhits can be added in existing lucene queries instead? there might be optimizations like single execution of filter query, not creating Doc and score query multiple times, that can be leveraged

FilterWeight in AbstractKnnVectorQuery class need to be stored in variable and it should be accessible from child class. Then, we can reuse it.

src/main/java/org/opensearch/knn/index/query/nativelib/NativeEngineKnnVectorQuery.java

src/main/java/org/opensearch/knn/index/query/common/DocAndScoreQuery.java

src/main/java/org/opensearch/knn/index/query/ExactSearcher.java

src/main/java/org/opensearch/knn/index/query/KNNWeight.java

src/main/java/org/opensearch/knn/index/query/iterators/GroupedNestedDocIdSetIterator.java

navneet1v

Code looks good to me. Just check this thing, since we are not returning all the child documents of the parent docs, will this results into same behavior where if 1 parent child docs are better than other parent child docs, will Opensearch returns just 1 parent doc to customer or it will return 2 parent docs to customers.

heemin32 · 2024-12-10T01:01:21Z

Code looks good to me. Just check this thing, since we are not returning all the child documents of the parent docs, will this results into same behavior where if 1 parent child docs are better than other parent child docs, will Opensearch returns just 1 parent doc to customer or it will return 2 parent docs to customers.

Even when multiple nested documents are returned per parent document, they are joined back to the parent document, ensuring that the final parent document count remains unaffected. It has been confirmed that, in such cases, the result will still include 2 parent documents.

Create Index With 2 shards

PUT /my-knn-index-1
{
  "settings": {
    "index": {
      "knn": true,
      "number_of_shards": 2
    }
  },
  "mappings": {
    "properties": {
      "nested_field": {
        "type": "nested",
        "properties": {
          "my_vector": {
            "type": "knn_vector",
            "dimension": 3,
            "space_type": "l2",
            "method": {
              "name": "hnsw",
              "engine": "faiss"
            }
          }
        }
      }
    }
  }
}

Ingest 4 documents with 10 nested docs per each

PUT /_bulk?refresh=true
{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
{"nested_field":[{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]}]}
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
{"nested_field":[{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]}]}
{ "index": { "_index": "my-knn-index-1", "_id": "3" } }
{"nested_field":[{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]}]}
{ "index": { "_index": "my-knn-index-1", "_id": "4" } }
{"nested_field":[{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]}]}

Check documents are distributed across 2 shards

GET /_cat/shards/my-knn-index-1
my-knn-index-1 0 p STARTED    33 4.8kb 127.0.0.1 integTest-0
my-knn-index-1 0 r UNASSIGNED                    
my-knn-index-1 1 p STARTED    11 4.5kb 127.0.0.1 integTest-0
my-knn-index-1 1 r UNASSIGNED

Search

GET /my-knn-index-1/_search
{
  "_source": false,
  "query": {
    "nested": {
      "path": "nested_field",
      "query": {
        "knn": {
          "nested_field.my_vector": {
            "vector": [
              10,
              10,
              10
            ],
            "k": 4,
            "expand_nested_docs": true
          }
        }
      },
      "score_mode": "max"
    }
  }
}

Result

Confirmed that 4 result is returned properly.

{
  "took": 15,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "my-knn-index-1",
        "_id": "2",
        "_score": 1.0
      },
      {
        "_index": "my-knn-index-1",
        "_id": "1",
        "_score": 0.0040983604
      },
      {
        "_index": "my-knn-index-1",
        "_id": "3",
        "_score": 4.115057E-5
      },
      {
        "_index": "my-knn-index-1",
        "_id": "4",
        "_score": 3.4010122E-7
      }
    ]
  }
}

navneet1v · 2024-12-10T01:33:54Z

@heemin32 thanks for confirming. Do we have a similar IT test added? also were you able to figure out in the code where this translation of child to parent docs is happening and ensuring that are picking up the parent docs == size only.

heemin32 · 2024-12-10T20:16:35Z

@heemin32 thanks for confirming. Do we have a similar IT test added? also were you able to figure out in the code where this translation of child to parent docs is happening and ensuring that are picking up the parent docs == size only.

Let me add one. There was a hidden bug that I missed as well. The translation of child to parent docs is happening in NestedQueryBuilder -> OpenSearchToParentBlockJoinQuery -> ToParentBlockJoinQuery.

Signed-off-by: Heemin Kim <heemin@amazon.com>

opensearch-trigger-bot · 2024-12-11T17:36:08Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-2283-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 88792e42f121b050f2fc9cf32b039052aab62128
# Push it to GitHub
git push --set-upstream origin backport/backport-2283-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-2283-to-2.x.

Signed-off-by: Heemin Kim <heemin@amazon.com> (cherry picked from commit 88792e4)

Signed-off-by: Heemin Kim <heemin@amazon.com> (cherry picked from commit 88792e4) Co-authored-by: Heemin Kim <heemin@amazon.com>

Signed-off-by: Heemin Kim <heemin@amazon.com>

heemin32 requested review from navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, ryanbogan, luyuncheng and shatejas as code owners November 20, 2024 17:59

heemin32 changed the title ~~Multiple innerHit in nested fields~~ Add Support for Multi Values in innerHit for Nested k-NN Fields in Lucene and FAISS Nov 20, 2024

heemin32 force-pushed the innerhit branch 2 times, most recently from 36cab2b to 80998eb Compare November 20, 2024 18:12

navneet1v reviewed Nov 22, 2024

View reviewed changes

src/main/java/org/opensearch/knn/index/query/ExactSearcher.java Outdated Show resolved Hide resolved

navneet1v reviewed Nov 22, 2024

View reviewed changes

heemin32 force-pushed the innerhit branch from 80998eb to 55d09dc Compare November 22, 2024 05:32

shatejas reviewed Nov 25, 2024

View reviewed changes

heemin32 force-pushed the innerhit branch from 55d09dc to 142ae3c Compare November 25, 2024 22:27

navneet1v reviewed Nov 25, 2024

View reviewed changes

src/main/java/org/opensearch/knn/index/query/ExactSearcher.java Outdated Show resolved Hide resolved

navneet1v reviewed Nov 25, 2024

View reviewed changes

src/main/java/org/opensearch/knn/index/query/KNNWeight.java Outdated Show resolved Hide resolved

navneet1v reviewed Nov 25, 2024

View reviewed changes

src/main/java/org/opensearch/knn/index/query/iterators/GroupedNestedDocIdSetIterator.java Outdated Show resolved Hide resolved

navneet1v reviewed Nov 25, 2024

View reviewed changes

src/main/java/org/opensearch/knn/index/query/iterators/GroupedNestedDocIdSetIterator.java Outdated Show resolved Hide resolved

heemin32 force-pushed the innerhit branch 6 times, most recently from 661cff7 to 2b1c552 Compare November 26, 2024 06:58

heemin32 requested review from shatejas and navneet1v November 26, 2024 07:03

heemin32 force-pushed the innerhit branch from eb50dc1 to d1ded83 Compare December 9, 2024 18:35

heemin32 requested review from navneet1v and jmazanec15 December 9, 2024 18:37

navneet1v previously approved these changes Dec 9, 2024

View reviewed changes

jmazanec15 previously approved these changes Dec 10, 2024

View reviewed changes

Multiple innerHit in nested fields

991f0c8

Signed-off-by: Heemin Kim <heemin@amazon.com>

heemin32 dismissed stale reviews from jmazanec15 and navneet1v via 991f0c8 December 10, 2024 21:19

heemin32 force-pushed the innerhit branch from d1ded83 to 991f0c8 Compare December 10, 2024 21:19

heemin32 requested review from jmazanec15 and navneet1v December 10, 2024 21:20

heemin32 mentioned this pull request Dec 10, 2024

[FEATURE]Support of new k-NN query parameter, expand_nested_docs opensearch-project/neural-search#1008

Closed

navneet1v approved these changes Dec 10, 2024

View reviewed changes

jmazanec15 approved these changes Dec 11, 2024

View reviewed changes

heemin32 added backport 2.x v2.19.0 labels Dec 11, 2024

heemin32 merged commit 88792e4 into opensearch-project:main Dec 11, 2024
37 of 39 checks passed

heemin32 added backport 2.x and removed backport 2.x labels Dec 11, 2024

opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 11, 2024

Multiple innerHit in nested fields (#2283)

cb93c9b

Signed-off-by: Heemin Kim <heemin@amazon.com> (cherry picked from commit 88792e4)

opensearch-trigger-bot bot mentioned this pull request Dec 11, 2024

[Backport 2.x] Add Support for Multi Values in innerHit for Nested k-NN Fields in Lucene and FAISS #2323

Merged

heemin32 added a commit that referenced this pull request Dec 11, 2024

Multiple innerHit in nested fields (#2283)

b289ca6

Signed-off-by: Heemin Kim <heemin@amazon.com> (cherry picked from commit 88792e4)

navneet1v pushed a commit that referenced this pull request Dec 11, 2024

Multiple innerHit in nested fields (#2283) (#2323)

739affe

Signed-off-by: Heemin Kim <heemin@amazon.com> (cherry picked from commit 88792e4) Co-authored-by: Heemin Kim <heemin@amazon.com>

heemin32 mentioned this pull request Dec 11, 2024

[FEATURE] Score mode support other than max with KNN nested field #1743

Closed

martin-gaievski mentioned this pull request Dec 17, 2024

Fixed failed test after knn added multiple inner hits feature opensearch-project/neural-search#1026

Merged

1 task

owenhalpert pushed a commit to owenhalpert/k-NN that referenced this pull request Dec 19, 2024

Multiple innerHit in nested fields (opensearch-project#2283)

32a8cca

Signed-off-by: Heemin Kim <heemin@amazon.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Multi Values in innerHit for Nested k-NN Fields in Lucene and FAISS #2283

Add Support for Multi Values in innerHit for Nested k-NN Fields in Lucene and FAISS #2283

heemin32 commented Nov 20, 2024 •

edited

Loading

navneet1v left a comment

navneet1v Nov 22, 2024

shatejas Nov 25, 2024 •

edited

Loading

navneet1v Nov 26, 2024

heemin32 Nov 26, 2024

navneet1v Nov 26, 2024

heemin32 Nov 26, 2024

shatejas Nov 25, 2024

heemin32 Nov 25, 2024

navneet1v left a comment

heemin32 commented Dec 10, 2024

navneet1v commented Dec 10, 2024

heemin32 commented Dec 10, 2024 •

edited

Loading

opensearch-trigger-bot bot commented Dec 11, 2024

Add Support for Multi Values in innerHit for Nested k-NN Fields in Lucene and FAISS #2283

Add Support for Multi Values in innerHit for Nested k-NN Fields in Lucene and FAISS #2283

Conversation

heemin32 commented Nov 20, 2024 • edited Loading

Description

Related Issues

Check List

navneet1v left a comment

Choose a reason for hiding this comment

navneet1v Nov 22, 2024

Choose a reason for hiding this comment

shatejas Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

navneet1v Nov 26, 2024

Choose a reason for hiding this comment

heemin32 Nov 26, 2024

Choose a reason for hiding this comment

navneet1v Nov 26, 2024

Choose a reason for hiding this comment

heemin32 Nov 26, 2024

Choose a reason for hiding this comment

shatejas Nov 25, 2024

Choose a reason for hiding this comment

heemin32 Nov 25, 2024

Choose a reason for hiding this comment

navneet1v left a comment

Choose a reason for hiding this comment

heemin32 commented Dec 10, 2024

Create Index With 2 shards

Ingest 4 documents with 10 nested docs per each

Check documents are distributed across 2 shards

Search

Result

navneet1v commented Dec 10, 2024

heemin32 commented Dec 10, 2024 • edited Loading

opensearch-trigger-bot bot commented Dec 11, 2024

heemin32 commented Nov 20, 2024 •

edited

Loading

shatejas Nov 25, 2024 •

edited

Loading

heemin32 commented Dec 10, 2024 •

edited

Loading