Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] support default model id in neural_sparse query #614

Merged

Conversation

zhichao-aws
Copy link
Member

@zhichao-aws zhichao-aws commented Feb 26, 2024

Description

see #610
In this PR we support default model_id for neural_sparse query. Existing code use check the class of queryBuilder in the visitor(ref), use modelId method to set the default model id for all NeuralQueryBuilder. This PR create a new interface called ModelInferenceQueryBuilder, it has the common method for setting default model id. In this PR, we let the NeuralQueryBuilder and NeuralSparseQueryBuilder implements the ModelInferenceQueryBuilder, and in the visitor we'll check and set default model id for ModelInferenceQueryBuilder. The unit test, integration test and bwc test can cover the changes in this PR, and make sure it doesn't have impact on existing functionality.

Issues Resolved

#610

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@zhichao-aws zhichao-aws changed the title FEATURE] support default model id in neural_sparse query [FEATURE] support default model id in neural_sparse query Feb 26, 2024
@zhichao-aws zhichao-aws force-pushed the neural_sparse_default_model_id branch from 8257486 to 7176ebb Compare February 26, 2024 10:18
Copy link

codecov bot commented Feb 27, 2024

Codecov Report

Attention: Patch coverage is 85.18519% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 82.65%. Comparing base (759a971) to head (6946d3a).
Report is 1 commits behind head on main.

Files Patch % Lines
...search/query/visitor/NeuralSearchQueryVisitor.java 81.81% 0 Missing and 2 partials ⚠️
...nsearch/neuralsearch/query/NeuralQueryBuilder.java 50.00% 0 Missing and 1 partial ⚠️
...h/neuralsearch/query/NeuralSparseQueryBuilder.java 92.85% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #614      +/-   ##
============================================
+ Coverage     82.57%   82.65%   +0.07%     
- Complexity      656      663       +7     
============================================
  Files            52       52              
  Lines          2055     2064       +9     
  Branches        328      330       +2     
============================================
+ Hits           1697     1706       +9     
  Misses          212      212              
  Partials        146      146              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zhichao-aws
Copy link
Member Author

This PR also contains the content in #615. To make a clear commit history, please merge #615 first and I'll rebase main again.

@vibrantvarun
Copy link
Member

@zhichao-aws We have fixed the build on main. Can you rebase and try running failing tests again? Thanks

@zhichao-aws zhichao-aws force-pushed the neural_sparse_default_model_id branch from acce8d2 to 7f3d52f Compare February 28, 2024 02:02
@zhichao-aws
Copy link
Member Author

rebase on #615

@zhichao-aws zhichao-aws force-pushed the neural_sparse_default_model_id branch from 7f3d52f to 1189b4e Compare February 29, 2024 04:00
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
}
}

public void testNeuralQueryEnricherProcessor_NeuralSearch_E2EFlow() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments which I did for the above tests applies here as well.

qa/restart-upgrade/build.gradle Show resolved Hide resolved
qa/restart-upgrade/build.gradle Show resolved Hide resolved
@zhichao-aws
Copy link
Member Author

I am thinking more on this

I think this functionality will break the scenario's if a user wants a different modeId with neural search query and neural sparse search query.

Also if a user doesn't want default model id support on neural query or neural sparse query either of them he cannot do that as by default if we create a processor it will be applied on both the query builder.

Therefore, there should be a different solution.

Here is my thinking This work can be done in possibly 2 ways

  1. Create a new processor NeuralSparseQueryEnricherProcessor and a new visitor

or

  1. Create new fields in the existing neuralqueryenricher processor like default_sparse_modelId and set it there and also the field map. But there would be lot of ifs and elses in the NeuralQueryEnricherProcessor.
    Also create new visitor because if you use the current visitor then there will be a if check for neural sparse query also and it will break some scenarios in NeuralQuery model id support.

Therefore my suggestion is to go with Option 1. Maintanence with option 1 is easy as it gives flexibility to extend the functionality for multiple use cases.

cc: @navneet1v @martin-gaievski

This scenerio is similar to users have 2 knn fields and needs 2 default model_id for neural query. In this case, users can set the neural_field_default_id to set the default model_id for each field. And for the scenerio you mentioned, it works the same.

/**
* Set a new model id for the query builder.
*/
public ModelInferenceQueryBuilder modelId(String modelId);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this returned value used anywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's used by the NeuralSearchQueryVisitor to check whether the model id field is null

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point the line using this return value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line you're pointing at seems the getter method without parameter, am I missing anything?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@zhichao-aws zhichao-aws Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just set the modelId, and don't use the return value. But this setter method is generated automatically by lombok Setter annotation. And the return value for setter is the class itself by default.

Signed-off-by: zhichao-aws <zhichaog@amazon.com>
@vibrantvarun
Copy link
Member

vibrantvarun commented Mar 10, 2024

I am thinking more on this
I think this functionality will break the scenario's if a user wants a different modeId with neural search query and neural sparse search query.
Also if a user doesn't want default model id support on neural query or neural sparse query either of them he cannot do that as by default if we create a processor it will be applied on both the query builder.
Therefore, there should be a different solution.
Here is my thinking This work can be done in possibly 2 ways

  1. Create a new processor NeuralSparseQueryEnricherProcessor and a new visitor

or

  1. Create new fields in the existing neuralqueryenricher processor like default_sparse_modelId and set it there and also the field map. But there would be lot of ifs and elses in the NeuralQueryEnricherProcessor.
    Also create new visitor because if you use the current visitor then there will be a if check for neural sparse query also and it will break some scenarios in NeuralQuery model id support.

Therefore my suggestion is to go with Option 1. Maintanence with option 1 is easy as it gives flexibility to extend the functionality for multiple use cases.
cc: @navneet1v @martin-gaievski

This scenerio is similar to users have 2 knn fields and needs 2 default model_id for neural query. In this case, users can set the neural_field_default_id to set the default model_id for each field. And for the scenerio you mentioned, it works the same.

Hey @zhichao-aws

Consider you created a processor

UT /_search/pipeline/default_model_pipeline 
{
  "request_processors": [
    {
      "neural_query_enricher" : {
        "default_model_id": "bQ1J8ooBpBj3wT4HVUsb",
        "neural_field_default_id": {
           "my_field_1": "uZj0qYoBMtvQlfhaYeud",
           "my_field_2": "upj0qYoBMtvQlfhaZOuM"
        }
      }
    }
  ]
}

By the code definition the first priority is given to neural_field_default_id to set the model Id in the querybuilder. Then if the neural_field_default_id is not present then it will cater default_model_id

Q ) Consider an example where cx want to do a neural search with default_model_id which applies to all field regardless of any field specific model id.

A) Cx can set a default_model_id which applies it to all fields.

UT /_search/pipeline/default_model_pipeline 
{
  "request_processors": [
    {
      "neural_query_enricher" : {
        "default_model_id": "bQ1J8ooBpBj3wT4HVUsb",
        "neural_field_default_id": {
        }
      }
    }
  ]
}

Q) The work which you did Now if Cx want a specific model Id for neural sparse search which is field related (lets say field name is passage_text) and for neural search cx want a default_model_id which applies to all fields.The work which you did will work like

A) Cx created a processor which has default_model_id for neural search and field specific model Id for neural sparse search

UT /_search/pipeline/default_model_pipeline 
{
  "request_processors": [
    {
      "neural_query_enricher" : {
        "default_model_id": "bQ1J8ooBpBj3wT4HVUsb",
        "neural_field_default_id": {
             "passage_text": "uZj0qYoBMtvQlfhaYeud"
        }
      }
    }
  ]
}

Now as per your code it will always select field specific model id because field map will never be empty and the code you did in neuralsearchqueryvisitor will always apply same logic for neural search and neural sparse search.

Moreover, the (reply)[https://github.com//pull/614#issuecomment-1986586819] you gave here says here to create a different model Id in for each field in the neural_field_default_id map.

UT /_search/pipeline/default_model_pipeline 
{
  "request_processors": [
    {
      "neural_query_enricher" : {
        "default_model_id": null,
        "neural_field_default_id": {
           "passage_text": "uZj0qYoBMtvQlfhaYeud",
           "passage_text_1": " LuZj0qYoBMtvQlfhaYeud",
        }
      }
    }
  ]
}
  1. The problem with above scenario's are code will fetch model Id with whatever field is passed in the map. But, most of the cx rely heavily on default_model_id as they prefer to apply 1 model id for all fields rather than giving it seperatly in the map.

  2. One more problem is what if on the same field name i want to use 2 model id's one in neural search query and neural sparse query.

Also there might most of the scenarios which might be missing. I have already discussed with @navneet1v and would advise to create a new visitor. Also for more readability add one more map in the neural_query_enricher processor i.e. neural_sparse_field_default_id and use it in new visitor.

Also consider 1 more scenario

"query": {
    "hybrid": {
      "queries": [
        {
          "neural_sparse": {
                 "passage_embedding": {
                        "query_text": "Hi world",
                 }
        },
        {
          "neural": {
            "passage_embedding": {
              "query_text": "Hi world",
              "k": 5
            }
          }
        }
      ]
    }
  }

How can someone give a different model id for a specific field in both the clauses above when having same field name by using your code implementation in the PR?

@zhichao-aws
Copy link
Member Author

Hi @vibrantvarun

How can someone give a different model id for a specific field in both the clauses above when having same field name by using your code implementation in the PR?

This is an impossible scenerio. Because neural query works only on knn field, and neural_sparse query works only on rank_features field. One field can not be knn and rank_features field. There must be 2 fields name for each.

Now as per your code it will always select field specific model id because field map will never be empty and the code you did in neuralsearchqueryvisitor will always apply same logic for neural search and neural sparse search.

Now the code implementation is as follows:

       if (modelInferenceQueryBuilder.modelId() == null) {
            if (neuralFieldMap != null
                && modelInferenceQueryBuilder.fieldName() != null
                && neuralFieldMap.get(modelInferenceQueryBuilder.fieldName()) != null) {
                String fieldDefaultModelId = (String) neuralFieldMap.get(modelInferenceQueryBuilder.fieldName());
                modelInferenceQueryBuilder.modelId(fieldDefaultModelId);
            } else if (modelId != null) {
                modelInferenceQueryBuilder.modelId(modelId);
            } else {
                throw new IllegalArgumentException(
                    "model id must be provided in neural query or a default model id must be set in search request processor"
                );
            }
        }

Consider we have a default model id for neural and field model id for neural_sparse and field name is passage_embedding. For neural sparse query on passage_embedding, it works fine. And for neural on other fields, for the first if statement, the check neuralFieldMap.get(modelInferenceQueryBuilder.fieldName()) != null will fail, because we only have passage_embedding in the map and the field name for neural query can not be passage_embedding. So it will check the second if statement and assign the default model_id.

@zane-neo
Copy link
Collaborator

LGTM, approving.

@vibrantvarun
Copy link
Member

Looks much better now. LGTM , but before I approve this merge request can you add a documentation-website issue to add the documentation for cx to understand the feature.

Sample issue: opensearch-project/documentation-website#5060

Reference: https://opensearch.org/docs/latest/search-plugins/semantic-search/#setting-a-default-model-on-an-index-or-field

@zhichao-aws
Copy link
Member Author

Looks much better now. LGTM , but before I approve this merge request can you add a documentation-website issue to add the documentation for cx to understand the feature.

Sample issue: opensearch-project/documentation-website#5060

Reference: https://opensearch.org/docs/latest/search-plugins/semantic-search/#setting-a-default-model-on-an-index-or-field

Good point. The document issue link: opensearch-project/documentation-website#6652

Copy link
Member

@vibrantvarun vibrantvarun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

CHANGELOG.md Outdated Show resolved Hide resolved
@@ -108,7 +119,9 @@ protected void doXContent(XContentBuilder xContentBuilder, Params params) throws
xContentBuilder.startObject(NAME);
xContentBuilder.startObject(fieldName);
xContentBuilder.field(QUERY_TEXT_FIELD.getPreferredName(), queryText);
xContentBuilder.field(MODEL_ID_FIELD.getPreferredName(), modelId);
if (modelId != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use Objects.nonNull here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

/**
* Set a new model id for the query builder.
*/
public ModelInferenceQueryBuilder modelId(String modelId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I got the design idea for this method. It looks like a setter, but it retuning a value. And in the example that you've provided that value is not even used.
If my understanding is correct I suggest you make method's return type as void

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In NeuralSparseQueryBuilder and NeuralQueryBuilder we have a modelId setter method generated automatically by lombok @Setter annotation. The auto generated modelId will take the class itself as the return type.

Do you prefer we add a setModelId method with void return value in NeuralSparseQueryBuilder and NeuralQueryBuilder?

/**
* Get the model id used by ml-commons model inference. Return null if the model id is absent.
*/
public String modelId();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we expect model id to be null, why not to use Optional as return type. That is giving clear indication to a caller that null object is a designed normal case rather than exception

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This getter method is also generated automatically by lombok.

Do you prefer we define a getModelId with Optional return type in NeuralSparseQueryBuilder and NeuralQueryBuilder?

if (neuralQueryBuilder.modelId() == null) {
if (queryBuilder instanceof ModelInferenceQueryBuilder) {
ModelInferenceQueryBuilder modelInferenceQueryBuilder = (ModelInferenceQueryBuilder) queryBuilder;
if (modelInferenceQueryBuilder.modelId() == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Objects.nonNull, or Optional.isPresent if you'd like to take Optional route

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

String fieldDefaultModelId = (String) neuralFieldMap.get(neuralQueryBuilder.fieldName());
neuralQueryBuilder.modelId(fieldDefaultModelId);
&& modelInferenceQueryBuilder.fieldName() != null
&& neuralFieldMap.get(modelInferenceQueryBuilder.fieldName()) != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these conditions are hard to read, can you incapsulate them into a method and give it a meaningful name, like for instance isDefaultModelIdDefined

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Signed-off-by: zhichao-aws <zhichaog@amazon.com>
@vibrantvarun
Copy link
Member

Please check the codecoverage and fix the GH action

@vibrantvarun vibrantvarun added v2.13.0 backport 2.x Label will add auto workflow to backport PR to 2.x branch labels Mar 14, 2024
@vibrantvarun vibrantvarun merged commit e41fba7 into opensearch-project:main Mar 14, 2024
64 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-614-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 e41fba7aa82401bd9dce5dec4e08ea7191512fa9
# Push it to GitHub
git push --set-upstream origin backport/backport-614-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-614-to-2.x.

zhichao-aws added a commit to zhichao-aws/neural-search that referenced this pull request Mar 14, 2024
…-project#614)

* feature: implement default model id for neural sparse

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* feature: implement default model id for neural sparse

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add ut

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add ut it

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add changelog

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix ingest pipeline in it

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add it for bwc restart-upgrade

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add it for bwc restart-upgrade

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add it for bwc restart-upgrade

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add it for bwc restart-upgrade

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix undeploy with retry

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add it for bwc restart-upgrade

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add it for bwc restart-upgrade

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* optimize it code structure

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add it for bwc rolling-upgrade

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* tidy

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* update index mapping in it

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* move version check to build script

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* resolve modelId

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* update init model id

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* modify versions check logic in bwc test

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add comments

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* updates for comments

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

---------

Signed-off-by: zhichao-aws <zhichaog@amazon.com>
(cherry picked from commit e41fba7)
zane-neo pushed a commit that referenced this pull request Mar 15, 2024
* [FEATURE] support default model id in neural_sparse query (#614)

* feature: implement default model id for neural sparse

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* feature: implement default model id for neural sparse

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add ut

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add ut it

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add changelog

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix ingest pipeline in it

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add it for bwc restart-upgrade

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add it for bwc restart-upgrade

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add it for bwc restart-upgrade

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add it for bwc restart-upgrade

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* fix undeploy with retry

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add it for bwc restart-upgrade

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add it for bwc restart-upgrade

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* optimize it code structure

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add it for bwc rolling-upgrade

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* tidy

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* update index mapping in it

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* move version check to build script

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* resolve modelId

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* update init model id

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* modify versions check logic in bwc test

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add comments

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* nit

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* updates for comments

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

---------

Signed-off-by: zhichao-aws <zhichaog@amazon.com>
(cherry picked from commit e41fba7)

* resolve conflicts

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* spotless Apply

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* add dependency

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

* update build.gradle

Signed-off-by: zhichao-aws <zhichaog@amazon.com>

---------

Signed-off-by: zhichao-aws <zhichaog@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch v2.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants