Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for knn_vector property type #524

Merged
merged 14 commits into from
Jun 20, 2023

Conversation

maltehedderich
Copy link
Contributor

Description

Adds support for the knn_vector property type and it's corresponding index settings.

Issues Resolved

#437

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Malte Hedderich <github@hedderich.pro>
Signed-off-by: Malte Hedderich <github@hedderich.pro>
…etting

Signed-off-by: Malte Hedderich <github@hedderich.pro>
Signed-off-by: Malte Hedderich <github@hedderich.pro>
Signed-off-by: Malte Hedderich <github@hedderich.pro>
Signed-off-by: Malte Hedderich <github@hedderich.pro>
@@ -124,6 +124,7 @@ public enum FieldType implements JsonEnum {

DenseVector("dense_vector"),

KnnVector("knn_vector"),
Copy link
Collaborator

@reta reta Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is not OpenSearch core field type, this is provided by the plugin (https://github.com/opensearch-project/k-NN), afaik we do not (and probably should not) make it a part of the standard Java client but part of the k-NN plugin APIs, @VachaShah @dblock what do you think?

Copy link
Member

@dblock dblock Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This effectively creates a chicken/egg problem, which is why plugins are "awesome" :)

@maltehedderich Is there a mechanism that lets us add a k-nn high level REST client JAR to the k-nn plugin with this code? or maybe it's a waste of time and it makes more sense to add support for k-nn to the new transport that doesn't have a dependency on core? See #326 and #262

On the other hand, in other clients we are adding plugin code into the client. We just namespace it clearly, so is there a way to namespace all k-nn code here under something.plugins.k-nn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick replies.

While namespaces probably work well for most plugins, this is difficult in the case of the KNN plugin due to its tight integration with core components.

The naming for both cluster and index-specific settings makes it difficult to identify them as plugin settings. As a user, if KNN settings remain under index settings, I would expect to be able to modify and read them through the Index Settings functionality in the Java client.

Moreover, I'm not aware of other plugins introducing custom field types. I think many users who use the OpenSearch KNN functionalities also use other field types (those who want a pure vector database have many alternatives). So it would be important to be able to define the mappings together with your other mappings.

I believe that in order to solve this cleanly we would either have to incorporate the functionality of the KNN plugin into the OpenSearch core or fundamentally consider how to deal with extensions to field types and index settings.
Personally, I think integrating vector search capabilities into the OpenSearch core is the best solution. However, if this is not in line with your product vision I am also willing to spend time on alternative solutions, we should just clarify in advance where we really want to go.

As for the namespacing solution, it's achievable but would necessitate several class extensions and potentially affect usability. Regarding incorporating this into the new Transport Client, I'm unsure how this would look like and would need more information about your vision for that integration.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While namespaces probably work well for most plugins, this is difficult in the case of the KNN plugin due to its tight integration with core components.

Most if not all plugins have tight integration with core components

Moreover, I'm not aware of other plugins introducing custom field types.

There are many, even within OpenSearch core plugins, fe mapper-murmur3 introduces murmur3 custom field type.

I believe that in order to solve this cleanly we would either have to incorporate the functionality of the KNN plugin into the OpenSearch core or fundamentally consider how to deal with extensions to field types and index settings.

I think more broadly we need an extensible mechanism to support plugin specific APIs in the Java client, we have talked about that in the past in scope of different issues (#377, #262) but have not conclude anything on the subject.

As for the namespacing solution, it's achievable but would necessitate several class extensions and potentially affect usability. Regarding incorporating this into the new Transport Client, I'm unsure how this would look like and would need more information about your vision for that integration.

We need to get back to the extensibility part of OpenSearch Java client. The major drawback I see is that by adding k-NN plugin to the core client we put all other plugins in unfair and disadvantageous position, I think we need to work towards long term goal instead of taking the simple path of baking everything in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many, even within OpenSearch core plugins, fe mapper-murmur3 introduces murmur3 custom field type.

How is "core plugin" defined? I am only aware of the distinction between "Bundled Plugins" and "Additional Plugins" (according to installing plugins). Unlike murmur3, k-NN belongs to the Bundled Plugins and should therefore be present in the vast majority of OpenSearch deployments.

I think more broadly we need an extensible mechanism to support plugin specific APIs in the Java client, we have talked about that in the past in scope of different issues (#377, #262) but have not conclude anything on the subject.

The discussions sound to me less like extensibility and more like the creation of a workaround through a kind of low level client. This could then of course be used as a basis for plugin specific libraries, but would itself not make the features of the Java client (typing, builders etc.) usable for e.g. users of the k-NN plugin.
According to my understanding of extensibility, there would need to be a standardized way to extend the components of the Java client with plugin specific types and functionalities. So that mappings, index settings and co can all be defined together via the builder without the need for custom HTTP requests.

We need to get back to the extensibility part of OpenSearch Java client. The major drawback I see is that by adding k-NN plugin to the core client we put all other plugins in unfair and disadvantageous position, I think we need to work towards long term goal instead of taking the simple path of baking everything in.

The distinction between "bundled plugins" and "additional plugins" already puts all "additional plugins" at a disadvantage. In addition, I think it is questionable how much of the "plugin character" is left if they are shipped with all but the minimal distributions.

As you said, there are already several ideas and discussions about this topic, not only here in the Java Client repo. In my opinion it is important that we come to a holistic decision how to deal with "bundled" and "additional" plugins within the clients. How do you think we can reach such a decision most effectively?

The rough concepts I have seen for this so far are:

  1. Integration into core client libraries with appropriate namespacing
  2. Generic low level clients provided by the core client libraries
  3. Development of plugin-specific clients which may build up on the core clients

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @maltehedderich

How is "core plugin" defined? I am only aware of the distinction between "Bundled Plugins" and "Additional Plugins" (according to installing plugins).

Sorry, I should have been more clear, your distinction is very correct, by core plugins I meant the plugins included in OpenSearch repository and bundled with it.

The distinction between "bundled plugins" and "additional plugins" already puts all "additional plugins" at a disadvantage.

There should be no distinction, I just provided an example of field type added in the plugin (since you mentioned you were not aware of), no emphasize on core - it just happened to be a plugin we bundle with OpenSearch.

The rough concepts I have seen for this so far are:

  1. Integration into core client libraries with appropriate namespacing
  2. Generic low level clients provided by the core client libraries
  3. Development of plugin-specific clients which may build up on the core clients

The solution (as I see) it is the combination of plugin specific APIs that could be plugged into OpenSearch Java client. The very rough idea of how it could look like (it is not formalized anywhere):

public class OpenSearchClient extends ApiClient<OpenSearchTransport, OpenSearchClient> {
	public <PluginClient> PluginClient plugin(Class<PluginClient> plugin) {
		....
	}
}
class OpenSearchKnnPluginClient {
  ....
}

But it does not include plugin specific mapping types, index settings, query builders, etc. , I think we could work it through by taking the k-NN plugin as the example and formalize the approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my current perspective, I think we definitely want something like that, but (as you already said) it only solves the problem with plugin specific APIs. However, in the case of the k-NN plugin, the standard APIs are used almost exclusively (except for training custom models?).

Nevertheless, I can understand your concerns about including plugin-specific code in the core library and will be happy to participate in possible solutions.

@dblock
Copy link
Member

dblock commented Jun 12, 2023

Let's go back to our original problem @maltehedderich and @reta. I think we want support for knn_vector in opensearch-java. So what's the shortest path forward that we can all agree to, and merge (and open issues for all the should/nice to have work)?

I'm with ya'll on moving k-nn to core, but that's beyond the point. I would say we don't want to do it mostly because working in core is a doozy.

@reta
Copy link
Collaborator

reta commented Jun 12, 2023

So what's the shortest path forward that we can all agree to, and merge (and open issues for all the should/nice to have work)?

This pull request I think is the shortest path to support knn_vector field type

@dblock
Copy link
Member

dblock commented Jun 13, 2023

So what's the shortest path forward that we can all agree to, and merge (and open issues for all the should/nice to have work)?

This pull request I think is the shortest path to support knn_vector field type

Shall we just merge it and open issues for the rest, or do you think it's too much debt?

@reta
Copy link
Collaborator

reta commented Jun 13, 2023

Shall we just merge it and open issues for the rest, or do you think it's too much debt?

Taking all pros and cons, I would incline to merge it - it won't make things much worth then they are now, but kicking off the work to design the mechanism to plugin/extension APIs would be great.

@maltehedderich
Copy link
Contributor Author

I forgot the --signoff flag when I committed the removal of the license header, but all my commits are signed with my SSH key. Is that enough or how should we solve that?

All code is written by myself, I just used one of the other Property Classes as template and kept the header because it was present in all other files.

@reta
Copy link
Collaborator

reta commented Jun 13, 2023

I forgot the --signoff flag when I committed the removal of the license header, but all my commits are signed with my SSH key. Is that enough or how should we solve that?

I forgot the --signoff flag when I committed the removal of the license header, but all my commits are signed with my SSH key. Is that enough or how should we solve that?

As far as DCO checks are not failing, we should be good, thanks @maltehedderich !

@dblock
Copy link
Member

dblock commented Jun 13, 2023

@reta But also, where's the DCO check?!

@reta
Copy link
Collaborator

reta commented Jun 13, 2023

@reta But also, where's the DCO check?!

@dblock It is no there! And also gone for OpenSearch, opensearch-project/OpenSearch#8016

…nse headers

Signed-off-by: Malte Hedderich <github@hedderich.pro>
@dblock
Copy link
Member

dblock commented Jun 13, 2023

It's back!

Signed-off-by: Malte Hedderich <github@hedderich.pro>
@maltehedderich
Copy link
Contributor Author

I removed the dense_vector property and force pushed the signing to my previous commit. It's getting late here in Germany so I'm out, will fix it tomorrow if anything is still wrong.

Signed-off-by: Malte Hedderich <github@hedderich.pro>
dblock
dblock previously approved these changes Jun 14, 2023
Copy link
Member

@dblock dblock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. We will need updates to USER_GUIDE and such.

As a followup, let's talk about namespacing this with the word "plugin" better?

reta
reta previously approved these changes Jun 14, 2023
@maltehedderich
Copy link
Contributor Author

Looks good. We will need updates to USER_GUIDE and such.

Regarding the USER_GUIDES, I would prepare some small examples, but probably won't get to it until the next few days. What is still unclear to me is the distinction to the Java Client Docs, do they pursue the same goal? Should we extend those as well?

As a followup, let's talk about namespacing this with the word "plugin" better?

Yes. In general, we wanted to take another look at how we handle plugin-specific APIs and should also discuss there how we can separate plugin-specific extensions of the standard APIs better.

…settings and mappings

Signed-off-by: Malte Hedderich <github@hedderich.pro>
…ipting_score with painless extension

Signed-off-by: Malte Hedderich <github@hedderich.pro>
@maltehedderich maltehedderich dismissed stale reviews from reta and dblock via 84cd2ee June 18, 2023 15:40
@maltehedderich
Copy link
Contributor Author

I have now added some examples for k-NN search to the USERGUIDE.

Examples for the approximative k-NN search are still absent, because for this a KnnQuery class has to be added within our query_dsl first. I'm pretty busy the next two weeks, so I'll only be able to add this in early/mid July. But since exact k-NN search is already fully functional with the modifications in this PR, I would suggest to merge it and add the approximate k-NN search later.

I also changed the parameter order for the Apache HttpHost in the USERGUIDE, as it didn't seem to match the referenced org.apache.hc.core5.http library before. However, the whole USERGUIDE seems still to be quite outdated and many examples do not work with the current library.

Signed-off-by: Malte Hedderich <github@hedderich.pro>
Signed-off-by: Malte Hedderich <github@hedderich.pro>
Signed-off-by: Malte Hedderich <github@hedderich.pro>
@@ -84,7 +89,7 @@ There are multiple low level transports which `OpenSearchClient` could be config
import org.apache.hc.core5.http.HttpHost;

final HttpHost[] hosts = new HttpHost[] {
new HttpHost("localhost", "http", 9200)
new HttpHost("http", "localhost", 9200)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@dblock dblock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this!

@dblock dblock merged commit 7d22a7d into opensearch-project:main Jun 20, 2023
@dblock dblock added the backport 2.x Backport to 2.x branch label Jun 20, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-524-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 7d22a7dab4278e3f6adb8b4d2125f40a624395c0
# Push it to GitHub
git push --set-upstream origin backport/backport-524-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-524-to-2.x.

@dblock
Copy link
Member

dblock commented Jun 20, 2023

I think we do want to backport this to 2.x, do we? @maltehedderich would need to be done manually as the auto-backport failed.

Regarding organizing the code for plugins, all yours and sincerely appreciate your contributions!

@maltehedderich
Copy link
Contributor Author

maltehedderich commented Jun 20, 2023

I think we do want to backport this to 2.x, do we? @maltehedderich would need to be done manually as the auto-backport failed.

Yes, that should be possible with reasonable effort. Shall we just create an issue for it and if it's not already done in the next few weeks, I'll take care of it in July?

Additionally, Mend seems to be unhappy with a dependency in the main branch, does dependabot take care of that automatically or do we have to adjust it manually?

Edit: Is the automatic backport only failing because of the merge conflict in the USERGUIDE 🤔 If that's the case, I think I will be able to fix that quickly tonight.

@dblock
Copy link
Member

dblock commented Jun 20, 2023

I think we do want to backport this to 2.x, do we? @maltehedderich would need to be done manually as the auto-backport failed.

Yes, that should be possible with reasonable effort. Shall we just create an issue for it and if it's not already done in the next few weeks, I'll take care of it in July?

I see you did this. Generally that's the way to go because otherwise we forget 😅

Additionally, Mend seems to be unhappy with a dependency in the main branch, does dependabot take care of that automatically or do we have to adjust it manually?

We let dependabot do its thing. If you see something it doesn't do, then yes.

Edit: Is the automatic backport only failing because of the merge conflict in the USERGUIDE 🤔 If that's the case, I think I will be able to fix that quickly tonight.

Right on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants