Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Lucene based kNN search support in core OpenSearch #3545

Closed
martin-gaievski opened this issue Jun 8, 2022 · 18 comments
Closed

[RFC] Lucene based kNN search support in core OpenSearch #3545

martin-gaievski opened this issue Jun 8, 2022 · 18 comments
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes

Comments

@martin-gaievski
Copy link
Member

martin-gaievski commented Jun 8, 2022

This document proposes integrating Lucene 9.0’s recent addition of Approximate Nearest Neighbor search into OpenSearch.

Problem Statement

k-NN Search has gained popularity over recent years with the growing use of ML. A few applications include recommendation systems, NLP services, and computer vision.

A brute force nearest neighbor search consists of calculating the distance from a query point to every point in the index and returning the “k” index points that have the smallest distance. A point can be a dense vector, a sparse vector or a binary code; however, in this document, we are only interested in dense vectors. The major drawback of this algorithm is that it is unable to search indices with millions or billions of vectors efficiently. To address this, a new problem was formulated: Approximate Nearest Neighbor search or ANN for short. The general idea is to sacrifice search quality and index latency to drastically improve search latency. Several algorithms have been created, including Hierarchical Small Worlds (HNSW), Inverted File System (IVF), and Locality Sensitive Hashing (LSH).

Several different libraries exist that implement these algorithms, including nmslib and faiss. The libraries are similar to Lucene in that they build indices that can be searched and serialized — however, they are built specifically for vector search. The k-NN plugin integrates these libraries into OpenSearch so that OpenSearch users can perform ANN search workloads.

The approach the plugin takes has a few drawbacks:

  1. The libraries mentioned above are implemented in C++. In a Java project like OpenSearch, this is a bit unnatural — it complicates the build process and prevents us from supporting different systems, such as Windows.
  2. The library indices must be loaded in totality into native memory in order to search them. The indices contain both the vectors and the underlying data structures. This causes issues with memory pressure when memory is tight.
  3. Developers that want to use vectors in their applications have to take a dependency on the k-NN plugin to get access to the knn_vector data type the plugin introduces.

Proposed Solution

TLDR: Build a new field type and query type in OpenSearch to support ANN search through Lucene.

Lucene 9’s introduction of Approximate Nearest Neighbor Support

In the 9.0 release, Lucene added support for building indices with dense vectors and running ANN search on them. In order to create indices with vectors, a new Codec format was introduced, KnnVectorsFormat. This format is used to encode and decode numeric vector fields and execute Nearest Neighbor search. In Lucene 9, the default format used is Lucene90HnswVectorsFormat. This format implements the HNSW algorithm for ANN search. During indexing, this format will create 2 separate segment files: one for the vectors and one for the HNSW graph structure, allowing the vectors to exist off heap.

Lucene’s implementation of HNSW takes two parameters at index time: max_connections and beam_width. Max_connections sets a ceiling on the number of connections a node in the graph can have. Beam_width is a parameter that controls the candidate list for neighbors of a new node added to the graph. These parameters map to m and ef_construction parameters from the original paper. These parameters are set in Lucene91HnswVectorsFormat’s constructor. Codec supports KnnVectorFormat at the field level, and this allows setting these parameters at the field level.

For search, a new query was introduced, the KnnVectorQuery. This query contains the query vector as well as k, the number of results a query on the HNSW graph should return. For search, the HNSW algorithm has one parameter, ef_search. In Lucene’s implementation this value is hardcoded to k.

Integration with OpenSearch

In order to integrate this new functionality into OpenSearch, we need to introduce a new field type, dense_vector. This field type would allow users to index floating point arrays of uniform length into OpenSearch and build the index structures needed to support ANN search with Lucene.

{
    "type": "dense_vector",
    "dtype": float,
    "dimension": d,
    "knn": {
        "metric": "l2",
        "algorithm": {
            "name": "hnsw",
            "params": {
                "max_connections": X,
                "beam_width": X
            }
        }
   }
}

Because initially we are going to provide support for Lucene HNSW implementation format above can be simplified to a following form

{
    "type": "dense_vector",
    "dimension": d, 
    "knn": {
    }
}

For provided request example system will use Lucene HNSW with L2 metric, max_connections equals 16 and beam_width equals 100.

The minimal default mapping for dense_vector does not include knn structure and is shown below. For such definition the proposed knn based on Lucene HNSW will not be enabled and related graph data structures are not created; only vector values are stored.

{
    "type": "dense_vector",
    "dimension": d
}

In addition to the field type, we are adding support a new query type naming knn-query.

Lucene implementation of HNSW and knn vector type has certain limitations that we have to accept with such integration - maximum number for dimensions for knn vector is 1024, this implies same limitation for new dense_vector field type.

Performance

Team has benchmarked prototype of solution based on Lucene 9.1 integration with OpenSearch 2.0.

For benchmark algorithm parameters ef_construction and beam_width set to 512, and max_connections and m varied in range from 4 to 96.

Benchmark results showed that Lucene 9.1 solution is comparable with existing k-NN hnsw implementation based on nmslib with certain tradeoffs: Lucene 9.1 solution cannot reach high recall values close to 1.0, but when recall values are comparable with existing k-NN plugin Lucene 9.1 solution consumes less resources and has better query latencies.

We have observed some trends in benchmark results - for the same high recall values memory that the k-NN index takes is higher than required for Lucene implementation. Query latencies and memory allocated per query are higher for k-NN implementation.

Solution shows itself as stable under constant query load that is typical for k-NN plugin.

Obtained metrics are rough as they are based on POC code of Lucene 9.1 integration with OpenSearch 2.0 without any optimizations. For production ready version we’re planning to have more accurate metrics.

Pre-filtering feature

Lucene implementation of HNSW gives additional benefit of having pre-filtering on queries. Current implementation in k-NN provides post-filtering - first documents are selected by applying HNSW algorithm and then we apply filter on top of HNSW results. In general results are not exactly correct due to sequence in which different processing steps are applied.

Lucene 9.1 applies filter on live-docs and then selects k closest results by applying HNSW algorithm.

At the moment we’re planning to deliver pre-filtering feature “as is”, no changes/additions to what is implemented in Lucene.

Impact on k-NN plugin

Existing knn_vector field supported by k-NN plugin will be independent of the dense_vector field introduced in core OpenSearch. This change will not break any functionality of the plugin, and customers will be able to upgrade their existing indices created in earlier k-NN versions taking into account limitation on maximum vector dimensions.

From a migration standpoint, users would be able to reindex their knn_vector indices into indices that use the new dense_vector field type and vice versa.

In order to share the knn query type with OpenSearch, the plugin’s Codec would be required to implement the new KnnVectorsFormat. More details on this will be available in the plugins repository.

Revised Approach

Overview

After some offline discussions, we have decided to propose a pivot to our plan to support Lucene k-NN.

Introducing a new data type , dense_vector, and query type, knn_query, when we already have the data type, knn_vector, and the query type, knn, in the plugin will be a source of confusion for users. Having two data types with very similar end user functionality will raise questions such as “Why aren’t they the same?”, “When should I use what?”, etc. Instead of introducing dense_vector, we are proposing to capture the Lucene functionality inside the existing knn_vector field type in the plugin. Then, we would use the existing knn query type as well. The interface would be consistent with our current knn_vector.

As a consequence of this change, because only one field mapping parser can be registered to a field name (as well as query parser to query name), we would need to move the implementation to the k-NN plugin. We would still target the 2.2 release.

Using existing knn_vector field type instead of new dense_vector

Original Proposal Limitations

User Experience

The main limitation to this approach is the confusing user experience that it creates. We would have 2 different data types that do very similar things: knn_vector and dense_vector. Each would support indexing dense vectors and running approximate Nearest Neighbor searches. In addition to this, we would have 2 different query types that would do a similar thing: knn_query and knn.

The decision to make a new data type and query type was made originally to separate from architecture specific nature of the k-NN plugin. However, on closer examination, a user could use the k-NN plugin without the native libraries on a different platform as long as we update plugin build infrastructure to decouple platform dependent components from platform independent components.

Updated Proposal

To solve the problem of user confusion, we would build Lucene ANN support into our current knn_vector interface:

{
    "type": "knn_vector",
    "dimension": 100,
    "method": {
        "name":"hnsw",
        "engine":"lucene",
        "space_type": "l2",
        "parameters":{
            "m":2048,
            "ef_construction": 245
        }
    }
}

The interface would stay the same except for adding lucene as a new engine to support. We have evaluated the technical feasibility of this approach and deemed it feasible.

With this, the query would stay the same as our current query for the plugin:

"query": {
  "knn": {
    "target_field": {
      "vector": [<insert vector>],
      "k": 2
    }
  }
}

Note that in the future, we could update these interfaces to how we see fit. For more information on current interface, please take a look at the k-NN documentation.

Updated Proposal Limitations

Initial Coupling to Architecture Dependent Nature of Plugin

One limitation with the updated proposal is that it couples the Lucene ANN platform agnostic functionality with the plugin’s platform dependence.

However, a user that would want to run the plugin on a non-plugin supported platform, could install the plugin zip without the libraries and use the Lucene engine. As it stands right now, if they tried to use nmslib or faiss, the process would crash.

Overall, we would solve this problem by decoupling the JNI library from the plugin and providing a graceful failure if a user tries to use the JNI components on an unsupported platform. Additionally, we would decouple the build process as well.

Difficulty integrating with other engine plugins

Another limitation lies in a more technical nuance. From the user perspective, the consequence is that they could not create a Lucene k-NN index with another Engine plugin feature like CCR.

This is because, we have to pass the HNSW parameters to the Lucene KnnVectorsFormat during construction. From a plugin, we have no way to add this to the default Codec that OpenSearch will use. Therefore, we will need to use our own custom codec. But, two custom codecs cannot be used at the same time for an index.

In the short term, this will mean users cannot use the Lucene KNN with another feature like CCR. This limitation is currently present for the plugin as well.

Related github issues:

Feedback Requested

Any general comments about the overall direction are welcome. Some specific questions:

  • Do you prefer exposing Lucene implementation via existing knn_vector field or by new dense_vector field?
  • Do you have use cases where close to 1.0 recall is not critical for k-NN search? For such use cases are you willing to make a tradeoff of lower recall comparing to a less memory consumption and lower query latencies?
  • Is L2 Euclidean distance a good default metric for your use cases? Other metrics supported by Lucene are Dot Product and Cosine Similarity.
@martin-gaievski martin-gaievski added enhancement Enhancement or improvement to existing feature or request untriaged labels Jun 8, 2022
@xuezhou25 xuezhou25 added RFC Issues requesting major changes and removed untriaged labels Jun 9, 2022
@martin-gaievski martin-gaievski changed the title [RFC] KNN search support in core OpenSearch [RFC] Lucene based kNN search support in core OpenSearch Jun 9, 2022
@dblock
Copy link
Member

dblock commented Jun 14, 2022

I like having choices, and appreciate having to think about memory/performance tradeoffs, so +1 to exposing a Lucene implementation.

Regarding the questions, I can speak to (2). We had implemented brute-force k-nn with euclidean distance, then LSH, then nn-descent at Artsy. The distance measure works differently for different types of data (the nn-descent paper suggests using cosine similarity for text, l2 distance of color histograms for images, for example), so I definitely think the ability to choose is important and L2 is definitely not a good default metric in all cases.

@martin-gaievski
Copy link
Member Author

I like having choices, and appreciate having to think about memory/performance tradeoffs, so +1 to exposing a Lucene implementation.

Regarding the questions, I can speak to (2). We had implemented brute-force k-nn with euclidean distance, then LSH, then nn-descent at Artsy. The distance measure works differently for different types of data (the nn-descent paper suggests using cosine similarity for text, l2 distance of color histograms for images, for example), so I definitely think the ability to choose is important and L2 is definitely not a good default metric in all cases.

Thank you for bringing the question of default metric @dblock. Our suggestion for L2 as default one is based mainly on fact that its default in Lucene HNSW implementation. We may need to conduct additional benchmark for different metrics supported by Lucene (in addition to L2 it supports cosine similarity and dot product) and make more data driven decision on default metric value.

@dblock
Copy link
Member

dblock commented Jun 15, 2022

I think as long as users can choose the metric, the default should always match Lucene's default. 💯 .

@jmazanec15
Copy link
Member

Im not sure on default:

{
    "type": "dense_vector",
    "dimension": d, 
    "knn": {
    }
}

I think it should be:

{
    "type": "dense_vector",
    "dimension": d
}

If a user wants to skip building knn specific index structures (which can be expensive to build) for cases like painless scripting, I think they should be able to set "knn.enabled = false" and have it default to true.

@martin-gaievski
Copy link
Member Author

Yes, the format without a knn structure is the default. For this case system will skip creation of knn index structures. I was referring to one with empty knn as to a minimal form that supports knn search. Let me update the doc to highlight this

@jmazanec15
Copy link
Member

Yes, the format without a knn structure is the default. For this case system will skip creation of knn index structures.

My problem is more around having users submit an empty map for default knn:

{
    "knn": {
    }
}

My feeling is, use knn default when "knn" is not specified:

{
    "type": "dense_vector",
    "dimension": d
}

If a user does not want to build knn structures, do something like:

{
    "type": "dense_vector",
    "dimension": d,
    "knn.enabled": false
}

@martin-gaievski
Copy link
Member Author

I have a doubt regarding such approach. I think we're making an assumption that if knn structure not specified then we do not want to perform knn on that data, but example with painless script shows that it's not always the case. Having "knn.enabled": false and the script at the same time looks a bit counterintuitive.

@reta
Copy link
Collaborator

reta commented Jun 23, 2022

@martin-gaievski we may consider different name for field type, it seems like Elasticsearch had introduced it (as xpack feature) [1], I am afraid we call for a trouble .... :(

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/dense-vector.html

@martin-gaievski
Copy link
Member Author

@martin-gaievski we may consider different name for field type, it seems like Elasticsearch had introduced it (as xpack feature) [1], I am afraid we call for a trouble .... :(

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/dense-vector.html

I see where you're coming from. We actually wanted to have a name that is recognizable by community. We've reviewed multiple options and dense_vector is the name we got consensus for from PM team and I trust them in this territory.

@reta
Copy link
Collaborator

reta commented Jun 23, 2022

No objection from my side, we have similar dilemma with flattened [1] to avoid any possible legal issues ...

[1] #1018

@martin-gaievski
Copy link
Member Author

We want to step back a little and make additional evaluation of possible solutions. In particular, functionality of dense_vector can be implemented as part of k-NN plugin with a main benefit that it can be done under existing knn_vector type.

@martin-gaievski
Copy link
Member Author

martin-gaievski commented Jul 11, 2022

The new section “Revised Approach” has been added, mainly it suggests exposure of Lucene HNSW implementation through existing knn_vector field and avoids creation of new dense_vector field

@br3no
Copy link
Contributor

br3no commented Aug 3, 2022

I'm happy to see this issue. I think it's one of long-term strategic importance for OpenSearch, given the current momentum of neural search in industry and academia.

I think it's perfectly acceptable to introduce the functionality as part of the k-NN plugin.

@martin-gaievski
Copy link
Member Author

Big thanks to all participants, we're done collecting feedback and release version is practically finished. Please communicate suggestions and bug reports via github issues.

@mgdunn2
Copy link

mgdunn2 commented Aug 17, 2022

@martin-gaievski It isn't mentioned in the updated proposal but based off the current documentation for the knn-plugin found here it looks like the plan to support "pre-filtering" that was in the initial proposal was removed. Is this a documentation issue or is it still the case that all filtering is only done on the top k documents after they are retrieved?

@martin-gaievski
Copy link
Member Author

@martin-gaievski It isn't mentioned in the updated proposal but based off the current documentation for the knn-plugin found here it looks like the plan to support "pre-filtering" that was in the initial proposal was removed. Is this a documentation issue or is it still the case that all filtering is only done on the top k documents after they are retrieved?

We are still planning to add support for filtering on top of Lucene knn engine, it's planned currently for 2.4. The way it's implemented in Lucene makes it rather a hybrid approach of both pre and post filtering, but in any case top k elements are guaranteed. This is due to details of pre-filtering implementation, Lucene keeps traversing HNSW graphs and adding vectors to result until it finds top k vectors.

@mulugetam
Copy link
Contributor

@martin-gaievski this is great! How are you currently benchmarking kNN search? Also, any plans for OpenSearch-Benchmark to add kNN support?

@martin-gaievski
Copy link
Member Author

Hi,
We're using a tool that is tailored specifically for k-NN searches https://github.com/opensearch-project/k-NN/tree/main/benchmarks/perf-tool, it calculates latency and recall for search queries. We do have some extensions for OSB that can be used for data ingestion flow, please check https://github.com/opensearch-project/k-NN/tree/main/benchmarks/osb for details. Currently we're working to add more support like calculating recall etc.

frascuchon added a commit to argilla-io/argilla that referenced this issue Dec 15, 2022
- Using Lucene engine implementation for vector search because of
reasons (opensearch-project/OpenSearch#3545).

  => **This option is only available for OpenSearch >= 2.2**

- Remove word cloud computation and indexing with a multilingual
analyzer (reduce the indexing resources). The better way to compute this
word/term cloud will be using a list of provided tokens or tokenizer
instead of our custom one.

Co-authored-by: Francisco Aranda <francisco@recogn.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes
Projects
None yet
Development

No branches or pull requests

8 participants