Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] A new workload for vector embedding and search #198

Closed
vpehkone opened this issue Feb 17, 2024 · 5 comments
Closed

[FEATURE] A new workload for vector embedding and search #198

vpehkone opened this issue Feb 17, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@vpehkone
Copy link
Contributor

Is your feature request related to a problem?

There is not any workload that would test vector search and vector embedding. E.g. run a similar benchmark test as this neural search tutorial (https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/) does. The current vector search workload does not do vector embedding and requires manually downloading the dataset and converting it to the right format.

What solution would you like?

Create a new workload that uses a pretrained model for vector embedding and executes vector search. This does not require any change to the OpenSearch-Benchmark either the official OpenSearch docker image as there are already ml-common, neural-search and KNN-plugins.

Good dataset for this workload: https://microsoft.github.io/msmarco/
Documents: https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz
Query texts: https://msmarco.blob.core.windows.net/msmarcoranking/queries.tar.gz

I can implement this and do PR.

@vpehkone vpehkone added enhancement New feature or request untriaged labels Feb 17, 2024
@VijayanB
Copy link
Member

VijayanB commented Feb 21, 2024

@vpehkone We recently created a new workload for vector search https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch . Currently you can bring your own dataset in hdf5 format and use it in this workload. However, we don't support any dataset out of the box like nyc taxi at this moment. Please let us know if you see any gap in using this workload for your use case.

@ylwu-amzn
Copy link

@vpehkone Thanks, for ml-commons plugin, the maintainer team busy with some other tasks now. They can come back on this task later. Or if you have bandwidth, feel free to contribute for vector embedding generation benchmarking.

@akashsha1
Copy link

@vpehkone We recently created a new workload for vector search https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch . Currently you can bring your own dataset in hdf5 format and use it in this workload. However, we don't support any dataset out of the box like nyc taxi at this moment. Please let us know if you see any gap in using this workload for your use case.

Hi @VijayanB - Vesa and I are in the same team at Intel. We plan to run the vectorsearch workload you pointed to as well.
We're also running the neural search embedding benchmark, and would like to add that to the benchmark repo. Our goal is to analyze the pipeline, and identify optimization opportunities where we can add value to OpenSearch.

Having a vector search pipeline - with, and without embedding generation will allow us to dive deep into two key benchmarks relevant to OpenSearch. Let us know if there's any other scenarios which would be useful to analyze, and optimize.

@VijayanB
Copy link
Member

@akashsha1 Thanks for clarification. Having new workload for neural search is definitely a good idea. Like @ylwu-amzn mentioned, feel free to send out PR. Thanks

@IanHoang
Copy link
Collaborator

This is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants