-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] A new workload for vector embedding and search #198
Comments
@vpehkone We recently created a new workload for vector search https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch . Currently you can bring your own dataset in hdf5 format and use it in this workload. However, we don't support any dataset out of the box like nyc taxi at this moment. Please let us know if you see any gap in using this workload for your use case. |
@vpehkone Thanks, for ml-commons plugin, the maintainer team busy with some other tasks now. They can come back on this task later. Or if you have bandwidth, feel free to contribute for vector embedding generation benchmarking. |
Hi @VijayanB - Vesa and I are in the same team at Intel. We plan to run the vectorsearch workload you pointed to as well. Having a vector search pipeline - with, and without embedding generation will allow us to dive deep into two key benchmarks relevant to OpenSearch. Let us know if there's any other scenarios which would be useful to analyze, and optimize. |
@akashsha1 Thanks for clarification. Having new workload for neural search is definitely a good idea. Like @ylwu-amzn mentioned, feel free to send out PR. Thanks |
This is complete. |
Is your feature request related to a problem?
There is not any workload that would test vector search and vector embedding. E.g. run a similar benchmark test as this neural search tutorial (https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/) does. The current vector search workload does not do vector embedding and requires manually downloading the dataset and converting it to the right format.
What solution would you like?
Create a new workload that uses a pretrained model for vector embedding and executes vector search. This does not require any change to the OpenSearch-Benchmark either the official OpenSearch docker image as there are already ml-common, neural-search and KNN-plugins.
Good dataset for this workload: https://microsoft.github.io/msmarco/
Documents: https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz
Query texts: https://msmarco.blob.core.windows.net/msmarcoranking/queries.tar.gz
I can implement this and do PR.
The text was updated successfully, but these errors were encountered: