[FEATURE] A new workload for vector embedding and search #198

vpehkone · 2024-02-17T00:52:32Z

Is your feature request related to a problem?

There is not any workload that would test vector search and vector embedding. E.g. run a similar benchmark test as this neural search tutorial (https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/) does. The current vector search workload does not do vector embedding and requires manually downloading the dataset and converting it to the right format.

What solution would you like?

Create a new workload that uses a pretrained model for vector embedding and executes vector search. This does not require any change to the OpenSearch-Benchmark either the official OpenSearch docker image as there are already ml-common, neural-search and KNN-plugins.

Good dataset for this workload: https://microsoft.github.io/msmarco/
Documents: https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz
Query texts: https://msmarco.blob.core.windows.net/msmarcoranking/queries.tar.gz

I can implement this and do PR.

VijayanB · 2024-02-21T22:37:06Z

@vpehkone We recently created a new workload for vector search https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch . Currently you can bring your own dataset in hdf5 format and use it in this workload. However, we don't support any dataset out of the box like nyc taxi at this moment. Please let us know if you see any gap in using this workload for your use case.

ylwu-amzn · 2024-02-22T01:57:55Z

@vpehkone Thanks, for ml-commons plugin, the maintainer team busy with some other tasks now. They can come back on this task later. Or if you have bandwidth, feel free to contribute for vector embedding generation benchmarking.

akashsha1 · 2024-02-27T23:59:40Z

@vpehkone We recently created a new workload for vector search https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch . Currently you can bring your own dataset in hdf5 format and use it in this workload. However, we don't support any dataset out of the box like nyc taxi at this moment. Please let us know if you see any gap in using this workload for your use case.

Hi @VijayanB - Vesa and I are in the same team at Intel. We plan to run the vectorsearch workload you pointed to as well.
We're also running the neural search embedding benchmark, and would like to add that to the benchmark repo. Our goal is to analyze the pipeline, and identify optimization opportunities where we can add value to OpenSearch.

Having a vector search pipeline - with, and without embedding generation will allow us to dive deep into two key benchmarks relevant to OpenSearch. Let us know if there's any other scenarios which would be useful to analyze, and optimize.

VijayanB · 2024-02-28T19:39:33Z

@akashsha1 Thanks for clarification. Having new workload for neural search is definitely a good idea. Like @ylwu-amzn mentioned, feel free to send out PR. Thanks

IanHoang · 2024-11-12T22:01:19Z

This is complete.

vpehkone added enhancement New feature or request untriaged labels Feb 17, 2024

IanHoang removed the untriaged label Feb 22, 2024

This was referenced Mar 11, 2024

Add vector search with embedding generation workload #231

Closed

Add vector search with embedding generation workload #232

Merged

vpehkone mentioned this issue Mar 30, 2024

Added runners to register and deploy ml-model opensearch-project/opensearch-benchmark#497

Merged

IanHoang closed this as completed Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] A new workload for vector embedding and search #198

[FEATURE] A new workload for vector embedding and search #198

vpehkone commented Feb 17, 2024

VijayanB commented Feb 21, 2024 •

edited

Loading

ylwu-amzn commented Feb 22, 2024

akashsha1 commented Feb 27, 2024

VijayanB commented Feb 28, 2024

IanHoang commented Nov 12, 2024

[FEATURE] A new workload for vector embedding and search #198

[FEATURE] A new workload for vector embedding and search #198

Comments

vpehkone commented Feb 17, 2024

Is your feature request related to a problem?

What solution would you like?

VijayanB commented Feb 21, 2024 • edited Loading

ylwu-amzn commented Feb 22, 2024

akashsha1 commented Feb 27, 2024

VijayanB commented Feb 28, 2024

IanHoang commented Nov 12, 2024

VijayanB commented Feb 21, 2024 •

edited

Loading