-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vector search with embedding generation workload #232
Conversation
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
What is the licensing of this content? |
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
@vpehkone thanks for raising the PR. Can we move this benchmarks to folder named: semantic_search rather than vectorSearch_embedding |
- Changed dataset from ms marco to trec-covid. - Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
@navneet1v Addressed all comments: changed workload name to semantic_search, moved common code to OSB, and changed dataset to trec-covid. Can you please review? |
Since we have two semantic search workloads that are looking to be added, let's rename the workload to be more specific (such as |
…rch. - Added the sample output for treccovid_semantic_search. - Added description of test procedure. - Simplified treccovid_semantics_search workload configuration. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
@IanHoang If difference is just usage of datasets, do we need two different workloads? Why not merge and let individual procedures use their own corpus? |
@VijayanB These workload are very different. Trec-covid semantic search generates embeddings and does vector search. Noaa semantic search does range, aggregate, term, etc... searches. It does not make sense to merge these. |
@vpehkone @VijayanB what would you say about rebasing that workload on a single dataset? We're not using this workload and trec-covid for checking correctness. We can use some field of text type to generate embeddings. Here is the example of doc from noaa:
we can use field |
@martin-gaievski what is the reason for merging the noaa dataset with Semantic Search? I think its better to keep semantic search as a separate use case and workload. Having a uber name as semantic search is really good. We can have more dataset in semantic search later. But for now having a simple Semantic Search dataset with trec-covid as dataset is pretty neat I would say. |
def register(registry): | ||
registry.register_param_source("semantic-search-source", QueryParamSource) | ||
registry.register_param_source("create-ingest-pipeline", ingest_pipeline_param_source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IanHoang Can this be added directly to opensearch-benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VijayanB Sorry for the lates response. Yes, this can be technically contributed directly into the OpenSearch Benchmark repository . If this is going to be reused by other workloads potentially, we should include it there. @vpehkone Please feel free to make this quick change in OSB repository and we can review it quickly and get this shipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IanHoang I don't know how it could be possible to add a registration of parameter source function/class to OSB. OSB cannot know where to look at it or the name of the source function/class? Please let me know if you have an idea how it works, and I will try to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vpehkone We plan to add some documentation to the official OSB documentation regarding this. In the interest of time, we can have this merged in
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
We'll need to download the corpora manually and add it to our cloud repository. I can coordinate this with you offline in the slack community. |
@vpehkone Please add this to the workload.json as the base-url: https://dbyiw3u3rf9yr.cloudfront.net/corpora/treccovid |
"corpora": [ | ||
{ | ||
"name": "treccovid", | ||
"base-url": "https://vesa-oswl.s3.us-west-2.amazonaws.com/treccovid", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vpehkone This can now be updated to https://dbyiw3u3rf9yr.cloudfront.net/corpora/treccovid
. Please confirm if document count is also correct. Is 129192
from the uncompressed version of documents.json.bz2?
$ wc -l documents.json.bz2
253809 documents.json.bz2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the new URL for documents and queries. Yes, the document count 129192 is right and it is for uncompressed documents.
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
* Add vector search with embedding generation workload Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Add vector search with embedding generation workload Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated README.md with the license text. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * - Changed the workload form vectorsearch_embedding to semantic_search. - Changed dataset from ms marco to trec-covid. - Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * - Changed the workload name semantic_search to treccovid_semantic_search. - Added the sample output for treccovid_semantic_search. - Added description of test procedure. - Simplified treccovid_semantics_search workload configuration. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated parameters of treccovid workload. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Added files.txt to treccovid workload. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated the documents url for treccovid_semantic_search. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> --------- Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> (cherry picked from commit 417170f) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Add vector search with embedding generation workload Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Add vector search with embedding generation workload Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated README.md with the license text. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * - Changed the workload form vectorsearch_embedding to semantic_search. - Changed dataset from ms marco to trec-covid. - Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * - Changed the workload name semantic_search to treccovid_semantic_search. - Added the sample output for treccovid_semantic_search. - Added description of test procedure. - Simplified treccovid_semantics_search workload configuration. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated parameters of treccovid workload. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Added files.txt to treccovid workload. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated the documents url for treccovid_semantic_search. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> --------- Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> (cherry picked from commit 417170f) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Add vector search with embedding generation workload * Add vector search with embedding generation workload * Updated README.md with the license text. * - Changed the workload form vectorsearch_embedding to semantic_search. - Changed dataset from ms marco to trec-covid. - Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo. * - Changed the workload name semantic_search to treccovid_semantic_search. - Added the sample output for treccovid_semantic_search. - Added description of test procedure. - Simplified treccovid_semantics_search workload configuration. * Updated parameters of treccovid workload. * Added files.txt to treccovid workload. * Updated the documents url for treccovid_semantic_search. --------- (cherry picked from commit 417170f) Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Add vector search with embedding generation workload * Add vector search with embedding generation workload * Updated README.md with the license text. * - Changed the workload form vectorsearch_embedding to semantic_search. - Changed dataset from ms marco to trec-covid. - Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo. * - Changed the workload name semantic_search to treccovid_semantic_search. - Added the sample output for treccovid_semantic_search. - Added description of test procedure. - Simplified treccovid_semantics_search workload configuration. * Updated parameters of treccovid workload. * Added files.txt to treccovid workload. * Updated the documents url for treccovid_semantic_search. --------- (cherry picked from commit 417170f) Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…ect#232) * Add vector search with embedding generation workload Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Add vector search with embedding generation workload Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated README.md with the license text. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * - Changed the workload form vectorsearch_embedding to semantic_search. - Changed dataset from ms marco to trec-covid. - Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * - Changed the workload name semantic_search to treccovid_semantic_search. - Added the sample output for treccovid_semantic_search. - Added description of test procedure. - Simplified treccovid_semantics_search workload configuration. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated parameters of treccovid workload. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Added files.txt to treccovid workload. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated the documents url for treccovid_semantic_search. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> --------- Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
Signed-off-by: Vesa Pehkonen vesa.pehkonen@intel.com
Description
Add vector search with embedding generation workload
Issues Resolved
#198
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.