Add vector search with embedding generation workload #232

vpehkone · 2024-03-11T17:08:42Z

Signed-off-by: Vesa Pehkonen vesa.pehkonen@intel.com

Description

Add vector search with embedding generation workload

Issues Resolved

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

akashsha1 · 2024-03-11T17:52:41Z

@VijayanB / @wbeckler - PR for benchmarking vectorsearch with embedding i.e. neural search. Please help add necessary reviewers for this benchmark.

Feedback is appreciated.

wbeckler · 2024-03-14T21:06:19Z

What is the licensing of this content?

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

vectorsearch_embedding/README.md

navneet1v · 2024-03-25T19:16:18Z

@vpehkone thanks for raising the PR. Can we move this benchmarks to folder named: semantic_search rather than vectorSearch_embedding

vectorsearch_embedding/workload.json

vectorsearch_embedding/workload.py

- Changed dataset from ms marco to trec-covid. - Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

vpehkone · 2024-04-24T21:13:12Z

@navneet1v Addressed all comments: changed workload name to semantic_search, moved common code to OSB, and changed dataset to trec-covid. Can you please review?

semantic_search/test_procedures/default.json

IanHoang · 2024-05-20T20:03:00Z

Since we have two semantic search workloads that are looking to be added, let's rename the workload to be more specific (such as treccovid_semantic_search).

semantic_search/README.md

…rch. - Added the sample output for treccovid_semantic_search. - Added description of test procedure. - Simplified treccovid_semantics_search workload configuration. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

VijayanB · 2024-05-21T18:00:15Z

Since we have two semantic search workloads that are looking to be added, let's rename the workload to be more specific (such as treccovid_semantic_search).

@IanHoang If difference is just usage of datasets, do we need two different workloads? Why not merge and let individual procedures use their own corpus?

treccovid_semantic_search/test_procedures/default.json

treccovid_semantic_search/README.md

vpehkone · 2024-05-21T19:27:34Z

Since we have two semantic search workloads that are looking to be added, let's rename the workload to be more specific (such as treccovid_semantic_search).

@IanHoang If difference is just usage of datasets, do we need two different workloads? Why not merge and let individual procedures use their own corpus?

@VijayanB These workload are very different. Trec-covid semantic search generates embeddings and does vector search. Noaa semantic search does range, aggregate, term, etc... searches. It does not make sense to merge these.

martin-gaievski · 2024-05-21T20:05:50Z

Since we have two semantic search workloads that are looking to be added, let's rename the workload to be more specific (such as treccovid_semantic_search).

@IanHoang If difference is just usage of datasets, do we need two different workloads? Why not merge and let individual procedures use their own corpus?

@VijayanB These workload are very different. Trec-covid semantic search generates embeddings and does vector search. Noaa semantic search does range, aggregate, term, etc... searches. It does not make sense to merge these.

@vpehkone @VijayanB what would you say about rebasing that workload on a single dataset? We're not using this workload and trec-covid for checking correctness. We can use some field of text type to generate embeddings. Here is the example of doc from noaa:

{
  "TMIN": 8.3,
  "SNOW": "0",
  "WSF5": 11.2,
  "SNWD": "0",
  "PRCP": 0,
  "station": {
    "wmo_id": "72398",
    "state": "MARYLAND",
    "name": "SALISBURY WICOMICO RGNL AP",
    "location": {
      "lon": -75.5103,
      "lat": 38.3406
    },
    "elevation": 14.6,
    "country_code": "US",
    "id": "USW00093720",
    "country": "United",
    "state_code": "MD"
  },
  "WDF2": "20",
  "TMAX": 18.3,
  "AWND": 3.1,
  "TRANGE": {
    "lte": 18.3,
    "gte": 8.3
  },
  "date": "2015-04-15T00:00:00",
  "TAVG": 12.9,
  "WSF2": 8.1,
  "WDF5": "20"
}

we can use field station.name instead of text to generate embeddings.
Two workloads then can be merged into one as rest of operations are independent, maybe some operations can be reused.
I'm not sure the other way around is possible as trec-covid mapping is simple and is lacking integer and keyword fields that are needed for hybrid query.

navneet1v · 2024-05-21T20:37:14Z

@martin-gaievski what is the reason for merging the noaa dataset with Semantic Search? I think its better to keep semantic search as a separate use case and workload. Having a uber name as semantic search is really good. We can have more dataset in semantic search later. But for now having a simple Semantic Search dataset with trec-covid as dataset is pretty neat I would say.

VijayanB · 2024-07-02T22:36:38Z

treccovid_semantic_search/workload.py

+def register(registry):
+    registry.register_param_source("semantic-search-source", QueryParamSource)
+    registry.register_param_source("create-ingest-pipeline", ingest_pipeline_param_source)


@IanHoang Can this be added directly to opensearch-benchmark?

@VijayanB Sorry for the lates response. Yes, this can be technically contributed directly into the OpenSearch Benchmark repository . If this is going to be reused by other workloads potentially, we should include it there. @vpehkone Please feel free to make this quick change in OSB repository and we can review it quickly and get this shipped.

@IanHoang I don't know how it could be possible to add a registration of parameter source function/class to OSB. OSB cannot know where to look at it or the name of the source function/class? Please let me know if you have an idea how it works, and I will try to implement it.

@vpehkone We plan to add some documentation to the official OSB documentation regarding this. In the interest of time, we can have this merged in

treccovid_semantic_search/README.md

treccovid_semantic_search/index.json

treccovid_semantic_search/test_procedures/default.json

treccovid_semantic_search/workload.py

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

treccovid_semantic_search/workload.json

IanHoang · 2024-07-18T19:41:43Z

@vpehkone does the data corpora need to be added to a cloud repository (similar to this workload.json) If so, we might need to add a files.txt, similar to other workloads.

@IanHoang Yes, I added files.txt. Can you let me know how add the data corpora to the cloud repository? Then I can update the corpus url to workload.json.

We'll need to download the corpora manually and add it to our cloud repository. I can coordinate this with you offline in the slack community.

IanHoang · 2024-07-22T17:30:17Z

@vpehkone Please add this to the workload.json as the base-url: https://dbyiw3u3rf9yr.cloudfront.net/corpora/treccovid

IanHoang · 2024-07-22T18:03:55Z

treccovid_semantic_search/workload.json

+  "corpora": [
+    {
+      "name": "treccovid",
+      "base-url": "https://vesa-oswl.s3.us-west-2.amazonaws.com/treccovid",


@vpehkone This can now be updated to https://dbyiw3u3rf9yr.cloudfront.net/corpora/treccovid. Please confirm if document count is also correct. Is 129192 from the uncompressed version of documents.json.bz2?

$ wc -l documents.json.bz2 253809 documents.json.bz2

Updated the new URL for documents and queries. Yes, the document count 129192 is right and it is for uncompressed documents.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Add vector search with embedding generation workload Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Add vector search with embedding generation workload Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated README.md with the license text. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * - Changed the workload form vectorsearch_embedding to semantic_search. - Changed dataset from ms marco to trec-covid. - Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * - Changed the workload name semantic_search to treccovid_semantic_search. - Added the sample output for treccovid_semantic_search. - Added description of test procedure. - Simplified treccovid_semantics_search workload configuration. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated parameters of treccovid workload. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Added files.txt to treccovid workload. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated the documents url for treccovid_semantic_search. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> --------- Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> (cherry picked from commit 417170f) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Add vector search with embedding generation workload * Add vector search with embedding generation workload * Updated README.md with the license text. * - Changed the workload form vectorsearch_embedding to semantic_search. - Changed dataset from ms marco to trec-covid. - Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo. * - Changed the workload name semantic_search to treccovid_semantic_search. - Added the sample output for treccovid_semantic_search. - Added description of test procedure. - Simplified treccovid_semantics_search workload configuration. * Updated parameters of treccovid workload. * Added files.txt to treccovid workload. * Updated the documents url for treccovid_semantic_search. --------- (cherry picked from commit 417170f) Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

navneet1v · 2024-07-24T20:06:37Z

excited to see this workload getting merged. :) thanks @vpehkone making this happen and @IanHoang for reviewing the code.

…ect#232) * Add vector search with embedding generation workload Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Add vector search with embedding generation workload Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated README.md with the license text. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * - Changed the workload form vectorsearch_embedding to semantic_search. - Changed dataset from ms marco to trec-covid. - Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * - Changed the workload name semantic_search to treccovid_semantic_search. - Added the sample output for treccovid_semantic_search. - Added description of test procedure. - Simplified treccovid_semantics_search workload configuration. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated parameters of treccovid workload. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Added files.txt to treccovid workload. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Updated the documents url for treccovid_semantic_search. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> --------- Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

vpehkone added 3 commits March 11, 2024 08:43

Add vector search with embedding generation workload

23b079f

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

Merge remote-tracking branch 'origin/main' into vesa

dcd2a0e

Add vector search with embedding generation workload

786936c

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

vpehkone requested review from IanHoang, gkamat, beaioun and cgchinmay as code owners March 11, 2024 17:08

Updated README.md with the license text.

8ad6128

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

wbeckler reviewed Mar 15, 2024

View reviewed changes

vectorsearch_embedding/README.md Outdated Show resolved Hide resolved

Merge remote-tracking branch 'origin/neural_search' into vesa

1fcca1c

navneet1v reviewed Mar 25, 2024

View reviewed changes

vectorsearch_embedding/workload.json Outdated Show resolved Hide resolved

navneet1v reviewed Mar 25, 2024

View reviewed changes

vectorsearch_embedding/workload.py Outdated Show resolved Hide resolved

vectorsearch_embedding/workload.py Outdated Show resolved Hide resolved

vectorsearch_embedding/workload.py Outdated Show resolved Hide resolved

vectorsearch_embedding/workload.py Outdated Show resolved Hide resolved

vpehkone requested a review from rishabh6788 as a code owner March 30, 2024 06:00

IanHoang reviewed May 20, 2024

View reviewed changes

semantic_search/test_procedures/default.json Outdated Show resolved Hide resolved

IanHoang reviewed May 20, 2024

View reviewed changes

semantic_search/README.md Outdated Show resolved Hide resolved

IanHoang reviewed May 20, 2024

View reviewed changes

semantic_search/README.md Outdated Show resolved Hide resolved

vpehkone added 2 commits May 21, 2024 09:05

Merge remote-tracking branch 'origin/main' into neural_search

c16e1a7

vpehkone requested a review from VijayanB as a code owner May 21, 2024 16:30

martin-gaievski reviewed May 21, 2024

View reviewed changes

treccovid_semantic_search/test_procedures/default.json Show resolved Hide resolved

treccovid_semantic_search/test_procedures/default.json Show resolved Hide resolved

treccovid_semantic_search/README.md Show resolved Hide resolved