Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vector search with embedding generation workload #232

Merged
merged 13 commits into from
Jul 24, 2024

Conversation

vpehkone
Copy link
Contributor

Signed-off-by: Vesa Pehkonen vesa.pehkonen@intel.com

Description

Add vector search with embedding generation workload

Issues Resolved

#198

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
@akashsha1
Copy link

akashsha1 commented Mar 11, 2024

@VijayanB / @wbeckler - PR for benchmarking vectorsearch with embedding i.e. neural search. Please help add necessary reviewers for this benchmark.

Feedback is appreciated.

@wbeckler
Copy link

What is the licensing of this content?

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
@navneet1v
Copy link

@vpehkone thanks for raising the PR. Can we move this benchmarks to folder named: semantic_search rather than vectorSearch_embedding

vectorsearch_embedding/workload.py Outdated Show resolved Hide resolved
vectorsearch_embedding/workload.py Outdated Show resolved Hide resolved
vectorsearch_embedding/workload.py Outdated Show resolved Hide resolved
vectorsearch_embedding/workload.py Outdated Show resolved Hide resolved
- Changed dataset from ms marco to trec-covid.
- Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
@vpehkone vpehkone requested a review from rishabh6788 as a code owner March 30, 2024 06:00
@vpehkone
Copy link
Contributor Author

@navneet1v Addressed all comments: changed workload name to semantic_search, moved common code to OSB, and changed dataset to trec-covid. Can you please review?

@IanHoang
Copy link
Collaborator

IanHoang commented May 20, 2024

Since we have two semantic search workloads that are looking to be added, let's rename the workload to be more specific (such as treccovid_semantic_search).

semantic_search/README.md Outdated Show resolved Hide resolved
semantic_search/README.md Outdated Show resolved Hide resolved
vpehkone added 2 commits May 21, 2024 09:05
…rch.

- Added the sample output for treccovid_semantic_search.
- Added description of test procedure.
- Simplified treccovid_semantics_search workload configuration.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
@vpehkone vpehkone requested a review from VijayanB as a code owner May 21, 2024 16:30
@VijayanB
Copy link
Member

Since we have two semantic search workloads that are looking to be added, let's rename the workload to be more specific (such as treccovid_semantic_search).

@IanHoang If difference is just usage of datasets, do we need two different workloads? Why not merge and let individual procedures use their own corpus?

@vpehkone
Copy link
Contributor Author

Since we have two semantic search workloads that are looking to be added, let's rename the workload to be more specific (such as treccovid_semantic_search).

@IanHoang If difference is just usage of datasets, do we need two different workloads? Why not merge and let individual procedures use their own corpus?

@VijayanB These workload are very different. Trec-covid semantic search generates embeddings and does vector search. Noaa semantic search does range, aggregate, term, etc... searches. It does not make sense to merge these.

@martin-gaievski
Copy link
Member

Since we have two semantic search workloads that are looking to be added, let's rename the workload to be more specific (such as treccovid_semantic_search).

@IanHoang If difference is just usage of datasets, do we need two different workloads? Why not merge and let individual procedures use their own corpus?

@VijayanB These workload are very different. Trec-covid semantic search generates embeddings and does vector search. Noaa semantic search does range, aggregate, term, etc... searches. It does not make sense to merge these.

@vpehkone @VijayanB what would you say about rebasing that workload on a single dataset? We're not using this workload and trec-covid for checking correctness. We can use some field of text type to generate embeddings. Here is the example of doc from noaa:

{
  "TMIN": 8.3,
  "SNOW": "0",
  "WSF5": 11.2,
  "SNWD": "0",
  "PRCP": 0,
  "station": {
    "wmo_id": "72398",
    "state": "MARYLAND",
    "name": "SALISBURY WICOMICO RGNL AP",
    "location": {
      "lon": -75.5103,
      "lat": 38.3406
    },
    "elevation": 14.6,
    "country_code": "US",
    "id": "USW00093720",
    "country": "United",
    "state_code": "MD"
  },
  "WDF2": "20",
  "TMAX": 18.3,
  "AWND": 3.1,
  "TRANGE": {
    "lte": 18.3,
    "gte": 8.3
  },
  "date": "2015-04-15T00:00:00",
  "TAVG": 12.9,
  "WSF2": 8.1,
  "WDF5": "20"
}

we can use field station.name instead of text to generate embeddings.
Two workloads then can be merged into one as rest of operations are independent, maybe some operations can be reused.
I'm not sure the other way around is possible as trec-covid mapping is simple and is lacking integer and keyword fields that are needed for hybrid query.

@navneet1v
Copy link

@martin-gaievski what is the reason for merging the noaa dataset with Semantic Search? I think its better to keep semantic search as a separate use case and workload. Having a uber name as semantic search is really good. We can have more dataset in semantic search later. But for now having a simple Semantic Search dataset with trec-covid as dataset is pretty neat I would say.

Comment on lines +73 to +75
def register(registry):
registry.register_param_source("semantic-search-source", QueryParamSource)
registry.register_param_source("create-ingest-pipeline", ingest_pipeline_param_source)
Copy link
Member

@VijayanB VijayanB Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IanHoang Can this be added directly to opensearch-benchmark?

Copy link
Collaborator

@IanHoang IanHoang Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VijayanB Sorry for the lates response. Yes, this can be technically contributed directly into the OpenSearch Benchmark repository . If this is going to be reused by other workloads potentially, we should include it there. @vpehkone Please feel free to make this quick change in OSB repository and we can review it quickly and get this shipped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IanHoang I don't know how it could be possible to add a registration of parameter source function/class to OSB. OSB cannot know where to look at it or the name of the source function/class? Please let me know if you have an idea how it works, and I will try to implement it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vpehkone We plan to add some documentation to the official OSB documentation regarding this. In the interest of time, we can have this merged in

vpehkone added 3 commits July 2, 2024 16:20
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
@IanHoang
Copy link
Collaborator

@vpehkone does the data corpora need to be added to a cloud repository (similar to this workload.json) If so, we might need to add a files.txt, similar to other workloads.

@IanHoang Yes, I added files.txt. Can you let me know how add the data corpora to the cloud repository? Then I can update the corpus url to workload.json.

We'll need to download the corpora manually and add it to our cloud repository. I can coordinate this with you offline in the slack community.

@IanHoang
Copy link
Collaborator

@vpehkone Please add this to the workload.json as the base-url: https://dbyiw3u3rf9yr.cloudfront.net/corpora/treccovid

"corpora": [
{
"name": "treccovid",
"base-url": "https://vesa-oswl.s3.us-west-2.amazonaws.com/treccovid",
Copy link
Collaborator

@IanHoang IanHoang Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vpehkone This can now be updated to https://dbyiw3u3rf9yr.cloudfront.net/corpora/treccovid. Please confirm if document count is also correct. Is 129192 from the uncompressed version of documents.json.bz2?

$ wc -l documents.json.bz2
253809 documents.json.bz2

Copy link
Contributor Author

@vpehkone vpehkone Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the new URL for documents and queries. Yes, the document count 129192 is right and it is for uncompressed documents.

@IanHoang IanHoang added backport 2 Backport to the "2" branch backport 3 Backport to the "3" branch labels Jul 22, 2024
@IanHoang IanHoang merged commit 417170f into opensearch-project:main Jul 24, 2024
4 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 24, 2024
* Add vector search with embedding generation workload
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Add vector search with embedding generation workload
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Updated README.md with the license text.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* - Changed the workload form vectorsearch_embedding to semantic_search.
- Changed dataset from ms marco to trec-covid.
- Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* - Changed the workload name semantic_search to treccovid_semantic_search.
- Added the sample output for treccovid_semantic_search.
- Added description of test procedure.
- Simplified treccovid_semantics_search workload configuration.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Updated parameters of treccovid workload.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Added files.txt to treccovid workload.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Updated the documents url for treccovid_semantic_search.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

---------

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
(cherry picked from commit 417170f)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 24, 2024
* Add vector search with embedding generation workload
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Add vector search with embedding generation workload
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Updated README.md with the license text.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* - Changed the workload form vectorsearch_embedding to semantic_search.
- Changed dataset from ms marco to trec-covid.
- Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* - Changed the workload name semantic_search to treccovid_semantic_search.
- Added the sample output for treccovid_semantic_search.
- Added description of test procedure.
- Simplified treccovid_semantics_search workload configuration.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Updated parameters of treccovid workload.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Added files.txt to treccovid workload.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Updated the documents url for treccovid_semantic_search.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

---------

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
(cherry picked from commit 417170f)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
IanHoang pushed a commit that referenced this pull request Jul 24, 2024
* Add vector search with embedding generation workload


* Add vector search with embedding generation workload


* Updated README.md with the license text.



* - Changed the workload form vectorsearch_embedding to semantic_search.
- Changed dataset from ms marco to trec-covid.
- Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo.



* - Changed the workload name semantic_search to treccovid_semantic_search.
- Added the sample output for treccovid_semantic_search.
- Added description of test procedure.
- Simplified treccovid_semantics_search workload configuration.



* Updated parameters of treccovid workload.



* Added files.txt to treccovid workload.



* Updated the documents url for treccovid_semantic_search.



---------


(cherry picked from commit 417170f)

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
IanHoang pushed a commit that referenced this pull request Jul 24, 2024
* Add vector search with embedding generation workload


* Add vector search with embedding generation workload


* Updated README.md with the license text.



* - Changed the workload form vectorsearch_embedding to semantic_search.
- Changed dataset from ms marco to trec-covid.
- Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo.



* - Changed the workload name semantic_search to treccovid_semantic_search.
- Added the sample output for treccovid_semantic_search.
- Added description of test procedure.
- Simplified treccovid_semantics_search workload configuration.



* Updated parameters of treccovid workload.



* Added files.txt to treccovid workload.



* Updated the documents url for treccovid_semantic_search.



---------


(cherry picked from commit 417170f)

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@navneet1v
Copy link

excited to see this workload getting merged. :) thanks @vpehkone making this happen and @IanHoang for reviewing the code.

finnroblin pushed a commit to finnroblin/opensearch-benchmark-workloads that referenced this pull request Aug 5, 2024
…ect#232)

* Add vector search with embedding generation workload
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Add vector search with embedding generation workload
Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Updated README.md with the license text.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* - Changed the workload form vectorsearch_embedding to semantic_search.
- Changed dataset from ms marco to trec-covid.
- Moved benchmark task runners DeletePipeline, DeleteMlModel, RegisterMlModel and DeployMlModel to OS-benchmark repo.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* - Changed the workload name semantic_search to treccovid_semantic_search.
- Added the sample output for treccovid_semantic_search.
- Added description of test procedure.
- Simplified treccovid_semantics_search workload configuration.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Updated parameters of treccovid workload.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Added files.txt to treccovid workload.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

* Updated the documents url for treccovid_semantic_search.

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>

---------

Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2 Backport to the "2" branch backport 3 Backport to the "3" branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants