Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add batching for querying in ElasticsearchDocumentStore and OpenSearchDocumentStore #5063

Merged
merged 9 commits into from
Jun 1, 2023

Conversation

bogdankostic
Copy link
Contributor

Related Issues

  • n/a

Proposed Changes:

This PR introduces an instance parameter batch_size to ElasticsearchDocumentStore and OpenSearchDocumentStore that allows to set the number of queries / documents that are passed to Elasticsearch / OpenSearch using msearch or bulk, respectively.

How did you test it?

I added some unit tests.

Notes for the reviewer

This is needed for cases where EmbeddingRetriever's run_batch method is called with several thousands of queries. Without this change, we run into the HTTP Error 413 Transport Error: Payload too large.

Checklist

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good to me already. One change request is about 10_000 as default instead of 1_000 if possible. The write_documents method of the OpenSearchDocumentStore should be updated too (see the comment). Ideally, we could extend the test cases to check that also write_documents picks up the batch_size as specified in the init of OpenSearchDocumentStore/ElasticsearchDocumentStore

@@ -165,6 +166,8 @@ def __init__(
index type and knn parameters). If `0`, training doesn't happen automatically but needs
to be triggered manually via the `train_index` method.
Default: `None`
:param batch_size: Number of Documents to index at once / Number of queries to execute at once. If you face
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The write_documents method of the OpenSearchDocumentStore also has a batch_size parameter with a default of 10_000. If we introduce a batch_size param in the init of the document store, we should also use self.batch_size or batch_size in write_documents and make the parameter batch_size: Optional[int] = None in the signature of the write_documents method of the OpenSearchDocumentStore.
Can we make the default 10_000 instead of 1_000 then to prevent a breaking change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed write_documents and changed the default back to 10_000. I initially changed it to 1000 because I found this on the Elasticsearch documentation, but this information is probably outdated as it is for a former version.

@coveralls
Copy link
Collaborator

coveralls commented Jun 1, 2023

Pull Request Test Coverage Report for Build 5146692208

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 291 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.4%) to 40.083%

Files with Coverage Reduction New Missed Lines %
document_stores/elasticsearch.py 51 43.36%
document_stores/opensearch.py 110 66.26%
document_stores/search_engine.py 130 62.31%
Totals Coverage Status
Change from base Build 5144018275: 0.4%
Covered Lines: 8981
Relevant Lines: 22406

💛 - Coveralls

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@bogdankostic bogdankostic merged commit a9a49e2 into main Jun 1, 2023
@bogdankostic bogdankostic deleted the doc_store_batching branch June 1, 2023 16:47
bogdankostic added a commit that referenced this pull request Jun 6, 2023
bogdankostic added a commit that referenced this pull request Jun 6, 2023
… for `WeaviateDocumentStore` (#5079)

* Add batch_size parameter and cast timeout_config to tuple

* Add unit test

* Remove debug tqdm

* Remove debug tqdm introduced in #5063
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants