Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix faiss index batch_size bug on python3.7 and update es config for … #2965

Merged
merged 5 commits into from
Aug 9, 2022

Conversation

w5688414
Copy link
Contributor

@w5688414 w5688414 commented Aug 4, 2022

…pipelines

PR types

  • Bug fixes

PR changes

  • Docs
  • APIs

Description

  • faiss index batch_size bug on python3.7
  • update es config
  • Fix the nltk download bug
  • Optimize Readme for inner dataset pipline

Copy link

@tianxin1860 tianxin1860 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave a comment

@@ -15,6 +15,7 @@
parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.")
parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
parser.add_argument("--update_batch_size", default=100, type=int, help="The batch size of document_store to update passage embedding for building ANN index.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_batch_size 这个变量是要控制什么?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_batch_size 和 retriever_batch_size 之间的关系是什么?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是faiss的update_embeddings的参数,它给的解释是:

:param batch_size: When working with large number of documents, batching can help reduce memory footprint.

https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/experimental/pipelines/pipelines/document_stores/faiss.py

retriever_batch_size是DensePassageRetriever的参数,它给的解释是

:param batch_size: Number of questions or passages to encode at once. In case of multiple gpus, this will be the total batch size.
https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/experimental/pipelines/pipelines/nodes/retriever/dense.py

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如线下讨论,不适合暴露 update_batch_size 参数,直接修改 pipeline 默认值即可。

Copy link

@tianxin1860 tianxin1860 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@w5688414 w5688414 merged commit bb1729b into PaddlePaddle:develop Aug 9, 2022
@w5688414 w5688414 deleted the pip8 branch June 7, 2023 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants