-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose index statistics #128
Conversation
pyserini/index/pyutils.py
Outdated
@@ -336,3 +336,22 @@ def convert_collection_docid_to_internal_docid(self, docid: str) -> int: | |||
The Lucene internal ``docid`` corresponding to the external collection ``docid``. | |||
""" | |||
return self.object.convertDocidToLuceneDocid(self.reader, docid) | |||
|
|||
def stats(self) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add type hints here?
and below, do some explicit type checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All keys have type str
so that would be possible but, the values vary between int
and str
, I'm not sure how to hint/type check without the code getting messy..
def stats(self) -> Dict[str, int or str]:
Looks kinda strange to me, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the values be string too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the original printIndexStats
in Anserini's IndexUtils
also showed the stored fields as string, e.g. for the 'title' field:
(indexOption: DOCS, hasVectors: false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, I see.
Yea, def stats(self) -> dict:
is fine I think.
Returns | ||
------- | ||
dict | ||
Index statistics as a dictionary of statistic's name to statistic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to just list the statistics we provide? so user doesn't need to go hunting in the Java code?
Pyserini end of castorini/anserini#1218