-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Client (external API) Module For Enhanced Metadata #306
Conversation
…ne method, regenerate all cassettes with new fields
@pytest.mark.asyncio() | ||
async def test_bulk_doi_search(): | ||
dois = [ | ||
"10.1063/1.4938384", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bonus points for comments documenting why each of these were chosen
|
||
|
||
def create_bibtex_key(author: list[str], year: str, title: str) -> str: | ||
FORBIDDEN_KEY_CHARACTERS = {"_", " ", "-", "/", "'", "`", ":"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wdyt of moving this to module level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is specific to bibtex, no?
# see if we can upgrade to DocDetails | ||
# if not, we can progress with a normal Doc | ||
# if "overwrite_fields_from_metadata" is used: | ||
# will map "docname" to "key", and "dockey" to "doc_id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just change these names? Now would be the time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if if we want to break backwards compatibility quite yet-- I'm for keeping and having extra duplicative fields.
…n, replace bibtex extract w pattern, lower timeout threshold in test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So fucking cool - great work
This adds a new module:
paperqa.clients
which provides base classes for both providing (MetadataProvider
) and processing (MetadataPostProcessor
)Doc
metadata. Two metadata providers are included:CrossrefProvider
andSemanticScholarProvider
and one processor:JournalQualityPostProcessor
(which adds in a quality score for each extracted journal).DocMetadataClient
then orchestrates all the usage of these providers, so a user doesn't need to worry about where metadata comes from, they just need to request it. Usage examples make this simpler:The doc_details (a
DocDetails
object) is a subclass of aDoc
with extra metadata and the ability to be merged with the__add__
operator. In this way, users can use whicheverMetadataProvider
's they want and they will get a resultantDocDetails
object which fills in whatever fields they can get.DocDetails
have rich metadata like:The
DocMetadataClient.query
method will accept any validClientQuery
supported by theMetadataProviders
you have selected (by default all). But we've only implemented aDOIQuery
and aTitleAuthorQuery
for now. It will return all fields by default, but you can also filter for whatever you'd like:The above example will only query SemanticScholar (not Crossref), and it'll only return the title, doi, and authors for that paper. This is useful if you need to a small lookup for DOI or something along those lines. The processors run by default, but if you don't specify them in the
clients
argument they won't run, like in the above example. So to run withJournalQualityPostProcessor
enabled:Note we had to add
"journal"
as a field, otherwise theJournalQualityPostProcessor
would have had no journal to work from.The metadata client has been added into
Docs.aadd
method -- so if you set theuse_doc_details
flag to true, your other inputs will be used to get rich metadata using aDocDetails
object rather than the standardDoc
object.Tests are included which use the
pytest-vcr
module and cassettes to avoid needing network access. Secrets are filtered from the cassettes by default.