Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Client (external API) Module For Enhanced Metadata #306

Merged
merged 18 commits into from
Aug 14, 2024

Conversation

mskarlin
Copy link
Collaborator

This adds a new module: paperqa.clients which provides base classes for both providing (MetadataProvider) and processing (MetadataPostProcessor) Doc metadata. Two metadata providers are included: CrossrefProvider and SemanticScholarProvider and one processor: JournalQualityPostProcessor (which adds in a quality score for each extracted journal). DocMetadataClient then orchestrates all the usage of these providers, so a user doesn't need to worry about where metadata comes from, they just need to request it. Usage examples make this simpler:

async with aiohttp.ClientSession() as session:
    client = DocMetadataClient(session)
    doc_details = await client.query(title="PaperQA: Retrieval-Augmented Generative Agent for Scientific Research")

The doc_details (a DocDetails object) is a subclass of a Doc with extra metadata and the ability to be merged with the __add__ operator. In this way, users can use whichever MetadataProvider's they want and they will get a resultant DocDetails object which fills in whatever fields they can get. DocDetails have rich metadata like:

print(doc_details.formatted_citation) 
>>>Jakub L'ala, Odhran O'Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, and Andrew D. White. Paperqa: retrieval-augmented generative agent for scientific research. ArXiv, Dec 2023. doi:10.48550/arxiv.2312.07559. This article has 21 citations.

The DocMetadataClient.query method will accept any valid ClientQuery supported by the MetadataProviders you have selected (by default all). But we've only implemented a DOIQuery and a TitleAuthorQuery for now. It will return all fields by default, but you can also filter for whatever you'd like:

async with aiohttp.ClientSession() as session:
        # now get with authors just from one source
        s2_client = DocMetadataClient(session, clients=[SemanticScholarProvider])
        s2_details = await s2_client.query(
            title="Augmenting large language models with chemistry tools",
            fields=["title", "doi", "authors"],
        )

The above example will only query SemanticScholar (not Crossref), and it'll only return the title, doi, and authors for that paper. This is useful if you need to a small lookup for DOI or something along those lines. The processors run by default, but if you don't specify them in the clients argument they won't run, like in the above example. So to run with JournalQualityPostProcessor enabled:

async with aiohttp.ClientSession() as session:
        # now get with authors just from one source
        s2_client = DocMetadataClient(session, clients=[SemanticScholarProvider, JournalQualityPostProcessor])
        s2_details = await s2_client.query(
            title="Augmenting large language models with chemistry tools",
            fields=["title", "doi", "authors", "journal"],
        )

Note we had to add "journal" as a field, otherwise the JournalQualityPostProcessor would have had no journal to work from.

The metadata client has been added into Docs.aadd method -- so if you set the use_doc_details flag to true, your other inputs will be used to get rich metadata using a DocDetails object rather than the standard Doc object.

Tests are included which use the pytest-vcr module and cassettes to avoid needing network access. Secrets are filtered from the cassettes by default.

@mskarlin mskarlin added the enhancement New feature or request label Aug 12, 2024
tests/conftest.py Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
paperqa/utils.py Outdated Show resolved Hide resolved
paperqa/utils.py Outdated Show resolved Hide resolved
.pre-commit-config.yaml Outdated Show resolved Hide resolved
paperqa/clients/crossref.py Show resolved Hide resolved
paperqa/clients/crossref.py Show resolved Hide resolved
paperqa/clients/exceptions.py Show resolved Hide resolved
paperqa/clients/journal_quality.py Outdated Show resolved Hide resolved
paperqa/clients/crossref.py Outdated Show resolved Hide resolved
paperqa/clients/semantic_scholar.py Show resolved Hide resolved
paperqa/clients/semantic_scholar.py Show resolved Hide resolved
paperqa/clients/client_models.py Show resolved Hide resolved
tests/test_clients.py Outdated Show resolved Hide resolved
@pytest.mark.asyncio()
async def test_bulk_doi_search():
dois = [
"10.1063/1.4938384",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonus points for comments documenting why each of these were chosen



def create_bibtex_key(author: list[str], year: str, title: str) -> str:
FORBIDDEN_KEY_CHARACTERS = {"_", " ", "-", "/", "'", "`", ":"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdyt of moving this to module level?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is specific to bibtex, no?

paperqa/clients/utils.py Outdated Show resolved Hide resolved
paperqa/clients/utils.py Outdated Show resolved Hide resolved
paperqa/prompts.py Outdated Show resolved Hide resolved
paperqa/utils.py Outdated Show resolved Hide resolved
paperqa/utils.py Outdated Show resolved Hide resolved
paperqa/utils.py Outdated Show resolved Hide resolved
paperqa/types.py Show resolved Hide resolved
paperqa/types.py Show resolved Hide resolved
paperqa/prompts.py Outdated Show resolved Hide resolved
paperqa/docs.py Outdated Show resolved Hide resolved
# see if we can upgrade to DocDetails
# if not, we can progress with a normal Doc
# if "overwrite_fields_from_metadata" is used:
# will map "docname" to "key", and "dockey" to "doc_id"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just change these names? Now would be the time

Copy link
Collaborator Author

@mskarlin mskarlin Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if if we want to break backwards compatibility quite yet-- I'm for keeping and having extra duplicative fields.

…n, replace bibtex extract w pattern, lower timeout threshold in test
@whitead whitead self-requested a review August 14, 2024 20:02
Copy link
Collaborator

@whitead whitead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So fucking cool - great work

@mskarlin mskarlin merged commit eced4f3 into main Aug 14, 2024
1 check passed
@mskarlin mskarlin deleted the add-paperdetails-type branch August 14, 2024 21:02
@whitead whitead mentioned this pull request Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants