Add Client (external API) Module For Enhanced Metadata #306

mskarlin · 2024-08-12T23:55:53Z

This adds a new module: paperqa.clients which provides base classes for both providing (MetadataProvider) and processing (MetadataPostProcessor) Doc metadata. Two metadata providers are included: CrossrefProvider and SemanticScholarProvider and one processor: JournalQualityPostProcessor (which adds in a quality score for each extracted journal). DocMetadataClient then orchestrates all the usage of these providers, so a user doesn't need to worry about where metadata comes from, they just need to request it. Usage examples make this simpler:

async with aiohttp.ClientSession() as session:
    client = DocMetadataClient(session)
    doc_details = await client.query(title="PaperQA: Retrieval-Augmented Generative Agent for Scientific Research")

The doc_details (a DocDetails object) is a subclass of a Doc with extra metadata and the ability to be merged with the __add__ operator. In this way, users can use whichever MetadataProvider's they want and they will get a resultant DocDetails object which fills in whatever fields they can get. DocDetails have rich metadata like:

print(doc_details.formatted_citation) 
>>>Jakub L'ala, Odhran O'Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, and Andrew D. White. Paperqa: retrieval-augmented generative agent for scientific research. ArXiv, Dec 2023. doi:10.48550/arxiv.2312.07559. This article has 21 citations.

The DocMetadataClient.query method will accept any valid ClientQuery supported by the MetadataProviders you have selected (by default all). But we've only implemented a DOIQuery and a TitleAuthorQuery for now. It will return all fields by default, but you can also filter for whatever you'd like:

async with aiohttp.ClientSession() as session:
        # now get with authors just from one source
        s2_client = DocMetadataClient(session, clients=[SemanticScholarProvider])
        s2_details = await s2_client.query(
            title="Augmenting large language models with chemistry tools",
            fields=["title", "doi", "authors"],
        )

The above example will only query SemanticScholar (not Crossref), and it'll only return the title, doi, and authors for that paper. This is useful if you need to a small lookup for DOI or something along those lines. The processors run by default, but if you don't specify them in the clients argument they won't run, like in the above example. So to run with JournalQualityPostProcessor enabled:

async with aiohttp.ClientSession() as session:
        # now get with authors just from one source
        s2_client = DocMetadataClient(session, clients=[SemanticScholarProvider, JournalQualityPostProcessor])
        s2_details = await s2_client.query(
            title="Augmenting large language models with chemistry tools",
            fields=["title", "doi", "authors", "journal"],
        )

Note we had to add "journal" as a field, otherwise the JournalQualityPostProcessor would have had no journal to work from.

The metadata client has been added into Docs.aadd method -- so if you set the use_doc_details flag to true, your other inputs will be used to get rich metadata using a DocDetails object rather than the standard Doc object.

Tests are included which use the pytest-vcr module and cassettes to avoid needing network access. Secrets are filtered from the cassettes by default.

tests/conftest.py

paperqa/utils.py

.pre-commit-config.yaml

paperqa/clients/crossref.py

paperqa/clients/exceptions.py

paperqa/clients/journal_quality.py

paperqa/clients/crossref.py

paperqa/clients/semantic_scholar.py

paperqa/clients/client_models.py

…ne method, regenerate all cassettes with new fields

tests/test_clients.py

jamesbraza · 2024-08-13T23:04:32Z

tests/test_clients.py

+@pytest.mark.asyncio()
+async def test_bulk_doi_search():
+    dois = [
+        "10.1063/1.4938384",


Bonus points for comments documenting why each of these were chosen

jamesbraza · 2024-08-13T23:05:38Z

paperqa/utils.py

+
+
+def create_bibtex_key(author: list[str], year: str, title: str) -> str:
+    FORBIDDEN_KEY_CHARACTERS = {"_", " ", "-", "/", "'", "`", ":"}


Wdyt of moving this to module level?

I think this is specific to bibtex, no?

paperqa/clients/utils.py

paperqa/prompts.py

… live tests for stability

paperqa/utils.py

paperqa/types.py

paperqa/prompts.py

paperqa/docs.py

whitead · 2024-08-13T01:58:30Z

paperqa/docs.py

+        # see if we can upgrade to DocDetails
+        # if not, we can progress with a normal Doc
+        # if "overwrite_fields_from_metadata" is used:
+        # will map "docname" to "key", and "dockey" to "doc_id"


Should we just change these names? Now would be the time

I don't know if if we want to break backwards compatibility quite yet-- I'm for keeping and having extra duplicative fields.

…n, replace bibtex extract w pattern, lower timeout threshold in test

whitead

So fucking cool - great work

Michael Skarlinski added 6 commits August 7, 2024 07:35

first pass at adding clients module

9a5c03d

remove settings.py

1b64abd

add client module and holder docdetails type

356b21a

remove some TODOs and add weird request test

87defb9

stop pulling bibtex if not requested

9918389

better comment on bibtex usage

62df7ce

mskarlin added the enhancement New feature or request label Aug 12, 2024

mskarlin requested review from whitead and jamesbraza August 12, 2024 23:55

fix eof error

6020500

jamesbraza reviewed Aug 13, 2024

View reviewed changes

tests/conftest.py Outdated Show resolved Hide resolved

tests/conftest.py Outdated Show resolved Hide resolved

paperqa/utils.py Outdated Show resolved Hide resolved

paperqa/utils.py Outdated Show resolved Hide resolved

.pre-commit-config.yaml Outdated Show resolved Hide resolved

mskarlin added 2 commits August 13, 2024 09:15

revert to default email for test cassettes

c3012d5

s2 and crossref module-level api headers

e9a447b

jamesbraza reviewed Aug 13, 2024

View reviewed changes

mskarlin added 3 commits August 13, 2024 14:07

move doi_url into its own field, refactor all model validators into o…

b9e5a1c

…ne method, regenerate all cassettes with new fields

add robustness for timeout errors

8d34719

move exception handling into parent method

f154a56

jamesbraza reviewed Aug 13, 2024

View reviewed changes

mskarlin added 4 commits August 13, 2024 16:51

add explicit "prefer other" to the __add__ method and use crossref in…

eaa59be

… live tests for stability

move clients/utils into utils

826757c

rename text in prompt to citation

32d720a

use loop in tests

ce61a1b

whitead reviewed Aug 14, 2024

View reviewed changes

adjust citation prompt, add docstring for populate_bibtex_key_citatio…

bfa119e

…n, replace bibtex extract w pattern, lower timeout threshold in test

whitead self-requested a review August 14, 2024 20:02

whitead approved these changes Aug 14, 2024

View reviewed changes

add topological run-order via nested sequence

e3d93d0

mskarlin merged commit eced4f3 into main Aug 14, 2024
1 check passed

mskarlin deleted the add-paperdetails-type branch August 14, 2024 21:02

whitead mentioned this pull request Sep 11, 2024

September 2024 release #362

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Client (external API) Module For Enhanced Metadata #306

Add Client (external API) Module For Enhanced Metadata #306

mskarlin commented Aug 12, 2024

jamesbraza Aug 13, 2024

jamesbraza Aug 13, 2024

mskarlin Aug 13, 2024

whitead Aug 13, 2024

mskarlin Aug 14, 2024 •

edited

Loading

whitead left a comment



		def create_bibtex_key(author: list[str], year: str, title: str) -> str:
		FORBIDDEN_KEY_CHARACTERS = {"_", " ", "-", "/", "'", "`", ":"}

Add Client (external API) Module For Enhanced Metadata #306

Add Client (external API) Module For Enhanced Metadata #306

Conversation

mskarlin commented Aug 12, 2024

jamesbraza Aug 13, 2024

Choose a reason for hiding this comment

jamesbraza Aug 13, 2024

Choose a reason for hiding this comment

mskarlin Aug 13, 2024

Choose a reason for hiding this comment

whitead Aug 13, 2024

Choose a reason for hiding this comment

mskarlin Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

whitead left a comment

Choose a reason for hiding this comment

mskarlin Aug 14, 2024 •

edited

Loading