Propagate section information from new Reach implementation #1399

bgyori · 2022-11-24T20:05:48Z

This PR adapts to recent changes in Reach for extracting section information when reading nxml files. There was an old implementation of this but Reach stopped producing section names at some point, and the new reinstated implementation is different, so the code on the INDRA side also had to be adapted. I did some empirical statistics on the kinds of (unnormalized) section names that occur and made improvements to their normalization.

Independently, it looks like PubMed changed their search API to return a maximum of 10k instead of 100k IDs for searches, requiring updates to tests. I also improved the way we get MeSH IDs from non-standard MeSH URNs from MedScan.

bgyori added 6 commits November 24, 2022 12:17

Reimplement section finding for new format

9bee779

Add API function to process FRIES files

c5272b7

Move testing to 3.7+

dd0cb6d

Handle more section variants

f3650f9

Deal with updated PubMed retmax

ec23724

Implement Gilda grounding for Medscan

4de0741

bgyori mentioned this pull request Nov 24, 2022

Question about section type in REACH processor #1388

Closed

Fix API function

67fed37

bgyori merged commit 1e0eda4 into sorgerlab:master Nov 25, 2022

bgyori deleted the reach_sections branch November 25, 2022 03:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate section information from new Reach implementation #1399

Propagate section information from new Reach implementation #1399

bgyori commented Nov 24, 2022

Propagate section information from new Reach implementation #1399

Propagate section information from new Reach implementation #1399

Conversation

bgyori commented Nov 24, 2022