Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify articles with multiple versions and only use the most recent version #62

Open
gaurav opened this issue Apr 6, 2020 · 2 comments
Assignees

Comments

@gaurav
Copy link
Collaborator

gaurav commented Apr 6, 2020

Some PubMed articles have multiple versions: for example, PMID 31431825 has four versions in PubMed. Given the stream-based parallel processing system Omnicorp currently uses, I don't think there's any way to identify groups of articles with the same PMID, and there don't appear to be any attributes in the XML to indicate which one is the "current" version (see documentation, example).

Currently, we process each version as a separate article, and so produce multiple copies of the triples for each article. To get a sense of the scale of this problem, this appears to affect 474 PMID articles, each of which have two or more versions.

I propose to add an additional script before we start parallel processing of the entire corpus. This script will generate a list of PubMed versions across the entire corpus stored in a text file. The parallel processors can then skip all PubMed versions except for the most recent one, and so ensure that we don't include information from earlier versions in our output.

@cbizon Do you think this is the right approach for ROBOKOP?
@balhoff Is there a cleverer way of figuring out which articles are the latest version that I'm missing?

@balhoff
Copy link
Collaborator

balhoff commented Apr 7, 2020

@gaurav I think that seems reasonable if the the script is relatively quick.

@gaurav gaurav self-assigned this Apr 7, 2020
@gaurav
Copy link
Collaborator Author

gaurav commented Apr 8, 2020

It does seem pretty quick -- it took around 2.5 hours on the cluster without parallelization, and gave me a list of 1,358 PMIDs with multiple versions. As per our conversation in #63 (comment), I'll try using akka-stream/Monix/ZIO to parallelize it before I turn it into a pull request, which should speed it up x32 (assuming 32 cores), but for now I'll focus on modifying Main so that it ignores all but the last version of each of these PMIDs and so avoids producing duplicates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants