Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build ContentMine-based workflow for "main subject" of papers in Wikidata #51

Open
Daniel-Mietchen opened this issue Mar 5, 2017 · 6 comments

Comments

@Daniel-Mietchen
Copy link
Collaborator

ContentMine can analyze papers in various ways, including as to what the most salient terms are, e.g. via https://en.wikipedia.org/wiki/Tf%E2%80%93idf .

It would be nice to harvest that to annotate Wikidata items about papers with the property P921 "main subject".

@Daniel-Mietchen
Copy link
Collaborator Author

As a starting point, it would make sense to go for papers that are already on Wikidata and have a P932 (PMCID) statement.

The query for that is

SELECT ?item ?pmcid WHERE {
  ?item wdt:P31 wd:Q13442814;
        wdt:P932 ?pmcid.  
}
#LIMIT 100

Without the LIMIT command, this just took 6s and gave 334628 results, which sounds like a good maximal size for a test set.

@petermr
Copy link

petermr commented Mar 5, 2017

Daniel-Mietchen and I discussed this with the possible outcomes of:

High-level strategy

Collect a corpus of Open articles and carry out supervised term analysis of the content, supported by #wikidata-enhanced dictionaries. Articles with a "main topic" which maps onto #Wikidata items (Q\d+) are likely to have many mentions of the main topic. For example article http://europepmc.org/articles/PMC2491585 mentions

  • DENV (Q476209) x 88 // Dengue virus
  • Dengue (Q30953) x 8 // Dengue fever
  • Yellow fever x 3

and the most common terms (Bag of words) are:

  • HLA (Q911125) x 39 // Human leukocyte antigen
  • peptide (Q172847) x 33

We can infer that the main topic of the article is Dengue Virus and antigenicity. This is consistent with the title:

Conservation and variability of dengue virus proteins: implications for vaccine design.

The term "vaccine" occurs 16 times in the main text (whereas "HLA" and "peptide" - the mechanism of vaccination is emphasised.

Corpus of articles:

@Daniel-Mietchen
Copy link
Collaborator Author

OK, I've added these to https://www.wikidata.org/wiki/Q24288762#P921 .

How can we scale that up? Can you provide a list of the following kind?

  • Wikidata item for scientific article that has a PMCID (see my comment above)
  • Wikidata items for the top 3 - 5 topics identified as "main subject" (in the sense of P921) as per your comment above

@Daniel-Mietchen
Copy link
Collaborator Author

Reopening this, as we are still working on it.

@Daniel-Mietchen
Copy link
Collaborator Author

A side project could be to identify the main subject(s) for journals — currently, ca. 40k instances of scientific journal do not have any main subject set in Wikidata
Query:

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q5633421.
  MINUS {?item wdt:P921 ?mainsubject.}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
  
}

@Daniel-Mietchen
Copy link
Collaborator Author

Daniel-Mietchen commented Mar 19, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants