Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The identical paper is not recognized as duplicate #2708

Closed
bernhard-kleine opened this issue Apr 3, 2017 · 17 comments
Closed

The identical paper is not recognized as duplicate #2708

bernhard-kleine opened this issue Apr 3, 2017 · 17 comments
Assignees
Labels
status: waiting-for-feedback The submitter or other users need to provide more information about the issue

Comments

@bernhard-kleine
Copy link

JabRef 4.0.0-dev--snapshot--2017-04-03--master--b45575649
Windows 7 6.1 amd64
Java 1.8.0_121

Steps to reproduce:
0. open PMID 21497511 on Pubmed

  1. use Jabfox/Zotero for importing
  2. While in Jabref, click on the DOI symbol to open the paper on the journals home page.
  3. Import the article via Jabfox/Zotero
  4. While Authors, Journal, Volume, pages, DOI are all equal between the two entrys, the article, however, is not labelled as duplicate in Jabrefs import dialogue and furthermore not found with Quality->Find Duplicates.

This is indeed very strange. I wonder what Jabref now considers as duplicate of an entry.

Put the excerpt of the log file here
@Siedlerchr
Copy link
Member

Maybe related to #2687 @lynyus Maybe you have an idea

@bernhard-kleine
Copy link
Author

bernhard-kleine commented Apr 3, 2017

Maybe, but the difference to that issue is that the papers there are not identical while here it is the very same paper Jabref acted upon as not identical. The question here what makes them not duplicates, there why are the PhysRev not distinguished. Maybe in both cases what makes paper similar or dissimilar.

@stefan-kolb
Copy link
Member

@bernhard-kleine Can you post the BibTeX code for both entries please.

@stefan-kolb stefan-kolb added the status: waiting-for-feedback The submitter or other users need to provide more information about the issue label Apr 4, 2017
@matthiasgeiger
Copy link
Member

matthiasgeiger commented Apr 4, 2017

Crossref:

@Article{Schwartz_2011,
  author    = {Daniel R. Schwartz and Mitchell A. Lazar},
  title     = {Human resistin: found in translation from mouse to man},
  journal   = {Trends in Endocrinology {\&} Metabolism},
  year      = {2011},
  month     = {apr},
  doi       = {10.1016/j.tem.2011.03.005},
  publisher = {Elsevier {BV}},
}

vs. Pubmed:

@Article{,
  author          = {Schwartz, Daniel R and Lazar, Mitchell A},
  title           = {Human resistin: found in translation from mouse to man.},
  journal         = {Trends in endocrinology and metabolism: TEM},
  year            = {2011},
  volume          = {22},
  pages           = {259--265},
  month           = jul,
  issn            = {1879-3061},
  abstract        = {The discovery of resistin 10 years ago as a fat cell-secreted factor that modulates insulin resistance suggested a link to the current obesity-associated epidemics of diabetes and cardiovascular disease, which are major human health concerns. Although adipocyte-derived resistin is indisputably linked to insulin resistance in rodent models, the relevance of human resistin is complicated because human resistin is secreted by macrophages rather than adipocytes, and because of the descriptive nature of human epidemiology. In this review, we examine the recent and growing evidence that human resistin is an inflammatory biomarker and a potential mediator of diabetes and cardiovascular disease.},
  chemicals       = {Inflammation Mediators, RETN protein, human, Resistin, Retn protein, mouse},
  citation-subset = {IM},
  completed       = {2011-11-08},
  country         = {United States},
  created         = {2011-07-05},
  doi             = {10.1016/j.tem.2011.03.005},
  issn-linking    = {1043-2760},
  issue           = {7},
  keywords        = {Adipocytes, White, immunology, metabolism, secretion; Animals; Cardiovascular Diseases, metabolism; Diabetes Mellitus, Type 2, metabolism; Humans; Inflammation Mediators, blood, metabolism; Insulin Resistance; Macrophages, immunology, metabolism, secretion; Mice; Obesity, metabolism; Resistin, blood, metabolism, secretion; Species Specificity},
  mid             = {NIHMS289392},
  nlm             = {PMC3130099},
  nlm-id          = {9001516},
  owner           = {NLM},
  pii             = {S1043-2760(11)00049-X},
  pmc             = {PMC3130099},
  pmid            = {21497511},
  pubmodel        = {Print-Electronic},
  pubstatus       = {ppublish},
  revised         = {2017-02-20},
}

Problem seems to be that title, author and various other fields are sligthly different... However, as they have the same DOI this should generally be sufficient to indicate a duplicate?

@stefan-kolb stefan-kolb self-assigned this Apr 4, 2017
@bernhard-kleine
Copy link
Author

Sorry for being late I thought the answer from matthias was sufficient.

testj.bib.txt

By the way you should add .bib to the list of uploadable file types.

@stefan-kolb
Copy link
Member

Ok, so here is an evaluation of the scenario.

The two entries were not considered equal because:

  • first all required fields for article are compared and contribute to the equality score if they match to 0.8 percent correlated by words
  • journal field is not considered equal as its equality is just below 0.8 threshold.
  • the overall score is 0.76
  • within a 0.05 threshold to the equality threshold of 0.75 also the optional fields are evaluated
  • naturally, the optionally fields make both entries look different as one has a lot of optional fields and the other does not or they are different

To fix this we need to implement a better duplicate algorithm which is not that easy and a lot of work.

For this case i have added a step that checks if an identifier like the DOI is equal and considers them equal immediately.

@stefan-kolb
Copy link
Member

Thank you for your report 👍
This should be fixed in current master. Please try the latest build from http://builds.jabref.org/master.

stefan-kolb added a commit that referenced this issue Apr 4, 2017
stefan-kolb added a commit that referenced this issue Apr 4, 2017
@bernhard-kleine
Copy link
Author

Unfortunate, now there is an issue with unrelated entries identified as duplicates.
I installed the snapshot of today. And imported the two entries again. They were seen as duplicates. That is ok. However, then I started to do the Quality->Find duplicates on the 760+ entry bibfile where so far no duplicates were found. Totally unrelated entries were identified as duplicates. I show an example in the two attachments. As you will see, DOI, Authors, volume. pages, etc are different betwenn the two entries, only Journaltitle and Issn are similar. That is definitive wrong. Please try again. This is obviously not the correct solution
PMID 27294923 and PMID 25569080 are not the same.

screenjabref_20170404_1
screenjabref_20170404_2

@matthiasgeiger matthiasgeiger reopened this Apr 4, 2017
@matthiasgeiger
Copy link
Member

@stefan-kolb Can you please check again?

@stefan-kolb
Copy link
Member

@bernhard-kleine Are you running the latest dev version? There was one version where the ISSN was classified as identifier before but only the DOI, PUBMED and EPRINT are remaining now. 766e555

@bernhard-kleine
Copy link
Author

bernhard-kleine commented Apr 4, 2017 via email

@stefan-kolb
Copy link
Member

Well, there is no actual master of today as every commit reflects itself in a new build.
Unfortunately, I cannot reproduce this for the latest build.

@bernhard-kleine
Copy link
Author

bernhard-kleine commented Apr 4, 2017 via email

@bernhard-kleine
Copy link
Author

JabRef 4.0.0-dev--snapshot--2017-04-04--master--b71420628 (from today) and master--766e555c0 (from 16:11 yesterday)
Both show this.

@matthiasgeiger
Copy link
Member

@stefan-kolb As I thought it already might be solved by your changes I've checked it before re-opening the issue with JabRef_4_0_0-dev--snapshot--2017-04-04--master--b71420628.
Using this build I could reproduce the behavior reported by @bernhard-kleine with the mentioned PMIDs from above

@matthiasgeiger
Copy link
Member

@bernhard-kleine Should be fixed in now in the latest dev builds.

@bernhard-kleine
Copy link
Author

With the latest update only duplicates were found which were real duplicates. Thanks for your efforts. It is much appreciated.

Siedlerchr added a commit that referenced this issue Apr 5, 2017
* upstream/master:
  fix ID consideration in DuplicateCheck
  Add ArXiv identifier batch lookup (#2710)
  Update mockito from 2.7.19 to 2.7.21
  More defensive identifier list #2708
  Revert "Add more identifier field names #2708"
  Add more identifier field names #2708
  Consider entries as equal if their DOI matches #2708
  Imports
  Imports
  Move duplicate detection to logic
  Reuse edit distance class
  Refactoring
  EntryTypeDialog Fetching Autogenerates BibTeX Key (#2709)
  Add changelog entry
  Increase permitted size of StringUtil
  Make sure that JavaFx shuts down in case another JabRef instance is already open
Siedlerchr added a commit that referenced this issue Apr 6, 2017
* upstream/master: (35 commits)
  Update antlr from 4.6 to 4.7
  Fix build
  fix ID consideration in DuplicateCheck
  Add ArXiv identifier batch lookup (#2710)
  Update mockito from 2.7.19 to 2.7.21
  More defensive identifier list #2708
  Revert "Add more identifier field names #2708"
  Add more identifier field names #2708
  Consider entries as equal if their DOI matches #2708
  Imports
  Imports
  Move duplicate detection to logic
  Reuse edit distance class
  Refactoring
  EntryTypeDialog Fetching Autogenerates BibTeX Key (#2709)
  Add changelog entry
  Increase permitted size of StringUtil
  Make sure that JavaFx shuts down in case another JabRef instance is already open
  Remove obsolete localization strings
  Hide context menu before group edit/add (probably a JavaFX vs Swing problem)
  ...

# Conflicts:
#	src/main/java/org/jabref/gui/groups/GroupTreeController.java
#	src/main/java/org/jabref/gui/groups/GroupTreeViewModel.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-for-feedback The submitter or other users need to provide more information about the issue
Projects
None yet
Development

No branches or pull requests

4 participants