Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RefExtract: no hyphen between first and last page #3555

Open
ksachs opened this issue Jul 23, 2018 · 2 comments
Open

RefExtract: no hyphen between first and last page #3555

ksachs opened this issue Jul 23, 2018 · 2 comments

Comments

@ksachs
Copy link
Contributor

ksachs commented Jul 23, 2018

Just for the records. I don't know if this is an issue for labs.

Expected Behavior

If there is a line-break in references between first page and last page (/\d+\-\n\d+) keep the hyphen, don't delete it due to the line-break.

Current Behavior on legacy

If there is a line-break in the page-range, legacy swallows the hyphen and concatenates first- and last-page.
E.g.

F.Graner and B.Dubrulle, Titius-Bode laws in the solar system : I.scale in-
variance explains everything, Astronomy and Astrophysics 282,262-268 (1994);
II.Build your own law from disk models, Astronomy and Astrophysics 282,269-
276 (1994).

-> Astron.Astrophys., 282,269276

@michamos
Copy link
Contributor

It's a bug still present in refextract:

In [1]: from refextract import extract_references_from_string

In [2]: extract_references_from_string('''F.Graner and B.Dubrulle, Titius-Bode laws in the solar system : I.scale in-
   ...: variance explains everything, Astronomy and Astrophysics 282,262-268 (1994);
   ...: II.Build your own law from disk models, Astronomy and Astrophysics 282,269-
   ...: 276 (1994).''')
* could not find references section
* references separator ^[^\s]
* tags u'<cds.AUTHstnd>F.Graner and B.Dubrulle</cds.AUTHstnd>, Titius-Bode laws in the solar system : I.scale invariance explains everything, <cds.JOURNAL>Astron. Astrophys.</cds.JOURNAL> <cds.VOL>282</cds.VOL> <cds.YR>(1994)</cds.YR> <cds.PG>262-268</cds.PG>;'
* splitted_citations
  * line marker  
  * elements
    * AUTH {'auth_type': 'stnd', 'auth_txt': u'F.Graner and B.Dubrulle', 'type': 'AUTH', 'misc_txt': u''}
    * JOURNAL {'volume': u'282', 'is_ibid': False, 'title': u'Astron. Astrophys.', 'extra_ibids': [], 'year': u'1994', 'type': 'JOURNAL', 'misc_txt': u', Titius-Bode laws in the solar system : I.scale invariance explains everything, ', 'page': u'262-268'}
    * YEAR {'type': 'YEAR', 'misc_txt': '', 'year': u'1994'}
* tags u'II.Build your own law from disk models, <cds.JOURNAL>Astron. Astrophys.</cds.JOURNAL> <cds.VOL>282</cds.VOL> <cds.YR>(1994)</cds.YR> <cds.PG>269276</cds.PG>.'
* splitted_citations
  * line marker  
  * elements
    * JOURNAL {'volume': u'282', 'is_ibid': False, 'title': u'Astron. Astrophys.', 'extra_ibids': [], 'year': u'1994', 'type': 'JOURNAL', 'misc_txt': u'II.Build your own law from disk models, ', 'page': u'269276'}
    * YEAR {'type': 'YEAR', 'misc_txt': '', 'year': u'1994'}
Out[2]: 
[{'author': [u'F.Graner and B.Dubrulle'],
  'journal_page': [u'262-268'],
  'journal_reference': ['Astron. Astrophys. 282 (1994) 262-268'],
  'journal_title': [u'Astron. Astrophys.'],
  'journal_volume': [u'282'],
  'journal_year': [u'1994'],
  'misc': [u'Titius-Bode laws in the solar system : I.scale invariance explains everything'],
  'raw_ref': ['F.Graner and B.Dubrulle, Titius-Bode laws in the solar system : I.scale invariance explains everything, Astronomy and Astrophysics 282,262-268 (1994);'],
  'year': [u'1994']},
 {'journal_page': [u'269276'],
  'journal_reference': ['Astron. Astrophys. 282 (1994) 269276'],
  'journal_title': [u'Astron. Astrophys.'],
  'journal_volume': [u'282'],
  'journal_year': [u'1994'],
  'misc': [u'II.Build your own law from disk models'],
  'raw_ref': ['II.Build your own law from disk models, Astronomy and Astrophysics 282,269276 (1994).'],
  'year': [u'1994']}]

As you see, it's present already in the raw_ref, so the bug must be somewhere in the text pre-processing in https://github.com/inspirehep/refextract/blob/master/refextract/references/text.py. I don't think we have the resources here to fix refextract bugs (as it's very complex, nobody here knows how it works, and in the future we will probably to switch, at least partially, to something like GROBID), but you're free to have a look if you want to.

@ksachs
Copy link
Contributor Author

ksachs commented Feb 7, 2019

should be possible to prevent that bug by deleting linebreaks in page-ranges
before referxtract handles page-breaks:
fulltext = re.sub(r'(\d)-\n(\d)', r'\1-\2', fulltext)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants