Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly import LCCNs #2865

Merged
merged 24 commits into from
Jan 21, 2020
Merged

Correctly import LCCNs #2865

merged 24 commits into from
Jan 21, 2020

Conversation

hornc
Copy link
Collaborator

@hornc hornc commented Jan 15, 2020

e.g. http://lccn.loc.gov/64-11739
is the same as http://lccn.loc.gov/64011739
but the canonical LCCN listed on each is 64011739. OL will now store 64011739 to aid comparisons

It does not strip spaces,
http://lccn.loc.gov/ca34001802
shows the LCCN with a space: ca 34001802

The input for this example is ca 34-1802
After this PR, we will store the LCCN as ca 34001802
prior to this PR, we were reading it as 38 -- which is very broken.

Technical

There are far too many duplicate functions, I am putting clear deprecation notices in the code, and only deleting code when I am sure it is not being used.

This has turned into a pretty big refactor of the MARC code -- the first problem I had was figuring out which of the multiple lccn methods was being used and needed to be fixed. The refactor helps label the deprecated MARC code, and moves a lot of the used code into the correct place, and ensures the code that is used is the code under test. The refactoring is not complete, but the work needed to fix the LCCN issue is complete.

Testing

I'll do some live import tests locally to confirm none of the deprecated modules are being used by the MARC import paths.

Evidence

Stakeholders

@hornc hornc requested review from cdrini and tfmorris January 16, 2020 00:56
openlibrary/catalog/get_ia.py Outdated Show resolved Hide resolved
Copy link
Contributor

@tfmorris tfmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we've already invested all the work that we could have saved, but would it make sense to switch to PyMARC rather than continuing to maintain our own parser?

For the deprecation warnings, I think we should use the Deprecated module and its @deprecated decorator rather than rolling our own.

I presume the large blocks of "new" code are actually just copied from somewhere else, so didn't review them closely. I didn't run tests locally since they passed on CI.

openlibrary/catalog/get_ia.py Outdated Show resolved Hide resolved
openlibrary/catalog/marc/build_record.py Outdated Show resolved Hide resolved
openlibrary/catalog/marc/build_record.py Outdated Show resolved Hide resolved
openlibrary/catalog/marc/fast_parse.py Outdated Show resolved Hide resolved
openlibrary/catalog/marc/fast_parse.py Show resolved Hide resolved
openlibrary/catalog/marc/marc_binary.py Show resolved Hide resolved
Copy link
Contributor

@tfmorris tfmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, thought I was just approving the new commit, not the entire branch.

I still think it would be an improvement to use the @deprecated decorator.

@hornc hornc force-pushed the hotfix/2851-LCCNs branch from 1fd0246 to 2a0ea28 Compare January 21, 2020 00:22
@hornc
Copy link
Collaborator Author

hornc commented Jan 21, 2020

@tfmorris regarding PyMARC, I have tried to simply refactor the code that is currently in use, identify where it is by removing the most glaring duplication, and deprecate duplicate code paths that probably shouldn't be used. Using replacing the current code with PyMARC should be another effort, if we require it.

From what I have seen from this refactor, various imports, and from working with PyMARC separately, I think OL handles more cases than PyMARC, which is a problem in itself, and the cause of #2877

This code does use PyMARC for decoding MARC8, which is good.

from pymarc import MARC8ToUnicode
from openlibrary.catalog.marc import mnemonics
import six
marc8 = MARC8ToUnicode(quiet=True)

I'm not sure how widely these mnemonics are used: https://github.com/internetarchive/openlibrary/blob/e8d48561a90c3da5ab9a13d09c4117a10a2187c5/openlibrary/catalog/marc/mnemonics.py , but if PyMARC can parse those, we should replace it.

I also don't know where 'wrapped line' MARC records are described -- PyMARC does not support it. I'm not sure if it is an official field wrapping convention, but OL does support field continuations that end in ++

def handle_wrapped_lines(_iter):
"""
Handles wrapped MARC fields, which appear to be multiple
fields with the same field number ending with ++
Have not found an official spec which describe this.
"""
cur_lines = []
cur_tag = None
maybe_wrap = False
for t, l in _iter:
if len(l) > 500 and l.endswith('++\x1e'):
assert not cur_tag or cur_tag == t
cur_tag = t
cur_lines.append(l)
continue
if cur_lines:
yield cur_tag, cur_lines[0][:-3] + ''.join(i[2:-3] for i in cur_lines[1:]) + l[2:]
cur_tag = None
cur_lines = []
continue
yield t, l
assert not cur_lines

@hornc
Copy link
Collaborator Author

hornc commented Jan 21, 2020

@mekarpeles This PR will require a new pip install on production servers:

Deprecated==1.2.7
https://github.com/internetarchive/openlibrary/pull/2865/files#diff-16ea6285ffdbe776a9b61dbdb6c501a9R4

@hornc hornc requested a review from tfmorris January 21, 2020 01:23
Copy link
Contributor

@tfmorris tfmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't run tests, but I reviewed all the code and didn't spot any problems by inspection.

Thanks for making the suggested changes!

Copy link
Member

@mekarpeles mekarpeles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, should be blocked on @mekarpeles for installing deprecated module on test + prod

@mekarpeles mekarpeles self-assigned this Jan 21, 2020
@mekarpeles mekarpeles added the State: Blocked Work has stopped, waiting for something (Info, Dependent fix, etc. See comments). [managed] label Jan 21, 2020
@mekarpeles
Copy link
Member

source /opt/openlibrary/venv/bin/activate;sudo pip install Deprecated==1.2.7 installed on ol-dev0, ol-web3, ol-web4, and ol-home.

@mekarpeles mekarpeles removed the State: Blocked Work has stopped, waiting for something (Info, Dependent fix, etc. See comments). [managed] label Jan 21, 2020
@mekarpeles mekarpeles merged commit 5007d19 into master Jan 21, 2020
@cdrini cdrini deleted the hotfix/2851-LCCNs branch May 6, 2021 19:23
@mekarpeles mekarpeles added python Pull requests that update Python code and removed Module: Python labels Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Pull requests that update Python code Theme: Identifiers Issues related to ISBN's or other identifiers in metadata. [managed] Theme: MARC records
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LCCN incorrectly imported when containing alpha characters
4 participants