Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internet Archive import often not setting language #10141

Open
cdrini opened this issue Dec 10, 2024 · 5 comments
Open

Internet Archive import often not setting language #10141

cdrini opened this issue Dec 10, 2024 · 5 comments
Labels
Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 2 Important, as time permits. [managed] Type: Bug Something isn't working. [managed]

Comments

@cdrini
Copy link
Collaborator

cdrini commented Dec 10, 2024

Problem

Eg:

Reproducing the bug

  1. Trigger an import of an IA book
  • Expected behavior: Language on IA is copied over to Open Library
  • Actual behavior: Language is just missing

Context

  • Browser (Chrome, Safari, Firefox, etc):
  • OS (Windows, Mac, etc):
  • Logged in (Y/N):
  • Environment (prod, dev, local): prod

Breakdown

Requirements Checklist

  • [ ]

Related files

Stakeholders


Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
@cdrini cdrini added Type: Bug Something isn't working. [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Needs: Lead labels Dec 10, 2024
@NickDevi
Copy link

hello @cdrini I would like to work on this issue. Thank you!

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Dec 12, 2024
@mekarpeles mekarpeles added Priority: 2 Important, as time permits. [managed] Lead: @scottbarnes Issues overseen by Scott (Community Imports) and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Needs: Lead labels Dec 16, 2024
@mekarpeles
Copy link
Member

@NickDevi thank you for offering! I think first we'll have @scottbarnes investigate and provide next steps if necessary

@mekarpeles mekarpeles added Needs: Investigation This issue/PR needs a root-cause analysis to determine a solution. [managed] Needs: Response Issues which require feedback from lead and removed Needs: Response Issues which require feedback from lead labels Dec 16, 2024
@scottbarnes
Copy link
Collaborator

scottbarnes commented Dec 24, 2024

@NickDevi, sorry for the inexcusable delay. If you are still interested in this, I think this is what's going on.

  1. Languages work for NEW imports via the /api/import/ia endpoint (see Direct Import by OCAID.
  2. Languages ARE NOT added to EXISTING items when using that endpoint to supplement an existing record.
  3. This is because 'languages' is not included in fields which are added to an existing record if the field is empty/absent in the existing record. See
    edition_list_fields = [
    'local_id',
    'lccn',
    'lc_classifications',
    'oclc_numbers',
    'source_records',
    ]
  4. HOWEVER, simply adding 'languages' there will cause an import error. Note: we here enter the realm of belief, where I may be mistaken. This seems to be because the language field is in the wrong format in rec at this point. We need it to look like [{"key": "/languages/eng"}], for example, but it seems to look like "languages": ["eng"]. That won't work.
  5. The reason for this seems to be related to why this would work as an initial import (when there is no matching record) but not when matched to an existing record. That is I think because build_query() is not called when re-importing, but is called upon an initial import, and I think that's the only place the language field gets formatted. See
    if k in ('languages', 'translated_from'):
    for language in v:
    if web.ctx.site.get('/languages/' + language.lower()) is None:
    raise InvalidLanguage(language.lower())
    book[k] = [{'key': '/languages/' + language.lower()} for language in v]
    continue

What this probably means is that:

  1. 'languages' needs to be added to add_book/__init__.py as mentioned in (3) above, and;
  2. the language formatting code in build_query() mentioned in (4) above needs to be broken out into an a helper function and called from both build_query() and update_edition_with_rec_data() (found in add_book/__init__.py).

I don't think any of this is necessarily hard in and of itself, but imports are a bit of a quagmire.

If you are interested in this, and I'm more than happy to do this if you think this is will involve too much wandering in the wilderness, you'd want to read:

  1. https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing#api-authentication-for-imports (for learning how to set a cookie when using imports locally) and
  2. https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing#direct-import-by-ocaid (for importing by OCAID/identifier, such as isbn_9781849353946.

And then test this all locally. This would get you started testing your code using curl:

$ curl -c cookies.txt -X POST "http://localhost:8080/account/login" -d "username=openlibrary@example.com&password=admin123"
$ curl -L -v -X POST "http://localhost:8080/api/import/ia" \
     -b ./cookies.txt \
     -d "identifier=isbn_9781849353946&require_marc=false&force_import=true"
...
{"authors": [{"key": "/authors/OL5A", "name": "Leslie Kaplan", "status": "created"}], "success": true, "edition": {"key": "/books/OL11M", "status": "created"}, "work": {"key": "/works/OL1W", "status": "created"}}%

That should import. If you remove the language (possibly using http://localhost:8080/books/OL11M.yml?m=edit, where OL11M is the edition identifier you got when importing), and then attempt to re-import the same work, it should match, but not add English back as a language, which is the problem and the reason this issue exists.

From there, the fun begins editing the Python code.

Again, please let me know if you're still interested. I would say because of the various parts here, this isn't a great first issue.

Edit: fix file names.

@scottbarnes scottbarnes removed Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Investigation This issue/PR needs a root-cause analysis to determine a solution. [managed] Needs: Response Issues which require feedback from lead labels Dec 24, 2024
@qotkdrn
Copy link

qotkdrn commented Dec 24, 2024

Hello, my name is Alex Bae. I was wondering if I could contribute to this issue?

@scottbarnes
Copy link
Collaborator

@NickDevi, is this still of interest to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 2 Important, as time permits. [managed] Type: Bug Something isn't working. [managed]
Projects
None yet
Development

No branches or pull requests

5 participants