Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Authors getting incorrect alternate names after merge #498

Open
tfmorris opened this issue May 24, 2017 · 14 comments
Open

Authors getting incorrect alternate names after merge #498

tfmorris opened this issue May 24, 2017 · 14 comments
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] hacktoberfest Issues appropriate for Hacktoberfest participants Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Authors Module: Identifier Resolution For resolving records based on identifiers or patterns Priority: 3 Issues that we can consider at our leisure. [managed] Type: Bug Something isn't working. [managed]

Comments

@tfmorris
Copy link
Contributor

tfmorris commented May 24, 2017

In looking at the Charles Dickens record, we see the following bad name forms, many of which could be detected and eliminated automatically:

Non-author contributors conflated:

"Flo Gibson (Narrator)",
"illustrated by Arthur Rackham Charles Dickens",
"Introduction-John Carey",
"Mike; Spencer, John (editors) (Charles Dickens; Lord Halifax; Edgar Allan Poe; Bram Stoker; O. Henry; William Mudford; Frederick Marryat; Matthew Lewis; William Makepeace Thackeray; W. W. Jacobs; Saki) Jarvis"

Bad capitalization, spelling, etc:

"Dickens",
"DICKENS",
"Dickens, Charles",
"Charles",
"CHARLES DICKENS",
"CHARLES. DICKENS",
"Charles dickens",
"Dickens Charles.",
"C DICKENS",
"Charled Dickens",
"Dickens Charles",
"harles Dickens",

Dates or transliterations embedded:

"Dickens, Charles, 1812-1870.",
"DICKENS, CHARLES, 1812-1870",
"Charles Dickens Charles Dickens",
"Charles 1812-1870 Dickens",
"Dickens, Charles,$d1812-1870",
"Dickens,Charles ディケンズ,チャールズ (1812-1870)",

@LeadSongDog
Copy link

Looking at the history https://openlibrary.org/authors/OL24638A/Charles_Dickens?m=history shows that most of that happened on author merges. See the diffs.

@tfmorris
Copy link
Contributor Author

I assumed as much, but I don't think that makes a difference when it comes to cleanup -- or am I missing something?

@LeadSongDog
Copy link

The entry "Mike; Spencer, John (editors) (Charles Dickens; Lord Halifax; Edgar Allan Poe; Bram Stoker; O. Henry; William Mudford; Frederick Marryat; Matthew Lewis; William Makepeace Thackeray; W. W. Jacobs; Saki) Jarvis" came from some W record. Rather than just removing it from the merged A record, that W should be corrected too. The old A will now be a redirect: it should I think be deleted after the Ws are corrected.

@LeadSongDog
Copy link

@LeadSongDog
Copy link

A simpler case: OL6034980A is "Joseph Barrell " with a trailing space, apparently created (with other similar cases) by ImportBot on 27 Oct 2008 while importing ia:evolutioneartha02huntgoog into OL20493274M. The other authors on that edition were similarly afflicted.

@hornc hornc added the Module: Merging Record merging label Sep 4, 2017
@hornc hornc added the metadata label Sep 18, 2017
@mekarpeles mekarpeles changed the title Remove invalid/inappropriate/bad alternate names for authors Authors getting incorrect alternate names after merge Mar 13, 2018
@mekarpeles
Copy link
Member

related: #117

@xayhewalo xayhewalo added Affects: Data Issues that affect book/author metadata or user/account data. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] State: Backlogged Type: Bug Something isn't working. [managed] labels Oct 27, 2019
@xayhewalo
Copy link
Collaborator

@tfmorris Would you recommend fixing this issue at the Infobase level or at the Solr level? Also, are you or @hornc willing to be assignee for this issue? Note, being the assignee doesn't necessarily mean you are responsible for doing the work, just responsible for gathering/providing information to address the issue. From the Wiki.

The assigned owner is not necessarily the person who will fix the issue (it is not necessarily even established, at that point, if or when the issue will be fixed at all), but rather they are the person who will do as much or as little as needed to handle the issue (asking questions, soliciting input, establishing and updating the priority, checking if it is a duplicate, etc).

Once an issue is labeled State: Work In Progress, the owner is the individual doing the work, or leading/coordinating the group that is doing the work.

I've added labels per context: let me know your thoughts

@tfmorris
Copy link
Contributor Author

We could attempt a short term patch to the Solr updater to mitigate some of the most egregious issues, but long term we need to fix:

  • importer(s) - so they don't contribute to the problem
  • the author merge code - so that it normalizes and eliminates near duplicates
  • the database (bulk update to clean up existing issues)

Most of the problem is cosmetic, so I consider it lower priority.

@xayhewalo xayhewalo added the Priority: 3 Issues that we can consider at our leisure. [managed] label Nov 26, 2019
@cdrini cdrini added Needs: Lead Good First Issue Easy issue. Good for newcomers. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] and removed Needs: Lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Module: Merging Record merging metadata labels Apr 20, 2020
@mekarpeles mekarpeles added the hacktoberfest Issues appropriate for Hacktoberfest participants label Oct 5, 2020
@pranjii
Copy link

pranjii commented Sep 3, 2021

Can I get this?

@mekarpeles mekarpeles added Module: Authors and removed Good First Issue Easy issue. Good for newcomers. [managed] labels Dec 19, 2022
@LeadSongDog
Copy link

Hmm, I had the impression that the trailin spaces had been cleaned up, but I just came upon and fixed OL6027001A that had been untouched since 2008:
IMG_1638
IMG_1639

@cdrini
Copy link
Collaborator

cdrini commented Jun 12, 2023

Basic approach for the alternate names issue:

Use uniq with a custom key fn. The key fn should:

  1. Make case insensitive
  2. Remove punctuation
  3. Optional: Rm anystring that is a substring of the main string.

eg given:

Original Name Converted Name (uniq key)
Charles Dickens charles dickens
CHARLES DICKENS. charles dickens
CHARLES. DICKENS charles dickens

Then we dedupe and choose just one instance of Charles Dickens, because they all have the same uniq key.

@tfmorris
Copy link
Contributor Author

@cdrini Sounds like you're basically talking about creating the equivalent of an ICU primary strength sorting key (perhaps with some pre-processing cleanup). I suggest you just use PyICU directly to avoid having to reimplement everything, particularly for non-English names. https://unicode-org.github.io/icu/userguide/collation/concepts.html#comparison-levels

@LeadSongDog
Copy link

Low hanging fruit that should be simpler to automate.
Here’s a trivial case where the correction is simply to extract the dates from the name and put them in the born and died fields:
https://openlibrary.org/authors/OL12376689A/William_Grimshaw?b=2&a=1&_compare=Compare&m=diff

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Jun 3, 2024
@cdrini cdrini removed the Needs: Response Issues which require feedback from lead label Jun 7, 2024
@seabelis
Copy link
Collaborator

Common misspellings are okay to keep as someone might legitimately search for those. As for the various formats, would keeping them present help new imports match up correctly? For example, if dates are included?

@mekarpeles mekarpeles added the Module: Identifier Resolution For resolving records based on identifiers or patterns label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] hacktoberfest Issues appropriate for Hacktoberfest participants Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Authors Module: Identifier Resolution For resolving records based on identifiers or patterns Priority: 3 Issues that we can consider at our leisure. [managed] Type: Bug Something isn't working. [managed]
Projects
None yet
Development

No branches or pull requests

8 participants