-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Authors getting incorrect alternate names after merge #498
Comments
Looking at the history https://openlibrary.org/authors/OL24638A/Charles_Dickens?m=history shows that most of that happened on author merges. See the diffs. |
I assumed as much, but I don't think that makes a difference when it comes to cleanup -- or am I missing something? |
The entry "Mike; Spencer, John (editors) (Charles Dickens; Lord Halifax; Edgar Allan Poe; Bram Stoker; O. Henry; William Mudford; Frederick Marryat; Matthew Lewis; William Makepeace Thackeray; W. W. Jacobs; Saki) Jarvis" came from some W record. Rather than just removing it from the merged A record, that W should be corrected too. The old A will now be a redirect: it should I think be deleted after the Ws are corrected. |
That seems to have happened at https://openlibrary.org/authors/OL2895898A?m=diff&b=2 and https://openlibrary.org/authors/OL24638A/Charles_Dickens?m=diff&b=16 That leaves the question of which work used to link to https://openlibrary.org/authors/OL2895898A?v=1 before https://openlibrary.org/recentchanges/2012/03/04/merge-authors/45842283 was done. |
A simpler case: OL6034980A is "Joseph Barrell " with a trailing space, apparently created (with other similar cases) by ImportBot on 27 Oct 2008 while importing ia:evolutioneartha02huntgoog into OL20493274M. The other authors on that edition were similarly afflicted. |
related: #117 |
@tfmorris Would you recommend fixing this issue at the Infobase level or at the Solr level? Also, are you or @hornc willing to be assignee for this issue? Note, being the assignee doesn't necessarily mean you are responsible for doing the work, just responsible for gathering/providing information to address the issue. From the Wiki.
I've added labels per context: let me know your thoughts |
We could attempt a short term patch to the Solr updater to mitigate some of the most egregious issues, but long term we need to fix:
Most of the problem is cosmetic, so I consider it lower priority. |
Can I get this? |
Basic approach for the alternate names issue: Use
eg given:
Then we dedupe and choose just one instance of Charles Dickens, because they all have the same uniq key. |
@cdrini Sounds like you're basically talking about creating the equivalent of an ICU primary strength sorting key (perhaps with some pre-processing cleanup). I suggest you just use PyICU directly to avoid having to reimplement everything, particularly for non-English names. https://unicode-org.github.io/icu/userguide/collation/concepts.html#comparison-levels |
Low hanging fruit that should be simpler to automate. |
Common misspellings are okay to keep as someone might legitimately search for those. As for the various formats, would keeping them present help new imports match up correctly? For example, if dates are included? |
In looking at the Charles Dickens record, we see the following bad name forms, many of which could be detected and eliminated automatically:
Non-author contributors conflated:
"Flo Gibson (Narrator)",
"illustrated by Arthur Rackham Charles Dickens",
"Introduction-John Carey",
"Mike; Spencer, John (editors) (Charles Dickens; Lord Halifax; Edgar Allan Poe; Bram Stoker; O. Henry; William Mudford; Frederick Marryat; Matthew Lewis; William Makepeace Thackeray; W. W. Jacobs; Saki) Jarvis"
Bad capitalization, spelling, etc:
"Dickens",
"DICKENS",
"Dickens, Charles",
"Charles",
"CHARLES DICKENS",
"CHARLES. DICKENS",
"Charles dickens",
"Dickens Charles.",
"C DICKENS",
"Charled Dickens",
"Dickens Charles",
"harles Dickens",
Dates or transliterations embedded:
"Dickens, Charles, 1812-1870.",
"DICKENS, CHARLES, 1812-1870",
"Charles Dickens Charles Dickens",
"Charles 1812-1870 Dickens",
"Dickens, Charles,$d1812-1870",
"Dickens,Charles ディケンズ,チャールズ (1812-1870)",
The text was updated successfully, but these errors were encountered: