Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_person may produce misleading results for strings containing CJK characters #14

Closed
mhucka opened this issue May 22, 2023 · 1 comment
Assignees
Labels
Bug 🐛 Something isn't working Priority ★★★ High priority
Milestone

Comments

@mhucka
Copy link
Contributor

mhucka commented May 22, 2023

is_person() in name_utils.py will return False if a name string contains all-CJK characters. At the time I wrote it, it was done this way because the name checkers like ProbablePeople can't handle CJK. However, it's obviously wrong if the string really is a human name.

@mhucka mhucka added Bug 🐛 Something isn't working Priority ★★★ High priority labels May 22, 2023
@mhucka mhucka self-assigned this May 22, 2023
@mhucka mhucka added this to the 1.3.0 milestone Jul 13, 2023
@mhucka
Copy link
Contributor Author

mhucka commented May 15, 2024

A partial fix is now in the dev branch and will be in the upcoming 1.3.0 release. The new implementation of is_person() is not very accurate when it comes to names in CJK scripts, but it is still better than the current situation (which is that it always returns False for CJK names).

Solving this problem properly turns out to be very difficult. I wish I could do something better than the current weak, home-grown heuristics. Unfortunately, this appears to be a research-grade problem that no one has solved. Even the best AI systems today can't reliable tell you if, say, a given 1-3 character sequence in Chinese is the name of a person.

The current solution may be as good as we can get for now. I'm going to close this issue because it is unlikely that I can devote more time on this matter.

@mhucka mhucka closed this as completed May 15, 2024
mhucka added a commit that referenced this issue May 15, 2024
This now attempts to make is_person() handle names written in Chinese,
Japanese and Korean scripts. It uses multiple heuristics to do this,
and it is not very accurate right now. However, it's still better than
it was before, because before, is_person() would simply return False
for name written in CJK scripts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug 🐛 Something isn't working Priority ★★★ High priority
Projects
Status: Done
Development

No branches or pull requests

1 participant