`is_person` may produce misleading results for strings containing CJK characters #14

mhucka · 2023-05-22T17:33:09Z

is_person() in name_utils.py will return False if a name string contains all-CJK characters. At the time I wrote it, it was done this way because the name checkers like ProbablePeople can't handle CJK. However, it's obviously wrong if the string really is a human name.

The text was updated successfully, but these errors were encountered:

mhucka · 2024-05-15T03:38:07Z

A partial fix is now in the dev branch and will be in the upcoming 1.3.0 release. The new implementation of is_person() is not very accurate when it comes to names in CJK scripts, but it is still better than the current situation (which is that it always returns False for CJK names).

Solving this problem properly turns out to be very difficult. I wish I could do something better than the current weak, home-grown heuristics. Unfortunately, this appears to be a research-grade problem that no one has solved. Even the best AI systems today can't reliable tell you if, say, a given 1-3 character sequence in Chinese is the name of a person.

The current solution may be as good as we can get for now. I'm going to close this issue because it is unlikely that I can devote more time on this matter.

This now attempts to make is_person() handle names written in Chinese, Japanese and Korean scripts. It uses multiple heuristics to do this, and it is not very accurate right now. However, it's still better than it was before, because before, is_person() would simply return False for name written in CJK scripts.

mhucka added Bug 🐛 Something isn't working Priority ★★★ High priority labels May 22, 2023

mhucka self-assigned this May 22, 2023

mhucka added this to the 1.3.0 milestone Jul 13, 2023

mhucka closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`is_person` may produce misleading results for strings containing CJK characters #14

`is_person` may produce misleading results for strings containing CJK characters #14

mhucka commented May 22, 2023

mhucka commented May 15, 2024

is_person may produce misleading results for strings containing CJK characters #14

is_person may produce misleading results for strings containing CJK characters #14

Comments

mhucka commented May 22, 2023

mhucka commented May 15, 2024

`is_person` may produce misleading results for strings containing CJK characters #14

`is_person` may produce misleading results for strings containing CJK characters #14