-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auto convert certain Unicode Latin characters to the ASCII equivalents #78
Comments
I have a feeling chat gpt could be useful to get a list of character maps |
should this also be used for replacing |
ooh, interesting. The new emoji cleaning probably broke a lot of |
I think we should exclude letter emojis until this issue is complete |
We might have to add exclusions for B, A and AB since they are in the pictograph range iirc |
Added |
A partial solution would be NFKC normalization (String#normalize in JS). list of character replacements: out.txt import sys
import unicodedata
from pprint import pprint
chrs = [chr(c) for c in range(sys.maxunicode+1)]
pprint([(c,unicodedata.normalize("NFKC",c),unicodedata.name(c,None),unicodedata.name(unicodedata.normalize('NFKC',c)[0],None)) for c in chrs if unicodedata.normalize('NFKC', c)!=c]) Note that this doesn't fix letter emojis, nor small caps. |
I've used NFKD and NFKC normalization in other projects before, It creates a mess out of emojis. I prefer https://github.com/gc/confusables for normalization using a map |
|
For example, all of the bold, italics, etc, letters here https://www.compart.com/en/unicode/block/U+1D400 which are sometimes used instead of ASCII letters in YT titles to make it look fancy or something. I have personally never seen any of them used for any legitimate purpose in a title.
Current list of applicable characters:
The text was updated successfully, but these errors were encountered: