Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto convert certain Unicode Latin characters to the ASCII equivalents #78

Closed
SuperStormer opened this issue Jul 4, 2023 · 10 comments
Closed

Comments

@SuperStormer
Copy link
Contributor

SuperStormer commented Jul 4, 2023

For example, all of the bold, italics, etc, letters here https://www.compart.com/en/unicode/block/U+1D400 which are sometimes used instead of ASCII letters in YT titles to make it look fancy or something. I have personally never seen any of them used for any legitimate purpose in a title.

Current list of applicable characters:

@ajayyy
Copy link
Owner

ajayyy commented Jul 4, 2023

I have a feeling chat gpt could be useful to get a list of character maps

@mchangrh
Copy link
Contributor

should this also be used for replacing 🅱️, α Α (Alpha) and similar glyphs that aren't transformation of a letter but are commonly used as replacements?

@ajayyy
Copy link
Owner

ajayyy commented Jul 20, 2023

ooh, interesting. The new emoji cleaning probably broke a lot of 🅱️ titles

@ajayyy
Copy link
Owner

ajayyy commented Jul 20, 2023

Screenshot_20230720-125800_Firefox_Nightly

Yea...

@ajayyy
Copy link
Owner

ajayyy commented Jul 20, 2023

I think we should exclude letter emojis until this issue is complete

@mchangrh
Copy link
Contributor

We might have to add exclusions for B, A and AB since they are in the pictograph range iirc

ajayyy added a commit that referenced this issue Jul 20, 2023
Resolves discussion in #78
@ajayyy
Copy link
Owner

ajayyy commented Jul 20, 2023

Added

@SuperStormer
Copy link
Contributor Author

SuperStormer commented Jan 17, 2024

A partial solution would be NFKC normalization (String#normalize in JS).

list of character replacements: out.txt

import sys
import unicodedata
from pprint import pprint

chrs = [chr(c) for c in range(sys.maxunicode+1)]
pprint([(c,unicodedata.normalize("NFKC",c),unicodedata.name(c,None),unicodedata.name(unicodedata.normalize('NFKC',c)[0],None)) for c in chrs if unicodedata.normalize('NFKC', c)!=c])

Note that this doesn't fix letter emojis, nor small caps.

@mchangrh
Copy link
Contributor

I've used NFKD and NFKC normalization in other projects before, It creates a mess out of emojis. I prefer https://github.com/gc/confusables for normalization using a map

@ajayyy
Copy link
Owner

ajayyy commented Aug 11, 2024

There's a JS function to normalize strings and one of the modes should work well for this ("𝗠𝘆 𝗖𝗼𝗼𝗹 𝗧𝗶𝘁𝗹𝗲".normalize("NFKD") results in "My Cool Title")

#283

@ajayyy ajayyy closed this as completed in 0712c85 Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants