Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not consuming full EGC for ZWJ #2016

Closed
dankamongmen opened this issue Aug 2, 2021 · 3 comments
Closed

not consuming full EGC for ZWJ #2016

dankamongmen opened this issue Aug 2, 2021 · 3 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@dankamongmen
Copy link
Owner

I've added a new unit test loading

🤽🏼‍♀️ Woman Playing Water Polo: Medium-Light Skin Tone

into an nccell. We're only loading 11 bytes, but we ought be loading 17:

U+1F93D WATER POLO
UTF-8: f0 9f a4 bd UTF-16BE: d83edd3d Decimal: 🤽 Octal: \0374475
🤽
Category: So (Symbol, Other); East Asian width: W (wide)
Unicode block: 1F900..1F9FF; Supplemental Symbols and Pictographs
Bidi: ON (Other Neutrals)

U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3
UTF-8: f0 9f 8f bc UTF-16BE: d83cdffc Decimal: 🏼 Octal: \0371774

Category: Sk (Symbol, Modifier); East Asian width: W (wide)
Unicode block: 1F300..1F5FF; Miscellaneous Symbols and Pictographs
Bidi: ON (Other Neutrals)

U+200D ZERO WIDTH JOINER
UTF-8: e2 80 8d UTF-16BE: 200d Decimal: ‍ Octal: \020015

Category: Cf (Other, Format); East Asian width: N (neutral)
Unicode block: 2000..206F; General Punctuation
Bidi: BN (Boundary Neutral)

U+2640 FEMALE SIGN
UTF-8: e2 99 80 UTF-16BE: 2640 Decimal: ♀ Octal: \023100

Category: So (Symbol, Other); East Asian width: A (ambiguous)
Unicode block: 2600..26FF; Miscellaneous Symbols
Bidi: ON (Other Neutrals)

U+FE0F VARIATION SELECTOR-16
UTF-8: ef b8 8f UTF-16BE: fe0f Decimal: ️ Octal: \0177017

Category: Mn (Mark, Non-Spacing); East Asian width: A (ambiguous)
Unicode block: FE00..FE0F; Variation Selectors
Bidi: NSM (Non-Spacing Mark)

@dankamongmen dankamongmen added the bug Something isn't working label Aug 2, 2021
@dankamongmen dankamongmen added this to the 3.0.0 milestone Aug 2, 2021
@dankamongmen dankamongmen self-assigned this Aug 2, 2021
@dankamongmen
Copy link
Owner Author

uc_is_grapheme_break() definitely returns true for U+2640 FEMALE SIGN. i think we're misinterpreting the meaning of a grapheme break?

@dankamongmen
Copy link
Owner Author

it doesn't appear to be a context-sensitive thing, either, since u8_grapheme_next() also returns 11 bytes here

@dankamongmen
Copy link
Owner Author

if we check for \u200d and force a join following it, we get the expected 17 bytes. i'm not sure we want this...not all sequences are valid, and what if there are two such characters. if nothing else, this does seem to improve the appearance of mojibake...

dankamongmen added a commit that referenced this issue Aug 2, 2021
@dankamongmen dankamongmen modified the milestones: 3.0.0, 2.4.0 Aug 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant