-
Notifications
You must be signed in to change notification settings - Fork 675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[css-text-3] Segment Break Transformation Rules for East Asian Width property of A #337
Comments
My concerns here are:
I'm happy to make the change if i18n recommends it and implementors agree, but I am hesitant to do so for these reasons. |
@fantasai Current segment break rule in the draft already change the traditional behavior, and up to now no browser implement it. Do you mean you just want to drop the rule totally? And I don't think my proposal is much complex than current. Current rule: My proposal: |
@r12a waiting on i18n feedback before we get this on the CSSWG agenda again |
Just to clarify, the proposal is that if lang=zh|ja|yi then A->W otherwise A->N for the purpose of line-break transformations?
I think that should probably be okay. I would be against making A->W the general case.
|
@fantasai In fact there are two proposals:
|
I think the motivation is reasonable, and some A should be treated as W in CJ context, especially quotations in Chinese. But I am concerned about A+A case, especially given that there are lots of letters in A. The safest thing to do is probably this: if the context is Chinese or Japanese, and one side of line break is a punctuation in A, and the other side is F/W/H, then the segment break is removed. |
…segment break if language context is CJY. #337 <https://lists.w3.org/Archives/Public/www-style/2016Oct/0068.html>
@upsuper @kojiishi @hax Checked in a fix, based on Xidorn's suggestion. A+A still keeps the space, but A+F or F+A will delete the space if the A's language context is Chinese/Japanese/Yi. This is more conservative than the original request, because we don't want to break existing pages and A+A is reasonably common on non-CJK pages. An interesting question is, should we be checking the language on the segment break instead of on the A? |
@fantasai
It's basically same as my "If the context is East-Asian language, A should be treat as W", but I believe checking language on the segment break is much more precise and clear. |
@fantasai I think our discussion was concluded that we do that only for punctuations in A in that language context? It doesn't seem to me other A should have that behavior. |
OK, switched to checking the language of the segment break (rather than the A character), and restricted that rule to punctuation only. More fun: Unicode decided to categorize emoji as Wide for some reason. >:[ |
…, and restrict Ambiguous characters we care about to punctuation/symbols. #337
Fixed to treat Emoji the same as an Ambiguous character: a6aa4d8 |
I think Emoji is too much; it's sometimes surprising and unexpected. The data here: U+0023, U+002A, U+0030-0039 are probably not desired. |
…, and restrict Ambiguous characters we care about to punctuation/symbols. w3c#337
@kojiishi Can you explain what you think the spec should say about this? Definitely we can't rely on EAW for emoji, they are totally inconsistent. E.g. U+1F600 Grinning Face is EAW=Wide while U+263A Smiling Face is EAW=Neutral. Our rules need to treat them the same somehow, and definitely we can't treat emoji as Wide here. |
I prefer not to mention. It has historical reasons to be inconsistent afaiu, Emoji is hard because it's sometimes Emoji but sometimes is not, depends on fonts. In this case, it affects only when author inserted a segment break before or after. Also even though there might be cases where it looks strange, it's interoperable, right? |
To avoid the problem mentioned by @kojiishi (U+0023, U+002A, U+0030-0039), we could for this purpose treat W emoji and N emoji as A. The only ones in all that that don't seem to me to really be "emoji" as commonly understood by people are:
But even then, U+00AE REGISTERED SIGN and U+2122 TRADE MARK SIGN are A as well, so lumping COPYRIGHT SIGN with them doesn't bother me. As for U+203C, U+2049, they both are sentence ending punctuation. In Chinese and Japanese typesetting, spaces are generally not inserted around sentence ending punctuation, so treating them as A and discarding the spaces seems OK too. |
We need to add CJK Unified Ideographs Extension G to that list, which is in the upcoming Unicode 13.0. I'm also told that 99.9% of Korean text that you will find on the web uses Western (aka ASCII) punctuation, so perhaps not as much as an issue as first thought, at least in practical terms. |
The "ghost of christmas past", who whispered in my ear about CJK Unified Ideographs Extension G, also suggests that anything Lisu and Khitan Small Script, the latter of which is new in Unicode Version 13.0, should be on the list. 👻 |
Thank you Mike for the investigation, this is really helpful. This code snippet in WebKit may help developing the list too. Maybe we should agree on the expected accuracy first. My basic idea is:
Do these look reasonable? Any opinions, additions, or change suggestions? |
The VerticalOrientation property could also be a good data to develop the list. |
A wild idea came up, maybe the VerticalOrientation property is better than the Unicode Block for this purpose, rather than just using it as a reference to develop a list of blocks? |
I presume the reason we're discussing heuristics at all, and not simply adding another value to And the intention is, roughly, if the segment break is between two CJK ideographs, or between a CJK ideograph and punctuation, collapse it to nothing? I ask because - if this behaviour must remain context-dependent - I'm wondering if it would be easier to add a value to Alternatively: it looks like Gecko is currently collapsing segment breaks between two ideographs, but not between ideographs and punctuation, and Blink/Webkit are doing neither. Perhaps segment breaks could always collapse between two ideographs (an easy and unambiguous test for UAX#14 class ID), but only collapse according to the more complex heuristics if the appropriate property was set? In other words, "always collapse segment breaks between ID characters. Only collapse other (some? all?) segment breaks if (both of these are attempts to reduce both the processing cost of evaluating the heuristic, and the cost of getting the heuristics wrong). |
I'm ok to add a new value, but what are the benefits of the new value? Is it to prevent regressing non-CJK content? How is it different from choosing conservative heuristics?
I take this back. I remember VerticalOrientation is still too aggressive. |
Yes, exactly.
To my non-expert eyes, this particular heuristic appears to be quite hard to get right. That's purely based on the discussion in La Coruña, and re-reading all the comments on this issue (from the last four years!). So I figured it's worth exploring if there's a way to remove the heuristic, or at least drastically reduce its scope. |
This is because Unicode changed the EAW of a lot of characters in an effectively random and backwards-incompatible way when it introduced Emoji. The results based on e.g. Unicode 6, when these rules were written, would have been quite sensible. :/ Trying to compensate for this change is one of the reasons the rules became too complicated... I've committed an initial draft of the Unicode block-based approach. I think the interesting questions remaining are:
I'm leaning towards yes on enclosed ideographics, no on the symbols, and I don't know enough about Bopomofo when it is used as a stand-alone script to say. Lisu and Khitan both use spaces; they should not therefore discard them during collapsing. Small forms etc. are primarily used with Chinese and Japanese, not Korean, so I think it's reasonable to include them here. (Keep in mind also that both sides of the break need to belong to the set in order to discard, and Hangul is excluded.) |
Are we talking about one or both of these Khitan (presumably the former, as I don't think the later is in Unicode): https://en.wikipedia.org/wiki/Khitan_small_script If yes, do they really use spaces? Where can I learn more about that? If not, what are we talking about? |
…they're used in bopomofo-only texts. #337
Refiled the OP as #5017 |
https://drafts.csswg.org/css-text-3/#line-break-transform
As this rule, common use cases of quotation marks in Chinese
will have unexpected spaces, because quotation marks are A.
Ideally, we should consider the language information of the context. If the context is East Asian language, A should be treat as W. Even in the unknown language context, if any side of the line feed is A and other side is F, W or H, the segment break should also be removed.
The text was updated successfully, but these errors were encountered: