-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[css-text-3] Should enclosed ideographic blocks be space-discarding? #4992
Comments
What's the rationale for this? If we only discard when both sides are part of the space-discarding character set, sometimes unintended behavior will appear. For example:
will become:
( |
Interesting indeed. I'm fine not to include if usages outside CJ are expected. Will you want a space when Latin context follows enclosed numerals? i.e., for "㉑Text", should there be a space?
Please consider the other case:
then you would want:
|
I'm not sure. I will discuss with the clreq editors in w3c/clreq#293 |
This can happen indeed. We need to answer at least two questions (maybe not in this issue):
|
@kidayasuo might want to jump in, he is brainstorming the same question in jlreq, which is more suitable way to define the typographic rules in future jlreq. I'm fine either way for how the spec describes the typographic behavior, but that won't affect what authors actually do, both types of authors are most likely to keep their preferred behaviors. For the segment transformation rules, whichever options we take, the other group will need to learn the rules, so I think going simpler is easier to remember and adopt. |
I'm inclined to think that we need to expect authors to make adjustments sometimes to resolve ambiguities. So given:
a small adjustment to the source text could fix the problem, such as:
If the line breaking is done manually, this shouldn't be a common problem, once the author is aware of how things work. If line-breaking is done automatically, the author ought to expect some problems that will need to be rectified, although the application doing the line-breaking could also look out for situations where applying a line-break between certain characters would introduce ambiguity (eg. not splitting immediately after a space in CJ). |
Just FYI - I'm not sure about other WYSIWYG editors, but BlueGriffon enables line wrapping (of the HTML source code) by default, and I often see unwanted spaces appear in Chinese (or between Chinese and English) when the web page is created with BlueGriffon. |
Sounds like you should raise a bug report against that app. |
I didn't find the bug tracker for it, so I'll friendly ping @therealglazou here. |
I agree with @r12a to handle ambiguous cases by not inserting a line break around such places. What it means is that for these "ambiguous" cases it does not really matter whichever we decide. Probably it is better to opt for whichever is more easier to remember of intuitive to end users. One possible caveat is that what is truly ambiguous can sometimes be unclear because the ambiguity in this case is based on human expectations. |
I agree, too, with @r12a to handle ambiguous cases by not inserting a line break around such places, and also agree that authors should not insert a line break between CJK and non-CJK letters when do not want a space between them (e.g., However, I can't agree that authors should not insert a line break between a CJK punctuation and a non-CJK letter when do not want extra space between them (e.g., |
So. While this was some very interesting discussion on @xfq's side track, nobody seems to have commented about the actual issue, which is about whether enclosed ideographic should be space-discarding? :) |
Well, we discussed this issue in a clreq meeting, and the general feeling was that that this issue was not important enough for us to discuss, because: 1) the enclosed ideographic blocks are rarely used and even less often appear at the beginning/end of a (hard-wrapped) line; 2) we have lots of high priority issues like (soft-wrapped) line breaking and text-spacing rules :) They might be more often used in Japanese (since the ARIB STD B24 character set contains many such characters). Perhaps @himorin knows more? |
Also for Japanese the general feeling is that enclosed ideographic is not that important. To provide a bit more analysis, characters in these blocks are: ① enclosed number or kana ① ㈠ ㊂ ㋑ I see category ① more often than others. They are used as list headings. As list headings they typically appear after explicit line breaks and therefore the transformation rules is not that relevant. Of couse they can be used as headers of inline lists, or other places in a line. In either case enclosed ideographic numbers should be treated the same way as enclosed Arabic numbers. Otherwise many ordinary people would be puzzled why a space is inserted around ① but not around ㊀. The category ② is legacy combining characters. They typically come before or after a noun “12日㈪” (12th Monday) and ㈱アップル (Apple incorporated). As they form one noun block probably people would not insert a line break in between. I saw them in the past more often and I feel the use is decreasing in favour of fully spelling them like 月曜/月曜日 (Monday) or 株式会社 (corporation). Please refer to the usage counts below. The category ③ is special purpose characters used by TV. They are not for general use. I googled them for these characters but most found pages were about these unicode character themselves. Here is a non-scientific use-count obtained by google searching each character within quotation marks. The second circled number denotes the category I used above. In sum, they are not frequent characters and if they used they are often used in a context where the transformation rule is not that important. I believe it is more important that all enclosed numbers are treated the same way regardless of if the number is in Arabic style or in ideographic style. Actually probably all enclosed letters and numbers should be treated in a consistent manner. |
I agree with @kidayasuo that "all enclosed letters and numbers should be treated in a consistent manner", and I want to use ①②③… in non-CJK text such as "There are three options: ①foo ②bar and ③baz". In such inline list cases, space between items should not be discarded. So I think they should not be space-discarding. |
In non-CJK text, the spaces will not be discarded because (currently) we only discard if both sides of the line break are part of the space-discarding character set. |
@xfq ok, my mistake |
Based on @kidayasuo and @MurakamiShinyu and @xfq 's comments, I'm closing this as "no change", i.e. treat these blocks the same as Enclosed Alphanumerics, i.e. not including as space-discarding characters. |
In #337 we decided to key line-break transformation behavior by Unicode Block. Most of the blocks are pretty straightforward: Han, Kana, Yi, and CJK punctuation blokcs discard, and everything else converts to a space. But there are a few interesting cases...
One interesting case is the enclosed ideographic blocks:
https://en.wikipedia.org/wiki/Enclosed_CJK_Letters_and_Months
https://en.wikipedia.org/wiki/Enclosed_Ideographic_Supplement
The numerics in the Letters and months block seem likely to be used outside of CJK context, also there are quite a few Hangul, and I wouldn't be surprised if at least some of the other characters are also used in Korean sometimes.
Note, however, that we only discard if both sides (before and after) the line break are part of the space-discarding character set.
The text was updated successfully, but these errors were encountered: