Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emphasis and East Asian text #208

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ikedas
Copy link

@ikedas ikedas commented Jun 25, 2017

Discussions:

There are three commits:

  • Change on code,
  • Proposed change on spec,
  • Additional test cases (maybe insufficient).

I realised that the change will introduce some ambiguity, but I think they are not actually problem.

Rule 6:

__foo、__bar__、baz__
.
<p><strong>foo、</strong>bar<strong>、baz</strong></p>

is not

<p><strong>foo、<strong>bar</strong>、baz</strong></p>

Rule 7:

**〔**foo〕
.
<p><strong>〔</strong>foo〕</p>

is not

<p>**〔**foo〕</p>

@jgm
Copy link
Member

jgm commented Jun 27, 2017

Thanks for doing this!

I think we could simplify this considerably by defining "punctuation character" (for purposes of the spec) so that it simply excludes East Asian pnuctuation characters.

This would really simplify the clauses in the spec for emphasis, since we'd avoid complicated logical constructions like (punctuation and not east asian).

It would also make the code slightly more efficient (one test rather than two -- though perhaps the compiler is smart enough to optimize away this difference).

What do you think?

@kivikakk

@ikedas
Copy link
Author

ikedas commented Jun 27, 2017

Thanks for comment.

I think we could simplify this considerably by defining "punctuation character" (for purposes of the spec) so that it simply excludes East Asian pnuctuation characters.

This would really simplify the clauses in the spec for emphasis, since we'd avoid complicated logical constructions like (punctuation and not east asian).

I thought the same at first, but such modification counld not handle many cases using underscores (_). Anyway, for EA writers it is real that EA punctuations should be handled in different way from Western ones.

Another point is that some punctuations are shared among EA and Western, e.g. , . They cannot be excluded.

@jgm
Copy link
Member

jgm commented Jun 27, 2017 via email

@ikedas
Copy link
Author

ikedas commented Jun 28, 2017

Can you give a specific example of a case where you think
what I suggest wouldn't work? I think I can do it in a way
that is logically equivalent to yours, but simpler both
in the spec and the program.

Ok here. In following texts, , and are EA punctuations.

Example 1

猫は*「のどか」*という。

猫は_「のどか」_という。

Current master:

<p>猫は*「のどか」*という。</p>
<p>猫は_「のどか」_という。</p>

Excluding EA punctuations:

<p>猫は<em>「のどか」</em>という。</p>
<p>猫は_「のどか」_という。</p>

Expected (with this PR):

<p>猫は<em>「のどか」</em>という。</p>
<p>猫は<em>「のどか」</em>という。</p>

Example 2

猫は*「のどか」*という。犬は*名がない*。

猫は_「のどか」_という。犬は_名がない_

Current master:

<p>猫は*「のどか」<em>という。犬は</em>名がない*。</p>
<p>猫は_「のどか」<em>という。犬は_名がない</em></p>

Excluding EA punctuations:

<p>猫は<em>「のどか」</em>という。犬は<em>名がない</em></p>
<p>猫は_「のどか」_という。犬は_名がない_。</p>

Expected (with this PR):

<p>猫は<em>「のどか」</em>という。犬は<em>名がない</em></p>
<p>猫は<em>「のどか」</em>という。犬は_名がない_。</p>

Another point is that some punctuations are shared among EA and
Western, e.g. , . They cannot be excluded.

Yes, the idea would be to define 'punctuation character'
to include these but exclude east-asian-only puntuation.

Excluding these from Western punctuations will not affect Western text, because space before/after punctuation is ordinary in Western texts ( means space).

The␣cat␣is␣named␣*“Nodoka”*.

On the other hand including them in EA punctuations will help formatting EA text because spaces before/after punctuation is unnatural in EA texts.

猫は*“のどか”*という。

猫は␣*“のどか”*␣という。            --- unnatural

So I think they would be better to belong to EA punctuations.

@kivikakk
Copy link
Contributor

kivikakk commented Sep 5, 2018

Just checking back in here; do we think we might be able to move forward with the suggestion in this PR?

@ikedas
Copy link
Author

ikedas commented Sep 5, 2018

Just checking back in here; do we think we might be able to move forward with the suggestion in this PR?

Of course I agree. Please let me know if there are anything I should do.
I'll re-push my commits.

@tamlok
Copy link

tamlok commented Sep 30, 2018

Hi,

Any updates on this PR?

I think lots of projects are waiting for the update in upstream. :)

Thanks!

@jgm
Copy link
Member

jgm commented Sep 30, 2018

Excluding these from Western punctuations will not affect Western text, because space before/after punctuation is ordinary in Western texts (␣ means space).

Not always. Examples:

  • the Marines’ slogan—“semper fi”—is well known.
  • he uttered his usual greeting (“hello”).
  • ‘“hello” is longer than “hi”,’ she noted.

@ikedas
Copy link
Author

ikedas commented Sep 30, 2018

@jgm, in this pr interaction between punctuations and emphasis matters. Are your examples affected (I haven’t confirmed)?

@jgm
Copy link
Member

jgm commented Sep 30, 2018

My point was just that there might be unexpected consequences to treating these characters like non-punctuation, and that it isn't the case that they're never flanked by punctuation characters. It's hard to survey ahead of time all the cases that might arise, but here's one for concreteness:

He stammered, “*hello, I was...*”

If the double quotes get treated as non-punctuation for purposes of determining flankingness, then the final * is not right flanking and we don't get emphasis.

@ikedas
Copy link
Author

ikedas commented Sep 30, 2018

If the double quotes get treated as non-punctuation for purposes of determining flankingness, then the final * is not right flanking and we don't get emphasis.

My PR does not treat LEFT/RIGHT DOUBLE QUOTATION MARKs as non-punctuations, but treats them as EA punctuations. In fact, even if my modification was applied:

$ build/src/cmark 
He stammered, “*hello, I was...*”
<p>He stammered, “<em>hello, I was...</em>”</p>
$ 

@jgm
Copy link
Member

jgm commented Oct 1, 2018

Sorry for the misunderstanding.

left_flanking = numdelims > 0 && !cmark_utf8proc_is_space(after_char) &&
                   (!cmark_utf8proc_is_punctuation(after_char) ||
                    cmark_utf8proc_is_eastasian_punctuation(after_char) ||
                    cmark_utf8proc_is_space(before_char) ||
                    cmark_utf8proc_is_punctuation(before_char));
right_flanking = numdelims > 0 && !cmark_utf8proc_is_space(before_char) &&
                  (!cmark_utf8proc_is_punctuation(before_char) ||
                   cmark_utf8proc_is_eastasian_punctuation(before_char) ||
                   cmark_utf8proc_is_space(after_char) ||
                   cmark_utf8proc_is_punctuation(after_char));

Simplifying a bit (EDIT: sorry, first version was completely wrong):

Left flanking:

  • after char is non-space, AND
  • one of the following:
    • after char is EA punctuation or non-punctuation
    • before char is space or punctuation

Right flanking:

  • before char is non-space, AND
  • one of the following:
    • before char is EA punctuation or non-punctuation
    • after char is space or punctuation

The effect of this part of the rule is to make it strictly easier to count as left-flanking and right-flanking, in the cases where a left-flanking run is followed by EA punctuation or a right-flanking run is preceded by EA punctuation. So there won't be examples of the sort I was trying to give, where your rule fails to count something as left- or right-flanking that the original rule does.

Your rule may, however, count some delimiter runs as BOTH left and right flanking where the original rule only has one flankingness. To deal with that, you also modify the rules for "can open" and "can close". The current rule says that a delimiter run that is both left and right flanking can open emphasis when the before char is punctuation. Your rule loosens that up to: when the before char is punctuation or the after char is EA punctuation. This ensures that, in every case where your rule makes a formerly left and not-right flanking delimiter run both left and right flanking, if it could open/close emphasis before it will still be able to open/close emphasis.

However, there could still be changes due to the fact that it could now close emphasis when it couldn't before. So, one kind of example to look for is a case where a delimiter run that formerly could only open emphasis can now both open and close, and gives bad results for that reason. I will think about whether there are realistic examples of this sort.

But, just to make a general comment, one thing I dislike about the proposed change is that it makes an already fairly complicated rule, which I could (barely) keep in my head, even more complicated and hard to think about. That is the reason I've found it difficult to get convinced that this change should be made. It's not by itself a reason to reject the change, but I haven't yet been convinced that the change won't have unanticipated consequences.

@jgm
Copy link
Member

jgm commented Oct 1, 2018

Here's an (admittedly artifical) example where we'd see a difference, if I'm not mistaken:

*“*there*”*

With the proposed rule, the second * can close emphasis and so we'd get

<em>“</em>there<em>”</em>

whereas currently we get

<em>“<em>there</em>”</em>

Unless I've made a mistake in thinking about it...

@jgm
Copy link
Member

jgm commented Oct 1, 2018

Another case:

*He said, **“*hello*”**.*

@ikedas
Copy link
Author

ikedas commented Oct 2, 2018

I'll investigate your simplified rule afterward (but I want to confirm: It is equivalent to my rule, isn't it?).

Your rule may, however, count some delimiter runs as BOTH left and right flanking where the original rule only has one flankingness. To deal with that, you also modify the rules for "can open" and "can close". The current rule says that a delimiter run that is both left and right flanking can open emphasis when the before char is punctuation. Your rule loosens that up to: when the before char is punctuation or the after char is EA punctuation. This ensures that, in every case where your rule makes a formerly left and not-right flanking delimiter run both left and right flanking, if it could open/close emphasis before it will still be able to open/close emphasis.

What is the reason for "unique flankingness" requirement? For me, flankingness looks introduced only to describe behavior of the parser (without consideration of EA context).

However, there could still be changes due to the fact that it could now close emphasis when it couldn't before. So, one kind of example to look for is a case where a delimiter run that formerly could only open emphasis can now both open and close, and gives bad results for that reason. I will think about whether there are realistic examples of this sort.

It is natural that modification of rules will cause change of behavior. We have to modify rules if the rules can't handle texts as we expect.

I can't decide whether changes brought to existing texts will be acceptable or not. There seem these options:

  • In below, an "ambiguous punctuation" is a punctuation character having east_asian_width property "A", and can be used in both East Asian and Western contexts, including: ¡, ¿, , , , , , .
  1. Reject entire changes by this PR. --- Obviously uncomfortable for East Asian writers.
  2. Treat ambiguous punctuations as non-East Asian punctuations --- A bit uncomfortable for East Asian writers.
  3. Add an option (during compilation or runtime) to treat ambiguous punctuations either as East Asian or non-East Asian punctuations according to choice of users.
  4. Treat ambiguous punctuations East Asian punctuations --- More or less uncomfortable for Western writers.

@ikedas
Copy link
Author

ikedas commented Oct 2, 2018

Another case:

*He said, **“*hello*”**.*

I'll add corresponding examples with East Asian context.

For example,

*他說,**“*你好*”**。*

will be handled by current master properly (Note: is not comma + space but an EA punctualtion), as:

<p><em>他說,<strong>“<em>你好</em>”</strong>。</em></p>

However, example above is a lucky case. Perhaps this sentence is understandable without . Removing it,

*他說**“*你好*”**。*

will be rendered with current master as:

<p><em>他說</em>*“<em>你好</em>”**。*</p>

I think it is hard to accept this result for writers.


As a workaround, for example, we might recommend writers to markup such as:

*他說 **“*你好*”**。*

This will be rendered as:

<p><em>他說 <strong>“<em>你好</em>”</strong>。</em></p>

The result is readalbe, if readers ignored an ugry space. However, it may not be easy to give excuse to force writers inserting unusual spaces not appeared in plain text witout markup.


Note: My PR will not solve all problems with current master: It can not handle as complex markup in East Asian context as Western context. In fact, since the example above is slightly complex, it will be rendered with my PR as:

<p><em>他說**“</em>你好<em>”**。</em></p>

However, from view of East Asian writers, it will improve current behavior much.

@jgm
Copy link
Member

jgm commented Oct 2, 2018

Yes, my simplified rephrasing was meant to be equivalent to your proposal. (Just to help me think about it more clearly.)

Thinking outside of the box a bit: instead of having two distinct classes of punctuation characters, would it work to treat East Asian characters in general (including both EA punctuation and EA non-punctuation characters) as equivalent to punctuation for determining flankingness and can-open/can-close?

That is: the rules would all be the same as they are, except that "punctuation" would be interpreted as including Western punctuation characters plus ALL EA characters. (Obviously, one might want a better name for this broad class than "punctuation," but that's a detail.)

This would keep the simpler logic of the current rules, and it would guarantee that nothing changes in the interpretation of Western texts.

@cangyuyao
Copy link

Just wondering is there any progress on this?

All CJK projects based on CommonMark just stuck on it for years.

@mity
Copy link
Contributor

mity commented May 30, 2019

Maybe this issue can be seen better from a different perspective. At least I have always found using the left-flanking and right-flanking terms confusing and I always easily got lost in them when thinking about some particular complicated input example.

Eventually I started to use in my head an alternative wording which (I believe) is 100%-equivalent to the current specs wording. It may be spelled as follows:

Left score and right score of the delimiter run determine whether the run may or may not open/close an emphasis. The scores are computed as follows:

  1. If the preceding character is Unicode whitespace, set the left score to 0.
    If the preceding character is Unicode punctuation, set the left score to 1.
    If the preceding character is anything else, set the left score to 2.

  2. If the subsequent character is Unicode whitespace, set the right score to 0.
    If the subsequent character is Unicode punctuation, set the right score to 1.
    If the subsequent character is anything else, set the right score to 2.

  3. If left score == 2 and right score == 2, and the delimiter run is _-based, then reset both scores to zero.

The delimiter run can open an emphasis iff left score <= right score and right score > 0.
The delimiter run can close an emphasis iff left score >= right score and left score > 0.

(If you prefer code, MD4C uses internally this alternative wording.)

I post this because it might be easier to come with the solution in this wording, if we just add more rules into the score calculations above. Imho, it could perhaps even solve the issue with the ambiguous punctuation noted in earlier comments. E.g. something like

  1. If the preceding character is EA-punctuation and the subsequent character is any EA-character, then reset right score to zero. (I.e. this makes it to be treated as if there is any punctuation before the run and whitespace after it in current implementation.)
    If the subsequent character is EA-punctuation and the preceding character is any EA-character, then reset left score to zero. (I.e. this makes it to be treated as if there is any punctuation after the run and whitespace before it in current implementation.)

At least, it can be easily seen this wouldn't change anything for western text, and the people who (unlike me) understand EA languages and their needs may play more safely as long as they propose rules which require EA-characters on both sides of the run. Divide et impera.

@spencer246
Copy link

spencer246 commented Aug 21, 2020

Although this PR works for Japanese and Chinese text (please note that Korean text uses "Western" punctuation marks), it does not solve a related but slightly different issue in Korean text reported here (github/javascript-tutorial, #2040).

Koreans expect *스크립트(script)*라고 to be rendered to <em>스크립트(script)</em>라고. Since Korean text uses "Western" punctuation marks, the current CommonMark spec or this PR does not render the above Korean text "correctly."

This Korean-text issue may be resolved by adding one more condition to @jgm's simple rule in this comment:

Right flanking:

  • before char is non-space, AND
  • one of the following:
    • before char is EA punctuation or non-punctuation
    • after char is space or punctuation or any EA character,

although it will break nested emphases more severely.


By the way, I think a better way to solve CJK-related emphasis issues is to introduce a new syntax ~_, _~, ~*, and *~ originally suggested by Prof. John MacFarlane for intra-word emphasis. However, his suggestion is equally applicable to any CJK-related emphasis issues arising from the lack of whitespace.

@spencer246
Copy link

It seems that the issue on emphasizing Korean texts has not been reported before.

I posted this issue in https://talk.commonmark.org/t/emphasis-and-east-asian-text/2491 as a comment.

@ikedas
Copy link
Author

ikedas commented Jul 17, 2022

Sorry I haven't had time to think over this issue.
I have one idea I would like to try and will post it later (perhaps in months).

faultyserver added a commit to discord/discord-intl that referenced this pull request Dec 19, 2024
…mmar (#16)

Chinese and Japanese content usually do _not_ include spaces between
formatted and unformatted segments of a single phrase, such as
`**{value}**件の投稿`. But this is technically not valid `strong` formatting
according to the CommonMark spec, since the right flank of the ending
delimiter is a non-space Unicode character.

See more information in the CommonMark discussion here:
https://talk.commonmark.org/t/emphasis-and-east-asian-text/2491/5
commonmark/cmark#208

Because this library is explicitly intended to support many languages
including most Asian languages, we are adding an extension to the
Markdown rules to accommodate these situations. The following tests
assert that the special cases for East Asian languages function in a
logically-similar way to Western languages.

The tests for this change are pretty small, as I'm not fluent in
anything near CJK and have purely gone off of suggestions and forums to
enumerate these. Most importantly, `**{value}**件の投稿`, is now treated as
a **bold** `value` followed by plain text, rather than being completely
ignored.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants