Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character references in autolinks #727

Open
xiaq opened this issue Nov 5, 2022 · 12 comments
Open

Character references in autolinks #727

xiaq opened this issue Nov 5, 2022 · 12 comments

Comments

@xiaq
Copy link

xiaq commented Nov 5, 2022

The spec doesn't specify whether character references are supported inside autolinks. The following Markdown:

<aa:&#65;>

is rendered as the following by cmark:

<p><a href="aa:A">aa:A</a></p>

but as the following by commonmark.js:

<p><a href="aa:&amp;#65;">aa:&amp;#65;</a></p>
@xiaq
Copy link
Author

xiaq commented Nov 5, 2022

Ah, I filed an issue about exactly the same problem in commonmark/commonmark.js#263. So it seems that the intention is to supported character references inside autolinks.

Maybe we can add an example to the spec with a character reference in an autolink?

@wooorm
Copy link
Contributor

wooorm commented Nov 5, 2022

I’m pretty strongly in the camp that character references should not work in autolinks.
Except for this, they work in the same spaces where (backslash) character escapes work.
Character escapes is in the same (preliminaries) section in the spec, and it has an example: https://spec.commonmark.org/0.30/#example-20.

I don’t think there should be one edge case where backslashes don’t work but characters references do?

@jgm
Copy link
Member

jgm commented Nov 6, 2022

I think the motivation was that autolinks can be URLs that you just copy from some other source, and these might contain character references.

@wooorm
Copy link
Contributor

wooorm commented Nov 6, 2022

I’m not sure about that reasoning: they might as well be fine unicode, particularly when coming from an address bar. I could see problems with double decoding.
But, most important for me: it has to be consistent with character escapes.

@wooorm
Copy link
Contributor

wooorm commented Nov 6, 2022

On motivation: do you mean cmark is more in line with your motivation? That the absence in cmjs was because it was forgotten? That no test for it in the spec was intended? What do you think about the test on character escapes but no test of character references?

@jgm
Copy link
Member

jgm commented Nov 6, 2022

Yes, in the linked issue, I said I thought that cmark was getting it right.
It could be worth adding a spec example for this.

@jgm
Copy link
Member

jgm commented Nov 6, 2022

I see why it would be nice if entities got resolved in exactly the places backslash escapes do -- but again, this is motivated by a desire to support URL copy-pasting.

@wooorm
Copy link
Contributor

wooorm commented Nov 6, 2022

Consistency with character escapes is most important to me.
If the character escapes are allowed too I am open to it. I still see a lot of inconsistency for character references in Babelmark (so good to specify whatever the choice is).
Here’s a test case of several normal cases and edge cases:

a <https://example&period;com>

b <https:&sol;&sol;example.com>

c <https&colon;//example.com>

d <&#104;ttps://example.com>

e <some&period;user@example.com>

f <some.user@example&period;com>

Note that C and D are not allowed per CommonMark as the protocol (part before and including :) does not allow &, ;, #.
And that E and F are not allowed per CM because neither the part before @ (ASCII atext) nor after (domain) allow ;.

@xiaq
Copy link
Author

xiaq commented Nov 6, 2022

@jgm IMO there is an equally valid argument against character reference if we are talking about copy-pasting: one could also copy-paste from a place that doesn't interpret character references, like the browser's URL bar, or a displayed webpage (as opposed to the HTML source).

@jgm
Copy link
Member

jgm commented Nov 6, 2022

@xiaq - granted.

@jgm
Copy link
Member

jgm commented Nov 6, 2022

Granting that there are these two possible sources for copy/paste, I think my reasoning was that if a valid character reference occurs in a copied URL, it's by far likeliest that its source is raw HTML rather than the browser's URL bar or a displayed web page. How often does one want to display something like &amp; in a URL?

@wooorm
Copy link
Contributor

wooorm commented Dec 26, 2022

I mostly care about consistency, so then I’d also ask: how often does one want to display something like \?, where ? is any ASCII punctuation. If it’s consistent: I’m fine with it.

But thinking some more about this, while the motivation of “allow copy/paste” is a good one, to get there I believe we should then also allow unicode letters/punctuation in email atext, and unicode letters + at likely & + \ in email domains?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants