Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Url' or 'href' should be also encoded when meeting non-normative characters #1450

Closed
ghost opened this issue Mar 14, 2019 · 13 comments
Closed

Comments

@ghost
Copy link

ghost commented Mar 14, 2019

Take this demo:

Mixed with Non-Lartin Characters in Url.

It seems for 'mailto:……', it's right that we don't allow any characters except Lartins (26 English characters), however, this is wrong in Url or href.

Ref:nodejs/nodejs.org#1612

@ghost ghost changed the title Url or mailto: should only allow Latian Characters Url or href should only allow Latian Characters Mar 14, 2019
@UziTech
Copy link
Member

UziTech commented Mar 14, 2019

looks like GitHub renders it the same way marked does:

http://www.baidu.com.

http://www.baidu.com。

http://www.baidu.com我。

mailto:abc@abc.com

mailto:abc@abc.com我。

@UziTech
Copy link
Member

UziTech commented Mar 14, 2019

Where does it say that links should only include latin characters?

@ghost
Copy link
Author

ghost commented Mar 14, 2019

@UziTech:What makes me feel puzzled is that for 'mailto', “。”and “我” are seperated from the English characters, compared with this, the url or href link(such as http……), you now can see “我” and “。” are a part of them. It should be the same as what we see in "mailto".

So I guess, whether url, href or mailto should only allow English characters (or lartain characters)?

@UziTech
Copy link
Member

UziTech commented Mar 15, 2019

according to the commonmark spec:

An absolute URI, for these purposes, consists of a scheme followed by a colon (:) followed by zero or more characters other than ASCII whitespace and control characters, <, and >. If the URI includes these characters, they must be percent-encoded (e.g. %20 for a space).

and spec for email

An email address, for these purposes, is anything that matches the non-normative regex from the HTML5 spec:

/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?
(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/

@UziTech
Copy link
Member

UziTech commented Mar 15, 2019

also according to gfm spec:

Trailing punctuation (specifically, ?, !, ., ,, :, *, _, and ~) will not be considered part of the autolink

@ghost
Copy link
Author

ghost commented Mar 15, 2019

Should we also use encodeURI or encodeURIComponent to the non-normative characters like Chinese ones, even if they are in the URL or href?I see that in the parameters we should use it to encode and decode through javascript. If that's OK, I'll change my topic as well.

I mean:

http://www.baidu.com/我

should be converted to:"http://www.baidu.com/%E6%88%91"

And take this:
http://www.baidu.com?a=我

should be converted to:
"http://www.baidu.com?a=%E6%88%91"

What do you think of this?

PS:You can compare the demo example—— Unencoded Characters.

If you directly open this page and move your mouse onto the link (starting with 'http'), you'll find that the Chinese characters aren't encoded yet. But if you switch to 'Html Rendering Preview', it seems your href or url are fully encoded.

【Right】
image
【Wrong, if you move your mouse onto the link, you'll find characters aren't encoded yet】
image

@ghost ghost changed the title Url or href should only allow Latian Characters 'Url' or 'href' should be also encoded when meeting non-normative characters Mar 15, 2019
@ghost
Copy link
Author

ghost commented Mar 15, 2019

My mistake, I didn't make it clear to you, now I've changed my topic to this: 'Url' or 'href' should be also encoded when meeting non-normative characters. :)

@UziTech
Copy link
Member

UziTech commented Mar 15, 2019

According to the spec only whitespace, control characters, <, and > need to be percent encoding.

@ghost
Copy link
Author

ghost commented Mar 15, 2019

Interesting question, feeling curious....

I'm looking at the RFC Resource, it says……

Octets MUST be encoded if they have no corresponding graphic
character within the US-ASCII coded character set, if the use of the
corresponding character is unsafe, or if the corresponding character
is reserved for some other interpretation within the particular URL
scheme.

No corresponding graphic US-ASCII:

URLs are written only with the graphic printable characters of the
US-ASCII coded character set. The octets 80-FF hexadecimal are not
used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent
control characters; these must be encoded.

RFC 1738. So which is the standard document for the URL or Href?

@UziTech
Copy link
Member

UziTech commented Mar 15, 2019

We do run urls through encodeURI. It looks like chrome automatically decodes the url when displaying it even though it is encoded

@ghost
Copy link
Author

ghost commented Mar 15, 2019

Really?I'll check it with other browsers such as IE or Edge……This is strange....

@UziTech
Copy link
Member

UziTech commented Mar 15, 2019

You can inspect the link to see it encoded

image

@ghost
Copy link
Author

ghost commented Mar 15, 2019

Ah ha……Thanks!
I'll close this, thanks for your patience!

@ghost ghost closed this as completed Mar 15, 2019
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant