Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address several IDNA issues #309

Merged
merged 3 commits into from
Jun 1, 2017
Merged

Address several IDNA issues #309

merged 3 commits into from
Jun 1, 2017

Conversation

annevk
Copy link
Member

@annevk annevk commented May 12, 2017

As reported at
#53 (comment) this is
causing issues in non-browser implementations.


Preview | Diff

@domenic
Copy link
Member

domenic commented May 12, 2017

So what should we do in Node.js/whatwg-url? Should we take this as license to violate the spec, so that our users can use our libraries on YouTube?

I'd kind of prefer the spec just match browsers here (and in most places); we've never had success getting implementers to comment on URL parsing issues before, and I doubt it's going to start here.

@annevk
Copy link
Member Author

annevk commented May 12, 2017

Sigh. I guess you're right. It's just such dumb Latin-alphabet-centric behavior that I'm having a hard time accepting this is what we want long term. And I'd rather fix issues once than multiple times.

@annevk
Copy link
Member Author

annevk commented May 12, 2017

zimbabao added a commit to zimbabao/node that referenced this pull request May 15, 2017
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
zimbabao added a commit to zimbabao/node that referenced this pull request May 15, 2017
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
zimbabao added a commit to zimbabao/node that referenced this pull request May 15, 2017
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
zimbabao added a commit to zimbabao/node that referenced this pull request May 15, 2017
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
zimbabao added a commit to zimbabao/node that referenced this pull request May 15, 2017
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
@rmisev
Copy link
Member

rmisev commented May 15, 2017

Following tests show, what not all browsers use fast path for ASCII only input:

Input URL Standard Chrome Edge Firefox
ab--c (failure) ab--c ab--c ab--c
xn--a (failure) xn--a (failure) xn--a
xn--nxa.xn--nxa xn--nxa.xn--nxa xn--nxa.xn--nxa xn--nxa.xn--nxa
β.β
β.β
xn--nxa.β xn--nxa.xn--nxa xn--nxa.xn--nxa xn--nxa.xn--nxa
β.β
β.β
ab--c.xn--nxa (failure) ab--c.xn--nxa ab--c.xn--nxa
ab--c.β
ab--c.β
ab--c.β (failure) (failure) ab--c.xn--nxa
ab--c.β
ab--c.β
xn--a.xn--nxa (failure) xn--a.xn--nxa (failure) xn--a.xn--nxa
xn--a.β (failure) (failure) (failure) xn--xn--a-pm43a.β

As you see Edge checks the xn-- labels for ASCII only input, while Chrome doesn't. The Chrome's behavior doesn't look consistent for me. I think Edge's is better.

Firefox doesn't fail on invalid labels at all and the last test result shows browser's bug, which must be fixed anyway.

Note: I tested with URL object, prefixing the input with http://. If Edge's result has two lines, then first is host returned in href, second - hostname.

@annevk
Copy link
Member Author

annevk commented May 15, 2017

Thanks for those tests. Chrome's behavior looks consistent? If all of the input is ASCII, don't run ToASCII. Label splitting happens as part of ToASCII so is not a consideration until then.

@rmisev
Copy link
Member

rmisev commented May 15, 2017

If follow Chrome's behavior, then it's unclear xn--a.xn--nxa is valid or not. It is accepted when I enter it in the address bar. But then Chrome converts it to xn--a.β. So valid becomes invalid... 😕

@annevk
Copy link
Member Author

annevk commented May 15, 2017

I see, the problem is that they use ToUnicode (which never fails) after which it no longer roundtrips. That does indeed seem bad.

zimbabao added a commit to zimbabao/node that referenced this pull request May 16, 2017
To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: nodejs#12965
Refs: nodejs#12966
Refs: whatwg/url#309
@annevk
Copy link
Member Author

annevk commented May 17, 2017

@rmisev sorry for leaving this. So what is the processing model we want? ASCII lowercase input; split input on "."; if a label starts with "xn--" or contains non-ASCII, run input through ToASCII? That's getting rather complicated. Always invoking ToASCII and setting the ignore hyphen flag might end up being better, no?

@rmisev
Copy link
Member

rmisev commented May 17, 2017

I think simpler is better, so always ToASCII.

zimbabao added a commit to zimbabao/node that referenced this pull request May 19, 2017
There are valid domain names with hyphens at 3 and 4th position, new
node WHATWG URL parser was failing for it assume its an invalid IDN.
Also filters IDN errors when domain label start or end with hyphen.

Also fix error in ToUnicode

Fixes: nodejs#12965
Refs: https://www.icann.org/news/announcement-2000-01-07-en
Refs: whatwg/url#309 (comment)
@annevk
Copy link
Member Author

annevk commented May 19, 2017

I updated the PR to point to the proposed update of UTS46 and set CheckHyphens to false.

@annevk annevk requested a review from domenic May 19, 2017 13:02
@domenic
Copy link
Member

domenic commented May 31, 2017

We're working on validating the tests via jsdom/whatwg-url in jsdom/whatwg-url#90, but kind of stuck since we don't implement CheckJoiners and CheckBidi. We'll probably go ahead and merge anyway, but would Node.js be able to create an implementation that passes the tests instead? That'd be even better.

@TimothyGu
Copy link
Member

would Node.js be able to create an implementation that passes the tests instead?

Yes I think so. In fact, cases 3 and 4 here (the two corresponding to CheckBidi) are already handled correctly in Node.js 8.0.0. CheckJoiners was not included in the release, since at the time the PR was merged there was not yet a resolution on CheckJoiners.

@domenic
Copy link
Member

domenic commented May 31, 2017

Great. Would it make sense to put off merging this PR until there is a corresponding verification from Node.js that the tests and spec are implementable? That's been our general preference so far, using either whatwg-url or Node.js.

@TimothyGu
Copy link
Member

until there is a corresponding verification from Node.js that the tests and spec are implementable?

And indeed there is: nodejs/node#13362 I'll look into implementing the changes in tr46.js and whatwg-url as well.

Marginally off-topic, but just wanted to point out that there seems to be a circular dependency between the URL Standard, Unicode IDNA spec, whatwg-url, Node.js, and WPT; specifically:

  • This PR depends on the proposed draft of Unicode IDNA
  • This PR depends on independent verification of implementability (Node.js or whatwg-url)
  • Node.js and whatwg-url PRs depend on this PR and WPT updates before they can be applied
  • WPT update depends on this PR

@domenic
Copy link
Member

domenic commented Jun 1, 2017

OK, looks like Node.js has a passing implementation, so let's merge this! BTW the dependency is not cyclic, as we don't need a merged PR, just an existing PR :).

@domenic domenic merged commit dc9d831 into master Jun 1, 2017
@domenic domenic deleted the annevk/ToASCII-warning branch June 1, 2017 21:24
TimothyGu added a commit to TimothyGu/node that referenced this pull request Jun 7, 2017
Remove custom tests for invalid IDNA domains in url-idna.js in favor of
the more comprehensive official set.

Refs: whatwg/url#309
Refs: web-platform-tests/wpt#5976
TimothyGu added a commit to TimothyGu/node that referenced this pull request Jun 7, 2017
Remove custom tests for invalid IDNA domains in url-idna.js in favor of
the more comprehensive official set.

PR-URL: nodejs#13362
Refs: whatwg/url#309
Refs: web-platform-tests/wpt#5976
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
jasnell pushed a commit to nodejs/node that referenced this pull request Jun 7, 2017
Remove custom tests for invalid IDNA domains in url-idna.js in favor of
the more comprehensive official set.

PR-URL: #13362
Refs: whatwg/url#309
Refs: web-platform-tests/wpt#5976
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
TimothyGu added a commit to TimothyGu/node that referenced this pull request Nov 28, 2017
Remove custom tests for invalid IDNA domains in url-idna.js in favor of
the more comprehensive official set.

PR-URL: nodejs#13362
Refs: whatwg/url#309
Refs: web-platform-tests/wpt#5976
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
MylesBorins pushed a commit to nodejs/node that referenced this pull request Jan 18, 2018
Remove custom tests for invalid IDNA domains in url-idna.js in favor of
the more comprehensive official set.

Backport-PR-URL: #17365
PR-URL: #13362
Refs: whatwg/url#309
Refs: web-platform-tests/wpt#5976
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
MylesBorins pushed a commit to nodejs/node that referenced this pull request Feb 11, 2018
Remove custom tests for invalid IDNA domains in url-idna.js in favor of
the more comprehensive official set.

Backport-PR-URL: #17365
PR-URL: #13362
Refs: whatwg/url#309
Refs: web-platform-tests/wpt#5976
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
MylesBorins pushed a commit to nodejs/node that referenced this pull request Feb 12, 2018
Remove custom tests for invalid IDNA domains in url-idna.js in favor of
the more comprehensive official set.

Backport-PR-URL: #17365
PR-URL: #13362
Refs: whatwg/url#309
Refs: web-platform-tests/wpt#5976
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
MylesBorins pushed a commit to nodejs/node that referenced this pull request Feb 13, 2018
Remove custom tests for invalid IDNA domains in url-idna.js in favor of
the more comprehensive official set.

Backport-PR-URL: #17365
PR-URL: #13362
Refs: whatwg/url#309
Refs: web-platform-tests/wpt#5976
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
rmisev added a commit to upa-url/upa that referenced this pull request May 24, 2020
TimothyGu added a commit to TimothyGu/url that referenced this pull request May 13, 2021
Many implementations currently skip ToASCII if domain is ASCII-only, but
as discovered in [1] and [2], this can result in some undesirable
behavior. Adding a note prevents implementors from making the mistake of
thinking ToASCII is a no-op if the input is ASCII, and also provides a
recommendation on how to properly optimize the ToASCII step.

[1]: whatwg#267
[2]: whatwg#309 (comment)
TimothyGu added a commit to TimothyGu/url that referenced this pull request May 13, 2021
Many implementations currently skip ToASCII if domain is ASCII-only, but
as discovered in [1] and [2], this can result in some undesirable
behavior. Adding a note prevents implementors from making the mistake of
thinking ToASCII is a no-op if the input is ASCII, and also provides a
recommendation on how to properly optimize the ToASCII step.

[1]: whatwg#267
[2]: whatwg#309 (comment)
TimothyGu added a commit to TimothyGu/url that referenced this pull request May 19, 2021
Many implementations currently skip ToASCII if domain is ASCII-only, but
as discovered in [1] and [2], this can result in some undesirable
behavior. Adding a note prevents implementors from making the mistake of
thinking ToASCII is a no-op if the input is ASCII, and also provides a
recommendation on how to properly optimize the ToASCII step.

[1]: whatwg#267
[2]: whatwg#309 (comment)
annevk pushed a commit that referenced this pull request May 19, 2021
Many implementations currently skip ToASCII if domain is ASCII-only, but as discovered in #267 and #309 (comment), this can result in some undesirable behavior. Adding a note prevents implementers from making the mistake of thinking ToASCII is a no-op if the input is ASCII, and also provides a recommendation on how to properly optimize the ToASCII step.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants