Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: allow percent encoded code points in the non-special URLs host #218

Merged
merged 7 commits into from
Feb 8, 2017

Conversation

rmisev
Copy link
Member

@rmisev rmisev commented Jan 26, 2017

After 3036255 introduced the URL-host parser rejects percent encoded code points in the non-special URLs host (because "%" is forbidden host code point).

The situation is paradoxical, because parser himself percent encodes some code points, actually producing invalid serialized URLs according to the current standard.

URL's from #148 to test issue:
unknown://%E2%80%A0/
notspecial://H%4fSt/path

Solution
I suggest that forbidden host code points be checked on percent decoded code points. The decoded data will be used only for this purpose, so the output will not be affected by this change.

Also this is a closer to special URLs host validity check.

…al URL

URL-host parser fix: decode percent encoded code points before the
forbidden host code points check
@annevk
Copy link
Member

annevk commented Jan 27, 2017

@achristensen07, this is a good reason to allow "%" for opaque hosts. Though we could also go this route and attempt to decode first.

@annevk
Copy link
Member

annevk commented Jan 30, 2017

I think my preferred fix here would be to simply not forbid "%" by special casing it in the URL-host parser.

@rmisev
Copy link
Member Author

rmisev commented Jan 30, 2017

I agree, it makes sense not to decode. Then more code points can be removed from forbidden list:

  • We don't needed to check these because:
    U+0009, U+000A and U+000D are removed by basic URL parser in step 3
    "#", "/" and "?" are stop code points in host state

  • Maybe enable IPv6:
    "[", ":", "]" (host state guarantees that ":" appears only between "[" and "]")

Now forbidden list for opaque hosts can be as short as: U+0000, U+0020, "\", "@"

@annevk
Copy link
Member

annevk commented Jan 30, 2017

See also #214 for the previous discussion regarding this.

@annevk
Copy link
Member

annevk commented Feb 1, 2017

In particular I think we should continue to include the code points you cannot reach to avoid issues down the road. Although potentially we could point out that you cannot reach them and that they are therefore excluded. That would be equally clear.

@achristensen07 any thoughts here?

@rmisev
Copy link
Member Author

rmisev commented Feb 1, 2017

I think must be excluded the "%" and "[", ":", "]" to allow IPv6 (for example ssh: scheme can use IPv6).
And yes, there no need to exclude not reachable code points.

@annevk
Copy link
Member

annevk commented Feb 1, 2017

Yeah, that makes sense. It's unfortunate those schemes have to write their own IPv4/6 parser/validation though. Perhaps we need to expose host parsing separately as well to JavaScript at some point to make that easier.

@jasnell
Copy link
Collaborator

jasnell commented Feb 1, 2017

A host parsing JS API would be welcome and would definitely be something worth exploring.

@annevk
Copy link
Member

annevk commented Feb 7, 2017

@achristensen07 ping, could you please have a look at the proposed change? Safari is the only browser that supports hosts for non-special URLs so your review would be most welcome.

@annevk
Copy link
Member

annevk commented Feb 7, 2017

@achristensen07 to clarify, the change basically makes it okay to use "%", "[", "]", and ":" (within "[" and "]") in opaque hosts. That way input such as non-special://[%30:x]/ would work end up with a host of [%30:x].

@rmisev
Copy link
Member Author

rmisev commented Feb 7, 2017

I further investigate this problem and I think a very basic IPv6 check can help here (change URL-host parser 2 step to):

  1. If input starts with "[" and ends with "]", then if input between "[" and "]" contais a forbidden host code point excluding ":", syntax violation, return failure.
  2. Otherwise, if input contains a forbidden host code point excluding "%", syntax violation, return failure.

This disallows such weird URLs: non-special://[:80 or non-special://]

And term "forbidden opaque host code point" (304f0ae) will not be required anymore.

If you agree, I can prepare a new commit.

@annevk
Copy link
Member

annevk commented Feb 7, 2017

That alternative seems reasonable too, but I think it's the best use of our time to wait for @achristensen07 to give input with regards to WebKit's preferences.

@achristensen07
Copy link
Collaborator

We definitely need to allow percent-encoded values in non-special URLs to remain, but we need to do something about invalid percent-encoded values. notspecial://H%4fSt/path is fine but notspecial://H%4gSt/path is not. In such a case, should we percent-encode the percent?

@annevk
Copy link
Member

annevk commented Feb 7, 2017

To me it seems fine to leave those be. We could at some point define an API to decode percent-encoded sequences and that could deal with those in various ways, but I'm not sure we need to make life difficult here necessarily. We don't do it elsewhere (e.g., paths, query, fragment).

What's your opinion on the IPv6 question and various solutions?

@achristensen07
Copy link
Collaborator

My implementation tries to parse and canonicalize a host as an ipv6 address before going to the different behavior for special/non-special. That's also what the current spec says. We could disallow [ and ] if it's not a valid ipv6 address. No problem.

@annevk
Copy link
Member

annevk commented Feb 7, 2017

The current specification branches before that point, actually: https://url.spec.whatwg.org/#concept-url-host-parser.

@achristensen07
Copy link
Collaborator

Sure enough. My implementation doesn't. We could make the spec match my implementation :)

@annevk
Copy link
Member

annevk commented Feb 7, 2017

Okay, so your implementation branches right after step 1 of https://url.spec.whatwg.org/#concept-host-parser if I understand you correctly. That would be okay with me as well.

@achristensen07
Copy link
Collaborator

Yep, that's exactly what I did. I think it makes sense to have it there, and that makes all the host parsing in the same place in the spec.

@annevk
Copy link
Member

annevk commented Feb 8, 2017

I adjusted the formatting and placement a bit. Let me know what you think. Are you interested in looking at test changes/new tests as well?

url.bs Outdated

<li><p>Return <var>output</var>.
</ol>

The <dfn>IPv4 number parser</dfn> takes a string <var>input</var> and a
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the <hr> is missing before "IPv4 number parser".

@rmisev
Copy link
Member Author

rmisev commented Feb 8, 2017

It looks like you missed <hr> before "IPv4 number parser".

I can add tests, I think to add these URLs:
non-special://%E2%80%A0/
non-special://H%4fSt/path
non-special://[1:2::3]/
non-special://[1:2::3]:80/
non-special://[:80/

I found one test in the current test suite, which will fail:
sc://%/ (becomes valid after this patch)
So maybe remove it?

@annevk
Copy link
Member

annevk commented Feb 8, 2017

I intentionally grouped the IPv4 parsers together.

Those tests look pretty good. Verifying that IPv6 is normalized would be good as well. And instead of removing the test I would change the pass condition.

@domenic
Copy link
Member

domenic commented Feb 8, 2017

Verified that updating jsdom/whatwg-url according to this patchset passes all the tests in web-platform-tests/wpt#4761 (and doesn't cause any failures).

domenic added a commit to jsdom/whatwg-url that referenced this pull request Feb 8, 2017
annevk pushed a commit to web-platform-tests/wpt that referenced this pull request Feb 8, 2017
@annevk annevk merged commit cdbcce6 into whatwg:master Feb 8, 2017
@annevk
Copy link
Member

annevk commented Feb 8, 2017

Thanks @rmisev!

domenic added a commit to jsdom/whatwg-url that referenced this pull request Feb 8, 2017
domenic added a commit to jsdom/whatwg-url that referenced this pull request Feb 8, 2017
annevk added a commit that referenced this pull request Feb 9, 2017
annevk added a commit that referenced this pull request Feb 10, 2017
annevk added a commit that referenced this pull request Feb 10, 2017
@rmisev rmisev deleted the non-special-url-host-parsing branch February 15, 2017 21:27
@rmisev rmisev mentioned this pull request Mar 15, 2017
TimothyGu added a commit to TimothyGu/node that referenced this pull request Apr 23, 2017
- Update to spec
  - Add opaque hosts
  - File state did not correctly deal with lack of base URL
  - Cleanup API for file and non-special URLs
  - Allow % and IPv6 addresses in non-special URL hosts
  - Use specific names for percent-encode sets
  - Add empty host concept for file and non-special URLs
  - Clarify IPv6 serializer
- Fix existing mistakes
  - Add missing ':' to forbidden host code point list.
  - Correct IPv4 parser empty label behavior
- Maintain type equivalence in URLContext with spec
  - scheme, username, and password should always be strings
  - host, port, query, and fragment may be strings or null
  - Align scheme state more closely with the spec
  - Make sure the `special` variable is always synced with
    URL_FLAG_SPECIAL.

PR-URL: nodejs#12523
Fixes: nodejs#10608
Fixes: nodejs#10634
Refs: whatwg/url#185
Refs: whatwg/url#225
Refs: whatwg/url#224
Refs: whatwg/url#218
Refs: whatwg/url#243
Refs: whatwg/url#260
Refs: whatwg/url#268
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
TimothyGu added a commit to nodejs/node that referenced this pull request Apr 24, 2017
- Update to spec
  - Add opaque hosts
  - File state did not correctly deal with lack of base URL
  - Cleanup API for file and non-special URLs
  - Allow % and IPv6 addresses in non-special URL hosts
  - Use specific names for percent-encode sets
  - Add empty host concept for file and non-special URLs
  - Clarify IPv6 serializer
- Fix existing mistakes
  - Add missing ':' to forbidden host code point list.
  - Correct IPv4 parser empty label behavior
- Maintain type equivalence in URLContext with spec
  - scheme, username, and password should always be strings
  - host, port, query, and fragment may be strings or null
  - Align scheme state more closely with the spec
  - Make sure the `special` variable is always synced with
    URL_FLAG_SPECIAL.

PR-URL: #12523
Fixes: #10608
Fixes: #10634
Refs: whatwg/url#185
Refs: whatwg/url#225
Refs: whatwg/url#224
Refs: whatwg/url#218
Refs: whatwg/url#243
Refs: whatwg/url#260
Refs: whatwg/url#268
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
evanlucas pushed a commit to nodejs/node that referenced this pull request Apr 25, 2017
- Update to spec
  - Add opaque hosts
  - File state did not correctly deal with lack of base URL
  - Cleanup API for file and non-special URLs
  - Allow % and IPv6 addresses in non-special URL hosts
  - Use specific names for percent-encode sets
  - Add empty host concept for file and non-special URLs
  - Clarify IPv6 serializer
- Fix existing mistakes
  - Add missing ':' to forbidden host code point list.
  - Correct IPv4 parser empty label behavior
- Maintain type equivalence in URLContext with spec
  - scheme, username, and password should always be strings
  - host, port, query, and fragment may be strings or null
  - Align scheme state more closely with the spec
  - Make sure the `special` variable is always synced with
    URL_FLAG_SPECIAL.

PR-URL: #12523
Fixes: #10608
Fixes: #10634
Refs: whatwg/url#185
Refs: whatwg/url#225
Refs: whatwg/url#224
Refs: whatwg/url#218
Refs: whatwg/url#243
Refs: whatwg/url#260
Refs: whatwg/url#268
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
evanlucas pushed a commit to nodejs/node that referenced this pull request May 1, 2017
- Update to spec
  - Add opaque hosts
  - File state did not correctly deal with lack of base URL
  - Cleanup API for file and non-special URLs
  - Allow % and IPv6 addresses in non-special URL hosts
  - Use specific names for percent-encode sets
  - Add empty host concept for file and non-special URLs
  - Clarify IPv6 serializer
- Fix existing mistakes
  - Add missing ':' to forbidden host code point list.
  - Correct IPv4 parser empty label behavior
- Maintain type equivalence in URLContext with spec
  - scheme, username, and password should always be strings
  - host, port, query, and fragment may be strings or null
  - Align scheme state more closely with the spec
  - Make sure the `special` variable is always synced with
    URL_FLAG_SPECIAL.

PR-URL: #12523
Fixes: #10608
Fixes: #10634
Refs: whatwg/url#185
Refs: whatwg/url#225
Refs: whatwg/url#224
Refs: whatwg/url#218
Refs: whatwg/url#243
Refs: whatwg/url#260
Refs: whatwg/url#268
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
evanlucas pushed a commit to nodejs/node that referenced this pull request May 2, 2017
- Update to spec
  - Add opaque hosts
  - File state did not correctly deal with lack of base URL
  - Cleanup API for file and non-special URLs
  - Allow % and IPv6 addresses in non-special URL hosts
  - Use specific names for percent-encode sets
  - Add empty host concept for file and non-special URLs
  - Clarify IPv6 serializer
- Fix existing mistakes
  - Add missing ':' to forbidden host code point list.
  - Correct IPv4 parser empty label behavior
- Maintain type equivalence in URLContext with spec
  - scheme, username, and password should always be strings
  - host, port, query, and fragment may be strings or null
  - Align scheme state more closely with the spec
  - Make sure the `special` variable is always synced with
    URL_FLAG_SPECIAL.

PR-URL: #12523
Fixes: #10608
Fixes: #10634
Refs: whatwg/url#185
Refs: whatwg/url#225
Refs: whatwg/url#224
Refs: whatwg/url#218
Refs: whatwg/url#243
Refs: whatwg/url#260
Refs: whatwg/url#268
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
rmisev added a commit to upa-url/upa that referenced this pull request May 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants