-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: allow percent encoded code points in the non-special URLs host #218
Conversation
…al URL URL-host parser fix: decode percent encoded code points before the forbidden host code points check
@achristensen07, this is a good reason to allow "%" for opaque hosts. Though we could also go this route and attempt to decode first. |
I think my preferred fix here would be to simply not forbid "%" by special casing it in the URL-host parser. |
I agree, it makes sense not to decode. Then more code points can be removed from forbidden list:
Now forbidden list for opaque hosts can be as short as: U+0000, U+0020, " |
See also #214 for the previous discussion regarding this. |
In particular I think we should continue to include the code points you cannot reach to avoid issues down the road. Although potentially we could point out that you cannot reach them and that they are therefore excluded. That would be equally clear. @achristensen07 any thoughts here? |
I think must be excluded the " |
Yeah, that makes sense. It's unfortunate those schemes have to write their own IPv4/6 parser/validation though. Perhaps we need to expose host parsing separately as well to JavaScript at some point to make that easier. |
A host parsing JS API would be welcome and would definitely be something worth exploring. |
@achristensen07 ping, could you please have a look at the proposed change? Safari is the only browser that supports hosts for non-special URLs so your review would be most welcome. |
@achristensen07 to clarify, the change basically makes it okay to use "%", "[", "]", and ":" (within "[" and "]") in opaque hosts. That way input such as |
I further investigate this problem and I think a very basic IPv6 check can help here (change URL-host parser 2 step to):
This disallows such weird URLs: And term "forbidden opaque host code point" (304f0ae) will not be required anymore. If you agree, I can prepare a new commit. |
That alternative seems reasonable too, but I think it's the best use of our time to wait for @achristensen07 to give input with regards to WebKit's preferences. |
We definitely need to allow percent-encoded values in non-special URLs to remain, but we need to do something about invalid percent-encoded values. notspecial://H%4fSt/path is fine but notspecial://H%4gSt/path is not. In such a case, should we percent-encode the percent? |
To me it seems fine to leave those be. We could at some point define an API to decode percent-encoded sequences and that could deal with those in various ways, but I'm not sure we need to make life difficult here necessarily. We don't do it elsewhere (e.g., paths, query, fragment). What's your opinion on the IPv6 question and various solutions? |
My implementation tries to parse and canonicalize a host as an ipv6 address before going to the different behavior for special/non-special. That's also what the current spec says. We could disallow [ and ] if it's not a valid ipv6 address. No problem. |
The current specification branches before that point, actually: https://url.spec.whatwg.org/#concept-url-host-parser. |
Sure enough. My implementation doesn't. We could make the spec match my implementation :) |
Okay, so your implementation branches right after step 1 of https://url.spec.whatwg.org/#concept-host-parser if I understand you correctly. That would be okay with me as well. |
Yep, that's exactly what I did. I think it makes sense to have it there, and that makes all the host parsing in the same place in the spec. |
I adjusted the formatting and placement a bit. Let me know what you think. Are you interested in looking at test changes/new tests as well? |
url.bs
Outdated
|
||
<li><p>Return <var>output</var>. | ||
</ol> | ||
|
||
The <dfn>IPv4 number parser</dfn> takes a string <var>input</var> and a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the <hr>
is missing before "IPv4 number parser".
It looks like you missed I can add tests, I think to add these URLs: I found one test in the current test suite, which will fail: |
I intentionally grouped the IPv4 parsers together. Those tests look pretty good. Verifying that IPv6 is normalized would be good as well. And instead of removing the test I would change the pass condition. |
Verified that updating jsdom/whatwg-url according to this patchset passes all the tests in web-platform-tests/wpt#4761 (and doesn't cause any failures). |
Thanks @rmisev! |
- Update to spec - Add opaque hosts - File state did not correctly deal with lack of base URL - Cleanup API for file and non-special URLs - Allow % and IPv6 addresses in non-special URL hosts - Use specific names for percent-encode sets - Add empty host concept for file and non-special URLs - Clarify IPv6 serializer - Fix existing mistakes - Add missing ':' to forbidden host code point list. - Correct IPv4 parser empty label behavior - Maintain type equivalence in URLContext with spec - scheme, username, and password should always be strings - host, port, query, and fragment may be strings or null - Align scheme state more closely with the spec - Make sure the `special` variable is always synced with URL_FLAG_SPECIAL. PR-URL: nodejs#12523 Fixes: nodejs#10608 Fixes: nodejs#10634 Refs: whatwg/url#185 Refs: whatwg/url#225 Refs: whatwg/url#224 Refs: whatwg/url#218 Refs: whatwg/url#243 Refs: whatwg/url#260 Refs: whatwg/url#268 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com> Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
- Update to spec - Add opaque hosts - File state did not correctly deal with lack of base URL - Cleanup API for file and non-special URLs - Allow % and IPv6 addresses in non-special URL hosts - Use specific names for percent-encode sets - Add empty host concept for file and non-special URLs - Clarify IPv6 serializer - Fix existing mistakes - Add missing ':' to forbidden host code point list. - Correct IPv4 parser empty label behavior - Maintain type equivalence in URLContext with spec - scheme, username, and password should always be strings - host, port, query, and fragment may be strings or null - Align scheme state more closely with the spec - Make sure the `special` variable is always synced with URL_FLAG_SPECIAL. PR-URL: #12523 Fixes: #10608 Fixes: #10634 Refs: whatwg/url#185 Refs: whatwg/url#225 Refs: whatwg/url#224 Refs: whatwg/url#218 Refs: whatwg/url#243 Refs: whatwg/url#260 Refs: whatwg/url#268 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com> Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
- Update to spec - Add opaque hosts - File state did not correctly deal with lack of base URL - Cleanup API for file and non-special URLs - Allow % and IPv6 addresses in non-special URL hosts - Use specific names for percent-encode sets - Add empty host concept for file and non-special URLs - Clarify IPv6 serializer - Fix existing mistakes - Add missing ':' to forbidden host code point list. - Correct IPv4 parser empty label behavior - Maintain type equivalence in URLContext with spec - scheme, username, and password should always be strings - host, port, query, and fragment may be strings or null - Align scheme state more closely with the spec - Make sure the `special` variable is always synced with URL_FLAG_SPECIAL. PR-URL: #12523 Fixes: #10608 Fixes: #10634 Refs: whatwg/url#185 Refs: whatwg/url#225 Refs: whatwg/url#224 Refs: whatwg/url#218 Refs: whatwg/url#243 Refs: whatwg/url#260 Refs: whatwg/url#268 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com> Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
- Update to spec - Add opaque hosts - File state did not correctly deal with lack of base URL - Cleanup API for file and non-special URLs - Allow % and IPv6 addresses in non-special URL hosts - Use specific names for percent-encode sets - Add empty host concept for file and non-special URLs - Clarify IPv6 serializer - Fix existing mistakes - Add missing ':' to forbidden host code point list. - Correct IPv4 parser empty label behavior - Maintain type equivalence in URLContext with spec - scheme, username, and password should always be strings - host, port, query, and fragment may be strings or null - Align scheme state more closely with the spec - Make sure the `special` variable is always synced with URL_FLAG_SPECIAL. PR-URL: #12523 Fixes: #10608 Fixes: #10634 Refs: whatwg/url#185 Refs: whatwg/url#225 Refs: whatwg/url#224 Refs: whatwg/url#218 Refs: whatwg/url#243 Refs: whatwg/url#260 Refs: whatwg/url#268 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com> Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
- Update to spec - Add opaque hosts - File state did not correctly deal with lack of base URL - Cleanup API for file and non-special URLs - Allow % and IPv6 addresses in non-special URL hosts - Use specific names for percent-encode sets - Add empty host concept for file and non-special URLs - Clarify IPv6 serializer - Fix existing mistakes - Add missing ':' to forbidden host code point list. - Correct IPv4 parser empty label behavior - Maintain type equivalence in URLContext with spec - scheme, username, and password should always be strings - host, port, query, and fragment may be strings or null - Align scheme state more closely with the spec - Make sure the `special` variable is always synced with URL_FLAG_SPECIAL. PR-URL: #12523 Fixes: #10608 Fixes: #10634 Refs: whatwg/url#185 Refs: whatwg/url#225 Refs: whatwg/url#224 Refs: whatwg/url#218 Refs: whatwg/url#243 Refs: whatwg/url#260 Refs: whatwg/url#268 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com> Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
After 3036255 introduced the URL-host parser rejects percent encoded code points in the non-special URLs host (because "%" is forbidden host code point).
The situation is paradoxical, because parser himself percent encodes some code points, actually producing invalid serialized URLs according to the current standard.
URL's from #148 to test issue:
unknown://%E2%80%A0/
notspecial://H%4fSt/path
Solution
I suggest that forbidden host code points be checked on percent decoded code points. The decoded data will be used only for this purpose, so the output will not be affected by this change.
Also this is a closer to special URLs host validity check.