-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define data: URLs #579
Define data: URLs #579
Conversation
fetch.bs
Outdated
@@ -5630,6 +5621,81 @@ if the script checks that the URL has the right hostname. | |||
|
|||
|
|||
|
|||
<h2 id=data-urls><code>data:</code> URLs</h2> | |||
|
|||
<p>For an informative description of <code>data:</code> URLs see RFC 2397. This section replaces its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<p>For an informative description of <code>data:</code> URLs see RFC 2397
Probably better to have a comma after URLs:
For an informative description of data:
URLs, see RFC 2397
fetch.bs
Outdated
<h2 id=data-urls><code>data:</code> URLs</h2> | ||
|
||
<p>For an informative description of <code>data:</code> URLs see RFC 2397. This section replaces its | ||
normative processing requirements. [[!RFC2397]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Informative link instead of normative?
fetch.bs
Outdated
<li><p>If <var>body</var> is failure, then return failure. | ||
|
||
<li><p>Remove the code point sequence that is an <a>ASCII case-insensitive</a> match for | ||
"<code>;base64</code>" from <var>mimeType</var>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or ;base64;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that would remove a ";" too many.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right. What if there are multiple occurrences of ;base64
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't just remove all of them because then you get into trouble with something like ;base64x;base64
. And it's really hard to figure out what Chrome and Firefox do because their MIME type parser throws away unknown parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess I should block on figuring out MIME types after all, which seems like a massive pain, but oh well.
fetch.bs
Outdated
</ol> | ||
|
||
<li><p>If <var>mimeType</var> starts with an <a>ASCII case-insensitive</a> match for | ||
"<code>charset;</code>", then prepend "<code>text/plain</code>" to <var>mimeType</var>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
;charset
?
The "starts with" part would also break URLs like data:;myownprop=abc;charset=UTF-8,%E4%B8%AD%E6%96%87
and data:;myownprop=abc,stuff
. The latter is technically invalid according to the RFC but works in Chrome and Firefox.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter doesn't preserve anything in Chrome and Firefox though. The former does seem to preserve the charset value somehow, but doesn't work at all in Safari. I'm not sure we need to support it.
fetch.bs
Outdated
equal value. | ||
<!-- See above about a dedicated function. --> | ||
|
||
<li><p>Return <var>mimeTypeBytes</var> and <var>body</var>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return mimeType instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A header value is a byte sequence.
Tests for whatwg/fetch#579.
For use by window.btoa()/window.atob() and soon data: URLs, see whatwg/fetch#579. Much of this text originated in the HTML Standard, see whatwg/html#2920 about removing that.
fetch.bs
Outdated
@@ -5685,7 +5686,14 @@ that RFC's normative processing requirements. [[RFC2397]] | |||
|
|||
<li><p>Set <var>mimeTypeBytes</var> to the <a>isomorphic encode</a> of <var>mimeType</var>. | |||
|
|||
<li><p>Return <var>mimeTypeBytes</var> and <var>body</var>. | |||
<li><p>Let <var>dataURLStruct</var> be a <a>struct</a> consisting of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the struct type should be defined outside the algorithm (like "URL record" currently is) as well as its consituents. Then you can have a single step that says "Return a data URL struct whose MIME type is mimeTypeBytes and whose body is body."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I guess, it seems rather pointless though. Maybe I should just use response directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dunno, I think people might want to reuse this parser more generally, and I think it's good to have it be standalone from the rest of the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also use a tuple to be super-lightweight on the spec side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd need to add indexing syntax to Infra to be able to address tuple entries then. Currently you have to name them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anywya, I guess it's not too much trouble to define a struct upfront.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking "Let (mimeType, body) be the result of running the data URL processor". I guess that's not explicitly defined, but it should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed this because it came up in https://docs.google.com/document/d/1FIUk5Y5_VmZ8rqHEsFEoXaysJUCRzf3RbvcyuYpim9c/edit?usp=sharing
Overall makes a lot of sense, just some missing links and questions about whitespace.
fetch.bs
Outdated
<var>dataURL</var> and then runs these steps: | ||
|
||
<ol> | ||
<li><p>Assert: <var>dataURL</var>'s <a for=URL>scheme</a> is "<code>data</code>". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheme ends up not being linked to anything, is the concept not exported in URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably lowercase URL.
fetch.bs
Outdated
|
||
<li><p>Let <var>encodedBody</var> be the remainder of <var>input</var>. | ||
|
||
<li><p>Let <var>mimeTypeBytes</var> be the <a>string percent decoding</a> of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string percent decoding doesn't get linked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because the corresponding PR in URL hasn't landed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's whatwg/url#340
fetch.bs
Outdated
|
||
<li><p>Let <var>mimeTypeBytes</var> be the <a>string percent decoding</a> of | ||
<var>encodedMimeType</var>. | ||
<!-- Note: implementations leave the percent-encoded bits around. That strikes me as rather broken, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, implementations don't percent decode the mime type at all? Are any of the tests failing everywhere because of this? If so, going with the broken reality seems preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really, nobody wants to put effort in this? I guess I should check again to make sure no implementation does it anywhere, but it seems rather sad and it means you cannot have commas anywhere (although the case for them is rather theoretical I suppose).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having no context on this, I guess I don't see why percent decoding is the default reasonable thing to do. Is it just generally true of "parts of URLs"? This part is constrained by the syntax of mime types, does that even allow for anything that needs to be percent encoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's why you'd percent-decode before you try to parse the MIME type. And yeah, generally in URLs "a" and "%61" would be equivalent and consumers of URLs are expected to treat those equivalently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, a cast that isn't entirely far fetched is data:video/ogg;codecs=theora,vorbis,...
. Even though I don't know of a case where the codecs are needed, it is valid to have a comma in a media type.
No objections to try to spec the more reasonable behavior, but it'd be good if it's isolated to a few tests, so that if they're all still failing in a year or so, reversing course will be a small effort.
fetch.bs
Outdated
|
||
<li><p><a>Strip leading and trailing ASCII whitespace</a> from <var>mimeType</var>. | ||
|
||
<li><p>Let <var>body</var> be the <a>string percent decoding</a> of <var>encodedBody</var>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not linked
fetch.bs
Outdated
|
||
<li> | ||
<p>If <var>mimeType</var> contains an <a>ASCII case-insensitive</a> match for | ||
"<code>;base64;</code>" or ends with an <a>ASCII case-insensitive</a> match for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is no whitespace around "base64" allowed? The URL data:text/html ; base64 ;,YW5uZXZr
works for me on ChromeOS, so tests will be needed for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The silly thing is that data:text/html%20;%20base64%20;,YW5uZXZr
doesn't work, but I also cannot get that to work in other browsers (tested Firefox and Safari).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that not just because of the above issue about percent decoding the media type?
When it comes to whitespace, is there an ideal behavior that we should try to do, or should we just look for what's most web compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ideal might relate to how we parse MIME types in general. Ignoring whitespace is what we typically do though so I suppose that makes sense, but will require more of an actual parser.
fetch.bs
Outdated
|
||
<li><p>If <var>body</var> is failure, then return failure. | ||
|
||
<li><p>Remove the code point sequence that is an <a>ASCII case-insensitive</a> match for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it say to remove the first such sequence, given that it could appear again later?
fetch.bs
Outdated
</ol> | ||
|
||
<li><p>If <var>mimeType</var> starts with an <a>ASCII case-insensitive</a> match for | ||
"<code>;charset=</code>", then prepend "<code>text/plain</code>" to <var>mimeType</var>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question of whitespace here as well.
@foolip the bigger problem here is how MIME types in general are parsed. Chrome and Firefox end up normalizing them to some extent, but not in a super logical way. Basically everything unknown gets dropped, |
whatwg/mimesniff#30 is interesting, commented. In order to ultimately resolve https://bugzilla.mozilla.org/show_bug.cgi?id=1389251, can this PR leave the question of MIME type parsing unanswered, and just extract a string that's handed off to something else, or left as a TODO? The most urgent bit to reach interop on is spaces within the body. I'm the babysitter for this issue in https://docs.google.com/spreadsheets/d/11OdLcQs6nr3WQgkzu3kSKOPGnneMbZ8Unung8gE6jm8/edit?usp=sharing and will come back to it again eventually, right now is there anything I can help with to move it along? Tests look like a good start to me already. |
Most of this is about the MIME type. The only spaces that are dropped are within base64-bodies. And determining it's a base64 body depends on how we parse MIME types. MIME types is step 9-16 basically, give or take. I can't just drop that. I've reached out to @bzbarsky to see if he can help set a direction. My personal take is that we should parse MIME types here (and serialize them on output). But rather than dropping unknown parameters as Chrome and Firefox do, we preserve them (except for base64, assuming we can treat that as a parameter of sorts). Not preserving them doesn't seem good enough for a generic utility like |
So observably, the MIME type bit just affects what getResponseHeader("Content-Type") returns for data: URIs with parameters in the MIME type, right? In practice, I expect that including parameters (but not the ;base64 parameter) in there is fairly compatible, in the sense that no one uses those to a first approximation. Looking at the actual spec proposal, I have the following questions:
|
Oh, and the general direction here seems ok to me. |
Tests for whatwg/fetch#579.
The main reason was because otherwise you could not have a MIME type with a comma in it. I don't think Chrome and Safari have proper support for it. They don't normalize
This does not work in Safari, but does work in Chrome. I would prefer it not working.
Something like I think we should either require
Safari never expands (just uses the given value as MIME type) and Chrome and Firefox prepend
Safari does not, Chrome does. I'll go ahead and update the proposed standard and tests per the above. |
That's not allowed anyway, per the syntax for MIME types defined in https://tools.ietf.org/html/rfc2045#section-5.1. At least for the type itself; MIME parameters are a separate issue....
Hmm. In my testing via the URL bar they did. And
I think I would too, if we can figure out a simple way to spec that...
I think the idea behind the Firefox impl is that people might add random garbage to the end of the RFC-specified bits. Might be worth checking the history of why the parsing there is so loose. That said, does Safari basically enforce ";base64" being the very last thing? Just trying to feel out the compat constraints here. |
For MIME type with a comma I was including MIME type parameters. E.g., the Per the web-platform-tests I wrote Safari seems to enforce |
Safari treats |
Having required E.g., I think that's a reasonable model. The only thing I'm unsure about is:
|
Checking back in on this, seeing if I can be of service :)
Per #579 (comment) Firefox already preserves them, so that seems likely to be web compatible. What does Edge do? I was trying things like
That or behaving as if they're the empty string seems fine? What do implementations do? |
@foolip I don't think any browser preserves them at the moment. The way to test it is to fetch the data URL and look at the resulting Implementations throw away "unknown" parameters. However, for charset:
|
OK, so I was testing at the wrong level, simply parsing and serializing data: URLs doesn't do it, but passing throw fetch() would?
That seems weird.
I like that better. Ultimately, though, I'm not at all attached to the details here, as long as we get interop with low compat risk. |
Right. |
This is now blocked on whatwg/mimesniff#30, right? |
Yeah, which is blocked on whatwg/mimesniff#39. In particular, whether we can still turn MIME types into a sound system consisting of parse/model/serialize or whether it needs to be as ad-hoc as it is in implementations. |
aea5763
to
cedd3f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % nit, thanks Anne!
fetch.bs
Outdated
@@ -6095,7 +6094,13 @@ that RFC's normative processing requirements to be compatible with deployed cont | |||
|
|||
<li><p>If <var>body</var> is failure, then return failure. | |||
|
|||
<li><p>Remove the last 7 code points from <var>mimeType</var>. | |||
<li><p>Remove trailing U+0020 SPACE <a>code points</a> from <var>mimeType</var>, if any. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it'd be worth pointing out that just not implementing these steps would have no observable effect, since the MIME parser will ignore it? Or if it can make a difference (didn't check deeply) pointing that out, to avoid someone making that optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds like a good idea. I mainly clean up here in case we want to add some kind of error reporting to the MIME type parser later.
The tests don't seem to be updated, but I implemented this in my JS implementation and only got a couple test failures with the old tests, so at least it didn't totally break everything :). I'll comment over there with the details. |
Thanks! The main reason I had not updated the tests yet is because I was not quite sure this was all correct, and indeed we missed something. We do a starts with ; check that would fail if mimeType starts with spaces. Fixed now. |
@dholbert kindly clarified this in webcompat/web-bugs#8733 (comment), apparently the behavior is different for base64 and non-base64. Curious, but I don't think it changes the conclusion for this issue. |
Tests and spec-as-reified-in-JS match, so LGTM! |
fetch.bs
Outdated
<p><a>Strip leading and trailing ASCII whitespace</a> from <var>mimeType</var>. | ||
|
||
<p class="note">This will only remove U+0020 <a>code points</a>, if any. | ||
|
||
<li><p>If <var>position</var> is past the end of <var>input</var>, then return failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no tests that hit this condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I added some.
Sigh.
is not actually redundant due to the next step, checking if mimeType starts with ";". I'm going to remove that note. |
Commit message:
|
I guess I'll wait with landing another day since the above change was not anticipated by, e.g., @foolip. And I'd rather land together with the tests and Travis isn't green yet for those. The only thing that remains to be done is filing bugs against non-Firefox browsers. |
Oh, I see. As long as some test exercises that distinction, this is fine with me then. |
Yeah, looking through it quite a few tests cover this and I think originally those tests went the other way even, and when @domenic pointed out they were wrong I didn't realize it was because of this. So we're all good then. |
Finally! Celebration! Exclamation! @annevk, do you know of existing browser bugs we should hang these changes onto, or file new bugs? |
@foolip see #234 (comment). |
Unfortunately RFC 2397 has some ambiguities and implementations never really followed it in detail. So here's an attempt to define the processing model more clearly and get implementations aligned.
Tests: web-platform-tests/wpt#6890.
Fixes #234.
TODO:
window.atob()
to make use of the new base64 algorithm wherever we decide to locate it.Preview | Diff