-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Content-Type parsing (MIME type parsing) #30
Comments
What are all of the contexts that MIME type-like things are parsed? Ones that I know of:
Are there others? It's probably not the same parser used in all contexts. |
It's all rather messy indeed.
|
html/semantics/embedded-content/the-img-element/update-the-source-set.html has evidence (and tests!) that "broken" MIME types such as |
If it helps, I put this wiki page together back in the day: |
I also tried to assemble a list of source code locations related to MIME sniffing: |
Note that Content-Type header parsing also depends. See https://bugzilla.mozilla.org/show_bug.cgi?id=1210302 where we ended up needing to use different parsers for the request and response Content-Type headers. |
Firefox:
It would be good to verify those algorithms against other browsers somehow. @mikewest do you have insight as to what Chrome does? |
@sleevi is a better resource for the network stack's charset sniffing mechanisms. If he's too busy at BlinkOn this week, I'll dig around for a link. :) |
Firefox appears to use http://searchfox.org/mozilla-central/source/netwerk/base/nsURLHelper.cpp#1033 for request |
No, it would not. @annevk, why do you think it would? |
My bad, it does indeed skip quoted strings. |
@annevk It sounds like Chrome is equally in a weird place :) We have We have a separate parser for response For the Separately, for things like I'm sure we could add another dozen or so MIME type parsers before we start refactoring ;) For MIME sniffing in particular, the determination of whether or not to sniff is using the @mikewest Is that what you were hoping for? :) |
Rendered as
Rendered as text/plain: Chrome, Firefox That means there's already inconsistency between multiple headers and combined headers. I was led to believe that was only the case for cookies. Hopefully we can still make it so somehow. |
Both rendered as text/html: Chrome, Firefox, Safari |
If I reverse the order and list
If I list (It's not entirely clear to me why Chrome special cases |
Given the lack of interoperability it's not clear to me that the complicated |
Unfortunately Edge is not as consistent in its download story as I hoped.
Chrome, Safari: GBK
Chrome: GBK |
So far it sounds to me like Chrome and Firefox consistently treat:
identically to:
though they may not always agree on how the latter is treated. For Firefox this is expected, because the HTTP response header type in Firefox has no representation for "repeated headers" last I checked; they get turned into a comma-separated list. So the second form is all that the rest of the system sees. I can't speak for how Chrome handles this. Safari's behavior for the It's possible that Firefox treats |
On the topic of header parsing, I think the way to go would be to always combine (except for Set-Cookie) and then define all the parsers for the combined value. Then how you deal with quotes and such becomes a question on a per-header-parser basis. I.e., the As for |
Edge: from the tests I wrote, when the headers are separate it picks the first one. When the headers are combined, it tries to parse their entire value as a MIME type, not giving the comma any special consideration. Safari: when the headers are separate it picks the last one (matching Chrome/Firefox), but does not carry over charset when the essence matches (not matching Chrome/Firefox). Downloads / (even without parameters). (I think the earlier comment got misled by using |
Merging and handling commas on a per-header basis works, though I'd be worried about more breakages. And, at least in Chrome, it would require rewriting basically every handler, possibly significantly, just for the comma case - something that I'm worried will not end up happening. |
@MattMenke2 - header folding can be done the same way for all fields, as required by the HTTP spec. The only exception is Set-Cookie, as described in https://greenbytes.de/tech/webdav/rfc7230.html#rfc.section.3.2.2. That said, parsing of these values needs to be specific to the header field, as it requires knowledge of the syntax of individual list elements. You can't split on "," dues to commas appearing in the literal value. |
For stylesheet loading (initial tests in web-platform-tests/wpt#13144) it seems that Firefox treats a |
Chrome treats " |
My preferred way of dealing with If you wanted a more efficient approach to this you would not even have to combine first, but you do have to parse all of them at once. So as you reach the end of the first header value you'd simply continue with the second header value while retaining all the current state of the parser (e.g., the do not split flag). The "controversial" case here is that this would mean that Aside from web-platform-tests/wpt#10525 I'll write some more tests for (Note that the request cc @asankah |
I added tests for the |
Sadly, not too surprising - Chrome's network stack has a lot of behaviors modeled after FireFox, while it uses a forked WebKit as its renderer, resulting in inheriting different behavior from different browsers, depending on where the code is running. |
Also known as "extract a MIME type" down right. Tests: web-platform-tests/wpt#10525. Helps with #814. Fixes #529. Closes whatwg/mimesniff#30.
I put up an initial patch for this at whatwg/fetch#831. Relative to #30 (comment) it defines how splitting works (in a way that can be reused across different headers) and it deals with parsing a MIME type being able to return failure (I also added tests for this). I still need to add tests and possibly adjust the prose for these values: the empty string, " There is some potential for simplification here I suppose given that browsers are different, but unless we can simply use the first value always in all contexts it's unclear how much that'll buy us (the weird |
Also known as "extract a MIME type" down right. Tests: web-platform-tests/wpt#10525. Helps with #814. Fixes #529. Closes whatwg/mimesniff#30.
Empty string is treated the same as failure as far as I can tell. (I added tests.) I'm not sure about |
Firefox has had no need for these. It seems less weird behavior is better. See #30 for context.
Oops. Sorry I missed this thread. One problem with existing grammars is that splitting on E.g.: From RFC 7235 § 4.1:
None of the commas in this header delineate values. Instead they all delineate parameters. A contrived but valid example is below where the first and last
*Edited for accuracy. |
@asankah thanks for chiming in. It seems to me you could still split on |
@annevk Meaning the second example would be treated as being equivalent to
? Thus the parser for authentication challenges would need to associate all immediately following authentication headers that match It is a possibility :-) , as is adding lookaheads. Though it has the property that the meaning of a header depends on headers that follow. (Apologies if I'm misinterpreting your suggestion). |
Yes, that's exactly what I mean, and that would also not make eagerly combining intermediaries yield different results in the end. |
Also known as "extract a MIME type" done right. Tests: web-platform-tests/wpt#10525. Helps with #814. Fixes #529. Closes whatwg/mimesniff#30.
XMLHTTPRequest is expected to default to text/xml when the Content-Type fails to parse. But some web platform tests covering this expectation were failing (xhr/responsexml-media-type.htm). When the server returns a response with, for instance, 'Content-Type: bogus' or 'Content-Type: application', chromium wasn't setting the correct default 'text/xml'. The issue is that XHR should have considered 'bogus' and 'application' invalid MIME types. According to https://mimesniff.spec.whatwg.org/#parse-a-mime-type they are invalid. In fact, both FireFox and Safari are passing these WPT test cases. The chromium codebase has multiple MIME Type parsers with varying behaviors about what is and what isn't a valid MIME type. This is an interesting discussion about the subject whatwg/mimesniff#30 (comment). The idea of this CL is to update the Content-Type parsing logic that happens in XHR to use one of the other existing parsers. I believe we should avoid creating a new one, since there are multiple already. Looking at the options, there's HttpUtil::ParseContentType which seems to be the one that best implements the spec. But it is often used in the context of request header parsing. In //net/base/mime_util there are two parsers that are used in the parsing of response content-type. The proposal here is to move //services/network/public/cpp/cors ExtractMIMETypeFromMediaType (which uses //net/base/mime_util) to //net/base/mime_util and use it from Blink's XHR. One caveat is that XHR is currently parsing multiple Content-Type values in the same string. It simply reads the value before the first comma. So I'm adding a flag to the moved net::ExtractMIMETypeFromMediaType to toggle this behavior on, but still leaving the other places without it, since it's not part of the spec. Bug: 1053973 Change-Id: I8b27712aea30e2365e84886ffe2f7d4b251a4acf Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4241657 Reviewed-by: Kenichi Ishibashi <bashi@chromium.org> Reviewed-by: Yoav Weiss <yoavweiss@chromium.org> Commit-Queue: Yoav Weiss <yoavweiss@chromium.org> Commit-Queue: Kenichi Ishibashi <bashi@chromium.org> Cr-Commit-Position: refs/heads/main@{#1112098}
I looked into MIME type parsing to figure out how to make progress with whatwg/fetch#579 and httpwg/http-core#33. However, it doesn't seem like there's much interoperability or good places to start.
For instance the following decodes as UTF-8 in Chrome and Firefox, but windows-1252 in Edge and Safari (inspired by http://searchfox.org/mozilla-central/rev/4b79f3b23aebb4080ea85e94351fd4046116a957/netwerk/base/nsURLHelper.cpp#957):
Only Chrome and Firefox have a modicum of MIME type validation happening for
data:
URLs, but even that's rather limited and broken (e.g., unknown parameters get dropped, butimage/gif;charset=x
is fine).It seems that anything here would have to be quite forgiving to maintain the status quo of not bailing if a MIME type is invalid (i.e.., treat
text/html;
astext/html
and not as an error), but there's also quite some flexibility. And then there's the question of whether strings need to be simply passed through toBlob
and such or if there should be some validation step to normalize input (Chrome and Firefox appear to lowercase all input there).The text was updated successfully, but these errors were encountered: