-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp MIME type section #36
Conversation
This contains just parsing MIME types. I haven't actually changed everything yet since I figured I'd ask for some feedback first. My idea was to align this with URL. So we have MIME types, which are also known as MIME type records. And MIME type strings, which serve as input (maybe I'll define that part later, I'd like to focus on implementation requirements initially). We should probably also have some byte sequence entry points for most of these, but since going from bytes to strings is easy with isomorphic decode I thought making the main parser string-based is best. Thoughts? |
I like the general plan of aligning with URL. Although the spec is already like that, just a bit dated, right? (E.g. using multiple return values instead of a struct.) I'd like to hear more justification for using strings as input, instead of bytes. What call sites use which? I'm not aware of how many call sites would use this parsing algorithm, so it's hard to judge. Bytes seems more correct though at first impression. If you do go with strings, it should be stated to be a JS string, I think. The current spec has XXX boxes that are, as far as I can tell, designed to limit certain things to 127 bytes. Those have disappeared, but it seems important to do some testing there, or at least preserve some kind of XXX box. |
|
Indeed, I believe some spec somewhere (an RFC, maybe?) had a limit of 127 bytes, but I questioned whether it was actually enforced by any implementation. |
|
I also don't think we should state a specific string type, this works with all of them after all. Why require casts? |
You're right, I forgot that our only two string types were JS string and SV string. I'm still not convinced this should be strings instead of bytes though. The lack of utilities for parsing/manipulating byte sequences is fixable. I'd like to get an accounting of the call sites before we decide one way or another. The XHR overrideMimeType() example is interesting precisely because I'm unsure how it should behave given that it operates on a DOMString (instead of a ByteString). What bytes do we actually transmit over HTTP? It seems like the spec right now sometimes stores a string in the "override MIME type" field, e.g. overrideMimeType() step 3, but sometimes stores a byte sequence, e.g. overrideMimeType() step 2. Your version not only accepts strings as inputs, but also creates them as outputs in the MIME type struct, which seems quite bad if we're eventually sending these over the wire? |
None, |
And to be clear, I do intend to provide a "parse a MIME type from bytes" and "serialize a MIME type to bytes" for those cases, with the necessary asserts on the input for the latter of the two. |
The (supposed) 127-byte limit was on the individual portions (type, subtype, parameter name, parameter value), not the overall MIME type. |
|
From some testing in #39 I think that we do need to preserve whitespace around separators. For charset normalization, bogus charsets should turn into UTF-8. For charset=utf-8, it's debatable. It's plausible that stuff could depend on both the encoding being untouched and it being normalized. I think I have a slight preference to normalize it, since that is what |
We should test what happens with non-ASCII characters in the various segments of the MIME type; the spec seems to just ASCII lowercase the strings and pass them through. I wonder if that's how browsers treat them. And I wonder if browsers treat them differently when given HTTP header bytes or other byte-accepting entry points vs. XHR overrideMimeType or other string-accepting APIs. |
As far as I can tell the proposed spec skips whitespace after the = but collects it before the = sign. http://httpwg.org/specs/rfc7231.html says no whitespace is allowed. We should test what browsers do in such scenarios if you haven't already. If you have, adding a note about this potentially-confusing mismatch would be good. Both that it mismatches the RFC, and that it treats before different than after. |
It doesn't mismatch the RFC anymore than anything else that the RFC would say is invalid and is simply consumed as part of the token here, no? (There are tests for this already and browser bugs have been filed, see #34.) |
Well, I don't see anything about consuming as part of a token in the RFC. But as far as I can tell the RFC has I see you tested spaces before the = sign, but did you test spaces after? |
How is it ignored before the = sign? I strip it after currently, but that should be dropped as the Encoding Standard handles that already for encodings. |
Sorry, I got confused between my two posts. You ignore it after the = sign, but don't ignore it before the = sign. #34 has tests for before the = sign, which I guess is what led you to the behavior of not skipping whitespace there. I was wondering if there are any tests for after the = sign, which would help decide on spaces or no there. It'd be ideal if there were a way of testing this independent of encoding handling, but I guess there is not in browsers today. |
If browsers agree with all the proposed changes we could test it through data URLs and XMLHttpRequest, but that would first require browsers to actually start storing unknown parameters and such. So yeah, not in today's browsers. |
Well, something to keep in mind at least; it'd be nice to write such tests and ask browsers to follow that model. But you're more in touch with whether that's realistic than I am. |
I already wrote such a test: web-platform-tests/wpt#6890 (review). |
mimesniff.bs
Outdated
<li> | ||
Enter loop <var>M</var>: | ||
<p>If the current <a>code point</a> in <var>input</var> is U+0022 ("), then advance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part accepts (and ignores) unterminated quoted strings, such as 'text/plain; charset="utf-8' or 'text/plain; charset="utf-8; param2=value2'. I think it's too lax, what do you think? Is it OK to fail parsing in such cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it seems your design principle is to let the parser parse type and subtype whenever they are sane (i.e., regardless of the "parameter" section). Is that right? Then this behavior may be OK...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When testing I found that only Safari would treat text/html;charset="gbk
not as GBK. Therefore I went with the majority. But yes, nobody returned failure and started downloading such a resource, so I don't think we can start with that now.
I think "exclude parameters" on the serializer is useless because we already have "essence". Instead of saying e.g. "Let result be mimeType serialized without parameters", you'd just say "Let result be mimeType's essence". |
Commit message:
|
This follows whatwg/mimesniff#58 by referencing the definitions for JavaScript and JSON MIME type that now live in MIME Sniffing. It also follows whatwg/mimesniff#36 by using the terms "valid MIME type string" and "valid MIME type string without parameters" instead of their non-string counterparts that previously appeared.
This follows whatwg/mimesniff#58 by referencing the definitions for JavaScript and JSON MIME type that now live in MIME Sniffing. It also follows whatwg/mimesniff#36 by using the terms "valid MIME type string" and "valid MIME type string without parameters" instead of their non-string counterparts that previously appeared.
This follows whatwg/mimesniff#58 by referencing the definitions for JavaScript and JSON MIME type that now live in MIME Sniffing. It also follows whatwg/mimesniff#36 by using the terms "valid MIME type string" and "valid MIME type string without parameters" instead of their non-string counterparts that previously appeared. Finally, it updates the terms "explicitly supported XML/JSON type" to include the word "MIME", like other MIME type group definitions now do.
…ng, a=testonly Automatic update from web-platform-testsAdjust XMLHttpRequest Content-Type handling See whatwg/xhr#176 and whatwg/mimesniff#36. -- wpt-commits: 84e7972a0518fb57f39740143d4b63e79b14e9f4 wpt-pr: 8422
This follows whatwg/mimesniff#58 by referencing the definitions for JavaScript and JSON MIME type that now live in MIME Sniffing. It also follows whatwg/mimesniff#36 by using the terms "valid MIME type string" and "valid MIME type string without parameters" instead of their non-string counterparts that previously appeared. Finally, it updates the terms "explicitly supported XML/JSON type" to include the word "MIME", like other MIME type group definitions now do.
…ng, a=testonly Automatic update from web-platform-testsAdjust XMLHttpRequest Content-Type handling See whatwg/xhr#176 and whatwg/mimesniff#36. -- wpt-commits: 84e7972a0518fb57f39740143d4b63e79b14e9f4 wpt-pr: 8422 UltraBlame original commit: be93580d93e3b6e94946124f06c188bc8daac745
…ng, a=testonly Automatic update from web-platform-testsAdjust XMLHttpRequest Content-Type handling See whatwg/xhr#176 and whatwg/mimesniff#36. -- wpt-commits: 84e7972a0518fb57f39740143d4b63e79b14e9f4 wpt-pr: 8422 UltraBlame original commit: be93580d93e3b6e94946124f06c188bc8daac745
…ng, a=testonly Automatic update from web-platform-testsAdjust XMLHttpRequest Content-Type handling See whatwg/xhr#176 and whatwg/mimesniff#36. -- wpt-commits: 84e7972a0518fb57f39740143d4b63e79b14e9f4 wpt-pr: 8422 UltraBlame original commit: be93580d93e3b6e94946124f06c188bc8daac745
The term "parsable MIME type" used to be part of the MIME Sniffing standard, but it was removed in whatwg/mimesniff#36. This change replaces its uses with equivalent phrasing that references the "parse a MIME type" algorithm. It also replaces mentions of "ASCII-encoded strings" with the Infra standard's definition of "ASCII string". Closes w3c#170.
TODO:
Preview | Diff