-
Notifications
You must be signed in to change notification settings - Fork 549
Require UTF-8 #1039
Comments
* Update spec to insist on UTF-8 * Fixes #1039
Fix #1039 * Update spec to insist on UTF-8 for all new content
I've been reviewing the changes as part of the i18n WG review of HTML 5.3, and while we wholeheartedly support the idea of moving the Web fully to UTF-8, we have some concerns with the spec text. As i understand it, the spec is trying to do three things:
4.2.5.4. Specifying the document’s character encoding
The first sentence has a qualifier, 'for the modern Web', which helps, but that last sentence is missing what i think is an important rider that appears in the WhatWG version of the spec, which says:
That, to me sounds better (though not to Addison, who thinks it's untestable). Otherwise, one wonders:
The danger is that the current text implies that legacy documents no longer need to be supported by a browser, even if they use the encodings listed in the Encoding spec. That will cut off the use of many pages on the Web. We're looking for some kind of wording that clearly says that from now on only utf-8 is acceptable, but that also makes it clear that existing content will still work (as long as it doesn't use encodings that aren't interoperably implemented or encodings that carry significant security risks - those being indicated in the Encoding spec). Another suggestion from i18n WG participants was to say SHOULD but immediately follow it by text that explains why content authors REALLY SHOULD avoid anything but utf-8. There was some text to that effect previously in section 4.2.5.4. Specifying the document’s character encoding, but it was removed in this update. Similar comments are relevant to section 12.1 text/html
Again, to my mind, adding 'for newly-created documents' would be more accurate here. |
Good points both. Let me answer.
Madame Puff - Ada Lovelace's beloved cat - will forever hide dead birds in your chimney. Which is to say, that will depend on the browser. If you use
I agree with your second question:
As we say in Rendering "User agents are not required to present HTML documents in any particular way." I think it would be helpful to add a section to either Rendering or Obsolete Features to say how browsers should treat legacy content. |
Except that that isn't true, iiuc. The parsing of the input byte stream should detect that the intended encoding is ISO 8859-5 per section 8.2.2, and the Encoding spec then defines exactly how ISO 8895-5 should be supported by a browser. So no strange things should happen for that particular legacy encoding. It's only for encodings that aren't defined in the Encoding spec that strange things could happen, and that what happens depends on browser support (modulo, of course, browsers that don't support the standards, but that's a recipe for strange things happening whatever the topic). If the upshot of the spec is now that strange things may happen to legacy encoded content that is covered by the provisions of the Encoding spec, then i think there's a problem, since we are putting in danger the future viability of that legacy content. Note that i agree that conversion to UTF-8 is an optimal solution, but it's not always possible, and unless there's a serious interop or security issue, i don't think we should abandon pages created in good faith in the past to future obscurity. |
Catching up with this. I agree with everything you've written. See incoming PR. One minor thing:
We don't need the "for newly-created documents" as the restriction also applies to legacy documents as well. The restrictions mentioned are about the serialisation of the declaration. They've always applied. However, I've added back in the comment about the older character encodings. Thanks for pointing that out. |
See whatwg/html#3091
The text was updated successfully, but these errors were encountered: