-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider restricting <form accept-charset> to utf-8 #3097
Comments
This change adds a “must” requirement for UTF-8 in most places in the spec that define a means for specifying a character encoding. Specifically, it makes UTF-8 required for any “character encoding declaration”, which includes the HTTP Content-Type header sent with any document, the `<meta charset>` element, and the `<meta http-equiv=content-type>` element. Along with those, this change also makes UTF-8 required for `<script charset>`. To make the normative source of those requirements clear, this change also adds a specific citation to the relevant requirement from the Encoding standard, and updates the in-spec IANA registration for text/html media type to indicate that UTF-8 is required. Finally, it changes an existing requirement for authoring tools to use UTF-8 from a “should” to a “must”. One place where this change doesn’t yet add a requirement for UTF-8 is for the `form` element’s `accept-charset` attribute. For that, see issue #3097.
I think we basically have to do this. You get data loss and interoperability issues with other encodings. And |
So then we can (should) just add a change for it to #3091, right? |
I think so, but @zcorpan wants to do research first as I understand it. |
Ah, OK — if so then I think we should go ahead and do what we seem to have already been planning, which is to merge #3091 first — without this (but with a note about it in the commit message) — and then handle this a separate change later |
Yeah, I think that's fine. |
This change adds a “must” requirement for UTF-8 in all but one of the places in the spec that define a means for specifying a character encoding. Specifically, it makes UTF-8 required for any “character encoding declaration”, which includes the HTTP Content-Type header sent with any document, the `<meta charset>` element, and the `<meta http-equiv=content-type>` element. Along with those, this change also makes UTF-8 required for `<script charset>` but also moves `<script charset>` to being obsolete-but-conforming (because now that both documents and scripts are required to use UTF-8, it’s redundant to specify `charset` on the `script` element, since it inherits from the document). To make the normative source of those requirements clear, this change also adds a specific citation to the relevant requirement from the Encoding standard, and updates the in-spec IANA registration for text/html media type to indicate that UTF-8 is required. Finally, it changes an existing requirement for authoring tools to use UTF-8 from a “should” to a “must”. The one place where this change doesn’t yet add a requirement for UTF-8 is for the `form` element’s `accept-charset` attribute. For that, see issue #3097.
This change adds a “must” requirement for UTF-8 in all but one of the places in the spec that define a means for specifying a character encoding. Specifically, it makes UTF-8 required for any “character encoding declaration”, which includes the HTTP Content-Type header sent with any document, the `<meta charset>` element, and the `<meta http-equiv=content-type>` element. Along with those, this change also makes UTF-8 required for `<script charset>` but also moves `<script charset>` to being obsolete-but-conforming (because now that both documents and scripts are required to use UTF-8, it’s redundant to specify `charset` on the `script` element, since it inherits from the document). To make the normative source of those requirements clear, this change also adds a specific citation to the relevant requirement from the Encoding standard, and updates the in-spec IANA registration for text/html media type to indicate that UTF-8 is required. Finally, it changes an existing requirement for authoring tools to use UTF-8 from a “should” to a “must”. The one place where this change doesn’t yet add a requirement for UTF-8 is for the `form` element’s `accept-charset` attribute. For that, see issue #3097.
This change adds a “must” requirement for UTF-8 in all but one of the places in the standard that define a means for specifying a character encoding. Specifically, it makes UTF-8 required for any “character encoding declaration”, which includes the HTTP Content-Type header sent with any document, the `<meta charset>` element, and the `<meta http-equiv=content-type>` element. Along with those, this change also makes UTF-8 required for `<script charset>` but also moves `<script charset>` to being obsolete-but-conforming (because now that both documents and scripts are required to use UTF-8, it’s redundant to specify `charset` on the `script` element, since it inherits from the document). To make the normative source of those requirements clear, this change also adds a specific citation to the relevant requirement from the Encoding standard, and updates the in-document IANA registration for text/html media type to indicate that UTF-8 is required. Finally, it changes an existing requirement for authoring tools to use UTF-8 from a “should” to a “must”. The one place where this change doesn’t yet add a requirement for UTF-8 is for the `form` element’s `accept-charset` attribute. For that, see issue #3097. Closes #3004.
Is this still something you need compat analysis help with? If so, are you just interested in the frequency of accept-charset values? SELECT
COUNT(0) AS frequency,
value
FROM
`httparchive.har.2017_12_01_chrome_requests_bodies`,
UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'accept-charset=[\'"]([^\'"]+)')) AS value
GROUP BY
value
ORDER BY
frequency DESC
|
Thank you @rviscomi, that's very useful. Those results suggest to me that restricting this further might affect some folks, but should overall be largely uncontroversial. (Note that we're not talking about implementations here. Just about what the validator says.) |
Thank you @rviscomi! Now we know how many resources in httparchive data set would be affected, but not necessarily the percentage of sites (since one site might use several resouces that match). I think it would be useful to know how many of those that use a value other than utf-8 that have an Another option could be to experiment with emitting a warning in the validator and listen to feedback/complaints on SO. If there are many complaints then we back off, if it's not much and the count declines over time we can update the standard and upgrade to an error. Thoughts? |
Modified the query to count the number of websites that include at least one charset that is not SELECT
COUNT(DISTINCT page) AS affected_sites
FROM
`httparchive.har.2017_12_01_chrome_requests_bodies`,
UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'accept-charset=[\'"]([^\'"]+)')) AS value
WHERE
value != 'utf-8' The result is 624. So out of the 431,851 sites analyzed this is merely 0.14%.
Yeah that's definitely a tricky one :) SELECT
COUNT(DISTINCT page)
FROM
`httparchive.har.2017_12_01_chrome_requests_bodies`,
UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'accept-charset=[\'"]([^\'"]+)')) AS value,
UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'action=[\'"]([^\'"]+)')) AS action
WHERE
value != 'utf-8' AND
NET.REG_DOMAIN(action) != NET.REG_DOMAIN(page) I'm cutting a corner here for the sake of simplicity; I'm assuming that the action attribute corresponds to the same form element as the accept-charset attribute. The result is 165 websites (~26% of the 624). |
Are you also considering cases where a form does not explicitly declare accept-charset, but inherits a non-UTF-8 charset from the page of which it is part edit: to be non-conforming? And would this change be separate from or concurrent with #3091 ? |
Also, I could see separately disallowing (or at least treating as non-conforming) attempts to use non-UTF-8 character encoding in cross-site form submissions and HREF constructions and in form submissions and navigations originating from pages fetched over unsecured connections (non- |
No. We can change the query to look for attribute names ending in "charset", which should cover both inherited charsets and explicit form accept-charsets. Again, this is just an approximation so take it with a grain of salt. SELECT
COUNT(DISTINCT page)
FROM
`httparchive.har.2017_12_01_chrome_requests_bodies`,
UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'charset=[\'"]([^\'"]+)')) AS value,
UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'action=[\'"]([^\'"]+)')) AS action
WHERE
value IS NOT NULL AND
action IS NOT NULL AND
value != 'utf-8' AND
NET.REG_DOMAIN(action) != NET.REG_DOMAIN(page) Result: 1545. (there are ~430,000 distinct pages in the dataset) |
What I had in mind when reporting this issue, the form itself wouldn't be non-conforming (but the page itself already is, since using and specifying utf-8 for the page is required as of #3091). But, I don't know, would it be a good idea to require explicit accept-charset="utf-8" if the page itself isn't utf-8? I hadn't thought about that. But that will also be a much higher percentage (though these pages are already non-conforming). Would the end goal be to make it possible for browser engines to force utf-8 for all form submissions eventually when the number of non-utf-8 form submissions get closer to 0? @rviscomi query in the previous comment doesn't look for this case I believe; it would need to look for any HTML pages that are in utf-8 (ideally looking at specified Content-Type or
This is a separate change. |
This change adds a “must” requirement for UTF-8 in all but one of the places in the standard that define a means for specifying a character encoding. Specifically, it makes UTF-8 required for any “character encoding declaration”, which includes the HTTP Content-Type header sent with any document, the `<meta charset>` element, and the `<meta http-equiv=content-type>` element. Along with those, this change also makes UTF-8 required for `<script charset>` but also moves `<script charset>` to being obsolete-but-conforming (because now that both documents and scripts are required to use UTF-8, it’s redundant to specify `charset` on the `script` element, since it inherits from the document). To make the normative source of those requirements clear, this change also adds a specific citation to the relevant requirement from the Encoding standard, and updates the in-document IANA registration for text/html media type to indicate that UTF-8 is required. Finally, it changes an existing requirement for authoring tools to use UTF-8 from a “should” to a “must”. The one place where this change doesn’t yet add a requirement for UTF-8 is for the `form` element’s `accept-charset` attribute. For that, see issue whatwg#3097. Closes whatwg#3004.
From #3091 (comment)
Should look into how
accept-charset
is used, and ponder about what the consequences would be if we only allowed<form accept-charset="utf-8">
as conforming.cc @sideshowbarker @hsivonen
The text was updated successfully, but these errors were encountered: