Consider restricting <form accept-charset> to utf-8 #3097

zcorpan · 2017-10-04T17:48:25Z

Should look into how accept-charset is used, and ponder about what the consequences would be if we only allowed <form accept-charset="utf-8"> as conforming.

cc @sideshowbarker @hsivonen

The text was updated successfully, but these errors were encountered:

This change adds a “must” requirement for UTF-8 in most places in the spec that define a means for specifying a character encoding. Specifically, it makes UTF-8 required for any “character encoding declaration”, which includes the HTTP Content-Type header sent with any document, the `<meta charset>` element, and the `<meta http-equiv=content-type>` element. Along with those, this change also makes UTF-8 required for `<script charset>`. To make the normative source of those requirements clear, this change also adds a specific citation to the relevant requirement from the Encoding standard, and updates the in-spec IANA registration for text/html media type to indicate that UTF-8 is required. Finally, it changes an existing requirement for authoring tools to use UTF-8 from a “should” to a “must”. One place where this change doesn’t yet add a requirement for UTF-8 is for the `form` element’s `accept-charset` attribute. For that, see issue #3097.

annevk · 2017-10-05T06:38:33Z

I think we basically have to do this. You get data loss and interoperability issues with other encodings. And fetch() and XMLHttpRequest already force it.

sideshowbarker · 2017-10-05T07:31:42Z

I think we basically have to do this. You get data loss and interoperability issues with other encodings. And fetch() and XMLHttpRequest already force it.

So then we can (should) just add a change for it to #3091, right?

annevk · 2017-10-05T07:32:25Z

I think so, but @zcorpan wants to do research first as I understand it.

sideshowbarker · 2017-10-05T07:37:10Z

I think so, but @zcorpan wants to do research first as I understand it.

Ah, OK — if so then I think we should go ahead and do what we seem to have already been planning, which is to merge #3091 first — without this (but with a note about it in the commit message) — and then handle this a separate change later

annevk · 2017-10-05T07:41:21Z

Yeah, I think that's fine.

This change adds a “must” requirement for UTF-8 in all but one of the places in the spec that define a means for specifying a character encoding. Specifically, it makes UTF-8 required for any “character encoding declaration”, which includes the HTTP Content-Type header sent with any document, the `<meta charset>` element, and the `<meta http-equiv=content-type>` element. Along with those, this change also makes UTF-8 required for `<script charset>` but also moves `<script charset>` to being obsolete-but-conforming (because now that both documents and scripts are required to use UTF-8, it’s redundant to specify `charset` on the `script` element, since it inherits from the document). To make the normative source of those requirements clear, this change also adds a specific citation to the relevant requirement from the Encoding standard, and updates the in-spec IANA registration for text/html media type to indicate that UTF-8 is required. Finally, it changes an existing requirement for authoring tools to use UTF-8 from a “should” to a “must”. The one place where this change doesn’t yet add a requirement for UTF-8 is for the `form` element’s `accept-charset` attribute. For that, see issue #3097.

This change adds a “must” requirement for UTF-8 in all but one of the places in the standard that define a means for specifying a character encoding. Specifically, it makes UTF-8 required for any “character encoding declaration”, which includes the HTTP Content-Type header sent with any document, the `<meta charset>` element, and the `<meta http-equiv=content-type>` element. Along with those, this change also makes UTF-8 required for `<script charset>` but also moves `<script charset>` to being obsolete-but-conforming (because now that both documents and scripts are required to use UTF-8, it’s redundant to specify `charset` on the `script` element, since it inherits from the document). To make the normative source of those requirements clear, this change also adds a specific citation to the relevant requirement from the Encoding standard, and updates the in-document IANA registration for text/html media type to indicate that UTF-8 is required. Finally, it changes an existing requirement for authoring tools to use UTF-8 from a “should” to a “must”. The one place where this change doesn’t yet add a requirement for UTF-8 is for the `form` element’s `accept-charset` attribute. For that, see issue #3097. Closes #3004.

rviscomi · 2017-12-15T19:17:42Z

Is this still something you need compat analysis help with? If so, are you just interested in the frequency of accept-charset values?

SELECT
  COUNT(0) AS frequency,
  value
FROM
  `httparchive.har.2017_12_01_chrome_requests_bodies`,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'accept-charset=[\'"]([^\'"]+)')) AS value
GROUP BY
  value
ORDER BY
  frequency DESC

frequency	value
64185	utf-8
459	iso-8859-1
113	windows-1251
61	gbk
38	euc-jp
35	gb2312
32	utf8
32	euc-kr
26	shift_jis
24	shift-jis
21	iso-8859-9
18	iso-8859-2
17	unknown
16	tis-620
12	urf-8
11	utf-8,*
9	x-euc-jp
8	iso-8859-15
6	big5
5	cp1251
5	{{config.busca.acceptcharset}}
5	iso-8859-1;utf-8
5	windows-1250
4	windows-1256
3	{{vm.charset}}
3	,
2	iso-8859-5
2	windows-1253
2	windows-1252
2	sjis
2	us-ascii
2	index.php
2	cb_form_accept_charset
2	iso-8859-1, utf-8
1	iso 8859-9
1	utf-10
1	swedish
1	+ chat.cfg.charset +
1	###form.accept-charset###
1	u
1	latin-1
1	cp-1251
1	gb18030
1	utf-8, utf-8
1	utf-8
1	windows-31j
1	iso-8859-1, iso-8859-2, utf-8
1	character_set
1	false
1	utf-8,us-ascii

annevk · 2017-12-15T20:58:00Z

Thank you @rviscomi, that's very useful. Those results suggest to me that restricting this further might affect some folks, but should overall be largely uncontroversial. (Note that we're not talking about implementations here. Just about what the validator says.)

zcorpan · 2017-12-17T19:11:51Z

Thank you @rviscomi! Now we know how many resources in httparchive data set would be affected, but not necessarily the percentage of sites (since one site might use several resouces that match).

I think it would be useful to know how many of those that use a value other than utf-8 that have an action that points to a third party (and so the affected page's author might not be able to fix the action resource). That's probably a bit convulted to write a query for though! 😁

Another option could be to experiment with emitting a warning in the validator and listen to feedback/complaints on SO. If there are many complaints then we back off, if it's not much and the count declines over time we can update the standard and upgrade to an error.

Thoughts?

rviscomi · 2017-12-18T19:22:44Z

Now we know how many resources in httparchive data set would be affected, but not necessarily the percentage of sites (since one site might use several resouces that match).

Modified the query to count the number of websites that include at least one charset that is not utf-8:

SELECT
  COUNT(DISTINCT page) AS affected_sites
FROM
  `httparchive.har.2017_12_01_chrome_requests_bodies`,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'accept-charset=[\'"]([^\'"]+)')) AS value
WHERE
  value != 'utf-8'

The result is 624. So out of the 431,851 sites analyzed this is merely 0.14%.

I think it would be useful to know how many of those that use a value other than utf-8 that have an action that points to a third party

Yeah that's definitely a tricky one :)

SELECT
  COUNT(DISTINCT page)
FROM
  `httparchive.har.2017_12_01_chrome_requests_bodies`,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'accept-charset=[\'"]([^\'"]+)')) AS value,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'action=[\'"]([^\'"]+)')) AS action
WHERE
  value != 'utf-8' AND
  NET.REG_DOMAIN(action) != NET.REG_DOMAIN(page)

I'm cutting a corner here for the sake of simplicity; I'm assuming that the action attribute corresponds to the same form element as the accept-charset attribute. The result is 165 websites (~26% of the 624).

bsittler · 2017-12-18T19:39:28Z

Are you also considering cases where a form does not explicitly declare accept-charset, but inherits a non-UTF-8 charset from the page of which it is part edit: to be non-conforming? And would this change be separate from or concurrent with #3091 ?

bsittler · 2017-12-18T19:45:14Z

Also, I could see separately disallowing (or at least treating as non-conforming) attempts to use non-UTF-8 character encoding in cross-site form submissions and HREF constructions and in form submissions and navigations originating from pages fetched over unsecured connections (non-https: page containing the form or link) as being inherently more risky/more prone to injection attacks than same-registered-domain, https:-only navigations and form submissions due to additional attack risk/risk of interference by bad actors in those cases. Is it worthwhile to consider nonconformance and deprecation for those cases on a more advanced schedule than for same-site https: cases?

rviscomi · 2017-12-18T20:44:58Z

Are you also considering cases where a form does not explicitly declare accept-charset, but inherits a non-UTF-8 charset from the page of which it is part?

No. We can change the query to look for attribute names ending in "charset", which should cover both inherited charsets and explicit form accept-charsets. Again, this is just an approximation so take it with a grain of salt.

SELECT
  COUNT(DISTINCT page)
FROM
  `httparchive.har.2017_12_01_chrome_requests_bodies`,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'charset=[\'"]([^\'"]+)')) AS value,
  UNNEST(REGEXP_EXTRACT_ALL(LOWER(body), r'action=[\'"]([^\'"]+)')) AS action
WHERE
  value IS NOT NULL AND
  action IS NOT NULL AND
  value != 'utf-8' AND
  NET.REG_DOMAIN(action) != NET.REG_DOMAIN(page)

Result: 1545. (there are ~430,000 distinct pages in the dataset)

zcorpan · 2017-12-19T08:38:14Z

@bsittler

Are you also considering cases where a form does not explicitly declare accept-charset, but inherits a non-UTF-8 charset from the page of which it is part edit: to be non-conforming?

What I had in mind when reporting this issue, the form itself wouldn't be non-conforming (but the page itself already is, since using and specifying utf-8 for the page is required as of #3091).

But, I don't know, would it be a good idea to require explicit accept-charset="utf-8" if the page itself isn't utf-8? I hadn't thought about that. But that will also be a much higher percentage (though these pages are already non-conforming). Would the end goal be to make it possible for browser engines to force utf-8 for all form submissions eventually when the number of non-utf-8 form submissions get closer to 0?

@rviscomi query in the previous comment doesn't look for this case I believe; it would need to look for any HTML pages that are in utf-8 (ideally looking at specified Content-Type or meta) that also have a form element without an accept-charset attribute.

And would this change be separate from or concurrent with #3091 ?

This is a separate change.

Fixes #3097.

This change adds a “must” requirement for UTF-8 in all but one of the places in the standard that define a means for specifying a character encoding. Specifically, it makes UTF-8 required for any “character encoding declaration”, which includes the HTTP Content-Type header sent with any document, the `<meta charset>` element, and the `<meta http-equiv=content-type>` element. Along with those, this change also makes UTF-8 required for `<script charset>` but also moves `<script charset>` to being obsolete-but-conforming (because now that both documents and scripts are required to use UTF-8, it’s redundant to specify `charset` on the `script` element, since it inherits from the document). To make the normative source of those requirements clear, this change also adds a specific citation to the relevant requirement from the Encoding standard, and updates the in-document IANA registration for text/html media type to indicate that UTF-8 is required. Finally, it changes an existing requirement for authoring tools to use UTF-8 from a “should” to a “must”. The one place where this change doesn’t yet add a requirement for UTF-8 is for the `form` element’s `accept-charset` attribute. For that, see issue whatwg#3097. Closes whatwg#3004.

Fixes whatwg#3097.

zcorpan added document conformance needs compat analysis removal/deprecation Removing or deprecating a feature topic: forms labels Oct 4, 2017

sideshowbarker added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Dec 18, 2017

annevk added a commit that referenced this issue Nov 23, 2018

Require UTF-8 for accept-charset

4552160

Fixes #3097.

annevk mentioned this issue Nov 23, 2018

Require UTF-8 for accept-charset #4195

Merged

annevk closed this as completed in #4195 Nov 26, 2018

annevk added a commit that referenced this issue Nov 26, 2018

Require UTF-8 for accept-charset

840e22f

Fixes #3097.

himorin mentioned this issue Dec 3, 2018

Consider restricting <form accept-charset> to utf-8 w3c/i18n-activity#616

Closed

mustaqahmed pushed a commit to mustaqahmed/html that referenced this issue Feb 15, 2019

Require UTF-8 for accept-charset

0e5d9c7

Fixes whatwg#3097.

mustaqahmed pushed a commit to mustaqahmed/html that referenced this issue Feb 15, 2019

Require UTF-8 for accept-charset

e8cf9e5

Fixes whatwg#3097.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider restricting <form accept-charset> to utf-8 #3097

Consider restricting <form accept-charset> to utf-8 #3097

zcorpan commented Oct 4, 2017

annevk commented Oct 5, 2017

sideshowbarker commented Oct 5, 2017

annevk commented Oct 5, 2017

sideshowbarker commented Oct 5, 2017

annevk commented Oct 5, 2017

rviscomi commented Dec 15, 2017 •

edited

Loading

annevk commented Dec 15, 2017 •

edited

Loading

zcorpan commented Dec 17, 2017

rviscomi commented Dec 18, 2017

bsittler commented Dec 18, 2017 •

edited

Loading

bsittler commented Dec 18, 2017 •

edited

Loading

rviscomi commented Dec 18, 2017 •

edited

Loading

zcorpan commented Dec 19, 2017

Consider restricting <form accept-charset> to utf-8 #3097

Consider restricting <form accept-charset> to utf-8 #3097

Comments

zcorpan commented Oct 4, 2017

annevk commented Oct 5, 2017

sideshowbarker commented Oct 5, 2017

annevk commented Oct 5, 2017

sideshowbarker commented Oct 5, 2017

annevk commented Oct 5, 2017

rviscomi commented Dec 15, 2017 • edited Loading

annevk commented Dec 15, 2017 • edited Loading

zcorpan commented Dec 17, 2017

rviscomi commented Dec 18, 2017

bsittler commented Dec 18, 2017 • edited Loading

bsittler commented Dec 18, 2017 • edited Loading

rviscomi commented Dec 18, 2017 • edited Loading

zcorpan commented Dec 19, 2017

rviscomi commented Dec 15, 2017 •

edited

Loading

annevk commented Dec 15, 2017 •

edited

Loading

bsittler commented Dec 18, 2017 •

edited

Loading

bsittler commented Dec 18, 2017 •

edited

Loading

rviscomi commented Dec 18, 2017 •

edited

Loading