UTF-8 decoding of headers (revise #207) #3279

tomako · 2018-09-19T16:02:20Z

Long story short

Header fields are decoded using UTF-8 + surrogateescape error handler. It generates surrogates when the headers are Latin1 (ISO-8859-1) encoded. The resulting string needs extra care: You can't encode it without the same surrogateescape error handler but surprisingly it is serializable using JSON which causes headache later.

Expected behaviour

According to RFC7230:

Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.

Header fields should be decoded using ISO-8859-1 without surrogateescape error handler.

Actual behaviour

The current modification was made still in 2014 (see the old ticket #207 ) to resolve an issue of an UTF-8 encoded header. As the author mentioned it was against the spec. Indeed. He was right. It doesn't cause any issue with ASCII or UTF-8 headers but Latin1 headers which follow the standards. So the good guys are punished :/

Steps to reproduce

Just start up any aiohttp server and send a message where the header contains non US ASCII character. I could not send it properly via curl but managed to do it with requests:
>>> requests.post("http://localhost:8000/", json={"requests": "data"}, headers={"User-Agent": "Versión"})
You will see the value is b"Versi\xf3n" in request.raw_headers.
In request.headers the value is "Versi\udcf3n".

Your environment

aiohttp 2.3.10 (server) - not the latest but as I see this part is still the same
python 3.5.2
Linux Ubuntu 16.04 LTS

The text was updated successfully, but these errors were encountered:

asvetlov · 2018-09-19T16:04:18Z

GitMate.io thinks possibly related issues are #3270 (UnicodeDecodeError: 'utf-8' codec can't decode byte ...), #1750 (Encoding is always UTF-8 in POST data), #1652 (Trailer headers), #1731 (UnicodeEncodeError: 'utf-8' codec can't encode character '\udca9'), and #18 (Auto-decoding doesn't recognize content-type: application/json; charset=utf-8).

asvetlov · 2018-09-19T18:11:18Z

It is the deliberate choice.
Latin-1 is specified as loose-less bytes <-> str codec (but resulting string could be easily broken for non-ASCII). Basically, headers are binary octet string without any encoding.

Good guys should use ASCII only :)
aiohttp supports UTF-8 because currently we see more UTF-8 than Latin-1 headers and UTF-8 are often used for non-latin alphabets encoding.

If you still need to receive and process non-UTF8 strings -- use raw_headers with explicit decoding by any desired codec.

tomako · 2018-09-20T11:41:17Z

I couldn't agree more regarding good guys. But what should we do with the rest? :)
I have to admit I haven't checked whether we are receiving user agent strings encoded by UTF-8. I know that we are getting Latin1 because they are causing issues :(
In short there are 2 ways:

Follow the standard and use Latin1. The result can be a mix of nonsense characters if the header is encoded by UTF-8. If required the original binary version is recoverable and can be decoded using UTF-8.
Follow the trends and use UTF-8. The result can contain surrogates if the header is encoded by Latin1. If required the original version is recoverable and can be decoded using Latin1.

Almost the same. Small pros and cons on both sides. You have chosen the 2nd one.
I would vote to the standard way of course. Moreover the handling of surrogates is inconsistent and make headache.

Well, I see your point and can accept your decision. Thank you.

webknjaz · 2018-09-20T11:57:57Z

N.B. According to my research last summer (cherrypy/cheroot#27 (comment)), none of the mainstream HTTP clients (browsers) actually try to decode unicode headers.

tomako · 2018-09-21T02:14:33Z

If it matters WSGI follows the standard way.

asvetlov · 2018-09-21T09:29:10Z

The ship has sailed in 2014.
Sorry, changing encoding fixes your case but breaks others code.

lock · 2019-10-28T01:01:33Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
If you feel like there's important points made in this discussion, please include those exceprts into that new issue.

asvetlov added the wontfix label Sep 26, 2018

asvetlov closed this as completed Sep 26, 2018

lock bot added the outdated label Oct 28, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 decoding of headers (revise #207) #3279

UTF-8 decoding of headers (revise #207) #3279

tomako commented Sep 19, 2018 •

edited

Loading

asvetlov commented Sep 19, 2018

asvetlov commented Sep 19, 2018

tomako commented Sep 20, 2018 •

edited

Loading

webknjaz commented Sep 20, 2018

tomako commented Sep 21, 2018

asvetlov commented Sep 21, 2018

lock bot commented Oct 28, 2019 •

edited by webknjaz

Loading

UTF-8 decoding of headers (revise #207) #3279

UTF-8 decoding of headers (revise #207) #3279

Comments

tomako commented Sep 19, 2018 • edited Loading

Long story short

Expected behaviour

Actual behaviour

Steps to reproduce

Your environment

asvetlov commented Sep 19, 2018

asvetlov commented Sep 19, 2018

tomako commented Sep 20, 2018 • edited Loading

webknjaz commented Sep 20, 2018

tomako commented Sep 21, 2018

asvetlov commented Sep 21, 2018

lock bot commented Oct 28, 2019 • edited by webknjaz Loading

tomako commented Sep 19, 2018 •

edited

Loading

tomako commented Sep 20, 2018 •

edited

Loading

lock bot commented Oct 28, 2019 •

edited by webknjaz

Loading