Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 decoding of headers (revise #207) #3279

Closed
tomako opened this issue Sep 19, 2018 · 7 comments
Closed

UTF-8 decoding of headers (revise #207) #3279

tomako opened this issue Sep 19, 2018 · 7 comments

Comments

@tomako
Copy link

tomako commented Sep 19, 2018

Long story short

Header fields are decoded using UTF-8 + surrogateescape error handler. It generates surrogates when the headers are Latin1 (ISO-8859-1) encoded. The resulting string needs extra care: You can't encode it without the same surrogateescape error handler but surprisingly it is serializable using JSON which causes headache later.

Expected behaviour

According to RFC7230:

Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.

Header fields should be decoded using ISO-8859-1 without surrogateescape error handler.

Actual behaviour

The current modification was made still in 2014 (see the old ticket #207 ) to resolve an issue of an UTF-8 encoded header. As the author mentioned it was against the spec. Indeed. He was right. It doesn't cause any issue with ASCII or UTF-8 headers but Latin1 headers which follow the standards. So the good guys are punished :/

Steps to reproduce

Just start up any aiohttp server and send a message where the header contains non US ASCII character. I could not send it properly via curl but managed to do it with requests:
>>> requests.post("http://localhost:8000/", json={"requests": "data"}, headers={"User-Agent": "Versión"})
You will see the value is b"Versi\xf3n" in request.raw_headers.
In request.headers the value is "Versi\udcf3n".

Your environment

aiohttp 2.3.10 (server) - not the latest but as I see this part is still the same
python 3.5.2
Linux Ubuntu 16.04 LTS

@asvetlov
Copy link
Member

GitMate.io thinks possibly related issues are #3270 (UnicodeDecodeError: 'utf-8' codec can't decode byte ...), #1750 (Encoding is always UTF-8 in POST data), #1652 (Trailer headers), #1731 (UnicodeEncodeError: 'utf-8' codec can't encode character '\udca9'), and #18 (Auto-decoding doesn't recognize content-type: application/json; charset=utf-8).

@asvetlov
Copy link
Member

It is the deliberate choice.
Latin-1 is specified as loose-less bytes <-> str codec (but resulting string could be easily broken for non-ASCII). Basically, headers are binary octet string without any encoding.

Good guys should use ASCII only :)
aiohttp supports UTF-8 because currently we see more UTF-8 than Latin-1 headers and UTF-8 are often used for non-latin alphabets encoding.

If you still need to receive and process non-UTF8 strings -- use raw_headers with explicit decoding by any desired codec.

@tomako
Copy link
Author

tomako commented Sep 20, 2018

I couldn't agree more regarding good guys. But what should we do with the rest? :)
I have to admit I haven't checked whether we are receiving user agent strings encoded by UTF-8. I know that we are getting Latin1 because they are causing issues :(
In short there are 2 ways:

  1. Follow the standard and use Latin1. The result can be a mix of nonsense characters if the header is encoded by UTF-8. If required the original binary version is recoverable and can be decoded using UTF-8.
  2. Follow the trends and use UTF-8. The result can contain surrogates if the header is encoded by Latin1. If required the original version is recoverable and can be decoded using Latin1.

Almost the same. Small pros and cons on both sides. You have chosen the 2nd one.
I would vote to the standard way of course. Moreover the handling of surrogates is inconsistent and make headache.

Well, I see your point and can accept your decision. Thank you.

@webknjaz
Copy link
Member

N.B. According to my research last summer (cherrypy/cheroot#27 (comment)), none of the mainstream HTTP clients (browsers) actually try to decode unicode headers.

@tomako
Copy link
Author

tomako commented Sep 21, 2018

If it matters WSGI follows the standard way.

@asvetlov
Copy link
Member

The ship has sailed in 2014.
Sorry, changing encoding fixes your case but breaks others code.

@lock
Copy link

lock bot commented Oct 28, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
If you feel like there's important points made in this discussion, please include those exceprts into that new issue.

@lock lock bot added the outdated label Oct 28, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants