-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 decoding of headers (revise #207) #3279
Comments
GitMate.io thinks possibly related issues are #3270 (UnicodeDecodeError: 'utf-8' codec can't decode byte ...), #1750 (Encoding is always UTF-8 in POST data), #1652 (Trailer headers), #1731 (UnicodeEncodeError: 'utf-8' codec can't encode character '\udca9'), and #18 (Auto-decoding doesn't recognize content-type: application/json; charset=utf-8). |
It is the deliberate choice. Good guys should use ASCII only :) If you still need to receive and process non-UTF8 strings -- use raw_headers with explicit decoding by any desired codec. |
I couldn't agree more regarding good guys. But what should we do with the rest? :)
Almost the same. Small pros and cons on both sides. You have chosen the 2nd one. Well, I see your point and can accept your decision. Thank you. |
N.B. According to my research last summer (cherrypy/cheroot#27 (comment)), none of the mainstream HTTP clients (browsers) actually try to decode unicode headers. |
If it matters WSGI follows the standard way. |
The ship has sailed in 2014. |
Long story short
Header fields are decoded using UTF-8 +
surrogateescape
error handler. It generates surrogates when the headers are Latin1 (ISO-8859-1) encoded. The resulting string needs extra care: You can't encode it without the samesurrogateescape
error handler but surprisingly it is serializable using JSON which causes headache later.Expected behaviour
According to RFC7230:
Header fields should be decoded using ISO-8859-1 without
surrogateescape
error handler.Actual behaviour
The current modification was made still in 2014 (see the old ticket #207 ) to resolve an issue of an UTF-8 encoded header. As the author mentioned it was against the spec. Indeed. He was right. It doesn't cause any issue with ASCII or UTF-8 headers but Latin1 headers which follow the standards. So the good guys are punished :/
Steps to reproduce
Just start up any aiohttp server and send a message where the header contains non US ASCII character. I could not send it properly via curl but managed to do it with requests:
>>> requests.post("http://localhost:8000/", json={"requests": "data"}, headers={"User-Agent": "Versión"})
You will see the value is
b"Versi\xf3n"
inrequest.raw_headers
.In
request.headers
the value is"Versi\udcf3n"
.Your environment
aiohttp 2.3.10 (server) - not the latest but as I see this part is still the same
python 3.5.2
Linux Ubuntu 16.04 LTS
The text was updated successfully, but these errors were encountered: