Prefer defaulting to utf-8, rather than using chardet autodetect #1078

tomchristie · 2020-07-21T12:46:12Z

Not marking this as a "draft" since it is ready for review, but we shouldn't merge it without prior discussion.
Drops chardet autodetection in favour of utf-8 as the default.

This changes our behaviour from:

Use any explicitly set .encoding.
Then use any charset included on the Content-Type.
Then use use iso-8859-1 for any text/* Content-Type without a charset.
Then use chardet autodetection.

To...

Use any explicitly set .encoding.
Then use any charset included on the Content-Type.
Then use use iso-8859-1 for any text/* Content-Type without a charset.
Then use utf-8.

We would probably want to document how to use chardet for autodetection, at least for the simpler non-streaming case.
Users can replicate the previous behaviour with...

response = client.get(...)
if response.charset_encoding is None:
    response.encoding = chardet.detect(self.content)["encoding"]
response.text

We might also consider a TextDecoder API at some point in order to allow users to customize text decoding, including the streaming case, which would allow us to support eg. chardet or HTML meta tag detection options.

What are our thoughts on switching to utf-8 as the fallback case, rather than using chardet-autodetection?

StephenBrown2 · 2020-07-21T15:20:07Z

Would your thoughts on a TextDecoder API make the encoder pluggable (as suggested in #1018 (comment)), so that one could add it in the Client settings or such, something like encoding=, decoder=, or charset_detector=:

import charset_normalizer

normalize_encoding = lambda content: charset_normalizer.detect(content)["encoding"]

with Client(encoding=normalize_encoding) as client:
	response = client.get(...)
print(response.text)

This could short-curcuit the logic above, or just be added as a fallback if the rest fail, depending on how much trust/speed trade-off there should be.

Admittedly, this could be done as you stated above, like:

import charset_normalizer

response = client.get(...)
if response.charset_encoding is None:
    response.encoding = charset_normalizer.detect(self.content)["encoding"]
print(response.text)

but having it on the Client would be a nice-to-have to avoid repeating the same if block everywhere.

Apologies if I missed anything or am jumping the gun. seeing the last bit of your post makes me think that it would be an abstract class to implement, but I think it could also just be done with a simple function.

florimondmanca

Need to ponder on this, but I have an initial question/comment. :-)

florimondmanca · 2020-07-21T20:04:38Z

tests/models/test_responses.py

+    content = "おはようございます。".encode("utf-8")
    response = httpx.Response(200, content=content, headers=headers, request=REQUEST)
    assert response.text == "おはようございます。"
-    assert response.encoding == "EUC-JP"
+    assert response.encoding == "utf-8"


So, what if we do get a "おはようございます。".encode("EUC-JP"), and the server fails to include ; encoding=EUC-JP?

I assume we'll try to .decode("utf-8") — and that it will fail?

>>> text = "おはようございます。".encode("EUC-JP") >>> text.decode("utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 0: invalid start byte

In that case I assume we want to allow the user to provide some kind of manual encoding (eg if they know it's some exotic encoding).

Most likely it would just be a recommendation to use something like this...

text = response.content.decode("EUC-JP")

This is mostly fine I suppose.

But then, we probably also want to provide more helpful error messaging than the UnicodeDecodeError above, correct? For example, a DecodingError with an error message saying "consider decoding .content manually" (we may have this already, but not sure?).

Rather than...

text = response.content.decode("EUC-JP") print(text)

You'd use..

response.encoding = "EUC-JP" print(response.text)

tomchristie · 2020-07-23T11:39:03Z

So after having a dig into this I think we clearly want to update our strategy here.

For one thing the text/* -> iso-8859-1 rule that we currently have in place is there because of an obsolete part of the HTTP spec.

For example, HTML5 now recommends cp1252 ("Windows 1252") as the fallback encoding. It's a good choice because like iso-8859-1 it also encompasses the full byte range, but it's printable characters are a superset of the printable characters in iso-8859-1.

I asked for a bit of dev feedback on twitter mostly to try to get an idea about which non utf-8 charsets we might expect. There's also some useful statistical info, here.

And here

Note that windows 1251 is more prevalent than windows 1252, but it's not a good fallback because the later is aligned with ISO-8859-1. (The printable characters in cp-1252 are a superset of the printable characters in iso-8859-1)

Useful info from the twitter question...

Windows defaults cp1252, rather than utf-8. Not surprising, but wanted to confirm that.
ISO-8859-1 was the only ISO-8859-* option that was mentioned.
There was a russian locale example of cp1251.

So, based on this...

I can think of 3 possible strategies that we might want to adopt here...

charset from "Content-Type" header.
UTF-8.
Windows 1252.

Which would be nice and simple. Users can anyways use response.encoding = ... if still required.

charset from "Content-Type" header.
UTF-8.
Autodetect using chardet, if installed.
Windows 1252.

Bit fancy. Potentially awkward in streaming text cases, since chardet will buffer up before returning a charset.

charset from "Content-Type" header.
UTF-8.
User locale if not UTF-8 or Windows 1252.
Windows 1252.

Could bit more resilient than the first case, since machine locale may sometimes be a decent proxy onto possible text encodings of the URL being accessed by the client. But might be a bit of a non-obvious behaviour difference between devs testing stuff locally vs. deployed to server, which may well not have the same locale.

Any thoughts?

tomchristie · 2020-07-23T12:11:43Z

Here's a bit of provisional documentation for how things might look if we adopted the first of those three options, but also allowed the text encodings to be configurable...

Text Encodings

When accessing response.text, the byte content in the response needs to be decoded into a unicode string, using some particular encoding. The precedence rules that are used for httpx here are...

If the response includes a Content-Type header with a "charset=" attribute, then use that.
Try to decode as UTF-8.
Decode as Windows 1252.

UTF-8 is now very widely adopted, so if the character set has not been specified then we always attempt that as our first choice.

Windows 1252 is a superset of ISO-8859-1, and includes encodings for full byte range (0-255), so will always successfully return a unicode text string, for any input. This makes it a good fall-back option. The HTML5 spec recommends decoding as Windows 1252 when the encoding is not known.

In some cases you might need to override the default behaviour, to deal with resources that are not including character set information, and are not utf-8 encoded. Here are some scenarios for how you might override this behaviour:

Explicitly override the encoding...

response = client.get(...)
response.encoding = ...
print(response.text)

Explicitly override the encoding, if no charset info is included...

response = client.get(...)
if response.charset_encoding is None:
    response.encoding = ...
print(response.text)

Auto-determine the encoding using chardet, if no charset info is included...

response = client.get(...)
if response.charset_encoding is None:
    response.encoding = chardet.detect(response.content)["encoding"]
print(response.text)

Use a different fallback encoding...

# Configure client to expect Russian language resources.
# Try UTF-8 first, then fallback to Window 1251.
client = Client(encodings=['utf-8', 'cp1251'])
response = client.get(...)
print(response.text)

Use a fallback encoding based on the user-locale...

# Configure client to try the user locale encoding,
# if it is not already one of the two defaults options.
locale_lang, locale_encoding = locale.getdefaultlocale()
encodings = ['utf-8', 'cp1252']
if locale_encoding.lower() not in encodings:
    encodings = ['utf-8', locale_encoding, 'cp1252']

client = Client(encodings=encodings)
response = client.get(...)
print(response.text)

florimondmanca · 2020-07-23T12:45:42Z

I think first option + detailed docs for more exotic cases (exactly like the above) should get us covered, right?

In particular I like how the default behavior would be really simple while accommodating most situations, and then how the "manual mode" docs seem to allow dealing with any special cases.

tomchristie · 2020-07-23T14:04:50Z

Also useful context... psf/requests#2086

StephenBrown2 · 2020-07-23T15:19:06Z

I'm in favor of Option 2, but in the name of speed + simplicity, Option 1 + detailed docs is the best.

Unfortunately the docs you wrote aren't quite copy-and-pastable, as there's a self.content where I think it should be response.content:

-    response.encoding = chardet.detect(self.content)["encoding"]
+    response.encoding = chardet.detect(response.content)["encoding"]

You may also want to list a couple of the charset-detection modules to give users more choice (those especially interested in speed may not prefer chardet if they see something else exists).

tomchristie · 2020-07-27T09:31:35Z

Switching this to a draft since I don't think it's quite the right implementation.

Prefer defaulting to utf-8, rather than using chardet autodetect

0daa85f

tomchristie added api change PRs that contain breaking public API changes enhancement New feature or request do not merge PRs that should not be merged labels Jul 21, 2020

florimondmanca reviewed Jul 21, 2020

View reviewed changes

Merge branch 'master' into prefer-utf8-over-chardet

27f52ff

tomchristie marked this pull request as draft July 27, 2020 09:30

tomchristie mentioned this pull request Jul 27, 2020

Version 1.0, working notes. #1092

Closed

11 tasks

j178 mentioned this pull request Aug 21, 2020

Packaging dependancy tweaks #1206

Merged

tomchristie closed this Sep 2, 2020

tomchristie deleted the prefer-utf8-over-chardet branch September 2, 2020 12:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefer defaulting to utf-8, rather than using chardet autodetect #1078

Prefer defaulting to utf-8, rather than using chardet autodetect #1078

tomchristie commented Jul 21, 2020 •

edited

Loading

StephenBrown2 commented Jul 21, 2020 •

edited

Loading

florimondmanca left a comment

florimondmanca Jul 21, 2020

tomchristie Jul 23, 2020

tomchristie commented Jul 23, 2020 •

edited

Loading

tomchristie commented Jul 23, 2020 •

edited

Loading

florimondmanca commented Jul 23, 2020

tomchristie commented Jul 23, 2020

StephenBrown2 commented Jul 23, 2020

tomchristie commented Jul 27, 2020

Prefer defaulting to utf-8, rather than using chardet autodetect #1078

Prefer defaulting to utf-8, rather than using chardet autodetect #1078

Conversation

tomchristie commented Jul 21, 2020 • edited Loading

StephenBrown2 commented Jul 21, 2020 • edited Loading

florimondmanca left a comment

Choose a reason for hiding this comment

florimondmanca Jul 21, 2020

Choose a reason for hiding this comment

tomchristie Jul 23, 2020

Choose a reason for hiding this comment

tomchristie commented Jul 23, 2020 • edited Loading

tomchristie commented Jul 23, 2020 • edited Loading

Text Encodings

florimondmanca commented Jul 23, 2020

tomchristie commented Jul 23, 2020

StephenBrown2 commented Jul 23, 2020

tomchristie commented Jul 27, 2020

tomchristie commented Jul 21, 2020 •

edited

Loading

StephenBrown2 commented Jul 21, 2020 •

edited

Loading

tomchristie commented Jul 23, 2020 •

edited

Loading

tomchristie commented Jul 23, 2020 •

edited

Loading