Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer defaulting to utf-8, rather than using chardet autodetect #1078

Closed
wants to merge 2 commits into from

Conversation

tomchristie
Copy link
Member

@tomchristie tomchristie commented Jul 21, 2020

Closes #1018

Not marking this as a "draft" since it is ready for review, but we shouldn't merge it without prior discussion.
Drops chardet autodetection in favour of utf-8 as the default.

This changes our behaviour from:

  • Use any explicitly set .encoding.
  • Then use any charset included on the Content-Type.
  • Then use use iso-8859-1 for any text/* Content-Type without a charset.
  • Then use chardet autodetection.

To...

  • Use any explicitly set .encoding.
  • Then use any charset included on the Content-Type.
  • Then use use iso-8859-1 for any text/* Content-Type without a charset.
  • Then use utf-8.

We would probably want to document how to use chardet for autodetection, at least for the simpler non-streaming case.
Users can replicate the previous behaviour with...

response = client.get(...)
if response.charset_encoding is None:
    response.encoding = chardet.detect(self.content)["encoding"]
response.text

We might also consider a TextDecoder API at some point in order to allow users to customize text decoding, including the streaming case, which would allow us to support eg. chardet or HTML meta tag detection options.

What are our thoughts on switching to utf-8 as the fallback case, rather than using chardet-autodetection?

@tomchristie tomchristie added api change PRs that contain breaking public API changes enhancement New feature or request do not merge PRs that should not be merged labels Jul 21, 2020
@StephenBrown2
Copy link
Contributor

StephenBrown2 commented Jul 21, 2020

Would your thoughts on a TextDecoder API make the encoder pluggable (as suggested in #1018 (comment)), so that one could add it in the Client settings or such, something like encoding=, decoder=, or charset_detector=:

import charset_normalizer

normalize_encoding = lambda content: charset_normalizer.detect(content)["encoding"]

with Client(encoding=normalize_encoding) as client:
	response = client.get(...)
print(response.text)

This could short-curcuit the logic above, or just be added as a fallback if the rest fail, depending on how much trust/speed trade-off there should be.

Admittedly, this could be done as you stated above, like:

import charset_normalizer

response = client.get(...)
if response.charset_encoding is None:
    response.encoding = charset_normalizer.detect(self.content)["encoding"]
print(response.text)

but having it on the Client would be a nice-to-have to avoid repeating the same if block everywhere.

Apologies if I missed anything or am jumping the gun. seeing the last bit of your post makes me think that it would be an abstract class to implement, but I think it could also just be done with a simple function.

Copy link
Member

@florimondmanca florimondmanca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to ponder on this, but I have an initial question/comment. :-)

Comment on lines +56 to +59
content = "おはようございます。".encode("utf-8")
response = httpx.Response(200, content=content, headers=headers, request=REQUEST)
assert response.text == "おはようございます。"
assert response.encoding == "EUC-JP"
assert response.encoding == "utf-8"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what if we do get a "おはようございます。".encode("EUC-JP"), and the server fails to include ; encoding=EUC-JP?

I assume we'll try to .decode("utf-8") — and that it will fail?

>>> text = "おはようございます。".encode("EUC-JP")
>>> text.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 0: invalid start byte

In that case I assume we want to allow the user to provide some kind of manual encoding (eg if they know it's some exotic encoding).

Most likely it would just be a recommendation to use something like this...

text = response.content.decode("EUC-JP")

This is mostly fine I suppose.

But then, we probably also want to provide more helpful error messaging than the UnicodeDecodeError above, correct? For example, a DecodingError with an error message saying "consider decoding .content manually" (we may have this already, but not sure?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than...

text = response.content.decode("EUC-JP")
print(text)

You'd use..

response.encoding = "EUC-JP"
print(response.text)

@tomchristie
Copy link
Member Author

tomchristie commented Jul 23, 2020

So after having a dig into this I think we clearly want to update our strategy here.

For one thing the text/* -> iso-8859-1 rule that we currently have in place is there because of an obsolete part of the HTTP spec.

For example, HTML5 now recommends cp1252 ("Windows 1252") as the fallback encoding. It's a good choice because like iso-8859-1 it also encompasses the full byte range, but it's printable characters are a superset of the printable characters in iso-8859-1.

I asked for a bit of dev feedback on twitter mostly to try to get an idea about which non utf-8 charsets we might expect. There's also some useful statistical info, here.

Screenshot 2020-07-23 at 12 27 37

And here

Screenshot 2020-09-09 at 09 11 03

Note that windows 1251 is more prevalent than windows 1252, but it's not a good fallback because the later is aligned with ISO-8859-1. (The printable characters in cp-1252 are a superset of the printable characters in iso-8859-1)

Useful info from the twitter question...

  • Windows defaults cp1252, rather than utf-8. Not surprising, but wanted to confirm that.
  • ISO-8859-1 was the only ISO-8859-* option that was mentioned.
  • There was a russian locale example of cp1251.

So, based on this...


I can think of 3 possible strategies that we might want to adopt here...

  1. charset from "Content-Type" header.
  2. UTF-8.
  3. Windows 1252.

Which would be nice and simple. Users can anyways use response.encoding = ... if still required.

  1. charset from "Content-Type" header.
  2. UTF-8.
  3. Autodetect using chardet, if installed.
  4. Windows 1252.

Bit fancy. Potentially awkward in streaming text cases, since chardet will buffer up before returning a charset.

  1. charset from "Content-Type" header.
  2. UTF-8.
  3. User locale if not UTF-8 or Windows 1252.
  4. Windows 1252.

Could bit more resilient than the first case, since machine locale may sometimes be a decent proxy onto possible text encodings of the URL being accessed by the client. But might be a bit of a non-obvious behaviour difference between devs testing stuff locally vs. deployed to server, which may well not have the same locale.

Any thoughts?

@tomchristie
Copy link
Member Author

tomchristie commented Jul 23, 2020

Here's a bit of provisional documentation for how things might look if we adopted the first of those three options, but also allowed the text encodings to be configurable...

Text Encodings

When accessing response.text, the byte content in the response needs to be decoded into a unicode string, using some particular encoding. The precedence rules that are used for httpx here are...

  • If the response includes a Content-Type header with a "charset=" attribute, then use that.
  • Try to decode as UTF-8.
  • Decode as Windows 1252.

UTF-8 is now very widely adopted, so if the character set has not been specified then we always attempt that as our first choice.

Windows 1252 is a superset of ISO-8859-1, and includes encodings for full byte range (0-255), so will always successfully return a unicode text string, for any input. This makes it a good fall-back option. The HTML5 spec recommends decoding as Windows 1252 when the encoding is not known.

In some cases you might need to override the default behaviour, to deal with resources that are not including character set information, and are not utf-8 encoded. Here are some scenarios for how you might override this behaviour:

Explicitly override the encoding...

response = client.get(...)
response.encoding = ...
print(response.text)

Explicitly override the encoding, if no charset info is included...

response = client.get(...)
if response.charset_encoding is None:
    response.encoding = ...
print(response.text)

Auto-determine the encoding using chardet, if no charset info is included...

response = client.get(...)
if response.charset_encoding is None:
    response.encoding = chardet.detect(response.content)["encoding"]
print(response.text)

Use a different fallback encoding...

# Configure client to expect Russian language resources.
# Try UTF-8 first, then fallback to Window 1251.
client = Client(encodings=['utf-8', 'cp1251'])
response = client.get(...)
print(response.text)

Use a fallback encoding based on the user-locale...

# Configure client to try the user locale encoding,
# if it is not already one of the two defaults options.
locale_lang, locale_encoding = locale.getdefaultlocale()
encodings = ['utf-8', 'cp1252']
if locale_encoding.lower() not in encodings:
    encodings = ['utf-8', locale_encoding, 'cp1252']

client = Client(encodings=encodings)
response = client.get(...)
print(response.text)

@florimondmanca
Copy link
Member

I think first option + detailed docs for more exotic cases (exactly like the above) should get us covered, right?

In particular I like how the default behavior would be really simple while accommodating most situations, and then how the "manual mode" docs seem to allow dealing with any special cases.

@tomchristie
Copy link
Member Author

Also useful context... psf/requests#2086

@StephenBrown2
Copy link
Contributor

I'm in favor of Option 2, but in the name of speed + simplicity, Option 1 + detailed docs is the best.

Unfortunately the docs you wrote aren't quite copy-and-pastable, as there's a self.content where I think it should be response.content:

-    response.encoding = chardet.detect(self.content)["encoding"]
+    response.encoding = chardet.detect(response.content)["encoding"]

You may also want to list a couple of the charset-detection modules to give users more choice (those especially interested in speed may not prefer chardet if they see something else exists).

@tomchristie tomchristie marked this pull request as draft July 27, 2020 09:30
@tomchristie
Copy link
Member Author

Switching this to a draft since I don't think it's quite the right implementation.

@tomchristie tomchristie mentioned this pull request Jul 27, 2020
11 tasks
@tomchristie tomchristie closed this Sep 2, 2020
@tomchristie tomchristie deleted the prefer-utf8-over-chardet branch September 2, 2020 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change PRs that contain breaking public API changes do not merge PRs that should not be merged enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace chardet with plain UTF-8 detection
3 participants