Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace chardet with plain UTF-8 detection #1018

Closed
Tronic opened this issue Jun 9, 2020 · 14 comments · Fixed by #1269
Closed

Replace chardet with plain UTF-8 detection #1018

Tronic opened this issue Jun 9, 2020 · 14 comments · Fixed by #1269
Labels
enhancement New feature or request
Milestone

Comments

@Tronic
Copy link

Tronic commented Jun 9, 2020

I'd suggest replacing chardet with dead-simple decoding logic:

encoding = "UTF-8"
try:
    text = databytes.decode(encoding)
except UnicodeDecodeError:
    encoding = "ISO-8859-1"
    text = databytes.decode(encoding)

It is fairly safe to assume that any valid UTF-8 sequence is in fact UTF-8, and should be decoded as such.

Otherwise ISO-8859-1 (as specified in HTTP RFC) should be used instead of trying to detect languages, which is always quite unreliable. Further, ISO-8859-1 maps all possible byte values directly to Unicode code points 0-255, so the final decoding step in the above code will never raise UnicodeDecodeError.

Since the chardet module has been around a long time and is known as the standard go-to solution for charset detection, many modules depend on it, and this is a frequent source of dependency conflicts such as the one reported here: sanic-org/sanic#1858

A more serious problem is that chardet is notoriously unreliable. Especially its tendency to misdetect anything as Turkish is well known:

>>> chardet.decode("Testiä".encode("UTF-8"))   # "Testing" in Finnish
{'encoding': 'ISO-8859-9',
 'confidence': 0.5521177026603239,
 'language': 'Turkish'}

>>> "Testiä".encode("UTF-8").decode("ISO-8859-9")  # Yields gibberish, not Turkish
'Testiä'
@tomchristie
Copy link
Member

Hiya. Okay, this is well pitched, tho it's quite hard to get a handle onto if it'd be a net positive or net negative for us to switch away from.

We use chardet in cases where the charset hasn't been included in the Content-Type, because one of the initial design goals for httpx was requests compatiblility and that's historically been their behaviour. However I'm open to reassessing this. I expect the landscape has kinda change with the prevalence of UTF-8 having dramatically risen in that time.

  • It'd be interesting to collate some issues from requests to see what their prior experiences are here.
  • It'd be interesting to get an insight into browser behaviours wrt. fuzzily determining content types where it's unset in the headers.
  • We might well do a better job by using some fuzzy pattern matching for HTML meta tags as a fallback when headers are unset, with utf-8, and then iso-8859-1 as absolute fallbacks.

I guess the upshot of this is likely to be...

  • Start by allowing the "apparent encoding" mechanism to be configurable.
  • Switch chardet out from being the default, and instead fallback on 1. fuzzy pattern match for meta tags 2. utf-8 3. iso-8859-1 in that order.

@tomchristie tomchristie added this to the v1.0 milestone Jun 9, 2020
@tomchristie
Copy link
Member

tomchristie commented Jun 9, 2020

Hrm, so it looks to me like chardet is what Mozilla use if both Content-Type and Meta tags fail... https://www-archive.mozilla.org/projects/intl/chardet.html

As such I think we ought to be a bit careful about the assumption in "It is fairly safe to assume that any valid UTF-8 sequence is in fact UTF-8, and should be decoded as such." - it's not necessarily a given that'd match up with actual real-world behaviours, or with current browser behaviours.

One thing that'd be helpful would be some examples of URLs which are resolving with undesired content-types. Having a dig into real world cases would help given a better overall orientation of cases where HTML pages are omitting encoding information.

@tomchristie
Copy link
Member

Useful context in psf/requests#1737

@tomchristie
Copy link
Member

Also useful context aio-libs/aiohttp#1811 (aiohttp diverges from requests here)

@Tronic
Copy link
Author

Tronic commented Jul 15, 2020

Benchmark of the OP code against chardet detect, using 100 kB of ASCII and 100 kB of ISO-8859-1 (only 'ä' repeated):

>>> %timeit auto_decode(ascii_bytes)
7.66 µs ± 84.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit auto_decode(latin1_bytes)
12.1 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit chardet.detect(ascii_bytes)
1.32 ms ± 8.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit chardet.detect(latin1_bytes)
2.89 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There is a huge difference even with ASCII, and non-ASCII characters totally devastate chardet performance (tried with UTF-8 too, with similar timing results as latin1 on both implementations).

@Tronic
Copy link
Author

Tronic commented Jul 15, 2020

It might be tempting to try and optimise by only using the initial few kilobytes for detection but doing so is likely to catch only HTML markup, missing the actual native characters that would appear somewhere in the middle of the document. Fortunately Python's built-in decoding code is extremely fast and breaks on first error, so any such optimisations are not needed, especially when the decoding result is also actually used.

@tomchristie
Copy link
Member

Okay, so given that the landscape has moved on since requests opted with chardet for determining unknown character encodings, and that UTF-8 is so much more prevalent now, I'm feeling pretty open to us considering a change in behaviour here.

There's two practical things we'd need to consider to make that change...

  1. A pull request updating TextDecoder to instead use UTF-8 with ISO-8859-1 fallback. It needs to support iterative decoding, so probably using IncrementalDecoder, and initially waiting until at least (say) 4096 bytes have been buffered before the initial decode pass?
  2. We might(?) want to consider an API to allow customizing the text decoder, to allow for alternate users to switch to alternate behaviours such as ChardetDecoder or an HTMLMetaTagDecoder?

@voegtlel
Copy link

Another note on this issue: psf/requests#4848 mentions that chardet is problematic when bundling it with an application in productive deployments. I also like the suggestion to make it pluggable with https://github.com/Ousret/charset_normalizer (MIT license) or https://github.com/PyYoshi/cChardet (which is MPL, still easier to comply than LGPL). It'd be great if you could consider making the dependency optional :)

@Tronic
Copy link
Author

Tronic commented Jul 17, 2020

Buffering causes issues for applications that expect partial messages to come through without delay (long polling, logging and other such hacks). Thus, maybe it would be best to just always use ISO-8859-1 when nothing else is explicitly specified?

Section 3.7.1 of https://tools.ietf.org/html/rfc2616 says:

When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value.

I wonder where the autodetection is actually used because I just tried serving a UTF-8 document as text/html with no charset or meta tags, and Chrome, Firefox, httpx and requests all decoded it as 8859-1 or similar, not as UTF-8, although chardet detected the content as UTF-8.

@tomchristie
Copy link
Member

The autodetection is only used for media types that are not text/*, and that don't have a charset set... https://github.com/encode/httpx/blob/master/httpx/_models.py#L777

And yes, I'm more inclined to diverge from requests behaviour here, since it's anyway applied to only a pretty narrow subset of cases, and the landscape wrt. utf-8 has moved on since their initial design decision here.

@alexchamberlain
Copy link

I found this issue as I was reporting an issue with response.text taking 10 seconds, which I tracked down to chardet.detect. For what it's worth, in my case, I'm parsing a sitemap, which is XML, so the algorithm for guessing the encoding should be something like: https://www.oreilly.com/library/view/python-cookbook/0596001673/ch12s11.html.

I've implemented this outside of httpx via response.content. Do you see supporting the XML encoding guessing or a custom guessing function fitting in here?

@tomchristie
Copy link
Member

Not really, no, although if we could provide enough good API for allowing it as a third party option then that'd be a good route.

@alexchamberlain
Copy link

There is a setter for encoding exposed on Response, which I think is sufficient now that I've thought about it a little more.

Would you be open to being able to set the encoding in __init__? I'd use it when reconstructing Response from an on-disk cache.

@tomchristie
Copy link
Member

@alexchamberlain Possibly, but it'd need some careful thinking about around how to make Response API nicely coherent in that area. (Eg. see #1091)

I probably wouldn't want us to accept any API changes on it at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants