-
-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace chardet with plain UTF-8 detection #1018
Comments
Hiya. Okay, this is well pitched, tho it's quite hard to get a handle onto if it'd be a net positive or net negative for us to switch away from. We use
I guess the upshot of this is likely to be...
|
Hrm, so it looks to me like As such I think we ought to be a bit careful about the assumption in "It is fairly safe to assume that any valid UTF-8 sequence is in fact UTF-8, and should be decoded as such." - it's not necessarily a given that'd match up with actual real-world behaviours, or with current browser behaviours. One thing that'd be helpful would be some examples of URLs which are resolving with undesired content-types. Having a dig into real world cases would help given a better overall orientation of cases where HTML pages are omitting encoding information. |
Useful context in psf/requests#1737 |
Also useful context aio-libs/aiohttp#1811 ( |
Benchmark of the OP code against chardet detect, using 100 kB of ASCII and 100 kB of ISO-8859-1 (only 'ä' repeated):
There is a huge difference even with ASCII, and non-ASCII characters totally devastate chardet performance (tried with UTF-8 too, with similar timing results as latin1 on both implementations). |
It might be tempting to try and optimise by only using the initial few kilobytes for detection but doing so is likely to catch only HTML markup, missing the actual native characters that would appear somewhere in the middle of the document. Fortunately Python's built-in decoding code is extremely fast and breaks on first error, so any such optimisations are not needed, especially when the decoding result is also actually used. |
Okay, so given that the landscape has moved on since There's two practical things we'd need to consider to make that change...
|
Another note on this issue: psf/requests#4848 mentions that chardet is problematic when bundling it with an application in productive deployments. I also like the suggestion to make it pluggable with https://github.com/Ousret/charset_normalizer (MIT license) or https://github.com/PyYoshi/cChardet (which is MPL, still easier to comply than LGPL). It'd be great if you could consider making the dependency optional :) |
Buffering causes issues for applications that expect partial messages to come through without delay (long polling, logging and other such hacks). Thus, maybe it would be best to just always use ISO-8859-1 when nothing else is explicitly specified? Section 3.7.1 of https://tools.ietf.org/html/rfc2616 says:
I wonder where the autodetection is actually used because I just tried serving a UTF-8 document as |
The autodetection is only used for media types that are not And yes, I'm more inclined to diverge from |
I found this issue as I was reporting an issue with I've implemented this outside of |
Not really, no, although if we could provide enough good API for allowing it as a third party option then that'd be a good route. |
There is a setter for Would you be open to being able to set the |
@alexchamberlain Possibly, but it'd need some careful thinking about around how to make I probably wouldn't want us to accept any API changes on it at the moment. |
I'd suggest replacing
chardet
with dead-simple decoding logic:It is fairly safe to assume that any valid UTF-8 sequence is in fact UTF-8, and should be decoded as such.
Otherwise ISO-8859-1 (as specified in HTTP RFC) should be used instead of trying to detect languages, which is always quite unreliable. Further, ISO-8859-1 maps all possible byte values directly to Unicode code points 0-255, so the final decoding step in the above code will never raise UnicodeDecodeError.
Since the chardet module has been around a long time and is known as the standard go-to solution for charset detection, many modules depend on it, and this is a frequent source of dependency conflicts such as the one reported here: sanic-org/sanic#1858
A more serious problem is that chardet is notoriously unreliable. Especially its tendency to misdetect anything as Turkish is well known:
The text was updated successfully, but these errors were encountered: