Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding detection results in performance difference with other clients #1811

Closed
serathius opened this issue Apr 12, 2017 · 7 comments
Closed
Labels

Comments

@serathius
Copy link

Long story short

When using aiohttp for fetching pages I found strange performance problems. We started comparing timings with other clients like curl and requests and difference was significant. For other clients fetching page was 6 times faster compared to aiohttp. After some digging we found that problem was using method "text" that was used encoding detection from chardet.

Requests is also using chardet, but the difference is that it's skipping it if content-type contains word "text" by using 'ISO-8859-1'. https://github.com/kennethreitz/requests/blob/master/requests/utils.py#L362

Expected behaviour

Matching behavior to other popular clients.

Actual behaviour

Huge performance hit for pages without explicit encoding. For example "Content-Type: text/html"
For 300kB pages time difference for using encoding and not is 8s to 2s respectively. (using method "text" instead of "read"), and for 1.3MB page difference is 33s to 4.5s.

Steps to reproduce

I'm sorry I cannot disclose the page that I used for testing.

Your environment

I tested it on two environments
Linux 4.8 Ubuntu 16.10 Python 3.6
aiohttp==1.0.5
chardet==2.3.0

Linux 4.8 Ubuntu 16.10 Python 3.5.2
aiohttp==2.0.6
chardet==3.0.1
^ for that environment problem was smaller by 30%

@asvetlov
Copy link
Member

Not sure if Content-Type: text/html should be treated as ISO-8859-1.
Nowdays default encoding is most likely UTF-8.
AFAIK there is no standard for default encoding for HTTP body in general or text/html particularly.

If you need better performance and you know the page encoding you could use (await resp.read()).decode('ISO-8859-1')

@serathius
Copy link
Author

I easily fixed the problem by using resp.text('ISO-8859-1').
I just wanted to provide some feedback that how requests is handling that.

@fafhrd91
Copy link
Member

@serathius could you check if cchardet is installed and uses c-extension

@asvetlov
Copy link
Member

The problem is: if charset was not provided by Content-Type header applying ISO-8859-1 might corrupt non-English text.
As I said now most part of non-English htmls are encoded by UTF-8 (at least should be). ISO-8859-1 decodes it without an error but resulting text is obviously broken.
That's why I think the current behavior should not be changed.

But documentation update with mentioning possible performance issue and resp.text('ISO-8859-1') would be fine.

@serathius would you make a PR for doc update?

@serathius
Copy link
Author

@asvetlov Yes, after some more research and experiments with chardet

@fafhrd91
Copy link
Member

fafhrd91 commented May 8, 2017

do we have any actionable for this ticket?

@lock
Copy link

lock bot commented Oct 28, 2019

This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.

If you feel like there's important points made in this discussion,
please include those exceprts into that new issue.

@lock lock bot added the outdated label Oct 28, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants