Encoding detection results in performance difference with other clients #1811

serathius · 2017-04-12T10:10:26Z

Long story short

When using aiohttp for fetching pages I found strange performance problems. We started comparing timings with other clients like curl and requests and difference was significant. For other clients fetching page was 6 times faster compared to aiohttp. After some digging we found that problem was using method "text" that was used encoding detection from chardet.

Requests is also using chardet, but the difference is that it's skipping it if content-type contains word "text" by using 'ISO-8859-1'. https://github.com/kennethreitz/requests/blob/master/requests/utils.py#L362

Expected behaviour

Matching behavior to other popular clients.

Actual behaviour

Huge performance hit for pages without explicit encoding. For example "Content-Type: text/html"
For 300kB pages time difference for using encoding and not is 8s to 2s respectively. (using method "text" instead of "read"), and for 1.3MB page difference is 33s to 4.5s.

Steps to reproduce

I'm sorry I cannot disclose the page that I used for testing.

Your environment

I tested it on two environments
Linux 4.8 Ubuntu 16.10 Python 3.6
aiohttp==1.0.5
chardet==2.3.0

Linux 4.8 Ubuntu 16.10 Python 3.5.2
aiohttp==2.0.6
chardet==3.0.1
^ for that environment problem was smaller by 30%

asvetlov · 2017-04-12T12:57:22Z

Not sure if Content-Type: text/html should be treated as ISO-8859-1.
Nowdays default encoding is most likely UTF-8.
AFAIK there is no standard for default encoding for HTTP body in general or text/html particularly.

If you need better performance and you know the page encoding you could use (await resp.read()).decode('ISO-8859-1')

serathius · 2017-04-12T13:00:38Z

I easily fixed the problem by using resp.text('ISO-8859-1').
I just wanted to provide some feedback that how requests is handling that.

fafhrd91 · 2017-04-12T15:31:46Z

@serathius could you check if cchardet is installed and uses c-extension

asvetlov · 2017-04-12T16:24:40Z

The problem is: if charset was not provided by Content-Type header applying ISO-8859-1 might corrupt non-English text.
As I said now most part of non-English htmls are encoded by UTF-8 (at least should be). ISO-8859-1 decodes it without an error but resulting text is obviously broken.
That's why I think the current behavior should not be changed.

But documentation update with mentioning possible performance issue and resp.text('ISO-8859-1') would be fine.

@serathius would you make a PR for doc update?

serathius · 2017-04-13T07:36:05Z

@asvetlov Yes, after some more research and experiments with chardet

fafhrd91 · 2017-05-08T20:37:47Z

do we have any actionable for this ticket?

lock · 2019-10-28T19:01:55Z

This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.

If you feel like there's important points made in this discussion,
please include those exceprts into that new issue.

asvetlov closed this as completed in 9dc8423 Jun 26, 2017

lock bot added the outdated label Oct 28, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding detection results in performance difference with other clients #1811

Encoding detection results in performance difference with other clients #1811

serathius commented Apr 12, 2017

asvetlov commented Apr 12, 2017

serathius commented Apr 12, 2017

fafhrd91 commented Apr 12, 2017

asvetlov commented Apr 12, 2017

serathius commented Apr 13, 2017

fafhrd91 commented May 8, 2017

lock bot commented Oct 28, 2019

Encoding detection results in performance difference with other clients #1811

Encoding detection results in performance difference with other clients #1811

Comments

serathius commented Apr 12, 2017

Long story short

Expected behaviour

Actual behaviour

Steps to reproduce

Your environment

asvetlov commented Apr 12, 2017

serathius commented Apr 12, 2017

fafhrd91 commented Apr 12, 2017

asvetlov commented Apr 12, 2017

serathius commented Apr 13, 2017

fafhrd91 commented May 8, 2017

lock bot commented Oct 28, 2019