Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect line splitting in HttpRequestParser #97

Closed
tumb1er opened this issue Jul 3, 2014 · 18 comments
Closed

incorrect line splitting in HttpRequestParser #97

tumb1er opened this issue Jul 3, 2014 · 18 comments
Labels

Comments

@tumb1er
Copy link
Contributor

tumb1er commented Jul 3, 2014

We are receiving some HTTP requests that cause an Invalid Header error in aiohttp.
That's because of an issue in splitlines in
aiohttp.procotol.HttpRequestParser:

lines = raw_data.decode(
            'ascii', 'surrogateescape').splitlines(True)

For example, this code produces 4 lines instead of 3:

>>> rd = bytearray(b'line1\r\nline2_start_\x1c_line2_end\r\nline3\r\n')
>>> rd.decode('ascii', 'surrogateescape').splitlines(True)
['line1\r\n', 'line2_start_\x1c', '_line2_end\r\n', 'line3\r\n']

In my case it was invalid user-agent header for UCBrowser, but any thoughts how to fix it?

@popravich
Copy link
Member

we had the same issue with splitlines when parsing csv.
In our case calling splitlines(True) before decode() solved it. But this means lines must be decoded after splitlines (loop through whole list) or each line must be decoded in place where it is used.

@fafhrd91
Copy link
Member

fafhrd91 commented Jul 6, 2014

problem with split lines and then decode is parser performance degradation. i tested that in early days of aiohttp development.

@asvetlov @popravich @kxepal ideas?

@popravich
Copy link
Member

I've done few simple performance tests and here are the results:

>>> raw = b'\r\n'.join([b'part1\x1c_part2\r\nline2\r\nline3'] * 50000)
>>> raw2 = b'\r\n'.join([b'some-not-very-short-header: and_its_verryyyyyy_looooooooong_value'+b'e'*100] * 10000)
>>> len(raw), len(raw2)
... (1399998, 1669998)

# short lines
>>> %timeit raw.decode('ascii', 'surrogateescape').splitlines(True)
100 loops, best of 3: 14.5 ms per loop
>>> %timeit list(map(lambda b: b.decode('ascii', 'surogateescape'), raw.splitlines(1)))
10 loops, best of 3: 81.2 ms per loop
>>> %timeit next(map(lambda b: b.decode('ascii', 'surogateescape'), raw.splitlines(1)))
100 loops, best of 3: 7.68 ms per loop
>>> %timeit raw.decode('ascii', 'surogateescape').split('\r\n')
100 loops, best of 3: 11.3 ms per loop

# longer lines
>>> %timeit raw2.decode('ascii', 'surrogateescape').splitlines(True)
100 loops, best of 3: 2.68 ms per loop
>>> %timeit list(map(lambda b: b.decode('ascii', 'surogateescape'), raw2.splitlines(1)))
100 loops, best of 3: 7.97 ms per loop
>>> %timeit next(map(lambda b: b.decode('ascii', 'surogateescape'), raw2.splitlines(1)))
100 loops, best of 3: 2.22 ms per loop
>>> %timeit raw2.decode('ascii', 'surogateescape').split('\r\n')
100 loops, best of 3: 3.25 ms per loop

So maybe it makes sense to use next(map... pair?
What your thoughts?

@tumb1er
Copy link
Contributor Author

tumb1er commented Jul 7, 2014

next(map(...)) processes only first line, and all other variants process all headers data. Correct test is

def test_next():
    try:
        while True:
            next(map(...))
    except StopIteration:
        pass

@asvetlov
Copy link
Member

asvetlov commented Jul 7, 2014

@fafhrd91
The benchmark from @popravich is incorrect but I like the idea for splitting byte-string and decoding after that.

@popravich
Copy link
Member

Sorry for confusion.
Yes, the test isn't correct, morning coffee didn't work)

I meant using iterator that map returns instead of lines list.
I will do some tests on HttpRequestParser with different variants of split lines and come back later.

But I think that splitting bytes and then decoding lines in place where its used might not hit performance too much. Any way I will do some tests.

@tumb1er
Copy link
Contributor Author

tumb1er commented Jul 7, 2014

@popravich what about regex splitting? May be it'll both fix split logic and preserve line endings. Without performance penalty.

@fafhrd91 fafhrd91 mentioned this issue Jul 7, 2014
@fafhrd91
Copy link
Member

fafhrd91 commented Jul 7, 2014

how often headers encoding is broken?
can we do something like (optimistic approach):

   try:
       lines = raw.decode('ascii').splitlines(True)
   except UnicodeDecodeError:
       lines = split and decode with surrogateescape 

@tumb1er
Copy link
Contributor Author

tumb1er commented Jul 8, 2014

It doesn't raise decode error, it just splits extra lines by \x1c symbol. Real exception is in parse_headers:

            try:
                name, value = line.split(':', 1)
            except ValueError:
                raise ValueError('Invalid header: {}'.format(line)) from None

Good idea but hard to implement it.

PS. Exception happens for less than 0.1% of requests for me.

@fafhrd91
Copy link
Member

fafhrd91 commented Jul 8, 2014

ah! i think this is bug in .splitlines()
.split('\r\n') works right it just removes '\r\n' from strings.

@fafhrd91
Copy link
Member

fafhrd91 commented Jul 8, 2014

@tumb1er could you tes fix @a6a179c5ad1e011a73610588de3046487244bed1

@fafhrd91
Copy link
Member

fafhrd91 commented Jul 8, 2014

@asvetlov could you fill python bug report for .splitlines()

@tumb1er
Copy link
Contributor Author

tumb1er commented Jul 8, 2014

@fafhrd91, I've tested, it works now.

BTW, \x1c in ASCII is named "File separator, Information separator four" here http://donsnotes.com/tech/charsets/ascii.html
So it may be not a bug in splitlines :)

@fafhrd91
Copy link
Member

fafhrd91 commented Jul 8, 2014

maybe:)
documentation should say that then.

@fafhrd91 fafhrd91 closed this as completed Jul 8, 2014
@asvetlov
Copy link
Member

asvetlov commented Jul 8, 2014

@fafhrd91 I don't understand clean what exactly is wrong with .splitlines() ?

@fafhrd91
Copy link
Member

fafhrd91 commented Jul 8, 2014

.splitlines() treats this chars \x1c, \x1d,\x1e` as line endings.

@popravich
Copy link
Member

well, str.splitlines() treats those chars as line endings and bytes.splitlines() does not. that may a problem.

@lock
Copy link

lock bot commented Oct 29, 2019

This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.

If you feel like there's important points made in this discussion,
please include those exceprts into that new issue.

@lock lock bot added the outdated label Oct 29, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants