Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aiohttp client throws http errors for the following redirect #2624

Closed
wumpus opened this issue Dec 26, 2017 · 23 comments
Closed

aiohttp client throws http errors for the following redirect #2624

wumpus opened this issue Dec 26, 2017 · 23 comments

Comments

@wumpus
Copy link

wumpus commented Dec 26, 2017

Long story short

I have been fetching the front pages of millions of websites using aiohttp, and collected a large number of cases where aiohttp client's http parser throws errors for stuff that browsers appear to think is fine. Some of these are real bugs in aiohttp's parser, others might be places where browsers do not obey the standard, and aiohttp might want to be more forgiving.

Here's an initial bug to see if you'd like me to do more triage on these.

Here is a 302 redirect that seems to work fine in curl and Firefox but aiohttp's http parser pukes on it.

Expected behaviour

$ curl http://lund.se/robots.txt -D /dev/tty
HTTP/1.1 302 Object Moved
Date: Tue, 26 Dec 2017 05:47:50 GMT
Connection: Keep-Alive
Content-Length: 0
Location: https://lund.se/robots.txt

Note Content-Length: 0.

If I tell curl to follow the redirect:

$ curl -L http://lund.se/robots.txt -D /dev/tty

that works and I see the actual https robots.txt file. My browsers also follow this redirect.

Actual behaviour

bug.py throws:

aiohttp.client_exceptions.ClientResponseError: 400, message='invalid constant string'

Steps to reproduce

import aiohttp
import asyncio

async def fetch(session, url):
        async with session.get(url) as response:
            return await response.text()

async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, 'http://lund.se/robots.txt')
        print(html)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

$ python bug.py

More examples

Since I'm crawling a lot of terrible websites, I have an easy ability to find more examples.

Other messages

OK in FireFox -- 301 to their frontpage. curl also follows this redir:
ClientPayloadError("400, message='Can not decode content-encoding: gzip'",) http://www.fusioncashsurveys.com/robots.txt

OK in FireFox -- 200 and a 5 line robots.txt:
(notice also that the message appears truncated?)
ClientResponseError("400, message='deflate'",) http://www.raundaz.com/robots.txt

301 OK in FireFox and curl: (at	least the message isn't	truncated this time)
ClientPayloadError("400, message='Can not decode content-encoding: deflate'",) http://www.labfortraining.it/

OK in FireFox, indeed it has a huge Content-Security-Policy header:
ClientResponseError("400, message='Got more than 8190 bytes when reading Header value is too long.'",) http://www.dakotabox.es/robots.txt

OK in FireFox, I suppose 'N/A:' is an invalid response header name
ClientResponseError("400, message='invalid character in header'",) http://www.charteroak.edu/robots.txt

Bad in FireFox, too, not a bug
ClientResponseError("400, message='invalid HTTP version'",) http://www.dgchangan.com/robots.txt

OK in FireFox, I see 2 Content-Length headers:
ClientResponseError("400, message='unexpected content-length header'",) http://www.ao30free.com/robots.txt

200, OK in FireFox, content-length looks OK to me
ClientResponseError("400, message='unexpected content-length header'",) http://www.bookfeeder.com/

Your environment

aiohttp 2.3.6 CLIENT
Python 3.6.4
Linux (CentOS 7.4.1708)

@asvetlov
Copy link
Member

Thanks for report

@koulVipin
Copy link

koulVipin commented Jan 24, 2018

I am also getting this error "aiohttp.client_exceptions.ClientResponseError: 400, message='unexpected content-length header'
snippet of code is

import aiohttp
import asyncio
import async_timeout

async def fetch(session, url):
	with async_timeout.timeout(10):
		async with session.get(url) as response:
			return await response.text()

async def main():
	headers = {}
	headers['Authorization'] = 'Basic xxxxxxx==='
	headers['Content-Type']  = 'application/x-www-form-urlencoded'
	headers['header1'] = 'somevalue'
	async with aiohttp.ClientSession(headers = headers) as session:
		for i in range(100):
			html = await fetch(session, url)
			print("\n" ,html)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

@asvetlov
Copy link
Member

Most likely the server responds with at least two Content-Length headers, the response is invalid.

@wumpus
Copy link
Author

wumpus commented Jan 28, 2018

Thanks to Postel's Law, many webservers emit invalid http. This is similar to many webpages being invalid html. Yet browsers display these pages. The html5 standard now standardizes how everyone is supposed to treat broken html; no such standard exists for broken http.

I'd like to know (1) are you interested in fixing this to work like browsers or (2) will you take patches that fix it to work like browsers or (3) aiohttp is a thing of beauty which perfectly implements the standard :-)

For (1) I can provide a large number of test cases, and help triage them. For (2) I can write patches for the things which are most common in my web crawls. For (3) I will admire your idealism.

@asvetlov
Copy link
Member

I definitely prefer option (2), but let's discuss fixes case by case.
Sorry, I not very motivated to fix weird cases myself (at least while they don't hurt me on my job for example) but I'm open for reviewing and accepting patches to improve situation.

@iho
Copy link

iho commented Feb 12, 2018

ClientResponseError("400, message='invalid character in header'",) http://www.charteroak.edu/robots.txt```
Do you have workaround to this error? I do not need to read header only body

@asvetlov
Copy link
Member

@iho just install aiohttp 3.0

@iho
Copy link

iho commented Feb 12, 2018

pip install aiohttp==3.0
Requirement already satisfied: aiohttp==3.0 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages
Requirement already satisfied: idna-ssl>=1.0 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages (from aiohttp==3.0)
Requirement already satisfied: multidict<5.0,>=4.0 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages (from aiohttp==3.0)
Requirement already satisfied: chardet<4.0,>=2.0 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages (from aiohttp==3.0)
Requirement already satisfied: async-timeout<2.0,>=1.2 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages (from aiohttp==3.0)
Requirement already satisfied: attrs>=17.4.0 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages (from aiohttp==3.0)
Requirement already satisfied: yarl<2.0,>=1.0 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages (from aiohttp==3.0)
Requirement already satisfied: idna>=2.0 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages (from idna-ssl>=1.0->aiohttp==3.0)

Example of code

import aiohttp
import asyncio

async def main():
    url = 'https://flyp.me/api/v1/order/create'

    data = {
      "order": {
      "from_currency": "LTC",
      "to_currency": "ZEC",
      "ordered_amount": "0.01",
      "destination": "t1SBTywpsDMKndjogkXhZZSKdVbhadt3rVt"
      }
    }
    async with aiohttp.ClientSession() as session:
        async with session.post(url, json=data) as response:
            print(await response.text())

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Traceback

Traceback (most recent call last):
  File "/home/user/.pyenv/versions/flyp/lib/python3.6/site-packages/aiohttp/client_reqrep.py", line 678, in start
    (message, payload) = await self._protocol.read()
  File "/home/user/.pyenv/versions/flyp/lib/python3.6/site-packages/aiohttp/streams.py", line 533, in read
    await self._waiter
  File "/home/user/.pyenv/versions/flyp/lib/python3.6/site-packages/aiohttp/client_proto.py", line 161, in data_received
    messages, upgraded, tail = self._parser.feed_data(data)
  File "aiohttp\_http_parser.pyx", line 295, in aiohttp._http_parser.HttpParser.feed_data
aiohttp.http_exceptions.BadHttpMessage: 400, message='invalid character in header'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "example.py", line 20, in <module>
    loop.run_until_complete(main())
  File "/home/user/.pyenv/versions/3.6.4/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
    return future.result()
  File "example.py", line 16, in main
    async with session.post(url, json=data) as response:
  File "/home/user/.pyenv/versions/flyp/lib/python3.6/site-packages/aiohttp/client.py", line 779, in __aenter__
    self._resp = await self._coro
  File "/home/user/.pyenv/versions/flyp/lib/python3.6/site-packages/aiohttp/client.py", line 331, in _request
    await resp.start(conn, read_until_eof)
  File "/home/user/.pyenv/versions/flyp/lib/python3.6/site-packages/aiohttp/client_reqrep.py", line 683, in start
    message=exc.message, headers=exc.headers) from exc
aiohttp.client_exceptions.ClientResponseError: 400, message='invalid character in header'

@asvetlov
Copy link
Member

The problem is parsing the response by upstream nodejs HTTP parser.
AIOHTTP_NO_EXTENSIONS environment variable disables the fast C parser, pure Python fallback process the response correctly.

@webknjaz
Copy link
Member

worth upgrading vendored lib

@iho
Copy link

iho commented Feb 12, 2018

@asvetlov thank you!

@asvetlov
Copy link
Member

@webknjaz upstream didn't fix the problem, it has added support for SOURCE HTTP verb only.

@lopuhin
Copy link

lopuhin commented Mar 26, 2018

Another example of getting ClientResponseError: 400, message='invalid constant string' from a well-behaving (I think) web service, using aiohttp==3.1.0. Code to reproduce:

import asyncio
import aiohttp

async def main():
    async with aiohttp.ClientSession() as session:
        async with session.delete('http://proxy.crawlera.com:8010/session/foo') as response:
            print(repr(response))

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

gives:

Traceback (most recent call last):
  File "venv/lib/python3.6/site-packages/aiohttp/client_reqrep.py", line 695, in start
    (message, payload) = await self._protocol.read()
  File "venv/lib/python3.6/site-packages/aiohttp/streams.py", line 533, in read
    await self._waiter
  File "venv/lib/python3.6/site-packages/aiohttp/client_proto.py", line 161, in data_received
    messages, upgraded, tail = self._parser.feed_data(data)
  File "aiohttp/_http_parser.pyx", line 297, in aiohttp._http_parser.HttpParser.feed_data
aiohttp.http_exceptions.BadHttpMessage: 400, message='invalid constant string'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "t.py", line 12, in <module>
    loop.run_until_complete(main())
  File "/Users/kostia/.pyenv/versions/3.6.4/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
    return future.result()
  File "t.py", line 7, in main
    async with session.delete('http://proxy.crawlera.com:8010/session/foo') as response:
  File "venv/lib/python3.6/site-packages/aiohttp/client.py", line 783, in __aenter__
    self._resp = await self._coro
  File "venv/lib/python3.6/site-packages/aiohttp/client.py", line 333, in _request
    await resp.start(conn, read_until_eof)
  File "venv/lib/python3.6/site-packages/aiohttp/client_reqrep.py", line 700, in start
    message=exc.message, headers=exc.headers) from exc
aiohttp.client_exceptions.ClientResponseError: 400, message='invalid constant string'

response that is gives the error (repr(data) in client_proto.py, line 161) is

b'HTTP/1.1 401 Unauthorized\r\nConnection: close\r\nDate: Mon, 26 Mar 2018 11:15:21 GMT\r\nProxy-Connection: close\r\nTransfer-Encoding: chunked\r\nWWW-Authenticate: Basic realm="Crawlera"\r\nX-Crawlera-Error: bad_auth\r\n\r\n0\r\n\r\n0\r\n\r\n'

and actual response that I'd like to parse (can't provide a public repro for it, but it gives the same error) is

b'HTTP/1.1 204 No Content\r\nConnection: close\r\nDate: Mon, 26 Mar 2018 11:08:45 GMT\r\nProxy-Connection: close\r\nTransfer-Encoding: chunked\r\n\r\n0\r\n\r\n0\r\n\r\n'

AIOHTTP_NO_EXTENSIONS=1 also does not help, although the error is different:

Traceback (most recent call last):
  File "venv/lib/python3.6/site-packages/aiohttp/client_reqrep.py", line 695, in start
    (message, payload) = await self._protocol.read()
  File "venv/lib/python3.6/site-packages/aiohttp/streams.py", line 533, in read
    await self._waiter
  File "venv/lib/python3.6/site-packages/aiohttp/client_proto.py", line 162, in data_received
    messages, upgraded, tail = self._parser.feed_data(data)
  File "venv/lib/python3.6/site-packages/aiohttp/http_parser.py", line 142, in feed_data
    msg = self.parse_message(self._lines)
  File "venv/lib/python3.6/site-packages/aiohttp/http_parser.py", line 408, in parse_message
    raise BadStatusLine(line) from None
aiohttp.http_exceptions.BadStatusLine: 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "t.py", line 12, in <module>
    loop.run_until_complete(main())
  File "/Users/kostia/.pyenv/versions/3.6.4/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
    return future.result()
  File "t.py", line 7, in main
    async with session.delete('http://proxy.crawlera.com:8010/session/foo') as response:
  File "venv/lib/python3.6/site-packages/aiohttp/client.py", line 783, in __aenter__
    self._resp = await self._coro
  File "venv/lib/python3.6/site-packages/aiohttp/client.py", line 333, in _request
    await resp.start(conn, read_until_eof)
  File "venv/lib/python3.6/site-packages/aiohttp/client_reqrep.py", line 700, in start
    message=exc.message, headers=exc.headers) from exc
aiohttp.client_exceptions.ClientResponseError: 400, message='Bad Request'

@pl77
Copy link

pl77 commented Oct 24, 2018

I found that aiohttp behaved much better with Crawlera (at least with some sites) if I avoided the proxy_auth argument and explicitly entered my API key in the url. For example:

urllist = ['https://google.com', 'https://bing.com']
proxy_api = "54fuj567a43see7uedhd9498c45_APIstringfromCrawlera"
proxy_host = "proxy.crawlera.com"
proxy_port = "8010"

proxy = "http://{}@{}:{}/".format(proxy_api, proxy_host, proxy_port)

jobs = [asyncio.ensure_future(session.get(URL(url), ssl=False, proxy=proxy)) for url in urllist]

done_jobs = await asyncio.gather(*jobs)
for response in done_jobs:
    print(response.status, "status code for", response.url)

@unl1k3ly
Copy link

unl1k3ly commented Nov 29, 2018

Hi all,
Have any of you been able to fixes this issue ?
I'm still having invalid character in header when requesting an endpoint using GET.
When i set AIOHTTP_NO_EXTENSIONS=1 var, i get Invalid HTTP Header: X-XSS-Protection=1;.

I'm using python 3.7, on aiohttp-3.4.4.

Cheers

@asvetlov
Copy link
Member

@unl1k3ly but X-XSS-Protection=1; header is really invalid, isn't it?

@unl1k3ly
Copy link

@asvetlov thanks for prompt reply mate. I'm not sure what you mean. The request works with curl and python requests module. With aiohttp i get that error as output. In fact, my endpoint returns that http header... would be an away to bypass this exception and finally print it's content ?

Cheers

@unl1k3ly
Copy link

All im getting now is aiohttp.client_exceptions.ClientResponseError: 400, message='invalid character in header'. I'm running on aiohttp-3.4.4.

Cheers

@unl1k3ly
Copy link

So, more updates on this... I've just tested with requests-futures and grequests and both seems to be returned the right content rather than raise an exception upon a response header.
Is there a way to bypass this exception so aiohttp finishes the request @asvetlov.

Thank you for all support.

@asvetlov
Copy link
Member

If you want to modify a parser code to recover after invalid headers string -- a PR is welcome.
I have no time and motivation to work on handling malformed headers but will review any improvement suggestion.

@autogestion
Copy link

So, it makes impossible for aiohttp server to process requests with http signatures

@ctg3
Copy link

ctg3 commented Mar 1, 2022

I'm stuck on aiohttp 3.6.3 because with aiohttp 3.7 and 3.8 I get an invalid character in header exception. The AIOHTTP_NO_EXTENSIONS workaround did not solve the issue for me. I would really appreciate a way to recover from the error and still receive the response body.

@Dreamsorcerer
Copy link
Member

response that is gives the error (repr(data) in client_proto.py, line 161) is

b'HTTP/1.1 401 Unauthorized\r\nConnection: close\r\nDate: Mon, 26 Mar 2018 11:15:21 GMT\r\nProxy-Connection: close\r\nTransfer-Encoding: chunked\r\nWWW-Authenticate: Basic realm="Crawlera"\r\nX-Crawlera-Error: bad_auth\r\n\r\n0\r\n\r\n0\r\n\r\n'

This is invalid, because a response should be finished after receiving a 0 length chunk:
https://www.rfc-editor.org/rfc/rfc9112.html#section-7.1-3

i.e. \r\n0\r\n\r\n should be removed from the end of that response.

The sample in the original issue seems to be fine now.

@Dreamsorcerer Dreamsorcerer closed this as not planned Won't fix, can't repro, duplicate, stale Aug 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests