Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Handling of Invalid Urls #1832

Open
cancan101 opened this issue Sep 1, 2021 · 6 comments
Open

Inconsistent Handling of Invalid Urls #1832

cancan101 opened this issue Sep 1, 2021 · 6 comments
Labels
user-experience Ensuring that users have a good experience using the library

Comments

@cancan101
Copy link

cancan101 commented Sep 1, 2021

If I call: httpx.get('https://😇'), an InvalidURL exception is raised. However, if I call: httpx.get('https:/google.com'), instead I get an UnsupportedProtocol exception, which seems inconsistent. I would expect the latter to raise an InvalidURL as well.

For reference, requests raises a InvalidURL in both cases.

This issue also exists for requests.get('https:///google.com') vs httpx.get('https:///google.com').

@tomchristie tomchristie added the user-experience Ensuring that users have a good experience using the library label Sep 1, 2021
@tomchristie
Copy link
Member

tomchristie commented Sep 1, 2021

Thanks @cancan101 - Appreciate the neatly summarised issue.

@tomchristie
Copy link
Member

tomchristie commented Oct 20, 2021

So...

Good place to start with something like this is to unpick it into the smallest possible components, in order to figure out exactly what behaviour we want, and that is consistent with the API throughout.

Here's where I got to after some first steps...

>>> u = httpx.URL('https://😇')  # Raises `InvalidURL`.

And in contrast...

>>> u = httpx.URL('https:/google.com')
>>> u.scheme
'https'
>>> u.host
''
>>> u.path
'/google.com'

Now. That's not necessarily undesired behaviour at this point. We might consider the first to be a strictly invalid URL, and the second to be a valid URL, that just happens to be missing a hostname.

We can also confirm that URLs instantiated with explicit portions end up the same way here...

>>> u = httpx.URL(scheme='https', path='/google.com')
URL('https:/google.com')

So at this level of the API we are at least consistent.

I'm somewhat surprised tho, at why that results in an UnsupportedProtocol when used to send a request. (Since it has a protocol, but is missing a host.)

Let's take a closer look...

>>> c = httpx.Client()
>>> r = c.build_request('GET', 'https:/google.com')
>>> r.url
URL('/google.com')

Hrm. Somethings changed here once we've started build_request, what's that about exactly?
Haven't dug into it yet, not sure.

@stale
Copy link

stale bot commented Feb 20, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Feb 20, 2022
@tomchristie
Copy link
Member

Still valid thx, @Stale bot.

@stale stale bot removed the wontfix label Feb 21, 2022
@stale
Copy link

stale bot commented Mar 25, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Mar 25, 2022
@shelbylsmith
Copy link

@tomchristie
I think the issue may stem from checking whether the URL is absolute:

httpx/httpx/_urls.py

Lines 341 to 350 in 5b06aea

def is_absolute_url(self) -> bool:
"""
Return `True` for absolute URLs such as 'http://example.com/path',
and `False` for relative URLs such as '/path'.
"""
# We don't use `.is_absolute` from `rfc3986` because it treats
# URLs with a fragment portion as not absolute.
# What we actually care about is if the URL provides
# a scheme and hostname to which connections should be made.
return bool(self._uri_reference.scheme and self._uri_reference.host)

In this case, since u.host is '', it causes this check to return false and so it is treated as a relative URL.
This means that _merge_url tries to prepend self.base_url to the 'relative' URL, but since it's intended to be an absolute URL, there is no base_url provided and so the scheme has now been removed. Since there is no longer a scheme, we receive UnsupportedProtocol:

httpx/httpx/_client.py

Lines 369 to 389 in 5b06aea

def _merge_url(self, url: URLTypes) -> URL:
"""
Merge a URL argument together with any 'base_url' on the client,
to create the URL used for the outgoing request.
"""
merge_url = URL(url)
if merge_url.is_relative_url:
# To merge URLs we always append to the base URL. To get this
# behaviour correct we always ensure the base URL ends in a '/'
# separator, and strip any leading '/' from the merge URL.
#
# So, eg...
#
# >>> client = Client(base_url="https://www.example.com/subpath")
# >>> client.base_url
# URL('https://www.example.com/subpath/')
# >>> client.build_request("GET", "/path").url
# URL('https://www.example.com/subpath/path')
merge_raw_path = self.base_url.raw_path + merge_url.raw_path.lstrip(b"/")
return self.base_url.copy_with(raw_path=merge_raw_path)
return merge_url

Is it sufficient to just check in is_absolute_url if there is a scheme since in a relative URL it would be provided in base_url?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user-experience Ensuring that users have a good experience using the library
Projects
None yet
Development

No branches or pull requests

3 participants