Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib.parse wrongly strips empty #fragment, ?query, //netloc #67041

Open
soilandreyes mannequin opened this issue Nov 12, 2014 · 19 comments
Open

urllib.parse wrongly strips empty #fragment, ?query, //netloc #67041

soilandreyes mannequin opened this issue Nov 12, 2014 · 19 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@soilandreyes
Copy link
Mannequin

soilandreyes mannequin commented Nov 12, 2014

BPO 22852
Nosy @orsenthil, @rbtcollins, @merwok, @cjerdonek, @berkerpeksag, @vadmium, @demianbrecht, @openandclose
Files
  • has_netloc.patch
  • has_netloc.v2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2014-11-12.10:23:30.997>
    labels = ['type-feature', 'library', '3.11']
    title = 'urllib.parse wrongly strips empty #fragment, ?query, //netloc'
    updated_at = <Date 2022-02-12.11:48:41.130>
    user = 'https://bugs.python.org/soilandreyes'

    bugs.python.org fields:

    activity = <Date 2022-02-12.11:48:41.130>
    actor = 'eric.araujo'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2014-11-12.10:23:30.997>
    creator = 'soilandreyes'
    dependencies = []
    files = ['38465', '38633']
    hgrepos = ['279']
    issue_num = 22852
    keywords = ['patch', 'needs review']
    message_count = 17.0
    messages = ['231070', '231099', '235582', '237999', '238220', '238246', '238251', '238255', '238258', '238264', '238897', '244516', '247905', '322778', '323123', '371181', '413124']
    nosy_count = 10.0
    nosy_names = ['orsenthil', 'rbcollins', 'eric.araujo', 'chris.jerdonek', 'berker.peksag', 'martin.panter', 'piotr.dobrogost', 'demian.brecht', 'soilandreyes', 'op368']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue22852'
    versions = ['Python 3.11']

    Linked PRs

    @soilandreyes
    Copy link
    Mannequin Author

    soilandreyes mannequin commented Nov 12, 2014

    urllib.parse can't handle URIs with empty #fragments. The fragment is removed and not reconsituted.

    http://tools.ietf.org/html/rfc3986#section-3.5 permits empty fragment strings:

      URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
      fragment    = *( pchar / "/" / "?" )
    

    And even specifies component recomposition to distinguish from not being defined and being an empty string:

    http://tools.ietf.org/html/rfc3986#section-5.3

    Note that we are careful to preserve the distinction between a
    component that is undefined, meaning that its separator was not
    present in the reference, and a component that is empty, meaning that
    the separator was present and was immediately followed by the next
    component separator or the end of the reference.

    This seems to be caused by missing components being represented as '' instead of None.

    >>> import urllib.parse
    >>> urllib.parse.urlparse("http://example.com/file#")
    ParseResult(scheme='http', netloc='example.com', path='/file', params='', query='', fragment='')
    >>> urllib.parse.urlunparse(urllib.parse.urlparse("http://example.com/file#"))
    'http://example.com/file'
    
    >>> urllib.parse.urlparse("http://example.com/file#").geturl()
    'http://example.com/file'
    
    >>> urllib.parse.urlparse("http://example.com/file# ").geturl()
    'http://example.com/file# '
    
    >>> urllib.parse.urlparse("http://example.com/file#nonempty").geturl()
    'http://example.com/file#nonempty'
    
    >>> urllib.parse.urlparse("http://example.com/file#").fragment
    ''

    The suggested fix is to use None instead of '' to represent missing components, and to check with "if fragment is not None" instead of "if not fragment".

    The same issue applies to query and authority. E.g.

    http://example.com/file? != http://example.com/file

    ... but be careful about the implications of

    file:///file != file:/file

    @soilandreyes soilandreyes mannequin added the stdlib Python modules in the Lib dir label Nov 12, 2014
    @soilandreyes
    Copy link
    Mannequin Author

    soilandreyes mannequin commented Nov 13, 2014

    I tried to make a patch for this, but I found it quite hard as the urllib/parse.py is fairly low-level, e.g. it is constantly encoding/decoding bytes and strings within each URI component. Basically the code assumes there are tuples of strings, with support for both bytes and strings baked in later.

    As you see in

    https://github.com/stain/cpython/compare/issue-2285-urllib-empty-fragment?expand=1

    the patch in parse.py is small - but the effect of that in test_urlparse.py is a bit bigger, as lots of test are testing for the result of urlsplit to have '' instead of None. It is uncertain how much real-life client code also check for '' directly. ("if not p.fragment" would of course still work - but "if p.fragment == ''" would not work anymore.

    I therefore suggest an alternative to my patch above - to add some boolean fields like has_fragment, thus the existing component fields can keep their backwards compatible '' and b'' values even when a component is actually missing, and yet allowing geturl() to reconstitute the URI according to the RFC.

    @vadmium
    Copy link
    Member

    vadmium commented Feb 9, 2015

    I also liked the idea of returning None to distinguish a missing URL component from an empty-but-present component, and it would make them more consistent with the “username” and “password” fields. But I agree it would break backwards compabitility too much. The idea of “has_fragment” etc flags might work. The ParseResult etc class signatures could be expanded like this:

    SplitResult(scheme, netloc, path, query, fragment, *, has_netloc=None, has_query=None, has_fragment=None)
    >>> url1 = SplitResult("http", "localhost", "/path", query="", fragment="")
    >>> url1.has_netloc
    True
    >>> url1.has_fragment
    False
    >>> url2 = SplitResult("http", "localhost", "/path", query="", fragment="", has_fragment=True)
    >>> url2.has_fragment
    True
    >>> url2.has_query
    False
    >>> url2.geturl()
    "http://localhost/path#"

    Is it also worth adding “has_params” for urlparse()?

    @vadmium
    Copy link
    Member

    vadmium commented Mar 13, 2015

    There have been a few recent bug reports (bpo-23505, bpo-23636) that may be solved by the has_netloc proposal. So I am posting a patch implementing it. The changes were a bit more involved than I anticipated, but should still be usable.

    I reused some of Stian’s tests, however the results are slightly different in my patch, matching the existing behaviour:

    • Never sets netloc, query, fragment to None
    • Always leaves hostname as None rather than ""
    • Retains username, password and port components in netloc
    • Converts hostname to lowercase

    Unfortunately I discovered that you cannot add __slots__ to namedtuple() subclasses; see bpo-17295 and bpo-1173475. Therefore in my patch I have removed __slots__ from the SplitResult etc classes, so that those classes can gain the has_netloc etc attributes.

    I chose to make the default has_netloc value based on existing urlunsplit() behaviour:

    >>> empty_netloc = ""
    >>> SplitResult("mailto", empty_netloc, "chris@example.com", "", "").has_netloc
    False
    >>> SplitResult("file", empty_netloc, "/path", "", "").has_netloc
    True

    I found out that the “urllib.robotparser” module uses a urlunparse(urlparse()) combination to normalize URLs, so had to be changed. This is a backwards incompatibility of this proposal.

    @vadmium vadmium added the type-feature A feature request or enhancement label Mar 13, 2015
    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Mar 16, 2015

    I've done an initial pass in Rietveld and left some comments, mostly around docs. Here are some additional questions though:

    Given has_* flags can be inferred during instantiation of *Result classes, is there a reason to have them writable, meaning is there a reason to add them to the __init__ methods?

    I'd also like to see this logic moved out of _NetlocResultMixinBase. I'm assuming it was put there for backwards compatibility which is understandable, but I don't think it makes sense to add such logic to a mixin who's purpose is additional functionality around the netloc attribute. This one's a little more subjective though, but here's a rough idea of what I'm thinking:

    SplitResult = namedtuple('SplitResult', 'scheme netloc path query fragment')
    class _SplitResultBase(SplitResult):
        def __new__(cls, scheme, netloc, path, query, fragment):
            inst = super().__new__(cls, scheme, netloc, path, query, fragment)
            inst.has_netloc = bool(netloc)
            return inst
    >>> s = urlsplit('http://example.com/foo/bar/')
    >>> s.has_netloc
    True

    This keeps backwards compatibility, but also adds the additional logic to the bases rather than in the mixins. I might also split out the logic into helper functions in order to avoid duplication between _SplitResultBase and _ParseResultBase.

    This method also avoids the dependency on ordering of base classes as well as the addition of a variadic signature to (Split|Parse)Result.__init__.

    Thoughts?

    @vadmium
    Copy link
    Member

    vadmium commented Mar 16, 2015

    ## Inferring flags ##

    The whole reason for the has_netloc etc flags is that I don’t think we can always infer their values, so we have to explicitly remember them. Consider the following two URLs, which I think should both have empty “netloc” strings for backwards compatibility, but should be handled differently by urlunsplit():

    >>> urlsplit("////evil.com").netloc
    ''
    >>> urlsplit("////evil.com").has_netloc
    True
    >>> urlunsplit(urlsplit("////evil.com"))  # Adds “//” back
    '////evil.com'
    >>> urlsplit("/normal/path").netloc
    ''
    >>> urlsplit("/normal/path").has_netloc
    False
    >>> urlunsplit(urlsplit("/normal/path"))  # Does not add “//”
    '/normal/path'

    ## _NetlocResultMixinBase abuse ##

    The _NetlocResultMixinBase class is a common class used by the four result classes I’m interested in. I probably should rename it to something like _SplitParseMixinBase, since it is the common base to both urlsplit() and urlparse() results.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Mar 17, 2015

    >>> urlsplit("////evil.com").netloc
    ''
    >>> urlsplit("////evil.com").has_netloc
    True
    >>> urlunsplit(urlsplit("////evil.com")) # Adds “//” back
    '////evil.com'

    RFC 3986, section 3.3:

    If a URI contains an authority component, then the path component
    must either be empty or begin with a slash ("/") character. If a URI
    does not contain an authority component, then the path cannot begin
    with two slash characters ("//").

    Because this is a backwards incompatible behavioural change and is just as invalid as far as the RFC goes, I think that the current behaviour should be preserved. Even though it's still incorrect, it won't break existing code if left unchanged.

    _NetlocResultMixinBase abuse

    The _NetlocResultMixinBase class is a common class used by the four result classes I’m interested in. I probably should rename it to something like _SplitParseMixinBase, since it is the common base to both urlsplit() and urlparse() results.

    I think I'm coming around to this and realizing that it's actually quite close to my proposal, the major difference being the additional level of hierarchy in mine. My issue was mostly due to the addition of the variadic signature in the docs (i.e. line 407 here: http://bugs.python.org/review/22852/diff/14176/Doc/library/urllib.parse.rst) which led me to believe a nonsensical signature would be valid. After looking at it again, __new__ is still bound to the tuple's signature, so you still get the following:

    >>> SplitResult('scheme','authority','path','query','fragment','foo','bar','baz')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Volumes/src/p/cpython/Lib/urllib/parse.py", line 137, in __new__
        self = super().__new__(type, *pos, **kw)
    TypeError: __new__() takes 6 positional arguments but 9 were given

    So I'm less opposed to this as-is. I would like to see the "*" removed from the docs though as it's misleading in the context of each of (Split|Parse)Result. I do agree that renaming _NetlocResultMixinBase would be helpful, but it might also be nice (from a pedant's point of view) to remove "mixin" altogether if the __new__ implementation stays as-is.

    @vadmium
    Copy link
    Member

    vadmium commented Mar 17, 2015

    Regarding unparsing of "////evil.com", see bpo-23505, where the invalid behaviour is pointed out as a security issue. This was one of the bugs that motivated me to make this patch. I cannot imagine some existing code (other than an exploit) that would be broken by restoring the empty “//” component; do you have an example?

    Why do you think the asterisks (*) in the Split/ParseResult signatures are misleading? I am trying to document that the has_ flags are keyword-only parameters. I avoided making them positional parameters, as they are not part of the underlying tuple object.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Mar 17, 2015

    I avoided making them positional parameters, as they are not part of the underlying tuple object.

    Ignore me, I was off my face and you're absolutely correct.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Mar 17, 2015

    I cannot imagine some existing code (other than an exploit) that would be broken by restoring the empty “//” component; do you have an example?

    You're likely right about the usage (I can't think of a plausible use case at any rate).

    At first read of bpo-23505, I think I agree with the changes you've made to allow for successful round-tripping, but I'd think it'll have to be vetted by at least more than one core dev.

    @vadmium
    Copy link
    Member

    vadmium commented Mar 22, 2015

    Posting patch v2 with these changes:

    • Split out scheme documentation fixes to bpo-23684.
    • Renamed _NetlocResultMixinBase → _SplitParseBase
    • Explained the default values of the flags better, and what None means
    • Changed to Demian’s forward-looking “version changed” notices. I decided it is okay because the same information can be inferred.
    • Tweaked urlunsplit() and urlunparse() descriptions

    @vadmium
    Copy link
    Member

    vadmium commented May 31, 2015

    Anyone want to review my new patch? This is a perennial issue; see all the duplicates I just linked.

    @vadmium vadmium changed the title urllib.parse wrongly strips empty #fragment urllib.parse wrongly strips empty #fragment, ?query, //netloc May 31, 2015
    @rbtcollins
    Copy link
    Member

    See also bpo-6631

    @cjerdonek
    Copy link
    Member

    I just learned of this issue. Rather than adding has_netloc, etc. attributes, why not use None to distinguish missing values as is preferred above, but add a new boolean keyword argument to urlparse(), etc. to get the new behavior (e.g. "allow_none" to parallel "allow_fragments")?

    It seems like this would be more elegant, IMO, because it would lead to the API we really want. For example, the ParseResult(), etc. signatures and repr() values would be simpler. Changing the default value of the new keyword arguments would also provide a clean and simple deprecation pathway in the future, if desired.

    @vadmium
    Copy link
    Member

    vadmium commented Aug 4, 2018

    I like this option. I suppose choosing which option to take is a compromise between compatiblity and simplicity. In the short term, the “allows_none” option requires user code to be updated. In the long term it may break compatibility. But the “has_netloc” etc option is more complex.

    @openandclose
    Copy link
    Mannequin

    openandclose mannequin commented Jun 10, 2020

    I found another related issue (bpo-37969).

    I also filed one myself (bpo-40938).

    ---

    One thing against the 'has_netloc' etc. solution is that
    while it guarantees round-trips (urlunsplit(urlsplit('...')) etc.),
    it is conditional on 'urlunsplit' getting 'SplitResult' object.
    When 'urlunsplit' gets a normal tuple, it is helpless.

    @merwok
    Copy link
    Member

    merwok commented Feb 12, 2022

    See also bpo-46337

    @merwok merwok added the 3.11 only security fixes label Feb 12, 2022
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @geryogam
    Copy link
    Contributor

    geryogam commented Dec 3, 2022

    I opened issue #99962 discussing alternative solutions based on a new representation (instead of the empty string) for undefined components.

    But after learning about this issue, I think that @cjerdonek’s solution using a None representation for undefined components with a new flag parameter (allow_none) in the parsing functions (urldefrag(), urlparse(), and urlsplit()) is the best solution, as it is the simplest representation for undefined components, it is backward compatible with the flag parameter defaulting to the legacy behaviour, and it does not create yet another parsing function.

    @cjerdonek’s solution is not unprecedented in the standard library. For instance @erlend-aasland has recently chosen a similar approach in PR #93823 for implementing the new PEP-249 manual commit mode in the sqlite3 module, by introducing a new autocommit parameter in the connect() function which changes the behaviour of the Connection.__init__(), Connection.execute(), Connection.executemany(), Connection.executescript(), Connection.commit(), Connection.rollback(), and Connection.close() methods.

    @serhiy-storchaka
    Copy link
    Member

    I tried to implement the @cjerdonek's idea.

    • Add a boolean option allow_none in urlparse(), urlsplit() and urldefrag() -- if it is true, then non defined components are represented as None instead of an empty string.
    • Add a boolean option keep_empty in urlunparse(), urlunsplit() and the geturl() method in all six result classes -- if it is true, keep empty non-None components in the resulting string.

    Now you can call a parsing function with allow_none=True to get unambiguous result and then call an unparsing function with keep_empty=True to preserve the original empty components or with keep_empty=False (default) to strip empty components. Calling with keep_empty=True after allow_none=False is an error -- it produces result with added ";?#" if corresponding components were absent in the original.

    But preserving the original representation is more common than stripping empty components, so in most cases we want to use keep_empty=True after allow_none=True. It makes sense to make this default in the geturl() method, since we know whether a named tuple is the result of the call with allow_none=True.

    So we need to encode the value of allow_none in the nametuple results. There are two ways:

    • Add an attribute. It should not be a part of the tuple for backward compatibility. (A keyword-only parameter can be added in the constructor, although this complicates the things. We need to override _replace() and __replace__() to preserve that attribute in the copies.
    • Add different classes for allow_none=True. Currently There are 6 result classes (DefragResult,
      SplitResult, ParseResult and the corresponding bytes counterparts), not including bytes classes and mixins, this amount can be simply doubled.

    I hesitate before choosing a way.

    Alternative to boolean options is addition of a new set of parsing and unparsing functions. newurlparse(string) is shorter than urlparse(string, allow_none=True). Old functions could be deprecated and removed in future.

    serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Aug 25, 2024
    …I components
    
    Changes in the urllib.parse module:
    
    * Add option allow_none in urlparse(), urlsplit() and urldefrag(). If
      it is true, represent not defined components as None instead of an
      empty string.
    * Add option keep_empty in urlunparse() and urlunsplit(). If it is
      true, keep empty non-None components in the resulting string.
    * Add option keep_empty in the geturl() method of DefragResult,
      SplitResult, ParseResult and the corresponding bytes counterparts.
    @picnixz picnixz removed the 3.11 only security fixes label Aug 25, 2024
    serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Aug 31, 2024
    …I components
    
    Changes in the urllib.parse module:
    
    * Add option allow_none in urlparse(), urlsplit() and urldefrag(). If
      it is true, represent not defined components as None instead of an
      empty string.
    * Add option keep_empty in urlunparse() and urlunsplit(). If it is
      true, keep empty non-None components in the resulting string.
    * Add option keep_empty in the geturl() method of DefragResult,
      SplitResult, ParseResult and the corresponding bytes counterparts.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants