Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError on path remove_dot_segments when there's extra dot-dot (/../) segments #536

Closed
besfahbod opened this issue Oct 14, 2020 · 6 comments · Fixed by #741
Closed

ValueError on path remove_dot_segments when there's extra dot-dot (/../) segments #536

besfahbod opened this issue Oct 14, 2020 · 6 comments · Fixed by #741
Labels

Comments

@besfahbod
Copy link
Contributor

besfahbod commented Oct 14, 2020

🐞 Describe the bug
As described in RFC 3986 § 5.2.4. Remove Dot Segments, the remove_dot_segments algorithm removes any extra /../ parts of the URL, ignoring errors when the stack is empty.

However, yarl.URL() behavior at the moment is to raise an exception, ValueError, when that happens.

💡 To Reproduce
Instantiate URL class with such a URL string:

yarl.URL('https://example.com/alice/../../eve')

💡 Expected behavior
Follow the RFC, and ignore the error, to get it working like this:

In [16]: yarl.URL('https://example.com/alice/../eve')
Out[16]: URL('https://example.com/eve')

In [17]: yarl.URL('https://example.com/alice/../../eve')
Out[17]: URL('https://example.com/eve')

In [18]: yarl.URL('https://example.com/alice/../../../eve')
Out[18]: URL('https://example.com/eve')

We probably want to also test (and fix, if needed), any other parts of the API that would involve path resolution (including remove_dot_segments) steps.

📋 Logs/tracebacks

In [15]: yarl.URL('https://example.com/alice/../../eve')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-e72c86c9e132> in <module>()
----> 1 yarl.URL('https://example.com/alice/../../eve')

~/ans/venv/lib/python3.6/site-packages/yarl/_url.py in __new__(cls, val, encoded, strict)
    181                 path = cls._normalize_path(path)
    182
--> 183             cls._validate_authority_uri_abs_path(host=host, path=path)
    184             query = cls._QUERY_REQUOTER(val[3])
    185             fragment = cls._FRAGMENT_REQUOTER(val[4])

~/ans/venv/lib/python3.6/site-packages/yarl/_url.py in _validate_authority_uri_abs_path(host, path)
    675         if len(host) > 0 and len(path) > 0 and not path.startswith("/"):
    676             raise ValueError(
--> 677                 "Path in a URL with authority should start with a slash ('/') if set"
    678             )
    679

ValueError: Path in a URL with authority should start with a slash ('/') if set

📋 Your version of the Python

$ python --version
Python 3.6.10

📋 Your version of the aiohttp/yarl/multidict distributions

$ python -m pip show aiohttp
Name: aiohttp
Version: 3.6.2
$ python -m pip show multidict
Name: multidict
Version: 4.7.6
$ python -m pip show yarl
Name: yarl
Version: 1.6.2

📋 Additional context
The proposed behavior would put yarl.URL on par with DOM's URL object, which already follows the RFC on this:

> String(new URL("https://example.com/alice/../eve"))
< "https://example.com/eve"
> String(new URL("https://example.com/alice/../../eve"))
< "https://example.com/eve"
> String(new URL("https://example.com/alice/../../../eve"))
< "https://example.com/eve"

As well as rust-url library (from the unit tests):

  {
    "input": "http://example.com/foo/bar/../ton/../../a",
    // ...
    "pathname": "/a",
    // ...
  },
  {
    "input": "http://example.com/foo/../../..",
    // ...
    "pathname": "/",
    // ...
  },
  {
    "input": "http://example.com/foo/../../../ton",
    // ...
    "pathname": "/ton",
    // ...
  },

which has tests based on (warning: broken links):

  "# Based on http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/script-tests/segments.js",
  "# AS OF https://github.com/jsdom/whatwg-url/commit/35f04dfd3048cf6362f4398745bb13375c5020c2",
@besfahbod besfahbod added the bug label Oct 14, 2020
@asvetlov
Copy link
Member

Interesting.
I (maybe wrong) read the RFC as "remove_dot_segments is applicable to path joining procedure only".
I agree that the same behavior as mentioned JavaScript and Rust libraries is desirable.

A pull request is welcome!

@besfahbod
Copy link
Contributor Author

So, here's the relevant pieces from the RFC:

image

And:

image

And the algorithm under section 5.2.2. Transform References.

I think what we need to pay attention to is the difference between a "Reference URL" and a "Target URL". A Target URL is expected to be resolved and normalized, meaning that it's expected to not have any . or .. sequences.

From that perspective, I believe the input to yarl.URL() is expected to be a Reference URL. Does that sound about right?


We could also consider comparing the rules with https://url.spec.whatwg.org/ , which seem to be more close to the common "URL Library" implementations.


Anyways, since we do desire the compatibility, as suggested above, I'm going to get it to work as suggestion.

@mjpieters
Copy link
Contributor

mjpieters commented Jun 14, 2022

which has tests based on (warning: broken links):

  "# Based on http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/script-tests/segments.js",
  "# AS OF https://github.com/jsdom/whatwg-url/commit/35f04dfd3048cf6362f4398745bb13375c5020c2",

The last revision in the Webkit LayoutTests repository with such a file is 149680; the next rev folded the segments.js file into segments.html. It doesn't actually contain any tests for .. and . handling.

If you were looking at urltestdata.json (on which the Rust tests are actually based), you may have missed that there are multiple # Based on comments, the relevant section is based on paths.js in the same Webkit test suite, nowadays at LayoutTests/fast/url/paths.html

@mjpieters
Copy link
Contributor

mjpieters commented Jun 14, 2022

I've converted those tests from JSON to a small Python script:

from yarl import URL
from urllib.parse import unquote, quote

cases = [
    ("/././foo", "/foo"),
    ("/./.foo", "/.foo"),
    ("/foo/.", "/foo/"),
    ("/foo/./", "/foo/"),
    ("/foo/bar/..", "/foo/"),
    ("/foo/bar/../", "/foo/"),
    ("/foo/..bar", "/foo/..bar"),
    ("/foo/bar/../ton", "/foo/ton"),
    ("/foo/bar/../ton/../../a", "/a"),
    ("/foo/../../..", "/"),
    ("/foo/../../../ton", "/ton"),
    ("/foo/%2e", "/foo/"),
    ("/foo/%2e%2", "/foo/%2e%2"),
    ("/foo/%2e./%2e%2e/.%2e/%2e.bar", "/%2e.bar"),
    ("////../..", "//"),
    ("/foo/bar//../..", "/foo/"),
    ("/foo/bar//..", "/foo/bar/"),
    ("/foo", "/foo"),
    ("/%20foo", "/%20foo"),
    ("/foo%", "/foo%"),
    ("/foo%2", "/foo%2"),
    ("/foo%2zbar", "/foo%2zbar"),
    ("/foo%2©zbar", "/foo%2%C3%82%C2%A9zbar"),
    ("/foo%41%7a", "/foo%41%7a"),
    ("/foo\t\u0091%91", "/foo%C2%91%91"),
    ("/foo%00%51", "/foo%00%51"),
    ("/(%28:%3A%29)", "/(%28:%3A%29)"),
    ("/%3A%3a%3C%3c", "/%3A%3a%3C%3c"),
    ("/foo\tbar", "/foobar"),
    ("\\\\foo\\\\bar", "//foo//bar"),
    ("/%7Ffp3%3Eju%3Dduvgw%3Dd", "/%7Ffp3%3Eju%3Dduvgw%3Dd"),
    ("/@asdf%40", "/@asdf%40"),
    ("/你好你好", "/%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD"),
    ("/‥/foo", "/%E2%80%A5/foo"),
    ("//foo", "/%EF%BB%BF/foo"),
    ("/\u202e/foo/\u202d/bar", "/%E2%80%AE/foo/%E2%80%AD/bar"),
]
requote = False
for i, (test, expected) in enumerate(cases, 1):
    if requote:
        expected = quote(
            unquote(expected, encoding="latin1"), safe="/@:()=", encoding="latin1"
        )
    try:
        actual = URL(f"http://example.com{test}").raw_path
    except ValueError as e:
        actual = f"exception: {e}"

    if actual != expected:
        print(f"({i}) \x1b[31mFAIL\x1b[0m: {test}")
        print(f"         {actual}\n      != {expected}")

and this outputs:

(11) FAIL: /foo/../../../ton
         exception: Path in a URL with authority should start with a slash ('/') if set
      != /ton
(13) FAIL: /foo/%2e%2
         /foo/.%252
      != /foo/%2e%2
(14) FAIL: /foo/%2e./%2e%2e/.%2e/%2e.bar
         exception: Path in a URL with authority should start with a slash ('/') if set
      != /%2e.bar
(20) FAIL: /foo%
         /foo%25
      != /foo%
(21) FAIL: /foo%2
         /foo%252
      != /foo%2
(22) FAIL: /foo%2zbar
         /foo%252zbar
      != /foo%2zbar
(23) FAIL: /foo%2©zbar
         /foo%252%C3%82%C2%A9zbar
      != /foo%2%C3%82%C2%A9zbar
(24) FAIL: /foo%41%7a
         /fooAz
      != /foo%41%7a
(26) FAIL: /foo%00%51
         /foo%00Q
      != /foo%00%51
(27) FAIL: /(%28:%3A%29)
         /((::))
      != /(%28:%3A%29)
(28) FAIL: /%3A%3a%3C%3c
         /::%3C%3C
      != /%3A%3a%3C%3c
(30) FAIL: \\foo\\bar
         /
      != //foo//bar
(31) FAIL: /%7Ffp3%3Eju%3Dduvgw%3Dd
         /%7Ffp3%3Eju=duvgw=d
      != /%7Ffp3%3Eju%3Dduvgw%3Dd
(32) FAIL: /@asdf%40
         /@asdf@
      != /@asdf%40

Several of these are not actual failures; yarl.URL() gives us equivalent paths with valid %hh substitutions for 13, 20-24, 26-28, 31 and 32, and in addition 28 also replaced %3c with %3C, so uppercase hex instead of lowercase. Those are fine, and those disappear when you set requote to True.

That leaves 11, 14 and 30:

(11) FAIL: /foo/../../../ton
         exception: Path in a URL with authority should start with a slash ('/') if set
      != /ton
(14) FAIL: /foo/%2e./%2e%2e/.%2e/%2e.bar
         exception: Path in a URL with authority should start with a slash ('/') if set
      != /%2e.bar
(30) FAIL: \\foo\\bar
         /
      != //foo//bar

I'm not so sure about turning backslashes into forward slashes nor am I inclined to research that specific issue right now.

The following alteration to the _normalize_path() loop makes 11 and 14 pass:

        for seg in segments:
            if seg not in (".", ".."):
                resolved_path.append(seg)
            elif seg == ".." and (len(resolved_path) > 1 or resolved_path[-1]):
                resolved_path.pop()
            # seg == "."

@besfahbod
Copy link
Contributor Author

Thanks for working on this, @mjpieters! I generally agree with the direction here.

Specifically, I think not-raising-an-exception is the most-important problem to fix here, as it causes processing of user-provided data to fail in a place where no such errors are expected.

@mjpieters
Copy link
Contributor

Here's an alternative version, one avoids the (slower) if check for each path segment like the original did:

        prefix, resolved_path = "", []
        if path.startswith("/"):
            # preserve the "/" root element of absolute paths.
            prefix = "/"
            path = path[1:]

        segments = path.split("/")
        for seg in segments:
            if seg == "..":
                # ignore any .. segments that would otherwise cause an
                # IndexError when popped from resolved_path if
                # resolving for rfc3986
                with suppress(IndexError):
                    resolved_path.pop()
            elif seg != ".":
                resolved_path.append(seg)

        if segments and segments[-1] in (".", ".."):
            # do some post-processing here.
            # if the last segment was a relative dir,
            # then we need to append the trailing '/'
            resolved_path.append("")

        return prefix + "/".join(resolved_path)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants