Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[security] CVE-2022-0391: urllib.parse should sanitize urls containing ASCII newline and tabs. #88048

Closed
orsenthil opened this issue Apr 18, 2021 · 52 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-security A security issue

Comments

@orsenthil
Copy link
Member

BPO 43882
Nosy @gpshead, @orsenthil, @vstinner, @ned-deily, @OddBloke, @ambv, @mgorny, @apollo13, @mlissner, @pablogsal, @miss-islington, @tirkarthi, @felixxm, @sethmlarson
PRs
  • bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. #25595
  • [3.9] bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (GH-25595) #25725
  • [3.8] bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (GH-25595) #25726
  • [3.7] bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (GH-25595) #25727
  • [3.6] bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (GH-25595) #25728
  • [3.9] bpo-43882 Remove the newline, and tab early. From query and fragments. #25853
  • bpo-43882 Remove the newline, and tab early. From query and fragments. #25921
  • [3.7] bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. #25923
  • [3.6] bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs #25924
  • [3.10] bpo-43882 Remove the newline, and tab early. From query and fragments. (GH-25921) #25936
  • [3.7] bpo-43882 - Mention urllib.parse changes in Whats New section for 3.7.11 #26267
  • [3.6] bpo-43882 - Mention urllib.parse changes in Whats New section for 3.6.14 #26268
  • [3.10] bpo-43882 - Mention urllib.parse changes in Whats new section. #26275
  • [3.9] bpo-43882 - Mention urllib.parse changes in Whats new section. #26276
  • [3.8] bpo-43882 - Mention urllib.parse changes in Whats new section. #26277
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = <Date 2021-06-02.01:26:09.260>
    created_at = <Date 2021-04-18.19:37:00.259>
    labels = ['type-security', '3.8', '3.9', '3.10', '3.11', '3.7', 'library']
    title = '[security] CVE-2022-0391: urllib.parse should sanitize urls containing ASCII newline and tabs.'
    updated_at = <Date 2022-02-09.11:40:17.243>
    user = 'https://github.com/orsenthil'

    bugs.python.org fields:

    activity = <Date 2022-02-09.11:40:17.243>
    actor = 'felixxm'
    assignee = 'orsenthil'
    closed = True
    closed_date = <Date 2021-06-02.01:26:09.260>
    closer = 'gregory.p.smith'
    components = ['Library (Lib)']
    creation = <Date 2021-04-18.19:37:00.259>
    creator = 'orsenthil'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 43882
    keywords = ['patch']
    message_count = 47.0
    messages = ['391343', '391352', '391426', '391859', '392334', '392338', '392611', '392781', '392808', '392835', '392873', '392926', '392944', '392971', '392995', '393009', '393025', '393030', '393033', '393034', '393039', '393049', '393107', '393108', '393136', '393139', '393142', '393144', '393146', '393149', '393150', '393198', '393203', '393205', '393207', '393211', '393997', '394056', '394057', '394058', '394062', '394112', '394113', '396628', '412688', '412705', '412821']
    nosy_count = 14.0
    nosy_names = ['gregory.p.smith', 'orsenthil', 'vstinner', 'ned.deily', 'odd_bloke', 'lukasz.langa', 'mgorny', 'apollo13', 'Mike.Lissner', 'pablogsal', 'miss-islington', 'xtreak', 'felixxm', 'sethmlarson']
    pr_nums = ['25595', '25725', '25726', '25727', '25728', '25853', '25921', '25923', '25924', '25936', '26267', '26268', '26275', '26276', '26277']
    priority = 'high'
    resolution = 'fixed'
    stage = 'commit review'
    status = 'closed'
    superseder = None
    type = 'security'
    url = 'https://bugs.python.org/issue43882'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9', 'Python 3.10', 'Python 3.11']

    @orsenthil
    Copy link
    Member Author

    A security issue was reported by Mike Lissner wherein an attacker was able to use \r\n in the url path, the urlparse method didn't sanitize and allowed those characters be present in the request.

    In [9]: from urllib.parse import urlsplit
    In [10]: urlsplit("java\nscript:alert('bad')")
    Out[10]: SplitResult(scheme='', netloc='', path="java\nscript:alert('bad')", query='', fragment='')

    Firefox and other browsers ignore newlines in the scheme. From
    the browser console:

    > new URL("java\nscript:alert(bad)")
    << URL { href: "javascript:alert(bad)", origin: "null", protocol:
    "javascript:", username: "", password: "", host: "", hostname: "", port: "", pathname: "alert(bad)", search: ""

    Mozilla Developers informed about the controlling specification for URLs is in fact defined by the "URL Spec"
    from WHATWG which updates RFC 3986 and specifies that tabs and newlines
    should be stripped from the scheme.

    See: https://url.spec.whatwg.org/#concept-basic-url-parser

    That link defines an automaton for URL parsing. From that link, steps 2 and 3 of scheme parsing read:

    If input contains any ASCII tab or newline, validation error.
    3. Remove all ASCII tab or newline from input.


    urlparse module behavior should be updated, and an ASCII tab or newline should be removed from the url (sanitized) before it is sent to the request, as WHATWG spec.

    @orsenthil orsenthil added 3.10 only security fixes 3.7 (EOL) end of life 3.8 (EOL) end of life 3.9 only security fixes labels Apr 18, 2021
    @orsenthil orsenthil self-assigned this Apr 18, 2021
    @orsenthil orsenthil added 3.10 only security fixes type-security A security issue 3.7 (EOL) end of life 3.8 (EOL) end of life 3.9 only security fixes labels Apr 18, 2021
    @orsenthil orsenthil self-assigned this Apr 18, 2021
    @orsenthil orsenthil added the type-security A security issue label Apr 18, 2021
    @tirkarthi
    Copy link
    Member

    See also a related issue to sanitise newline on other helper functions https://bugs.python.org/issue30713

    See also discussion and compatibility on disallowing control characters : https://bugs.python.org/issue30458

    @vstinner vstinner changed the title urllib.parse should sanitize urls containing ASCII newline and tabs. [security] urllib.parse should sanitize urls containing ASCII newline and tabs. Apr 20, 2021
    @vstinner vstinner changed the title urllib.parse should sanitize urls containing ASCII newline and tabs. [security] urllib.parse should sanitize urls containing ASCII newline and tabs. Apr 20, 2021
    @vstinner vstinner added stdlib Python modules in the Lib dir labels Apr 20, 2021
    @vstinner
    Copy link
    Member

    See also bpo-43883.

    @orsenthil
    Copy link
    Member Author

    I have added a PR to remove ascii newlines and tabs from URL input. It is as per the WHATWG spec.

    However, I still like to research more and find out if this isn't introducing behavior that will break existing systems. It should also be aligned the decisions we have made with previous related bug reports.

    Please review.

    @orsenthil
    Copy link
    Member Author

    New changeset 76cd81d by Senthil Kumaran in branch 'master':
    bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (GH-25595)
    76cd81d

    @orsenthil
    Copy link
    Member Author

    New changeset 491fde0 by Miss Islington (bot) in branch '3.9':
    [3.9] bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (GH-25595) (GH-25725)
    491fde0

    @gpshead
    Copy link
    Member

    gpshead commented May 1, 2021

    I think there's still a flaw in the fixes implemented in 3.10 and 3.9 so far. We're closer, but probably not quite good enough yet.

    why? We aren't stripping the newlines+tab early enough.

    I think we need to do the stripping *right after* the _coerce_args(url, ...) call at the start of the function.

    Otherwise we
    (1) are storing url variants with the bad characters in _parse_cache [a mere slowdown in the worst case as it'd just overflow the cache sooner]
    (2) are splitting the scheme off the URL prior to stripping. in 3.9+ there is a check for valid scheme characters, which will defer to the default scheme when found. The WHATWG basic url parsing has these characters stripped before any parts are split off though, so 'ht\rtps' - for example - would wind up as 'https' rather than our behavior so far of deferring to the default scheme.

    I noticed this when reviewing the pending 3.8 PR as it made it more obvious due to the structure of the code and would've allowed characters through into query and fragment in some cases. #25726 (review)

    @ambv
    Copy link
    Contributor

    ambv commented May 3, 2021

    Good catch, Greg. Since it's not merged already, this change will miss 3.8.10 but as a security fix will be included in 3.8.11 later in the year.

    The partial fix already landed in 3.9 will be released in 3.9.5 later today unless it's amended or reverted in a few hours.

    @orsenthil
    Copy link
    Member Author

    Based on Greg's review comment, I have pushed the fix for 3.9, and 3.8

    There is no need to hold off releases for these alone. If we get it merged before the release cut today, fine, otherwise, they will be in the next security fix.

    @orsenthil
    Copy link
    Member Author

    New changeset 8a59574 by Senthil Kumaran in branch '3.9':
    [3.9] bpo-43882 Remove the newline, and tab early. From query and fragments. (bpo-25853)
    8a59574

    @mgorny
    Copy link
    Mannequin

    mgorny mannequin commented May 4, 2021

    I hate to be the bearer of bad news but I've already found this change to be breaking tests of botocore and django. In both cases, the test failure is apparently because upstream used to reject URLs after finding newlines in the split components, and now they're silently stripped away.

    Filed bugs:
    boto/botocore#2377
    https://code.djangoproject.com/ticket/32713

    Note that I'm not saying the change should be reverted.

    @sethmlarson
    Copy link
    Mannequin

    sethmlarson mannequin commented May 4, 2021

    Leaving a thought here, I'm highlighting that we're now implementing two different standards, RFC 3986 with hints of WHATWG-URL. There are pitfalls to doing so as now a strict URL parser for RFC 3986 (like the one used by urllib3/requests) will give different results compared to Python and thus opens up the door for SSRF vulnerabilities 1.

    @mlissner
    Copy link
    Mannequin

    mlissner mannequin commented May 4, 2021

    I haven't watched that Blackhat presentation yet, but from the slides, it seems like the fix is to get all languages parsing URLs the same as the browsers. That's what @orsenthil has been doing here and plans to do in https://bugs.python.org/issue43883.

    Should we get a bug filed with requests/urllib3 too? Seems like a good idea if it suffers from the same problems.

    @gpshead
    Copy link
    Member

    gpshead commented May 5, 2021

    Both Django and Botocore issues appear to be in the category of: "depending on invalid data being passed through our urlsplit API so that they could look for it later" Not much sympathy. We never guaranteed we'd pass invalid data through. They're depending on an implementation detail (Hyrum's law). Invalid data causes other people who don't check for it problems. There is no valid solution on our end within the stdlib that won't frustrate somebody.

    We chose to move towards safer (undoubtedly not perfect) by default.

    Instead of the patches as you see them, we could've raised an exception. I'm sure that would also also have tripped up existing code depending on the undesirable behavior.

    If one wants to reject invalid data as an application/library/framework, they need a validator. The Python stdlib does not provide a URL validation API. I'm not convinced we would even want to (though that could be something bpo-43883 winds up providing) given how perilous that is to get right: Who's version of right? which set of standards? when and why? Conclusion: The web... such a mess.

    @ned-deily
    Copy link
    Member

    My reading of the previous message was, even if we raised exception
    or gave as a parameter, it wont be any better for certain downstream
    users, as we let the security problem open, and have it only as opt-in fix.

    Senthil, I am not sure which previous message you are referring to but, with regards to my comment about revert the recent fixes for 3.7 and 3.6 until the reported problems are resolved, I should add that, given the recent input from downstream users about the side effects, the only way we *should* proceed with the current changes is by including more information in a What's New entry and the NEWS blurb about that the implications to users are of these changes.

    @gpshead
    Copy link
    Member

    gpshead commented May 7, 2021

    There is no less intrusive fix as far as I can see. I believe we're down to either stick with what we've done, or do nothing. It doesn't have to be the same choice in all release branches, being more conservative with changes the older the stable branch is okay. (ie: removing this from 3.6 and 3.7 seems fine even if more recent ones do otherwise)

    Based on my testing, raising an exception is more intrusive to existing tests (which we can only ever hope is representative of code) than stripping. At least as exposed by running the changes through many tens of thousands of unittest suites at work.

    ie: if we raise an exception, pandas.read_json() starts failing because that winds up using urlsplit in hopes of extracting the scheme and comparing that to known values as their method of deciding if something should be treated as a URL to data rather than data. Pandas would need to be fixed.

    That urlsplit() API use pattern is repeated in various other pieces of code: urlsplit is not expected to raise an exception. The caller then has a conditional or two testing some parts of the urlsplit result to make a guess as to if something should be considered a URL or not. Doing code inspection, pandas included, this code pretty much always then goes on to pass the original url value off to some other library, be it urllib, or requests, or ...).

    Consequences of that code inspection finding? With our existing character stripping change, new data is then allowed to pass through these urlsplit uses and be considered a URL. Which leads to some code sending the url with embedded \r\n\t chars on to other APIs - a concern expressed a couple of times above.

    Even though urlsplit isn't itself a validation API, it gets used as an early step in peoples custom identification and validation attempts. So *any* change we make to it at all in any way breaks someones expectations, even if they probably shouldn't have had those expectations and aren't doing wise validation.

    Treat this analysis as a sign that we should provide an explicit url validator because almost everyone is doing it some form of wrong. (bpo-43883)

    I did wonder if Mike's suggestion of removing the characters during processing, but leaving them in the final result in https://bugs.python.org/issue43882#msg393033 is feasible as remediation for this? My gut feeling is that it isn't. It doesn't solve the problem of preventing the bad data from going where it shouldn't. Even if we happen to parse that example differently, the unwanted characters are still retained in other places they don't belong. Fundamantelly: We require people to make a different series of API call and choices in the end user code to **explicitly not use unvalidated inputs**. Our stdlib API surface can't satisfy that today and use of unvalidated data in wrong places is a broad software security antipattern theme.

    @orsenthil
    Copy link
    Member Author

    Ned wrote:

    Senthil, I am not sure which previous message you are referring to but.

    I meant, the messages from other developers who raised that change broke certain test cases.

    Ned, but I got little concerned, if we planned to revert the change.

    the only way we *should* proceed with the current changes is by including more information in a What's New entry and the NEWS blurb about that the implications to users are of these changes.

    I agree with completely. I will include an additional blurb for this change for security fix versions.

    Greg wrote:

    There is no less intrusive fix as far as I can see. I believe we're down to either stick with what we've done, or do nothing.

    Exactly my feeling too.

    It doesn't have to be the same choice in all release branches, being more conservative with changes the older the stable branch is okay. (ie: removing this from 3.6 and 3.7 seems fine even if more recent ones do otherwise)

    I hadn't considered that. But it wont save much will be my opinion. The users will have to upgrade to supported versions anyway and it will break then. The problem is only pushed a little.

    So, keeping it consistent seems alright to me. It is a little additional for everyone, but we seem to be doing it.

    @ned-deily
    Copy link
    Member

    I will include an additional blurb for this change for security fix versions.

    Ping. This issue is still blocking 3.7 and 3.6 security releases.

    @ned-deily
    Copy link
    Member

    New changeset c723d51 by Senthil Kumaran in branch '3.7':
    [3.7] bpo-43882 - Mention urllib.parse changes in Whats New section for 3.7.11 (GH-26267)
    c723d51

    @ned-deily
    Copy link
    Member

    New changeset 6f743e7 by Senthil Kumaran in branch '3.6':
    [3.6] bpo-43882 - Mention urllib.parse changes in Whats New section for 3.6.14 (GH-26268)
    6f743e7

    @ned-deily
    Copy link
    Member

    Thanks, Senthil and Greg! The updates for 3.7 and 3.6 are now merged. Is there anything else that needs to be done for this issue or can it now be closed?

    @gpshead
    Copy link
    Member

    gpshead commented May 20, 2021

    Lets get equivalent whatsnew text into the 3.8 and 3.9 and 3.10 branches before closing it.

    @orsenthil
    Copy link
    Member Author

    New changeset f14015a by Senthil Kumaran in branch '3.10':
    [3.10] bpo-43882 - Mention urllib.parse changes in Whats new section. (GH-26275)
    f14015a

    @orsenthil
    Copy link
    Member Author

    New changeset 0593ae8 by Senthil Kumaran in branch '3.9':
    [3.9] bpo-43882 - Mention urllib.parse changes in Whats new section. (GH-26276)
    0593ae8

    @gpshead gpshead closed this as completed Jun 2, 2021
    @gpshead gpshead closed this as completed Jun 2, 2021
    @ambv
    Copy link
    Contributor

    ambv commented Jun 28, 2021

    New changeset 634da2d by Senthil Kumaran in branch '3.8':
    [3.8] bpo-43882 - Mention urllib.parse changes in Whats new section. (bpo-26277)
    634da2d

    @vstinner
    Copy link
    Member

    vstinner commented Feb 6, 2022

    CVE-2022-0391 has been assigned to this vulnerability.

    @vstinner vstinner changed the title [security] urllib.parse should sanitize urls containing ASCII newline and tabs. [security] CVE-2022-0391: urllib.parse should sanitize urls containing ASCII newline and tabs. Feb 6, 2022
    @vstinner vstinner changed the title [security] urllib.parse should sanitize urls containing ASCII newline and tabs. [security] CVE-2022-0391: urllib.parse should sanitize urls containing ASCII newline and tabs. Feb 6, 2022
    @mlissner
    Copy link
    Mannequin

    mlissner mannequin commented Feb 7, 2022

    Looks like that CVE isn't public yet.

    https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-0391

    Any chance I can get access (I originally reported this vuln.). My email is mike@free.law, if it's possible and my email is needed.

    Thanks!

    @vstinner
    Copy link
    Member

    vstinner commented Feb 8, 2022

    Looks like that CVE isn't public yet.
    https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-0391
    Any chance I can get access (I originally reported this vuln.).

    Message from Gaurav Kamathe who requested the CVE:

    "We've sent a request to MITRE to get this published and it'll be available on MITRE shortly."

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @timmc
    Copy link

    timmc commented Aug 25, 2022

    @gpshead urlsplit already raises an exception for some malformed inputs: https://github.com/python/cpython/blob/v3.9.13/Lib/urllib/parse.py#L484 -- you can try it out with urlsplit('http://]/').

    This kind of eclectic approach -- taking part of one spec, part of another, ignoring invalid values, turning invalid outputs into answers without a warning -- is just going to cause more vulnerabilities. A URL with newlines in it was never valid, for browsers or anything else. Now there's just another parser that will give answers that don't match other parsers (whether it's browsers, or curl, or anything else). Better to bite the bullet and say "hey, that's not parseable".

    @gpshead
    Copy link
    Member

    gpshead commented Aug 25, 2022

    If you have a concrete proposal to do something different or have found other bugs, please open a new Issue. Comments added to an old merged PR are likely to be ignored and unseen.

    @timmc
    Copy link

    timmc commented Aug 25, 2022

    Understood. I mostly wanted to correct the record on urlsplit's existing behavior, for people looking back at the git blame to figure out why urlsplit behaves the way it does.

    (And of course I couldn't resist tacking on a warning about parser mismatch, which I don't think got enough air time in the discussion.)

    @parth-gr
    Copy link

    parth-gr commented Sep 5, 2022

    @orsenthil Why isn't this backported to python2?

    Should I open a manual backport of this? As it will be helpful for python2 users also

    @gpshead
    Copy link
    Member

    gpshead commented Sep 5, 2022

    python2 is EOL and receives zero support here.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-security A security issue
    Projects
    None yet
    Development

    No branches or pull requests

    9 participants