-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[security] CVE-2022-0391: urllib.parse should sanitize urls containing ASCII newline and tabs. #88048
Comments
A security issue was reported by Mike Lissner wherein an attacker was able to use
Firefox and other browsers ignore newlines in the scheme. From
Mozilla Developers informed about the controlling specification for URLs is in fact defined by the "URL Spec" See: https://url.spec.whatwg.org/#concept-basic-url-parser That link defines an automaton for URL parsing. From that link, steps 2 and 3 of scheme parsing read: If input contains any ASCII tab or newline, validation error. urlparse module behavior should be updated, and an ASCII tab or newline should be removed from the url (sanitized) before it is sent to the request, as WHATWG spec. |
See also a related issue to sanitise newline on other helper functions https://bugs.python.org/issue30713 See also discussion and compatibility on disallowing control characters : https://bugs.python.org/issue30458 |
See also bpo-43883. |
I have added a PR to remove ascii newlines and tabs from URL input. It is as per the WHATWG spec. However, I still like to research more and find out if this isn't introducing behavior that will break existing systems. It should also be aligned the decisions we have made with previous related bug reports. Please review. |
I think there's still a flaw in the fixes implemented in 3.10 and 3.9 so far. We're closer, but probably not quite good enough yet. why? We aren't stripping the newlines+tab early enough. I think we need to do the stripping *right after* the _coerce_args(url, ...) call at the start of the function. Otherwise we I noticed this when reviewing the pending 3.8 PR as it made it more obvious due to the structure of the code and would've allowed characters through into query and fragment in some cases. #25726 (review) |
Good catch, Greg. Since it's not merged already, this change will miss 3.8.10 but as a security fix will be included in 3.8.11 later in the year. The partial fix already landed in 3.9 will be released in 3.9.5 later today unless it's amended or reverted in a few hours. |
Based on Greg's review comment, I have pushed the fix for 3.9, and 3.8
There is no need to hold off releases for these alone. If we get it merged before the release cut today, fine, otherwise, they will be in the next security fix. |
I hate to be the bearer of bad news but I've already found this change to be breaking tests of botocore and django. In both cases, the test failure is apparently because upstream used to reject URLs after finding newlines in the split components, and now they're silently stripped away. Filed bugs: Note that I'm not saying the change should be reverted. |
Leaving a thought here, I'm highlighting that we're now implementing two different standards, RFC 3986 with hints of WHATWG-URL. There are pitfalls to doing so as now a strict URL parser for RFC 3986 (like the one used by urllib3/requests) will give different results compared to Python and thus opens up the door for SSRF vulnerabilities 1. |
I haven't watched that Blackhat presentation yet, but from the slides, it seems like the fix is to get all languages parsing URLs the same as the browsers. That's what @orsenthil has been doing here and plans to do in https://bugs.python.org/issue43883. Should we get a bug filed with requests/urllib3 too? Seems like a good idea if it suffers from the same problems. |
Both Django and Botocore issues appear to be in the category of: "depending on invalid data being passed through our urlsplit API so that they could look for it later" Not much sympathy. We never guaranteed we'd pass invalid data through. They're depending on an implementation detail (Hyrum's law). Invalid data causes other people who don't check for it problems. There is no valid solution on our end within the stdlib that won't frustrate somebody. We chose to move towards safer (undoubtedly not perfect) by default. Instead of the patches as you see them, we could've raised an exception. I'm sure that would also also have tripped up existing code depending on the undesirable behavior. If one wants to reject invalid data as an application/library/framework, they need a validator. The Python stdlib does not provide a URL validation API. I'm not convinced we would even want to (though that could be something bpo-43883 winds up providing) given how perilous that is to get right: Who's version of right? which set of standards? when and why? Conclusion: The web... such a mess. |
Senthil, I am not sure which previous message you are referring to but, with regards to my comment about revert the recent fixes for 3.7 and 3.6 until the reported problems are resolved, I should add that, given the recent input from downstream users about the side effects, the only way we *should* proceed with the current changes is by including more information in a What's New entry and the NEWS blurb about that the implications to users are of these changes. |
There is no less intrusive fix as far as I can see. I believe we're down to either stick with what we've done, or do nothing. It doesn't have to be the same choice in all release branches, being more conservative with changes the older the stable branch is okay. (ie: removing this from 3.6 and 3.7 seems fine even if more recent ones do otherwise) Based on my testing, raising an exception is more intrusive to existing tests (which we can only ever hope is representative of code) than stripping. At least as exposed by running the changes through many tens of thousands of unittest suites at work. ie: if we raise an exception, pandas.read_json() starts failing because that winds up using urlsplit in hopes of extracting the scheme and comparing that to known values as their method of deciding if something should be treated as a URL to data rather than data. Pandas would need to be fixed. That urlsplit() API use pattern is repeated in various other pieces of code: urlsplit is not expected to raise an exception. The caller then has a conditional or two testing some parts of the urlsplit result to make a guess as to if something should be considered a URL or not. Doing code inspection, pandas included, this code pretty much always then goes on to pass the original url value off to some other library, be it urllib, or requests, or ...). Consequences of that code inspection finding? With our existing character stripping change, new data is then allowed to pass through these urlsplit uses and be considered a URL. Which leads to some code sending the url with embedded \r\n\t chars on to other APIs - a concern expressed a couple of times above. Even though urlsplit isn't itself a validation API, it gets used as an early step in peoples custom identification and validation attempts. So *any* change we make to it at all in any way breaks someones expectations, even if they probably shouldn't have had those expectations and aren't doing wise validation. Treat this analysis as a sign that we should provide an explicit url validator because almost everyone is doing it some form of wrong. (bpo-43883) I did wonder if Mike's suggestion of removing the characters during processing, but leaving them in the final result in https://bugs.python.org/issue43882#msg393033 is feasible as remediation for this? My gut feeling is that it isn't. It doesn't solve the problem of preventing the bad data from going where it shouldn't. Even if we happen to parse that example differently, the unwanted characters are still retained in other places they don't belong. Fundamantelly: We require people to make a different series of API call and choices in the end user code to **explicitly not use unvalidated inputs**. Our stdlib API surface can't satisfy that today and use of unvalidated data in wrong places is a broad software security antipattern theme. |
Ned wrote:
I meant, the messages from other developers who raised that change broke certain test cases. Ned, but I got little concerned, if we planned to revert the change.
I agree with completely. I will include an additional blurb for this change for security fix versions. Greg wrote:
Exactly my feeling too.
I hadn't considered that. But it wont save much will be my opinion. The users will have to upgrade to supported versions anyway and it will break then. The problem is only pushed a little. So, keeping it consistent seems alright to me. It is a little additional for everyone, but we seem to be doing it. |
Ping. This issue is still blocking 3.7 and 3.6 security releases. |
Thanks, Senthil and Greg! The updates for 3.7 and 3.6 are now merged. Is there anything else that needs to be done for this issue or can it now be closed? |
Lets get equivalent whatsnew text into the 3.8 and 3.9 and 3.10 branches before closing it. |
CVE-2022-0391 has been assigned to this vulnerability. |
Looks like that CVE isn't public yet. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-0391 Any chance I can get access (I originally reported this vuln.). My email is mike@free.law, if it's possible and my email is needed. Thanks! |
Message from Gaurav Kamathe who requested the CVE: "We've sent a request to MITRE to get this published and it'll be available on MITRE shortly." |
@gpshead urlsplit already raises an exception for some malformed inputs: https://github.com/python/cpython/blob/v3.9.13/Lib/urllib/parse.py#L484 -- you can try it out with This kind of eclectic approach -- taking part of one spec, part of another, ignoring invalid values, turning invalid outputs into answers without a warning -- is just going to cause more vulnerabilities. A URL with newlines in it was never valid, for browsers or anything else. Now there's just another parser that will give answers that don't match other parsers (whether it's browsers, or curl, or anything else). Better to bite the bullet and say "hey, that's not parseable". |
If you have a concrete proposal to do something different or have found other bugs, please open a new Issue. Comments added to an old merged PR are likely to be ignored and unseen. |
Understood. I mostly wanted to correct the record on urlsplit's existing behavior, for people looking back at the git blame to figure out why urlsplit behaves the way it does. (And of course I couldn't resist tacking on a warning about parser mismatch, which I don't think got enough air time in the discussion.) |
@orsenthil Why isn't this backported to Should I open a manual backport of this? As it will be helpful for python2 users also |
python2 is EOL and receives zero support here. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: