-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage of urllib.unquote and unquote_to_bytes #88500
Comments
'eng' claimed in original title that "urllib.parse.parse_qsl cannot parse large data". On original PR, said problem with 6-7 millions bytes. Claim should be backed up by a generated example that fails with original code and succeeds with new code. Claims of 'faster' also needs some examples. Original PRs must nearly all propose merging a branch created from main into main. Performance enhancements are often not backported. |
fwiw this sort of thing may be reasonable to backport to 3.9 as it is more than just a performance enhancement but also a resource consumption bug and should result in no behavior change. """ |
The PR was closed due to technicalities (pointing to the wrong branch, CLA) and the OP didn’t follow up. Unless someone object I will close this issue as well. |
`urllib.unquote_to_bytes` and `urllib.unquote` could both potentially generate `O(len(string))` intermediate `bytes` or `str` objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram. This switches the implementation to using an expanding `bytearray` and a generator internally instead of precomputed `split()` style operations.
I created a new PR and included fixing a similar legacy design issue in If someone wanted to consider this a security issue it could be backported. It is at most a fixed constant factor (roughly |
`urllib.unquote_to_bytes` and `urllib.unquote` could both potentially generate `O(len(string))` intermediate `bytes` or `str` objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram. This switches the implementation to using an expanding `bytearray` and a generator internally instead of precomputed `split()` style operations. Microbenchmarks with some antagonistic inputs like `mess = "\u0141%%%20a%fe"*1000` show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal. The slowdown scales consistently linear with input size as expected. Memory usage observed manually using `/usr/bin/time -v` on `python -m timeit` runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile. Observed memory usage is ~1/2 for `unquote()` and <1/3 for `unquote_to_bytes()` using `python -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)'` as a test.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
urllib.unquote
#96763The text was updated successfully, but these errors were encountered: