-
-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zlib: OverflowError while trying to compress 2^32 bytes or more #71317
Comments
zlib fails to compress files larger than 4gb due to some 32bit issues. I've tested this in Python 3.4.3 and 3.5.1: > python3 -c "import zlib; zlib.compress(b'a' * (2**32 - 1))"
> python3 -c "import zlib; zlib.compress(b'a' * (2**32))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
OverflowError: Size does not fit in an unsigned int For Python 2.7, the issue starts at 2^31 byte (due to signed 32bit integers): > python2 -c "import zlib; zlib.compress(b'a' * (2**31 - 1))"
> python2 -c "import zlib; zlib.compress(b'a' * (2**31))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
OverflowError: size does not fit in an int Decompressing files larger than 4GB works just fine. |
This behaviour seems to have been fixed in bpo-25626. But you can only get that feature with Python3.5+. |
Well, I have Python 3.5.1 installed and the problem still persists. I'm not sure that 25626 ist the same problem - in the comments they say this was not an issue in Python 3.4 or 2.x, but this is clearly the case here. Another thing I've noticed: Contrary to my previous statement, zlib.decompress() doesn't work on archives that are larger than 4GB (I was mislead by the fact that my 1GB archive contained a 6GB file). When I use gzip.compress() on more than 2^32 bytes, the same OverflowError occurs as with zlib.compress(). But when I use gzip.decompress(), I can extract archives that are larger than 4GB. |
This is similar, but different to the other bug. The other bug was only about output limits for incrementally decompressed data. Klamann’s bug is about the actual size of input (and possibly also output) buffers. The gzip.compress() implementation uses zlib.compressobj.compress(), which does not accept 2 or 4 GiB input either. The underlying zlib library uses “unsigned int” for the size of input and output chunks. It has to be called multiple times to handle 4 GiB. In both Python 2 and 3, the one-shot compress() function only does a single call to zlib. This explains why Python 3 cannot take 4 GiB. Python 2 uses an “int” for the input buffer size, hence the 2 GiB limit. I tend to think of these cases as bugs, which could be fixed in 3.5 and 2.7. Sometimes others also treat adding 64-bit support as a bug fix, e.g. file.read() on Python 2 (bpo-21932). But other times it is handled as a new feature for the next Python version, e.g. os.read() was fixed in 3.5, but not 2.7 (bpo-21932), random.getrandbits() proposed for 3.6 only (bpo-27072). This kind of bug is apparently already fixed for crc32() and adler32() in Python 2 and 3; see bpo-10276. This line from zlib.compress() also worries me: zst.avail_out = length + length/1000 + 12 + 1; /* unsigned ints */ I suspect it may overflow, but I don’t have enough memory to verify. You would need to compress just under 4 GiB of data that requires 5 MB or more when compressed (i.e. not all the same bytes, or maybe try level=0). Also, the logic for expanding the output buffer in each of zlib.decompress(), compressobj.compress(), decompressobj.decompress(), compressobj.flush(), and decompressobj.flush() looks faulty when it hits UINT_MAX. I suspect it may overwrite unallocated memory or do other funny stuff, but again I don’t have enough memory to verify. What happens when you decompress more than 4 GiB when the compressed input is less than 4 GiB? Code fixes that I think could be made:
|
Quick and careless scanning at night lead me to a wrong result, Sorry.
With enough memory, compressing with level 0 does raise a error while the default level didn't. Except for overflow fix, does zlib have to support large data in one operation? For example, it's OK that zlib.compress does not support data beyond 4GB since we can split data in application and then call zlib.compress on each part and finally concatenate the results. |
You should be able to use a compression (or decompression) object as a workaround. But calling zlib.compress() multiple times would generate multiple separate deflated streams, which is different. I think it is reasonable for Python to handle larger data sizes for zlib. (In theory, the 4 GiB UINT_MAX limit is platform-dependent.) IMO it is a matter of writing the patch(es), and perhaps depending on the complexity, deciding whether to apply them to 2.7 etc or just the next version of Python 3 (risk vs reward). Alternatively (or in the mean time), I guess we could document the limitation. |
Yes. It's compression object not compress. I find more. The overflow checking is introduced to solve problem in bpo-8650. It seems the explicit overflow checking is introduced to keep compatibility with py2 (py2 raises the overflowerror in pyargparse). I support loosing the limitation, but now it can only go into the next version of py3. |
OK, let's see >>> import zlib
>>> zc = zlib.compressobj()
>>> c1 = zc.compress(b'a' * 2**31)
>>> c2 = zc.compress(b'a' * 2**31)
>>> c3 = zc.flush()
>>> c = c1 + c2 + c3
>>> zd = zlib.decompressobj()
>>> d1 = zd.decompress(c)
Segmentation fault (core dumped) Seriously? What is wrong with this library? I've tested this using Python 3.5.0 on linux and Python 3.5.1 on Windows. So, splitting the Input in chunks of less than 2^32 byte (less than 2^31 for Python 2.x) seems to work (except for this segfault in Python 3), but it's still annoying that you have to split and concatenate data each time and remember to call flush() or you lose data... imho, it would be best to fix the underlying issue. There is no reason why we should keep the 32 bit limitation.
+1 |
Klamann, thanks for crash report. I think your decompress crash is explained by the bug expanding past UINT_MAX I identified above. The key is that length = 0 in zlib_Decompress_decompress_impl(), as if wrapped around, and the return value will have been resized to zero. My suggested fix step 7 would address this. The workaround here would either be to pass compressed data in smaller chunks (4 MB or less), so that no chunk can expand to 4 GiB, or to make use of the max_length parameter. Either way, it will make any code more complicated though. If anyone wants to write a patch (or do testing) to solve any or all of the problems, I am happy to help. But it is not a high priority for me to do all the work, because I am not set up to test it easily. |
I'd like to help but it'll need some time. And I'd like to start after bpo-27164 is solved. zdict now also checks for 4GB limit. |
Don't know how useful this will be, but here's a crash report from Mac OS X 10.11 with Klamann's example (Python 3.5). |
Hello Martin. I've finished a patch to add 64bit support to zlib. I think it solves the 9 problems you mentioned and let the input and output both be larger than UINT_MAX. Hope you are willing to review and we can move this forward. :) |
Thanks for working on this. I did a pass over your patch and left a bunch of comments. |
I'm willing to and thanks for your work too :) I have replied to your comments and adjusted my patch accordingly. But there are still several I am confused or want to negotiate more. I now upload the adjusted patch. |
Sorry, I suddenly find that I've misunderstood one of your comments. I changed according then. Please use the new version. |
Upload the new patch to fix bugs and make improvements. I'll add tests later. |
Attach patch to restore argument parsing support for __int__. |
Upload the near-final patch. This one adds large buffer tests on 64bit platforms. I have tested them on a server with enough memory. I don't add the @support.requires_resource('cpu') since large memory tests are born to be slow. And actually when I experiment, they are not that slow, at least not obviously slower than other large memory tests in test_zlib. So I ignore it. |
Add the newest version applying Martin's comments. |
This is the v8 patch. It does two things: [1] Apply Martin's comment about decompressobj.decompress so when user passes in PY_SSIZE_T_MAX and there is enough memory, no error will be raised. [2] Now decompressobj.flush can still work even if decompressobj.unconsumed_tail is larger than 4GiB. This needs two changes. First is we don't always use Z_FINISH. Second is we need to change save_unconsumed_input to support 64bit. Before we didn't realize this. Corresponding tests are added. |
Make v8 patch consistent with the latest changeset. |
ping |
This is on my list of things to look at, just that I have been away and am a bit backlogged atm. |
Upload the v9 version. It applies your last comment and catch up with the hg tip. |
I may be hard to test this change without enough memory. I upload the test results with the latest change. |
Here is a possible patch for 2.7. To fix everything on 2.7 I changed the module to accept input buffers over 2 GiB by enabling PY_SSIZE_T_CLEAN. As a consequence, the patch also includes ports of Nadeem Vawda’s adler32(), crc32() changes from bpo-10276, and my tests from bpo-25626. I have only tested this on computers with less than 4 GiB of memory. I can test compression and checksums with more input by using a sparse memory map, but not decompression. fm = open("5GiB.sparse", "w+b")
fm.truncate(5 * 2**30)
m = mmap(fm.fileno(), 0)
z = compress(m) |
Hi, Martin. I replied your last comment and your patch looks good to me except one test fails: [1/1] test_zlib ====================================================================== Traceback (most recent call last):
File "/usr/home/zhangxiang3/Python-2.7.12/Lib/test/test_support.py", line 1348, in wrapper
return f(self, maxsize)
File "/usr/home/zhangxiang3/Python-2.7.12/Lib/test/test_zlib.py", line 90, in test_big_buffer
ChecksumTestCase.assertEqual32(self, zlib.crc32(data), 1044521549)
TypeError: unbound method assertEqual32() must be called with ChecksumTestCase instance as first argument (got ChecksumBigBufferTestCase instance instead) Ran 54 tests in 317.008s FAILED (errors=1)
test test_zlib failed -- Traceback (most recent call last):
File "/usr/home/zhangxiang3/Python-2.7.12/Lib/test/test_support.py", line 1348, in wrapper
return f(self, maxsize)
File "/usr/home/zhangxiang3/Python-2.7.12/Lib/test/test_zlib.py", line 90, in test_big_buffer
ChecksumTestCase.assertEqual32(self, zlib.crc32(data), 1044521549)
TypeError: unbound method assertEqual32() must be called with ChecksumTestCase instance as first argument (got ChecksumBigBufferTestCase instance instead) 1 test failed: |
Here is a new patch for 2.7 that should avoid the problem with ChecksumBigBufferTestCase. I tested a hacked version of the test, so I am more confident in it now :) |
All tests passed now. :) I think it's OK. Also upload the v10 version restore the statement to avoid the crash you mentioned in comments. [1/1] test_zlib ---------------------------------------------------------------------- OK |
New changeset bd61bcd9bee8 by Martin Panter in branch '3.5': New changeset bd556f748cf8 by Martin Panter in branch 'default': |
New changeset 2192edcfea02 by Martin Panter in branch '2.7': |
Thanks Xiang for your work on this, and Klamann for the report. |
Thanks Xiang and Martin for solving this, you guys are awesome :) |
Thanks Martin too. Nobody than me knows how much work you have done to this. :) I could have made it better at first. :( Sorry for the noise to everyone. |
1 similar comment
Thanks Martin too. Nobody than me knows how much work you have done to this. :) I could have made it better at first. :( Sorry for the noise to everyone. |
…dule Patch by Xiang Zhang. GitHub-Issue-Link: python/cpython#71317
…dule Patch by Xiang Zhang. GitHub-Issue-Link: python/cpython#71317
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: