Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive encrypted with Zip Crypto algorithm (weak encrypt) is extremely slow under stream unzip #91

Closed
auto-Dog opened this issue Jul 30, 2024 · 8 comments

Comments

@auto-Dog
Copy link

I use this tool to streamly unzip zip archive. Some of them are encrypted with Zip Crypto algorithm. I see this might triggers weak decrypter in stream_unzip.py:

def decrypt_weak_decompress(chunks, decompress, is_done, num_unused):
key_0 = 305419896
key_1 = 591751049
key_2 = 878082192
crc32 = zlib.crc32
bytes_c = bytes
def update_keys(byte):
nonlocal key_0, key_1, key_2
key_0 = ~crc32(bytes_c((byte,)), ~key_0) & 0xFFFFFFFF
key_1 = (key_1 + (key_0 & 0xFF)) & 0xFFFFFFFF

However, the running efficiency is extreamly low using such python loop (approximately 5MB/minute). Any way to speed up?

@michalc
Copy link
Member

michalc commented Aug 24, 2024

Probably yes there is a way to speed it up.

But 5MB/minute is slower than I would expect even for the code as it is right now. Do you have a short snippet of code that I could run to show it is that slow?

@michalc
Copy link
Member

michalc commented Aug 24, 2024

Ah he's an example zipping a 100MB file of pseudo-random data, so pretty much the worst case in terms of compression:

import datetime
import subprocess
import random

from stream_unzip import stream_unzip

# Always deal with 65 KiB
max_chunk = 65536

# Create 100MB file of pseudo-random data
print('Creating uncompressed file...')
total = 100_000_000
remaining = total
random.seed(0)
with open('random.txt', 'wb') as f:
    while remaining:
        chunk_size = min(max_chunk, remaining)
        f.write(random.randbytes(chunk_size))
        remaining -= chunk_size
print('Done')

# ZIP the file
print('Creating password-protected ZIP...')
subprocess.check_output(['zip', '-P', 'mypassword', 'random.zip', 'random.txt'])
print('Done')

# UnZIP
print('Unzipping with stream_unzip')
start = datetime.datetime.now()
with open('random.zip', 'rb') as f:
    zipped_chunks =  iter(lambda: f.read(max_chunk), b'')
    for file_name, size, chunks in stream_unzip(zipped_chunks, password=b'mypassword'):
        for _ in chunks:
            pass
end =  datetime.datetime.now()
taken = end - start
print('Done:', taken)

For me, the unzipping takes just under a minute, so it's more like 100MB/min. Not the speediest thing in the world, but more than an order of magnitude faster than 5MB/min. (And I'm just on a fairly regular laptop I think?)

So it would be good to see an example where it's 5MB/min

@michalc
Copy link
Member

michalc commented Aug 24, 2024

Comparing with Python's zipfile, zipfile is about 10% faster than stream_unzip for me

print('Unzipping with zipfile')
start = datetime.datetime.now()
with zipfile.ZipFile('random.zip') as myzip:
    myzip.setpassword(b'mypassword')
    with myzip.open('random.txt') as f:
        unzipped_chunks = iter(lambda: f.read(chunk_size), b'')
        for _ in unzipped_chunks:
            pass
end = datetime.datetime.now()
taken = end - start
print('Done:', taken)

So while stream_unzip maybe could probably be made faster (if zipfile can do it, why not stream_unzip?), I am suspecting the 5MB/min pain is from something else somehow?

michalc added a commit that referenced this issue Aug 24, 2024
Inspired by the report at #91,
found some performance improvements for the ZipCrypto function,
decrypt_weak_decompress. From some light testing of a password proteceted 100MB
file of pseudo-random data on my local filesystem, it seems to reduce
decryption+decompression time from ~55 seconds to ~46 seconds, which also makes
it a bit faster than Python's zipfile, at least in this circumstance.
michalc added a commit that referenced this issue Aug 24, 2024
Inspired by the report at #91,
found some performance improvements for the ZipCrypto function,
decrypt_weak_decompress. From some light testing of a password proteceted 100MB
file of pseudo-random data on my local filesystem, it seems to reduce
decryption+decompression time from ~55 seconds to ~46 seconds, which also makes
it a bit faster than Python's zipfile, at least in this circumstance.
@michalc
Copy link
Member

michalc commented Aug 24, 2024

Found a few ways to improve stream_unzip's ZipCrypto decrypting: #92, changing it from ~10% slower than Python's zipfile, to ~10% faster, at least for my tests

michalc added a commit that referenced this issue Aug 24, 2024
Inspired by the report at #91,
found some performance improvements for the ZipCrypto function,
decrypt_weak_decompress. From some light testing of a password proteceted 100MB
file of pseudo-random data on my local filesystem, it seems to reduce
decryption+decompression time from ~55 seconds to ~45 seconds, which also makes
it a bit faster than Python's zipfile, at least in this circumstance.
michalc added a commit that referenced this issue Aug 24, 2024
Inspired by the report at #91,
found some performance improvements for the ZipCrypto function,
decrypt_weak_decompress. From some light testing of a password proteceted 100MB
file of pseudo-random data on my local filesystem, it seems to reduce
decryption+decompression time from ~55 seconds to ~45 seconds, which also makes
it a bit faster than Python's zipfile, at least in this circumstance.
@michalc
Copy link
Member

michalc commented Aug 24, 2024

#92 is now released in v0.0.92

@michalc
Copy link
Member

michalc commented Aug 25, 2024

One thing crosses my mind... could the Zip Crypto thing be a red herring? Could the 5MB/min in fact be due to the file using Deflate64, which is known to be incredible slow in stream-unzip: #82

@auto-Dog
Copy link
Author

Appreciate it for debug & improvement! Your improvement does make everything faster.

Back to my question, I found the reason that slows down the unzip is a combination of factors:

  • ZIP archive with many sub-folders and sub-files
  • ZIP archive using ZIP Crypto algorithm
  • Wrongly set chunk_size parameter in stream_unzip, I thought it can be same as file_chunks size setting and set >1M as default value, making it very, very slow:
file_path_name, file_size, unzipped_chunks in stream_unzip(file_chunks,password=password,chunk_size=my_chunk_size)

@auto-Dog auto-Dog changed the title Archive encrypted with Zip Crypto algorithm (weak encrypt) is extreamly slow under stream unzip Archive encrypted with Zip Crypto algorithm (weak encrypt) is extremely slow under stream unzip Oct 7, 2024
@michalc
Copy link
Member

michalc commented Oct 29, 2024

If anyone stumbles on this, then decryption of ZipCrypto should now (as of v0.0.97) be much faster in stream-unzip (via Rust-based decrypting). From some light testing should be about 10 times as fast as Python’s zipfile module now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants