Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZIPs in ZIP #89

Closed
Torcsi opened this issue Jul 19, 2018 · 5 comments
Closed

ZIPs in ZIP #89

Torcsi opened this issue Jul 19, 2018 · 5 comments
Labels

Comments

@Torcsi
Copy link

Torcsi commented Jul 19, 2018

Question:

Assume large (internal) ZIP files stored in a huge (main) ZIP file (maybe more than depth 2).

How one could process all files within internal ZIPs, without severe, multiple buffering of several internal files?

Tried:

  1. Using jszip_sync (slowed 10x)
    Main ZIP read to buffers, buffers processed sync with jszip-sync, known to be very slow.

is there no way to process with yauzl buffers just sync?

  1. Local buffering (too many buffers, stable for my case at 50 files, 50x10 MB)

When reading the main ZIP file, the large ZIP files are to be read into a buffer (preferred) or to a temp file (to be avoided), since no way to read streams with yauzl.

When the inner ZIP files are processed, from buffer, with yauzl, due to the event mechanism the events of the inner ZIP files (zipFile.on("entry") and words, the openReadstream's callback readStream) events come mixed with the zipFile.on("entry").

Cannot zipFile cannot be asked anyhow NOT to produce a next entry until explicitly allowed, aka pause/resume?

  1. Entry by entry (even more jams)

Then I thought I will read the entries of the main ZIP file to an array (since no way to access the central directory in another way), then, not closing the ZIP, for each entry in entries array calling openReadStream, and reading the streams as needed. Then, not surprisingly, all openReadStreams of the main ZIP file nearly fired at once.

Cannot get access to the central directory not requiring an extra step and events to access entries?

Can zipfile.openReadStream by considered async and then await for completion of all the inner?

Not tried

  1. Promisifying, async await (failed)
    Seemed nearly impossible; the processing of inner ZIP files were anyway sync calls, but they get async immediately on readStream.on("readable")

  2. Using e.g. bottleneck (last chance)

Opening with autoclose false
Loading full directory of the main ZIP into a copy of the Central Directory into an array (CDClone)
Creating bottleneck limiter, starting processing from the 0th element of the CDClone

  • each workstep receives the zipfile, CDClone, an index to the entry to process
  • processing the actual entry, and, when "surely finished", placing a new workstep with the next entry OR closing the ZIP file

Here: processing the actual entry of the main ZIP file means

  • loading the data of an internal ZIP file into a buffer
  • creating another workstep with the buffer
  • the workstep would load the CDClone of the internal ZIP, process elementwise just like above
  • after closing the internal ZIP file, consider the actual entry of the main ZIP file surely finished.

This is remains asynchron but for what a price...
It seems too vulnerable to any kind of errors.

@thejoshwolfe
Copy link
Owner

In order for this to work efficiently, the inner zip files must not be compressed. If they are compressed, then you have to buffer the inner zipfiles somewhere; I would recommend on disk if that were the case. A zipfile must be read starting from the end, and a compressed file must be read starting from the beginning. There is no solution to this problem that doesn't involve buffering the entire file somewhere.

Assuming the inner zipfiles are not compressed (instead they are "stored"), then you can implement a RandomAccessReader for each entry in the greater zipfile and call yauzl.fromRandomAccessReader() with the RandomAccessReader. To implement the RandomAccessReader's method _readStreamForRange(), use the start and end options to openReadStream().

Cannot zipFile cannot be asked anyhow NOT to produce a next entry until explicitly allowed, aka pause/resume?

Use lazyEntries: true. This is recommended anyway since the option was added in yauzl 2.4.

That should work for you. I've never tried doing this myself, but all the tools exist to make this work. Let me know if you get it working or if you run into another problem.

@thejoshwolfe
Copy link
Owner

Also, it's worth noting that recursive unzipping is a vector for zip bomb attacks, so be sure to limit the system resources devoted to reading zip files provided by untrusted users with this strategy.

@Torcsi
Copy link
Author

Torcsi commented Jul 23, 2018

Thanks. I have come to a solution using lazyEntries, and calling readEntry in callback after processing the internal ZIP.

I use buffers -- my files are from controlled source, anyway, ZIP bomb attacks can be limited by filtering uncompressedlength before allocation of buffers.

Careful reuse of buffers also improved a bit.

FYI:
yauzl without processing inner ZIPs: 1.72 sec
unzipper from streams without processing inner ZIPS: 57 sec(!)
yauzl from file is 33x faster than unzipper from stream

Whereas inside the ZIP:
yauzl with recursive yauzl for the buffer: 24.34 sec
yauzl with call to jszip-sync for the buffer: 26.64 sec
yauzl from buffer is 10% faster than jszip-sync from buffer

@Torcsi Torcsi closed this as completed Jul 23, 2018
@brucehappy
Copy link

@thejoshwolfe I'm interested in doing exactly what you describe here: creating aRandomAccessReader to represent a stored (not compressed) Entry in one ZipFile and using it to create another ZipFile instance. You say:

To implement the RandomAccessReader's method _readStreamForRange(), use the start and end options to openReadStream().

However, the openReadStream() uses a callback as the means of returning the stream, whereas _readStreamForRange() expects the stream to be returned from the function directly. I need to support 0.10, so bringing in Promise/await (and thereby requiring >= Node v7.6) is out. It looks feasible to refactor the callback to readAndAssertNoEof in ZipFile.openReadStream, and the default implementation of RandomAccessReader.read such that they use callbacks for a new callback enabled createReadStream variant. Thoughts on this?

brucehappy added a commit to brucehappy/yauzl that referenced this issue Dec 12, 2018
…pport create a reader implementation for stored entries within a zip file, thereby supporting stored zips within zips. Add a test for the new classes based on the existing range-test.

Related to thejoshwolfe#89
@thejoshwolfe
Copy link
Owner

@brucehappy yikes! You found a nasty problem with this API. Looks like I got burned by not keeping my function colors straight. 🤦‍♂️

I'll take a look at your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants