Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add byte range options to openReadStream #38

Closed
max-mapper opened this issue Aug 14, 2016 · 3 comments
Closed

add byte range options to openReadStream #38

max-mapper opened this issue Aug 14, 2016 · 3 comments

Comments

@max-mapper
Copy link

Sort of thinking out loud here. I have a use case where I want to 'mount' a compressed archive and access bytes randomly without decompressing the whole archive up front. Basically I want to:

  1. Efficiently get the entry that matches some filename
  2. Read a byte range from that entry
  3. Repeat this many times, potentially reading the same entry multiple times

The yauzl API seems to be geared for single pass unzipping, which makes sense. One approach I was thinking is I could just get all on('entry') entries up front and keep them in memory, then when a byte range request comes in I can use the entry to retrieve the byte range, but I ran in to problems, it would be much nicer to be able to lazily consult the central directory as opposed to having to read it all up front.

The other issue is related to Deflate which requires decompression from the beginning of the entry. I guess an alternative compression type like BGZF would make arbitrary byte range lookups much faster, but it wouldn't be compatible with many implementations. However! I found another technique where you do a single pass over the entry and build an index (https://github.com/madler/zlib/blob/master/examples/zran.c). I think this would be acceptable for my use case.

Being able to implement the zran style indexing on top of yauzl would mean some API changes I think, e.g. a way to get a single entry from the CDR by name, and a lower level way to control the decompression state to support zran. Before I got too deep I wanted to sanity check this use case, does it seem doable?

@thejoshwolfe
Copy link
Owner

These are great ideas. Let me try to organize them into proposals

proposal: getEntryByName():

Unfortunately, the structure of a zip file does not lend itself to random access entry reads. You actually have to go through each entry sequentially to see what's in the central directory. It's definitely possible to do a name-based lookup table on top of that by reading each entry and adding it to a hashtable object (i.e. a javascript object), and i think that's what you want to happen. Now there's the question of whether yauzl should do that and provide an API, or whether you should do that in your own application. Either way is equivalent performance, yauzl doing it is more convenient for clients, and clients doing it is more convenient for yauzl.

yauzl's design is that it does only what's necessary to expose information in a zipfile without adding any overhead. yauzl is designed so that abstractions can be built on top of it without sacrificing memory or CPU performance. With this design in mind, it makes more sense to me for a name-indexed lookup table for entries to be outside the library, not inside the library.

proposal: add start, end, and decompress options to openReadStream:

This is a good idea, and I don't know why I didn't think of this before. Here's how the rules would work:

  • start and end can only be specified if decompress is false. end is exclusive, meaning the number of bytes read is end - start.
  • decompress defaults to true for compressed files and false for stored files. It is illegal to pass decompress: true for a stored file.

Regarding #11, adding a decrypt or password option would fit nicely into the above rules, which is a good sign.

You mention building an index to implement random access byte range reads for compressed entries. Sounds like a cool idea! However, I think that kind of indexing should also happen in the client application, not in yauzl. I haven't looked into zran.c, but I bet that a client could implement whatever they need for that kind of thing using the above API.

proposal: read the same entry multiple times

This should already be possible. You probably want autoClose: false for this usecase (see the open API). Come to think of it, I don't think any of the automated tests try to read an entry multiple times, but according to the way yauzl's is implemented, it should definitely work.

@thejoshwolfe thejoshwolfe changed the title random access entry reads? add byte range options to openReadStream Aug 16, 2016
@max-mapper
Copy link
Author

@thejoshwolfe thanks for coming up with the proposals, I totally agree about all this stuff happening on top of yauzl, not inside it. I'll keep that in mind as I hack on this

@thejoshwolfe
Copy link
Owner

the start and end options to openReadStream are now published in yauzl version 2.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
@max-mapper @thejoshwolfe and others