add byte range options to openReadStream #38

max-mapper · 2016-08-14T00:51:58Z

Sort of thinking out loud here. I have a use case where I want to 'mount' a compressed archive and access bytes randomly without decompressing the whole archive up front. Basically I want to:

Efficiently get the entry that matches some filename
Read a byte range from that entry
Repeat this many times, potentially reading the same entry multiple times

The yauzl API seems to be geared for single pass unzipping, which makes sense. One approach I was thinking is I could just get all on('entry') entries up front and keep them in memory, then when a byte range request comes in I can use the entry to retrieve the byte range, but I ran in to problems, it would be much nicer to be able to lazily consult the central directory as opposed to having to read it all up front.

The other issue is related to Deflate which requires decompression from the beginning of the entry. I guess an alternative compression type like BGZF would make arbitrary byte range lookups much faster, but it wouldn't be compatible with many implementations. However! I found another technique where you do a single pass over the entry and build an index (https://github.com/madler/zlib/blob/master/examples/zran.c). I think this would be acceptable for my use case.

Being able to implement the zran style indexing on top of yauzl would mean some API changes I think, e.g. a way to get a single entry from the CDR by name, and a lower level way to control the decompression state to support zran. Before I got too deep I wanted to sanity check this use case, does it seem doable?

thejoshwolfe · 2016-08-16T21:20:09Z

These are great ideas. Let me try to organize them into proposals

proposal: getEntryByName():

Unfortunately, the structure of a zip file does not lend itself to random access entry reads. You actually have to go through each entry sequentially to see what's in the central directory. It's definitely possible to do a name-based lookup table on top of that by reading each entry and adding it to a hashtable object (i.e. a javascript object), and i think that's what you want to happen. Now there's the question of whether yauzl should do that and provide an API, or whether you should do that in your own application. Either way is equivalent performance, yauzl doing it is more convenient for clients, and clients doing it is more convenient for yauzl.

yauzl's design is that it does only what's necessary to expose information in a zipfile without adding any overhead. yauzl is designed so that abstractions can be built on top of it without sacrificing memory or CPU performance. With this design in mind, it makes more sense to me for a name-indexed lookup table for entries to be outside the library, not inside the library.

proposal: add start, end, and decompress options to openReadStream:

This is a good idea, and I don't know why I didn't think of this before. Here's how the rules would work:

start and end can only be specified if decompress is false. end is exclusive, meaning the number of bytes read is end - start.
decompress defaults to true for compressed files and false for stored files. It is illegal to pass decompress: true for a stored file.

Regarding #11, adding a decrypt or password option would fit nicely into the above rules, which is a good sign.

You mention building an index to implement random access byte range reads for compressed entries. Sounds like a cool idea! However, I think that kind of indexing should also happen in the client application, not in yauzl. I haven't looked into zran.c, but I bet that a client could implement whatever they need for that kind of thing using the above API.

proposal: read the same entry multiple times

This should already be possible. You probably want autoClose: false for this usecase (see the open API). Come to think of it, I don't think any of the automated tests try to read an entry multiple times, but according to the way yauzl's is implemented, it should definitely work.

max-mapper · 2016-08-16T23:37:35Z

@thejoshwolfe thanks for coming up with the proposals, I totally agree about all this stuff happening on top of yauzl, not inside it. I'll keep that in mind as I hack on this

closes #11 closes #39 see also #38

thejoshwolfe · 2017-04-22T18:53:08Z

the start and end options to openReadStream are now published in yauzl version 2.8

thejoshwolfe added the enhancement label Aug 16, 2016

thejoshwolfe changed the title ~~random access entry reads?~~ add byte range options to openReadStream Aug 16, 2016

thejoshwolfe mentioned this issue Aug 17, 2016

encrypted zip files should not have undefined behavior #11

Closed

thejoshwolfe mentioned this issue Oct 18, 2016

Feature/encrypted files #39

Closed

thejoshwolfe added a commit that referenced this issue Apr 19, 2017

add decompress and decrypt options

3fdd6e4

closes #11 closes #39 see also #38

thejoshwolfe closed this as completed in 04e77fe Apr 22, 2017

thejoshwolfe mentioned this issue Apr 27, 2018

[wip] decodeFileData and canDecodeFileData #82

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add byte range options to openReadStream #38

add byte range options to openReadStream #38

max-mapper commented Aug 14, 2016

thejoshwolfe commented Aug 16, 2016

max-mapper commented Aug 16, 2016

thejoshwolfe commented Apr 22, 2017

add byte range options to openReadStream #38

add byte range options to openReadStream #38

Comments

max-mapper commented Aug 14, 2016

thejoshwolfe commented Aug 16, 2016

max-mapper commented Aug 16, 2016

thejoshwolfe commented Apr 22, 2017