Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zlib: support concatenated archives #5120

Closed

Conversation

kthelgason
Copy link
Contributor

This PR adds support for gunzipping archives that consists of multiple concatenated members.
This has not been supported by node's implementation so far but is a part of the spec rfc1952.

The problem was raised in #4306, and there is more detailed discussion on this feature in the issue.

I also refactored some of lib/zlib.js but I will submit that in a separate PR later.

@kthelgason kthelgason force-pushed the zlib-concatenated-archives branch 2 times, most recently from 267b15e to 017e172 Compare February 6, 2016 15:26
@Fishrock123 Fishrock123 added zlib Issues and PRs related to the zlib subsystem. semver-minor PRs that contain new features and should be released in the next minor version. labels Feb 6, 2016
@Fishrock123
Copy link
Contributor

cc @indutny or @bnoordhuis

@@ -254,11 +253,22 @@ class ZCtx : public AsyncWrap {
ctx->err_ = Z_NEED_DICT;
}
}
if (ctx->strm_.avail_in >= 2 && ctx->mode_ == GUNZIP) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic constants need an explaining comment and should preferably be true, well-named constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought this was a little silly. It should probably check to make sure that avail_in is at least equal to the gzip header size. I'll fix this.

@kthelgason
Copy link
Contributor Author

Thanks for the review @bnoordhuis. I've addressed some individual comments and will be updating this PR to fix all the issues mentioned.

@bnoordhuis
Copy link
Member

Can you post a comment when you're done? GH doesn't send notifications for new or updated commits.

@kthelgason
Copy link
Contributor Author

I've updated this in line with the comments from @bnoordhuis.

// member in the same archive, or just trailing garbage.
// Check the header to find out.
if (*ctx->strm_.next_in != GZIP_HEADER_ID1 ||
*(ctx->strm_.next_in + 1) != GZIP_HEADER_ID2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little more idiomatic to write it as ctx->strm_.next_in[1] != GZIP_HEADER_ID2. Suggestion: use indexed notation for the first byte too, for consistency; that's what we do elsewhere.

@bnoordhuis
Copy link
Member

LGTM with a style nit. Perhaps it would be good to add a test for an archive that looks like an concatenated archive but isn't, i.e., with trailing junk that just happens to start with the right magic bytes.

CI: https://ci.nodejs.org/job/node-test-pull-request/1580/ (currently private)

@kthelgason
Copy link
Contributor Author

I thought about including such a test, but it's pretty unclear to me what the expected behaviour is. Technically, trailing garbage in a gzipped file is forbidden by the spec. It seems however that in practice some files are padded with '\0', and I've accounted for this in my implementation. Currently, if a file happens to contain trailing garbage that happens to begin with the gzip header id bytes it will not decompress correctly and an error will be thrown.
I felt this was ok as the archive is technically invalid.

EDIT: I just checked if gzip(1) supports this and it throws an error in this situation. Seeing as it's cited by the RFC as a reference implementation I think we should adopt that behaviour as well.

@kthelgason
Copy link
Contributor Author

Addressed the style nit and force pushed.

@Fishrock123
Copy link
Contributor

@kthelgason About the trailing \0 thing -- does this change the behavior of that from the current one?

@kthelgason
Copy link
Contributor Author

@Fishrock123 no, both implementations simply ignore the junk bytes at the end.

@Fishrock123
Copy link
Contributor

@kthelgason Seems fine to me then; let's discuss the trailing bytes thing in a different issue if it is important.


assert.equal(zlib.gunzipSync(data).toString(), 'abcdef');

zlib.gunzip(data, (_, result) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't ignore the error. You can do assert.ifError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to wrap this kind of callback in common.mustCall to make sure the assertions are really executed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why common.mustCall exists! :)

@kthelgason
Copy link
Contributor Author

@thefourtheye @targos @Fishrock123 fixed.

Regarding the trailing garbage bytes, I think opening another issue to discuss that (if necessary) would be a good idea.

@jasnell
Copy link
Member

jasnell commented Feb 7, 2016

LGTM

@bnoordhuis
Copy link
Member

LGTM. Is there much to discuss w.r.t. trailing garbage? Emitting an error is the obvious thing to do, right?

@kthelgason
Copy link
Contributor Author

@bnoordhuis to me it seems that way yes. Esp. given that gzip(1) handles it that way.

@bnoordhuis
Copy link
Member

Exactly. I'd add the test case to this PR. Might as well land it with full coverage from the start.

@kthelgason
Copy link
Contributor Author

Thanks for the comment @rvagg and for the review @bnoordhuis. I've squashed the commits as recommended, and CI looks good (apart from the ARM failure but as pointed out by @jbergstroem I can't see that it's related to this at all).

@bnoordhuis
Copy link
Member

Thanks Kári, landed in f380db2.

@bnoordhuis bnoordhuis closed this Mar 15, 2016
bnoordhuis pushed a commit that referenced this pull request Mar 15, 2016
According to the spec gzipped archives can contain more than one
compressed member. Previously Node's gzip implementation would only
unzip the first member and throw away the rest of the compressed data.
Issue #4306 is an example of this occurring in daily use.

Fixes: #4306
PR-URL: #5120
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
@kthelgason kthelgason deleted the zlib-concatenated-archives branch March 15, 2016 17:12
evanlucas pushed a commit that referenced this pull request Mar 15, 2016
According to the spec gzipped archives can contain more than one
compressed member. Previously Node's gzip implementation would only
unzip the first member and throw away the rest of the compressed data.
Issue #4306 is an example of this occurring in daily use.

Fixes: #4306
PR-URL: #5120
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
evanlucas added a commit that referenced this pull request Mar 15, 2016
Notable changes:

* **governance**: The following members have been added as collaborators:
  - Andreas Madsen (@AndreasMadsen)
  - Benjamin Gruenbaum (@benjamingr)
  - Claudio Rodriguez (@claudiorodriguez)
  - Glen Keane (@thekemkid)
  - Jeremy Whitlock (@whitlockjc)
  - Matt Loring (@matthewloring)
  - Phillip Johnsen (@phillipj)
* **lib**: copy arguments object instead of leaking it (Nathan Woltman) #4361
* **zlib**: add support for concatenated members (Kári Tristan Helgason) #5120

PR-URL: #5702
@evanlucas evanlucas mentioned this pull request Mar 15, 2016
rvagg pushed a commit that referenced this pull request Mar 16, 2016
According to the spec gzipped archives can contain more than one
compressed member. Previously Node's gzip implementation would only
unzip the first member and throw away the rest of the compressed data.
Issue #4306 is an example of this occurring in daily use.

Fixes: #4306
PR-URL: #5120
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
evanlucas added a commit that referenced this pull request Mar 16, 2016
Notable changes:

* **governance**: The following members have been added as
collaborators:
  - Andreas Madsen (@AndreasMadsen)
  - Benjamin Gruenbaum (@benjamingr)
  - Claudio Rodriguez (@claudiorodriguez)
  - Glen Keane (@thekemkid)
  - Jeremy Whitlock (@whitlockjc)
  - Matt Loring (@matthewloring)
  - Phillip Johnsen (@phillipj)
* **lib**: copy arguments object instead of leaking it (Nathan Woltman)
#4361
* **zlib**: add support for concatenated members (Kári Tristan
Helgason) #5120

PR-URL: #5702
evanlucas added a commit that referenced this pull request Mar 16, 2016
Notable changes:

* **governance**: The following members have been added as
collaborators:
  - Andreas Madsen (@AndreasMadsen)
  - Benjamin Gruenbaum (@benjamingr)
  - Claudio Rodriguez (@claudiorodriguez)
  - Glen Keane (@thekemkid)
  - Jeremy Whitlock (@whitlockjc)
  - Matt Loring (@matthewloring)
  - Phillip Johnsen (@phillipj)
* **lib**: copy arguments object instead of leaking it (Nathan Woltman)
#4361
* **src**: allow combination of -i and -e cli flags (Rich Trott)
#5655
* **zlib**: add support for concatenated members (Kári Tristan
Helgason) #5120

PR-URL: #5702
evanlucas added a commit that referenced this pull request Mar 16, 2016
Notable changes:

* **contextify**: Fixed a memory consumption issue related to heavy use
of `vm.createContext` and `vm.runInNewContext`. (Ali Ijaz Sheikh)
#5392
* **governance**: The following members have been added as
collaborators:
  - Andreas Madsen (@AndreasMadsen)
  - Benjamin Gruenbaum (@benjamingr)
  - Claudio Rodriguez (@claudiorodriguez)
  - Glen Keane (@thekemkid)
  - Jeremy Whitlock (@whitlockjc)
  - Matt Loring (@matthewloring)
  - Phillip Johnsen (@phillipj)
* **lib**: copy arguments object instead of leaking it (Nathan Woltman)
#4361
* **src**: allow combination of -i and -e cli flags (Rich Trott)
#5655
* **v8**: backport fb4ccae from v8 upstream (Vladimir Krivosheev) #4231
  -  breakout events from v8 to offer better support for external
     debuggers
* **zlib**: add support for concatenated members (Kári Tristan
Helgason) #5120

PR-URL: #5702
evanlucas added a commit that referenced this pull request Mar 16, 2016
Notable changes:

* **contextify**: Fixed a memory consumption issue related to heavy use
of `vm.createContext` and `vm.runInNewContext`. (Ali Ijaz Sheikh)
#5392
* **governance**: The following members have been added as
collaborators:
  - Andreas Madsen (@AndreasMadsen)
  - Benjamin Gruenbaum (@benjamingr)
  - Claudio Rodriguez (@claudiorodriguez)
  - Glen Keane (@thekemkid)
  - Jeremy Whitlock (@whitlockjc)
  - Matt Loring (@matthewloring)
  - Phillip Johnsen (@phillipj)
* **lib**: copy arguments object instead of leaking it (Nathan Woltman)
#4361
* **src**: allow combination of -i and -e cli flags (Rich Trott)
#5655
* **v8**: backport fb4ccae from v8 upstream (Vladimir Krivosheev) #4231
  -  breakout events from v8 to offer better support for external
     debuggers
* **zlib**: add support for concatenated members (Kári Tristan
Helgason) #5120

PR-URL: #5702
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-minor PRs that contain new features and should be released in the next minor version. zlib Issues and PRs related to the zlib subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants