Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility with bzip2 when decompressing multiple bzip2 streams/chunks in a single file #1

Closed
reidmorrison opened this issue Oct 1, 2020 · 6 comments

Comments

@reidmorrison
Copy link

When using pbzip2 it uses multiple threads to create compressed bzip2 files in parallel. The challenge is that to do so it breaks the input up into chunks and compresses each separately. The regular bzip2 or bunzip2 command line interface handles this scenario and automatically decompresses the next chunk as the next bzip2 stream.

For more details on the data format, please see item 4. PBZIP2 DATA FORMAT at https://github.com/ruanhuabin/pbzip2

Thank you for publishing your FFI interface into bzip2, it is now the required implementation for dealing with bzip2 in IOStreams.

@andrew-aladev
Copy link

andrew-aladev commented Oct 12, 2020

We can see the following pbzip2 source code for debian package here and new source code here. We can also find lbzip2 here.

All these repositories are abandoned and doesn't include library with clear api. I won't recommend anyone to use pbzip2 as it is, this project was frozen in proof-of-concept state. If you have free time - please fork lbzip2 (it includes good tests) and provide convenient library. Unfortunatelly lbzip2 license is GPL v3.0 and it may not be compatible with everyone requirements. So if you have free time - please consider writing new library from scratch without using GPL code.

Personally I will use regular bzip2, for me reliability is much more important than performance.

@reidmorrison
Copy link
Author

reidmorrison commented Oct 16, 2020

We are not actually using pbzip2. It however shows how to perform parallel compression of bzip2 files.

bzip2 supports uncompressing multiple embedded bzip2 streams in a single file. Using this technique we are able to use bzip2-ffi across multiple machines to compress individual chunks, then we join them together into a single file that bzip2 is able to decompress, but bzip2-ffi is only able to read the first chunk/stream.

This is because bzip2-ffi does not expect a single file to contain multiple bzip2 streams in a single file, whereas bzip2 does.

@reidmorrison reidmorrison changed the title pbzip2 (parallel bzip2) compatibility Support multiple embedded bzip2 streams in a single file Oct 16, 2020
@reidmorrison
Copy link
Author

reidmorrison commented Oct 16, 2020

# Very simple example with small data to show the concept:
require "bzip2/ffi"
io = StringIO.new
# Write first stream
Bzip2::FFI::Writer.open(io) { |writer| writer.write("Hello World\n") }
# Write second stream (usually in a separate thread or on another machine)
Bzip2::FFI::Writer.open(io) { |writer| writer.write("Hello World2\n") }
# Write combined chunks into a single file
File.open("multiple.bz2", "wb") { |writer| writer.write(io.string) }

Reading the file produced above using gzip2:

% gunzip -c multiple.bz2
Hello World
Hello World2

Using bzip2-ffi to read the same file only returns the first chunk:

Bzip2::FFI::Reader.open("multiple.bz2") { |reader| reader.read }
"Hello World\n"

@reidmorrison reidmorrison changed the title Support multiple embedded bzip2 streams in a single file Compatibility with bzip2 when decompressing multiple bzip2 streams/chunks in a single file Oct 16, 2020
@philr philr closed this as completed in 5ad3b73 Jan 16, 2021
@philr
Copy link
Owner

philr commented Jan 16, 2021

Thanks for the report. I've committed a fix and will put a new release together soon.

@philr
Copy link
Owner

philr commented Feb 27, 2021

The fix has now been released in v1.1.0.

@reidmorrison
Copy link
Author

@philr I forget how thankless it is to make changes for random use-cases.
Thank you so much for spending your valuable time to work on and publish this change.
Thank you 🎉 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants