LZW is an archive format that utilizes power of LZW compression algorithm. LZW compression algorithm is a dictionary-based loseless algorithm. It's an old algorithm suitable for beginner to practice.
Internal algorithm processes byte data. So it's applicable to any file types, besides text file. Although it may not be able to achieve substantial compression rate for some file types that are already compressed efficiently, such as PDF files and MP4 files. It treats data as byte stream, unaware of the text-level pattern, which makes it less compression-efficient compared to other more advanced compression algorithms.
LZW compression algorithm is dynamic. It doesn't collect data statistics before hand. Instead, it learns the data pattern while conducting the compression, building a code table on the fly. The compression ratio approaches maximum after enough time. The algorithmic complexity is strictly linear to the size of the text. A more in-depth algorithmic analysis can be found in the following sections.
An alternative implementation that utilizes more efficient self-made customized bitmap
, dict
and set
to replace C++ builtin general-purpose std::bitset
, std::unordered_map
and std::set
can be found in the branch assignment
. Future enhancement includes customized hash
functions to replace builtin general-purpose std::hash
.
Pre-built binaries (Windows only) are available on the Releases page. Release version conforms to the semantic versioning convention.
Prerequisites: Git, Python3.6+, pip
, and a modern C++ compiler: g++
or cl
.
$ git clone https://github.com/MapleCCC/liblzw.git
$ cd liblzw
# You can optionally create a virtual environment for isolation purpose
$ python -m virtualenv .venv
$ source .venv/Scripts/activate
# Install basic build requirements
$ pip install -r requirements
# if g++ is available:
$ make
# Compiled executable is in build/lzw
# if speed is desirable:
$ make fast
# if on Windows platform and cl.exe is available:
$ mkdir build && cl /Fe"build/lzw.exe" lzw.cpp
# Compiled executable is in build/lzw.exe
# if speed is desirable:
# compile in release mode
$ mkdir build && cl /D "NDEBUG" /O2 /Fe"build/lzw.exe" lzw.cpp
Alternatively, you can build as a static library for embedding in other applications.
# Build static library
$ make build-lib
# Compiled library is in build/liblzw.a
# Build dynamic library
[...]
# Get Help Message
$ lzw --help
'''
Usage:
# Compression
$ lzw compress [-o|--output <ARCHIVE>] <FILES>...
# Decompression
$ lzw decompress <ARCHIVE>
'''
Contribution is welcome.
- When commiting new code, make sure to apply format specified in
.clang-format
config file. - Don't directly modify
README.md
. It's automatically generated from the templateREADME.raw.md
. If you want to add something to README, modify the templateREADME.raw.md
instead. - Also remember to add
scripts/pre-commit.py
to.git/hooks/pre-commit
as pre-commit hook script.
Prerequisites: Git, Python3.6+, pip
, npm
, and a modern C++ compiler: g++
or cl
.
# Clone the repository to local environment
$ git clone https://github.com/MapleCCC/liblzw.git
$ cd liblzw
# You can optionally create a virtual environment for isolation purpose
$ python -m virtualenv .venv
$ source .venv/Scripts/activate
# Install basic build requirements
$ pip install -r requirements.txt
# Install dev requirements
$ pip install -r requirements-dev.txt
$ npm install
# Install pre-commit hook script
$ make install-pre-commit-hook-script
The pre-commit hook script basically does four things:
-
Format staged C/C++ code
-
Format staged Python code
-
Transform
LaTeX
math equation inREADME.raw.md
to image url inREADME.md
-
Append content of
TODO.md
andCHANGELOG.md
toREADME.md
-
Generate TOC for
README.md
Besides relying on the pre-commit hook script, you can manually format code and transform math equations in README.md
.
# Format C/C++ and Python code under the working directory
$ make reformat
# Transform LaTeX math equations to image urls in README.md
$ make transform-eqn
The advantages of pre-commit hook script, compared to manual triggering scripts, is that it's convenient and un-disruptive, as it only introduces changes to staged files, instead of to all the files in the repo.
-
Unit test
$ make unit-test
-
Integrate test
Python3.6+, Pytest and Hypothesis are needed for integrate test
# You can optionally create a virtual environment for isolation purpose $ python -m virtualenv .venv $ source .venv/Scripts/activate $ python -m pip install -U pytest, hypothesis $ make integrate-test # It makes very thorough testing. It currently take about ten to twenty seconds.
Before talking about performance, we first conduct some complexity analysis. This help us understand the internal structure of the problem, and which and how various parts of the task contribute to the performance hotspots.
LZW algorithm treats data as binary stream, regardless of its text encoding. During the algorithm process, a running word is maintained. Before algorithm starts, the running word is empty. Every new byte consumed by the algorithm would be appended to the running word. The running word would be reset to some value under some circumstances.
For each byte consumed by the algorithm, deciding upon whether the running word is in the code dict, the algorithm has two branches to go.
If the new running word formed by appending the new byte at the end is in the code dict, the algorithm proceeds to
In the
The cost model for these two branches are respectively:
Suppose the source text byte length is
For simplicity, we assume that
The total cost model of compression process can then be summarized as:
For input data that doesn't have many repeated byte pattern,
If the underlying data structure implementation of code dict is hash table, then
For input data that has many repeated byte pattern,
We see that the time complexity of compression algorithm is not affected by the byte repetition pattern. It's always linear time. This nice property holds true as long as the underlying implementation of the code dict scales in sublinear factor.
Contrary to the compression algorithm, LZW decompression algorithm consumes a stream of codes, decodes them on the fly, and emitted byte data. A string dict is built and maintained along with the process. Similar to the compression process, a running word is maintained during the process. The running word is empty at the beginning.
Deciding on whether the code can be found in the string dict, the algorithm has two branches to go.
If the code currently consumed can be found in the string dict, the algorithm goes to branch
On the other hand, if the code currently consumed is not present in the string dict, a special situation happens and needs special care. Algorithm goes to branch
The cost model for these two branches is then:
Suppose the code stream length is
For simplicity, we assume that
The probability of going to branch
It's the same with that of the compression algorithm! The total cost model for the decompression algorithm turns out to be identical to that of the compression algorithm! They are both linear
The algorithmic detail of LZW compression algorithm doesn't have large parameter tuning space or rich variants for twisting. The only performance improvement we can try to make is concentrated on the underlying data structure implementation of the code dict and the string dict. The exported interface consists of dict.lookup
, dict.add
, and dict.membership_check
. Many data structure supports these three operations. When deciding which data structure to use, we want the cost to be as small as possible, because these three operations is invoked in nearly every iteration of the algorithm process. Three candidates are to be examined in detail:
-
Balanced Tree
Balanced tree has the advantages of being easy to implement, while having reasonable average performance bound for many operations. It's too mediocre for our demanding task.
-
Hash Table
Hash table, if implemented carefully, can exhibit excel performance in various frequent operations. The major drawback of hash table data structure is that it's space demanding, and its performance is sensitive to input data pattern. The good news is that we can tune the hashing function and collision resolution scheme to workaround skewed input data.
-
Trie
Trie is a specialized data structure especially good at handling text related data. Its major drawback is that correct and efficient implementation needs some attention and care.
We break the trade-off and decide to use hash table to implement the code dict and string dict.
For simplicity, we do not customize our own hashing functions. std::hash
is used instead. This could serve as a minor optimization candidate for future.
Ideally, the hash table implemented code dict and string dict exhibit dict.add
, dict.lookup
, and dict.membership_check
operations.
Instead of loading the whole file content into memory, which could lead to large memory footprint and hence resource pressure, our code adopts a more flexible IO method. We load input data in stream style. We read in data when needed and no too much more. This ensures that our code scales well with large input data size. Also our code is more memory space efficient.
A little trick we use here is to reserve sufficient space for the code dict and string dict at initialization. This effectively reduce the cost of resizing hash table, which is expensive, to say the least.
Deprecated.
Deprecated.