improve compression ratio of small alphabets #3391
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fix #3228
In situations where the source's alphabet size is very small, the evaluation of literal costs from the Optimal Parser is initially incorrect. It takes some time to converge, during which compression is less efficient.
This is especially important for small files, since most of the parsing decisions would be based on incorrect metrics.
After this patch, the scenario ##3228 is fixed,
delivering the expected 29 bytes compressed size (smallest known compressed size, down from 54).
On other "regular" data, this patch tends to be generally positive, though this is an average, and differences remain pretty small.
The patch seems to impact text data more, likely because it prunes non present alphabet symbols much faster.
On binary data with full alphabet, it's more balanced, and results typically vary by merely a few bytes (compared to
dev
), making it essentially a non-event.Since this modification is only for high compression modes, the speed impact is insignificant.