Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimal huff depth speed improvements #3302

Merged

Conversation

daniellerozenblit
Copy link
Contributor

@daniellerozenblit daniellerozenblit commented Oct 27, 2022

TLDR

This PR is a followup to a previous PR that, for high compression levels, brute force tests all valid Huffman table depths and chooses the optimal based on minimizing encoded + header size. This PR introduces some speed optimizations that could (potentially) allow this feature to be used at lower compression levels.

Note: This does cause us to lose a bit in terms of compression ratio, but offers a significant speed improvement. This could mean that we want more fine-grained depth modes, but that could be overkill.

Benchmarking

I benchmarked on an Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz machine with core isolation and turbo disabled. I measured compression time and compression ratio for silesia.tar (212M), compiling with clang15. I experimented with various combinations of compression level and chunk size. I ran each scenario 5 times and chose the maximum speed value.

Speed-Optimized Optimal Log vs. No Optimal Log

In the following tables, ctrl uses the original huffman log method with no speed or compression optimizations, and test uses the speed-optimized huffman log method.

Default block size

Screen Shot 2022-10-27 at 4 44 42 PM

-B1KB

Screen Shot 2022-10-27 at 4 45 05 PM

-B16KB

Screen Shot 2022-10-27 at 4 45 25 PM

Speed-Optimized Optimal Log vs. Brute Force Optimal Log

In the following tables, ctrl uses the brute force optimal log method, and test uses the speed-optimized optimal log method.

Default block size

Screen Shot 2022-10-27 at 10 37 30 PM

-B1KB

Screen Shot 2022-10-27 at 10 39 48 PM

-B16KB

Screen Shot 2022-10-27 at 10 38 29 PM

@Cyan4973
Copy link
Contributor

In the comparison above, what is dev ?

@daniellerozenblit
Copy link
Contributor Author

In the comparison above, what is dev ?

Apologies, dev refers to the original dev branch, without any sort of log optimizations. I will add additional tables comparing this speed optimization to my original PR.

@Cyan4973
Copy link
Contributor

While the compression losses are small, they are nonetheless present, in large enough quantity to question the "speed benefit at no compression loss" initial objective.

My understanding is that this PR starts by "guessing" what is likely a good optLog value, and then test left and right to see if there are better ones.

My concern is that managing the left/right regression logic may be a bit more complex than it initially seems. Of course, the resulting format is always correct, the concern is about finding the best choice, as the brute-force method does.

A recommendation here would be to simplify the logic by doing only a single direction : from smallest to larger. See if this method results in any loss of compression (if it does, then there is more to look into). See if it improves speed measurably.

The intuition is that it will help speed for small data, which is where it matters most because the cost is perceptible. It will probably not be efficient for large data, but also, the relative cost is much lower in this case.

Finally, presuming it works (to be proven), it would be a good step forward, a reference point that could still be improved upon.

@daniellerozenblit
Copy link
Contributor Author

daniellerozenblit commented Dec 20, 2022

After many rounds of experimentation, I unfortunately could not find a solution that fits the initial objective to achieve speed improvements with no loss in ratio. Any potential speed improvements that deviate from brute force seem to at least somewhat regress compression ratio for 1KB block sizes (even if only by a couple hundred bytes).

The solution proposed by @Cyan4973 appears to be the best solution for now, though still not quite optimal.

Speed-Optimized Optimal Log vs. Brute Force Optimal Log

In the following tables, ctrl uses the brute force optimal log method, and test uses the speed-optimized optimal log method.

Default block size

Screen Shot 2022-12-20 at 3 29 27 PM

-B1KB

Screen Shot 2022-12-20 at 3 29 49 PM

-B16KB

Screen Shot 2022-12-20 at 3 30 16 PM

There remains some loss in compression ratio for smaller block sizes, though the loss is fairly negligible.

for (huffLog = HUF_minTableLog(symbolCardinality); huffLog <= maxTableLog; huffLog++) {
maxBits = HUF_buildCTable_wksp(table, count,
maxSymbolValue, huffLog,
workSpace, wkspSize);
if (ERR_isError(maxBits)) continue;

Copy link
Contributor

@Cyan4973 Cyan4973 Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor optimization :

if (maxBits < optLogGuess) break;

If the tree builder is unable to use all allowed bits anyway, it means we have already reached the optimal huffman distribution at previous attempt. We can immediately stop the loop, as it will bring no further benefit.

Tested on silesia.tar with 1 KB blocks : this seems to improve compression speed by ~+1%, at no impact on compression ratio.

}
optSize = newSize;
Copy link
Contributor

@Cyan4973 Cyan4973 Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a difference here in how you handle size-equal solutions.

In previous PR, when 2 distributions result in equal size, you keep the first one, with smaller maxBits.
In this new PR, when 2 distributions result in equal size, you update the solution, and keep the larger maxBits.

It turns out that this difference is where the compression ratio regression happen on the 1KB block test.
If I change the logic to keep the smaller nb of bits, I'm getting most the original size back.

Actually try this variant :

  • break if newSize > optSize + 1
  • update solution only if newSize < optSize

and it should give you exactly the same result as the brute-force variant, while preserving most of the speed benefits of this PR.

Now, why is there a difference between 2 maxBits variants if size estimation tells us that they should lead to the same compressed size ? Initially, I thought there might be a side effect with statistics re-use in following blocks, but thinking about it, that can't be the case : the -B1K test cut the input into independent chunks of 1 KB, so there is no follow-up block, hence no statistics re-use. It can't be the reason for this difference.

Now, what could explain it ? Let's investigate...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that these last differences we observe between solutions which are expected to be equal is that HUF_estimateCompressedSize() is just an estimation.
At first sight, it looks exact, but there are actually a nb of minor details to be added for a more accurate picture :

  • there is a last-bit marker at end of the bitstream
  • the estimation doesn't distinguish 1 stream vs 4 streams, which would lead to different size estimations
  • 4 streams cannot be estimated "as is", it would require to know the exact histogram of each stream (we just have the global histogram of the block to work with).

At the end of the day, it's not that these shortcomings make a lot of difference : the estimation is still mostly correct. But it does explain potential off-by-1 estimation errors, which then produce to a certain level of "noise", below which it's not possible to make progresses.

Making HUF_estimateCompressedSize() more accurate might be useful, not least because it's also used in other parts of the compression library, where such differences might matter more. But it's definitely a different effort, not to be confused with this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: the last-bit marker at end of the bitstream is actually accounted for in HUF_estimateCompressedSize(). It currently returns nbBits >> 3 which returns the correct number of bytes - 1, when considering the closing bit. Since we only care about the relative sizes, this is not the cause in the disparity we observe between final compressed size and the estimated local compressed size.

@daniellerozenblit
Copy link
Contributor Author

daniellerozenblit commented Jan 3, 2023

I implemented the suggestions made by @Cyan4973 and observed favorable results.

Note: The losses we see in decompression ratio appear to be random noise. I ran the benchmarks additional times and did not see the loss in speed observed here.

Speed-Optimized Optimal Log vs. Brute Force Optimal Log

In the following tables, ctrl uses the brute force optimal log method, and test uses the speed-optimized optimal log method.

Default block size

Screen Shot 2023-01-03 at 10 24 48 AM

-B1KB

Screen Shot 2023-01-03 at 10 25 11 AM

-B16KB

Screen Shot 2023-01-03 at 10 25 33 AM

@daniellerozenblit daniellerozenblit marked this pull request as ready for review January 3, 2023 17:51
@daniellerozenblit daniellerozenblit merged commit 1c818e3 into facebook:dev Jan 3, 2023
@Cyan4973 Cyan4973 mentioned this pull request Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants