Optimal huff depth speed improvements #3302

daniellerozenblit · 2022-10-27T18:54:09Z

TLDR

This PR is a followup to a previous PR that, for high compression levels, brute force tests all valid Huffman table depths and chooses the optimal based on minimizing encoded + header size. This PR introduces some speed optimizations that could (potentially) allow this feature to be used at lower compression levels.

Note: This does cause us to lose a bit in terms of compression ratio, but offers a significant speed improvement. This could mean that we want more fine-grained depth modes, but that could be overkill.

Benchmarking

I benchmarked on an Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz machine with core isolation and turbo disabled. I measured compression time and compression ratio for silesia.tar (212M), compiling with clang15. I experimented with various combinations of compression level and chunk size. I ran each scenario 5 times and chose the maximum speed value.

Speed-Optimized Optimal Log vs. No Optimal Log

In the following tables, ctrl uses the original huffman log method with no speed or compression optimizations, and test uses the speed-optimized huffman log method.

Default block size

-B1KB

-B16KB

Speed-Optimized Optimal Log vs. Brute Force Optimal Log

In the following tables, ctrl uses the brute force optimal log method, and test uses the speed-optimized optimal log method.

Default block size

-B1KB

-B16KB

Cyan4973 · 2022-10-27T19:55:02Z

In the comparison above, what is dev ?

daniellerozenblit · 2022-10-27T20:00:44Z

In the comparison above, what is dev ?

Apologies, dev refers to the original dev branch, without any sort of log optimizations. I will add additional tables comparing this speed optimization to my original PR.

Cyan4973 · 2022-10-28T16:45:34Z

While the compression losses are small, they are nonetheless present, in large enough quantity to question the "speed benefit at no compression loss" initial objective.

My understanding is that this PR starts by "guessing" what is likely a good optLog value, and then test left and right to see if there are better ones.

My concern is that managing the left/right regression logic may be a bit more complex than it initially seems. Of course, the resulting format is always correct, the concern is about finding the best choice, as the brute-force method does.

A recommendation here would be to simplify the logic by doing only a single direction : from smallest to larger. See if this method results in any loss of compression (if it does, then there is more to look into). See if it improves speed measurably.

The intuition is that it will help speed for small data, which is where it matters most because the cost is perceptible. It will probably not be efficient for large data, but also, the relative cost is much lower in this case.

Finally, presuming it works (to be proven), it would be a good step forward, a reference point that could still be improved upon.

…regressing

…it/zstd into optimal-huff-depth-speed

daniellerozenblit · 2022-12-20T20:32:02Z

After many rounds of experimentation, I unfortunately could not find a solution that fits the initial objective to achieve speed improvements with no loss in ratio. Any potential speed improvements that deviate from brute force seem to at least somewhat regress compression ratio for 1KB block sizes (even if only by a couple hundred bytes).

The solution proposed by @Cyan4973 appears to be the best solution for now, though still not quite optimal.

Speed-Optimized Optimal Log vs. Brute Force Optimal Log

In the following tables, ctrl uses the brute force optimal log method, and test uses the speed-optimized optimal log method.

Default block size

-B1KB

-B16KB

There remains some loss in compression ratio for smaller block sizes, though the loss is fairly negligible.

Cyan4973 · 2022-12-20T21:15:30Z

lib/compress/huf_compress.c

-        for (huffLog = HUF_minTableLog(symbolCardinality); huffLog <= maxTableLog; huffLog++) {
-            maxBits = HUF_buildCTable_wksp(table, count,
-                                            maxSymbolValue, huffLog,
-                                            workSpace, wkspSize);
            if (ERR_isError(maxBits)) continue;



minor optimization :

if (maxBits < optLogGuess) break;

If the tree builder is unable to use all allowed bits anyway, it means we have already reached the optimal huffman distribution at previous attempt. We can immediately stop the loop, as it will bring no further benefit.

Tested on silesia.tar with 1 KB blocks : this seems to improve compression speed by ~+1%, at no impact on compression ratio.

Cyan4973 · 2022-12-20T22:42:29Z

lib/compress/huf_compress.c

            }
+            optSize = newSize;


There is a difference here in how you handle size-equal solutions.

In previous PR, when 2 distributions result in equal size, you keep the first one, with smaller maxBits.
In this new PR, when 2 distributions result in equal size, you update the solution, and keep the larger maxBits.

It turns out that this difference is where the compression ratio regression happen on the 1KB block test.
If I change the logic to keep the smaller nb of bits, I'm getting most the original size back.

Actually try this variant :

break if newSize > optSize + 1

update solution only if newSize < optSize

and it should give you exactly the same result as the brute-force variant, while preserving most of the speed benefits of this PR.

Now, why is there a difference between 2 maxBits variants if size estimation tells us that they should lead to the same compressed size ? Initially, I thought there might be a side effect with statistics re-use in following blocks, but thinking about it, that can't be the case : the -B1K test cut the input into independent chunks of 1 KB, so there is no follow-up block, hence no statistics re-use. It can't be the reason for this difference.

Now, what could explain it ? Let's investigate...

I think that these last differences we observe between solutions which are expected to be equal is that HUF_estimateCompressedSize() is just an estimation.
At first sight, it looks exact, but there are actually a nb of minor details to be added for a more accurate picture :

there is a last-bit marker at end of the bitstream

the estimation doesn't distinguish 1 stream vs 4 streams, which would lead to different size estimations

4 streams cannot be estimated "as is", it would require to know the exact histogram of each stream (we just have the global histogram of the block to work with).

At the end of the day, it's not that these shortcomings make a lot of difference : the estimation is still mostly correct. But it does explain potential off-by-1 estimation errors, which then produce to a certain level of "noise", below which it's not possible to make progresses.

Making HUF_estimateCompressedSize() more accurate might be useful, not least because it's also used in other parts of the compression library, where such differences might matter more. But it's definitely a different effort, not to be confused with this PR.

Discussed offline: the last-bit marker at end of the bitstream is actually accounted for in HUF_estimateCompressedSize(). It currently returns nbBits >> 3 which returns the correct number of bytes - 1, when considering the closing bit. Since we only care about the relative sizes, this is not the cause in the disparity we observe between final compressed size and the estimated local compressed size.

daniellerozenblit · 2023-01-03T15:25:59Z

I implemented the suggestions made by @Cyan4973 and observed favorable results.

Note: The losses we see in decompression ratio appear to be random noise. I ran the benchmarks additional times and did not see the loss in speed observed here.

Speed-Optimized Optimal Log vs. Brute Force Optimal Log

In the following tables, ctrl uses the brute force optimal log method, and test uses the speed-optimized optimal log method.

Default block size

-B1KB

-B16KB

daniellerozenblit and others added 5 commits October 17, 2022 10:24

Rough draft speed optimization

a08fabd

Merge

b4f0d36

Commit for benchmarking

4013319

Speed optimizations with macro

db74d04

Merge branch 'facebook:dev' into optimal-huff-depth-speed

0e1de8a

facebook-github-bot added the CLA Signed label Oct 27, 2022

Change threshold for benchmarking

c263821

daniellerozenblit and others added 3 commits December 14, 2022 10:24

Merge branch 'facebook:dev' into optimal-huff-depth-speed

326e442

huf log speed optimization: unidirectional scan of logs + break when …

482689b

…regressing

Merge branch 'optimal-huff-depth-speed' of github.com:daniellerozenbl…

2def93b

…it/zstd into optimal-huff-depth-speed

fix CI errors

c26f348

Cyan4973 reviewed Dec 20, 2022

View reviewed changes

implement suggestions

df714dd

update regression results.csv

87becc5

Cyan4973 approved these changes Jan 3, 2023

View reviewed changes

daniellerozenblit marked this pull request as ready for review January 3, 2023 17:51

daniellerozenblit merged commit 1c818e3 into facebook:dev Jan 3, 2023

Cyan4973 mentioned this pull request Feb 9, 2023

release v1.5.4 #3487

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal huff depth speed improvements #3302

Optimal huff depth speed improvements #3302

daniellerozenblit commented Oct 27, 2022 •

edited

Loading

Cyan4973 commented Oct 27, 2022

daniellerozenblit commented Oct 27, 2022

Cyan4973 commented Oct 28, 2022

daniellerozenblit commented Dec 20, 2022 •

edited

Loading

Cyan4973 Dec 20, 2022 •

edited

Loading

Cyan4973 Dec 20, 2022 •

edited

Loading

Cyan4973 Dec 20, 2022

daniellerozenblit Jan 3, 2023

daniellerozenblit commented Jan 3, 2023 •

edited

Loading

Optimal huff depth speed improvements #3302

Optimal huff depth speed improvements #3302

Conversation

daniellerozenblit commented Oct 27, 2022 • edited Loading

TLDR

Benchmarking

Speed-Optimized Optimal Log vs. No Optimal Log

Default block size

-B1KB

-B16KB

Speed-Optimized Optimal Log vs. Brute Force Optimal Log

Default block size

-B1KB

-B16KB

Cyan4973 commented Oct 27, 2022

daniellerozenblit commented Oct 27, 2022

Cyan4973 commented Oct 28, 2022

daniellerozenblit commented Dec 20, 2022 • edited Loading

Speed-Optimized Optimal Log vs. Brute Force Optimal Log

Default block size

-B1KB

-B16KB

Cyan4973 Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

Cyan4973 Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

Cyan4973 Dec 20, 2022

Choose a reason for hiding this comment

daniellerozenblit Jan 3, 2023

Choose a reason for hiding this comment

daniellerozenblit commented Jan 3, 2023 • edited Loading

Speed-Optimized Optimal Log vs. Brute Force Optimal Log

Default block size

-B1KB

-B16KB

daniellerozenblit commented Oct 27, 2022 •

edited

Loading

daniellerozenblit commented Dec 20, 2022 •

edited

Loading

Cyan4973 Dec 20, 2022 •

edited

Loading

Cyan4973 Dec 20, 2022 •

edited

Loading

daniellerozenblit commented Jan 3, 2023 •

edited

Loading