Software pipeline for ZSTD_compressBlock_fast_dictMatchState (+5-6% compression speed) #3086

embg · 2022-03-04T22:33:01Z

Summary

Using similar techniques as #2749, improves compression speed by 5-6% on a dataset of HTML headers between 0 and 8KB (the dataset has ratio 2.32 w/o dictionary, 3.41 with dictionary). This PR doesn't specifically address the cold dictionary scenario, which is something I'd like to explore in the future, but it does improve cold dictionary performance (as well as hot dictionary performance, see plots below).

Plan to merge

@Cyan4973 , @felixhandte , and I are in agreement regarding which measurements are required to merge this change. Because the full suite of measurements will take some work to produce, I am only including two experiments for now. Once I get all nits and feedback resolved, and have a final version of the code ready to merge, I will do the full suite and post results in this thread.

Ratio regression

This change regresses compression ratio by 0.1% on the html8K dataset (4350234 bytes -> 4354095 bytes). This is because we no longer search the very last position. It may be possible to fix this without killing the perf win -- however I think @felixhandte struggled to do this for noDict, so I didn't attempt it for now. Let me know if this regression is a blocker.

EDIT: 6cba817 fixes the ratio regression (at least on html8K) with no performance cost! Thanks @felixhandte for pointing out this fix.

It is still possible that this PR will cause zstd not to search the last position -- this would occur in a situation where the matchfinder is in accelerated search mode when it hits the end of the input. However, across the entire html8K dataset, this change in the parsing strategy did not regress compressed size by even a single byte. So there is essentially no ratio tradeoff with this PR.

Preliminary results

Cold dictionary

Hot dictionary

Final results

I benchmarked on a Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz CentOS machine with core isolation and turbo disabled. I measured all combinations of the following variables:

Dataset: html8K vs. github-users
Compiler: gcc 11.1.1 vs. clang 12.0.1
Dictionary temperature: hot vs. warm vs. cold
- Hot = 1 CDict, 0.1MB working set size (L2 cache). Warm = 100 CDicts, 8-15MB working set size (roughly L3 cache). Cold = 10000 CDicts, >1GB working set size (main memory).
Dictionary size: 4K vs 32K vs 110K

I modified the largeNbDicts benchmark to append the max speed over 10 iterations to a CSV file. I ran each scenario 10 times interleaved (bash script) to eliminate bias from background processes competing for main memory / L3 bandwidth. I then used the bootstrap method to compute confidence intervals for the percent increase in speed, (optimized / dev) - 1 (python script).

Summary of results

Regressions

All scenarios are speed-positive except github-users with 4K dict. The latter scenarios have small regressions, between 0.2% and 1.6% depending on dict temperature. This is due to the extremely high compression ratio on github-users (which reduced the win for all github-users dict sizes). Every time a match is found, the pipeline has to reset, killing the wins from pipelining when there are no matches found. Most real-world datasets have much lower compression ratio, and most real-world dictionaries are much larger than 4K. So the regression only affects an extremely unrealistic use-case.

Again: html8K was speed-positive across all scenarios. github-users with 32K and 110K dicts were speed-positive across all scenarios. This regression is only for github-users with 4K dict size.

Wins

All html8K scenarios are 5%+ wins on gcc. For html8K with 32K dictionary size, the gcc wins are 7%+ for all temperatures. The largest win I measured was 9.8% (± 0.2%) on html_8K/cold/110K with gcc.

clang is speed positive across all scenarios, but wins are more like 1-3% (this makes sense, since I only measured gcc in my exploration).

github-users wins for 32K dict and 110K dict are also around 1-3%. As discussed above, this is due to the extremely high compression ratio of github-users, which is not representative of most real-world data.

Full data

Percent wins for all scenarios
Percent wins + raw speeds for all scenarios
- Note: gcc and clang speeds for a single scenario can be compared, since those experiments were interleaved. Speeds across different scenarios cannot be compared, since those measurements were not interleaved (thus the background noise could have been different).

Rejected optimizations

L1 prefetching the dictMatchIndex / matchIndex. Together with pipelining, this was speed-positive for hot dict, but the win was very small. It basically killed the win from pipelining.
- Theory: L1 prefetching tied up Line Fill Buffers, of which there are only 12 on Skylake Server, delaying more urgent reads / writes.
Various L2 prefetches. These were speed-positive for cold dict, but huge regression for hot dict. I cannot land them without adding cold-dict detection for compression, which I plan to do in a future PR. L2 prefetches are thus out of scope for this PR.
Optimize pipeline initialization after a match is found. This ended up killing about half of the total pipelining win on html8K. Perhaps the extra branches outweighed the win from more extensive pipelining?
Replace ZSTD_count_2segments with ZSTD_count. For some reason this was a clear loss. Not sure why, it should have been a small win. Didn't spend time to investigate.

Future work

I did not try deeper pipelines, or combining pipelining with loop unrolling (as we have in noDict). It is completely possible that either of those directions could lead to bigger wins. Since I hit the noDict wins with a less complex pipeline (and even exceeded them in some important scenarios), I did not try these more complex optimizations.

Conclusion

There are big wins for html8K, and every realistic scenario I tested is speed-positive. Even for an unrealistic worst-case scenario (github-hot-4K-gcc), the speed regression was only 1.6%. The ratio regression has been addressed. This PR is ready to merge!

Cyan4973 · 2022-03-04T22:52:11Z

This change regresses compression ratio by 0.1% on the html8K dataset

0.1% ratio loss for ~5% speed win
is a good enough ratio for fast compression modes.

felixhandte

Overall this looks really solid! Well done.

I left some comments and questions, which we can discuss sometime next week.

lib/compress/zstd_fast.c

Cyan4973 · 2022-03-10T19:09:30Z

As a minor coding style comment :
a lot of variables have shifted from being const to being mutable in this PR.

This may seem a "minor event" (just one keyword difference), but it's not : in most cases, bugs involve mutable variables. When a variable is const, I know it doesn't change after initialization, and during a code review, I can mentally "forget" it after initialization. When it's mutable, it's a state, and I need to track its every change, to be sure the system behaves as intended. Multiply the nb of mutable variables, and the nb of states to keep track of becomes unwieldy.

This is a balance though. I'm not saying that mutable variables are forbidden. They are necessary. But because they add a "burden cost" to maintenance, adding one should be pondered with caution, and additional efforts to keep them const immutable are worthwhile (whenever applicable).

In this PR, I only found 2 cases where variables could be made const, so the vast majority remain mutable. No magical wand here. I'm merely using this box to emphasize the benefits of const for code maintenance.

felixhandte

LGTM!

embg · 2022-03-17T16:34:29Z

Updated PR summary with results from full benchmarking suite and discussion of good and bad future directions for this work.

embg added 7 commits March 4, 2022 17:07

prefetch dict content inside loop

bd4b2c0

ip0/ip1 pipeline

744f7c1

add L2_4 prefetch to dms pipeline

ef1949f

Remove L1 prefetch

c0a306b

Remove L2 prefetching

6f6de06

Reduce # of gotos

ced6f72

Cosmetic fixes

af68080

facebook-github-bot added the CLA Signed label Mar 4, 2022

felixhandte reviewed Mar 5, 2022

View reviewed changes

embg added 2 commits March 10, 2022 13:29

Check final position sometimes

6cba817

Track step size as in bc768bc

b4f9a3f