Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Software pipeline for ZSTD_compressBlock_fast_dictMatchState (+5-6% compression speed) #3086

Merged
merged 10 commits into from
Mar 17, 2022

Conversation

embg
Copy link
Contributor

@embg embg commented Mar 4, 2022

Summary

Using similar techniques as #2749, improves compression speed by 5-6% on a dataset of HTML headers between 0 and 8KB (the dataset has ratio 2.32 w/o dictionary, 3.41 with dictionary). This PR doesn't specifically address the cold dictionary scenario, which is something I'd like to explore in the future, but it does improve cold dictionary performance (as well as hot dictionary performance, see plots below).

Plan to merge

@Cyan4973 , @felixhandte , and I are in agreement regarding which measurements are required to merge this change. Because the full suite of measurements will take some work to produce, I am only including two experiments for now. Once I get all nits and feedback resolved, and have a final version of the code ready to merge, I will do the full suite and post results in this thread.

Ratio regression

This change regresses compression ratio by 0.1% on the html8K dataset (4350234 bytes -> 4354095 bytes). This is because we no longer search the very last position. It may be possible to fix this without killing the perf win -- however I think @felixhandte struggled to do this for noDict, so I didn't attempt it for now. Let me know if this regression is a blocker.

EDIT: 6cba817 fixes the ratio regression (at least on html8K) with no performance cost! Thanks @felixhandte for pointing out this fix.

It is still possible that this PR will cause zstd not to search the last position -- this would occur in a situation where the matchfinder is in accelerated search mode when it hits the end of the input. However, across the entire html8K dataset, this change in the parsing strategy did not regress compressed size by even a single byte. So there is essentially no ratio tradeoff with this PR.

Preliminary results

Cold dictionary Screen Shot 2022-03-04 at 2 25 30 PM
Hot dictionary Screen Shot 2022-03-04 at 2 32 39 PM

Final results

I benchmarked on a Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz CentOS machine with core isolation and turbo disabled. I measured all combinations of the following variables:

  • Dataset: html8K vs. github-users
  • Compiler: gcc 11.1.1 vs. clang 12.0.1
  • Dictionary temperature: hot vs. warm vs. cold
    • Hot = 1 CDict, 0.1MB working set size (L2 cache). Warm = 100 CDicts, 8-15MB working set size (roughly L3 cache). Cold = 10000 CDicts, >1GB working set size (main memory).
  • Dictionary size: 4K vs 32K vs 110K

I modified the largeNbDicts benchmark to append the max speed over 10 iterations to a CSV file. I ran each scenario 10 times interleaved (bash script) to eliminate bias from background processes competing for main memory / L3 bandwidth. I then used the bootstrap method to compute confidence intervals for the percent increase in speed, (optimized / dev) - 1 (python script).

Summary of results

Regressions

All scenarios are speed-positive except github-users with 4K dict. The latter scenarios have small regressions, between 0.2% and 1.6% depending on dict temperature. This is due to the extremely high compression ratio on github-users (which reduced the win for all github-users dict sizes). Every time a match is found, the pipeline has to reset, killing the wins from pipelining when there are no matches found. Most real-world datasets have much lower compression ratio, and most real-world dictionaries are much larger than 4K. So the regression only affects an extremely unrealistic use-case.

Again: html8K was speed-positive across all scenarios. github-users with 32K and 110K dicts were speed-positive across all scenarios. This regression is only for github-users with 4K dict size.

Wins

All html8K scenarios are 5%+ wins on gcc. For html8K with 32K dictionary size, the gcc wins are 7%+ for all temperatures. The largest win I measured was 9.8% (± 0.2%) on html_8K/cold/110K with gcc.

clang is speed positive across all scenarios, but wins are more like 1-3% (this makes sense, since I only measured gcc in my exploration).

github-users wins for 32K dict and 110K dict are also around 1-3%. As discussed above, this is due to the extremely high compression ratio of github-users, which is not representative of most real-world data.

Full data

Rejected optimizations

  • L1 prefetching the dictMatchIndex / matchIndex. Together with pipelining, this was speed-positive for hot dict, but the win was very small. It basically killed the win from pipelining.
    • Theory: L1 prefetching tied up Line Fill Buffers, of which there are only 12 on Skylake Server, delaying more urgent reads / writes.
  • Various L2 prefetches. These were speed-positive for cold dict, but huge regression for hot dict. I cannot land them without adding cold-dict detection for compression, which I plan to do in a future PR. L2 prefetches are thus out of scope for this PR.
  • Optimize pipeline initialization after a match is found. This ended up killing about half of the total pipelining win on html8K. Perhaps the extra branches outweighed the win from more extensive pipelining?
  • Replace ZSTD_count_2segments with ZSTD_count. For some reason this was a clear loss. Not sure why, it should have been a small win. Didn't spend time to investigate.

Future work

I did not try deeper pipelines, or combining pipelining with loop unrolling (as we have in noDict). It is completely possible that either of those directions could lead to bigger wins. Since I hit the noDict wins with a less complex pipeline (and even exceeded them in some important scenarios), I did not try these more complex optimizations.

Conclusion

There are big wins for html8K, and every realistic scenario I tested is speed-positive. Even for an unrealistic worst-case scenario (github-hot-4K-gcc), the speed regression was only 1.6%. The ratio regression has been addressed. This PR is ready to merge!

@Cyan4973
Copy link
Contributor

Cyan4973 commented Mar 4, 2022

This change regresses compression ratio by 0.1% on the html8K dataset

0.1% ratio loss for ~5% speed win
is a good enough ratio for fast compression modes.

Copy link
Contributor

@felixhandte felixhandte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks really solid! Well done.

I left some comments and questions, which we can discuss sometime next week.

lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Show resolved Hide resolved
lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Show resolved Hide resolved
lib/compress/zstd_fast.c Show resolved Hide resolved
lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
@Cyan4973
Copy link
Contributor

Cyan4973 commented Mar 10, 2022

As a minor coding style comment :
a lot of variables have shifted from being const to being mutable in this PR.

This may seem a "minor event" (just one keyword difference), but it's not : in most cases, bugs involve mutable variables. When a variable is const, I know it doesn't change after initialization, and during a code review, I can mentally "forget" it after initialization. When it's mutable, it's a state, and I need to track its every change, to be sure the system behaves as intended. Multiply the nb of mutable variables, and the nb of states to keep track of becomes unwieldy.

This is a balance though. I'm not saying that mutable variables are forbidden. They are necessary. But because they add a "burden cost" to maintenance, adding one should be pondered with caution, and additional efforts to keep them const immutable are worthwhile (whenever applicable).

In this PR, I only found 2 cases where variables could be made const, so the vast majority remain mutable. No magical wand here. I'm merely using this box to emphasize the benefits of const for code maintenance.

Copy link
Contributor

@felixhandte felixhandte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@embg
Copy link
Contributor Author

embg commented Mar 17, 2022

Updated PR summary with results from full benchmarking suite and discussion of good and bad future directions for this work.

@embg embg merged commit 64efba4 into facebook:dev Mar 17, 2022
@embg embg changed the title Software pipeline for ZSTD_compressBlock_fast_dictMatchState Software pipeline for ZSTD_compressBlock_fast_dictMatchState (+5-6% compression speed) Jun 23, 2022
@Cyan4973 Cyan4973 mentioned this pull request Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants