Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd: Optimize seqdeq amd64 asm #636

Merged
merged 1 commit into from
Jul 4, 2022

Conversation

greatroar
Copy link
Contributor

copyMemoryPrecise now generates a loop over 16-byte blocks with a single branchless 16-byte fixup after it. This is a tiny bit faster on the whole and quite a bit faster for some inputs.

Benchmark results on Intel Core i7-3770K
name                                                         old speed      new speed      delta
Decoder_DecoderSmall/kppkn.gtb.zst-8                          369MB/s ± 0%   374MB/s ± 1%  +1.56%  (p=0.008 n=5+5)
Decoder_DecoderSmall/geo.protodata.zst-8                      977MB/s ± 0%  1056MB/s ± 1%  +8.17%  (p=0.008 n=5+5)
Decoder_DecoderSmall/plrabn12.txt.zst-8                       291MB/s ± 0%   289MB/s ± 0%  -0.74%  (p=0.008 n=5+5)
Decoder_DecoderSmall/lcet10.txt.zst-8                         329MB/s ± 1%   333MB/s ± 0%  +1.23%  (p=0.008 n=5+5)
Decoder_DecoderSmall/asyoulik.txt.zst-8                       310MB/s ± 0%   310MB/s ± 1%    ~     (p=1.000 n=5+5)
Decoder_DecoderSmall/alice29.txt.zst-8                        291MB/s ± 0%   291MB/s ± 1%    ~     (p=0.421 n=5+5)
Decoder_DecoderSmall/html_x_4.zst-8                          2.07GB/s ± 0%  2.15GB/s ± 2%  +4.05%  (p=0.008 n=5+5)
Decoder_DecoderSmall/paper-100k.pdf.zst-8                    3.58GB/s ± 3%  3.74GB/s ± 1%  +4.31%  (p=0.008 n=5+5)
Decoder_DecoderSmall/fireworks.jpeg.zst-8                    8.57GB/s ± 0%  8.60GB/s ± 0%    ~     (p=0.056 n=5+5)
Decoder_DecoderSmall/urls.10K.zst-8                           474MB/s ± 1%   507MB/s ± 1%  +6.80%  (p=0.008 n=5+5)
Decoder_DecoderSmall/html.zst-8                               745MB/s ± 0%   803MB/s ± 0%  +7.68%  (p=0.008 n=5+5)
Decoder_DecoderSmall/comp-data.bin.zst-8                      399MB/s ± 1%   400MB/s ± 0%    ~     (p=0.841 n=5+5)
Decoder_DecodeAll/kppkn.gtb.zst-8                             521MB/s ± 0%   521MB/s ± 0%    ~     (p=0.841 n=5+5)
Decoder_DecodeAll/geo.protodata.zst-8                        1.27GB/s ± 1%  1.29GB/s ± 0%  +1.19%  (p=0.008 n=5+5)
Decoder_DecodeAll/plrabn12.txt.zst-8                          429MB/s ± 0%   427MB/s ± 0%  -0.51%  (p=0.032 n=5+5)
Decoder_DecodeAll/lcet10.txt.zst-8                            435MB/s ± 0%   439MB/s ± 0%  +0.94%  (p=0.008 n=5+5)
Decoder_DecodeAll/asyoulik.txt.zst-8                          438MB/s ± 0%   436MB/s ± 0%  -0.39%  (p=0.008 n=5+5)
Decoder_DecodeAll/alice29.txt.zst-8                           423MB/s ± 0%   420MB/s ± 1%  -0.72%  (p=0.008 n=5+5)
Decoder_DecodeAll/html_x_4.zst-8                             1.59GB/s ± 0%  1.59GB/s ± 1%  +0.54%  (p=0.032 n=5+5)
Decoder_DecodeAll/paper-100k.pdf.zst-8                       4.53GB/s ± 1%  4.54GB/s ± 1%    ~     (p=0.310 n=5+5)
Decoder_DecodeAll/fireworks.jpeg.zst-8                       9.64GB/s ± 1%  9.57GB/s ± 0%    ~     (p=0.151 n=5+5)
Decoder_DecodeAll/urls.10K.zst-8                              683MB/s ± 0%   681MB/s ± 0%    ~     (p=0.056 n=5+5)
Decoder_DecodeAll/html.zst-8                                 1.04GB/s ± 1%  1.06GB/s ± 0%  +1.77%  (p=0.008 n=5+5)
Decoder_DecodeAll/comp-data.bin.zst-8                         398MB/s ± 1%   399MB/s ± 1%    ~     (p=1.000 n=5+5)
Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-8    439MB/s ± 0%   437MB/s ± 0%  -0.39%  (p=0.016 n=5+5)
Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-8    448MB/s ± 0%   448MB/s ± 0%    ~     (p=0.841 n=5+5)
Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-8     478MB/s ± 0%   477MB/s ± 0%    ~     (p=0.151 n=5+5)
Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-8       463MB/s ± 0%   460MB/s ± 0%  -0.57%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/e.txt/fastest-8                       9.62GB/s ± 3%  9.66GB/s ± 1%    ~     (p=0.841 n=5+5)
Decoder_DecodeAllFiles/e.txt/default-8                        394MB/s ± 0%   395MB/s ± 0%    ~     (p=0.056 n=5+5)
Decoder_DecodeAllFiles/e.txt/better-8                         438MB/s ± 0%   442MB/s ± 0%  +0.82%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/e.txt/best-8                           501MB/s ± 0%   506MB/s ± 0%  +1.07%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/fse-artifact3.bin/fastest-8           1.04GB/s ± 0%  1.05GB/s ± 1%    ~     (p=0.056 n=5+5)
Decoder_DecodeAllFiles/fse-artifact3.bin/default-8           1.20GB/s ± 1%  1.20GB/s ± 1%    ~     (p=0.095 n=5+5)
Decoder_DecodeAllFiles/fse-artifact3.bin/better-8            1.01GB/s ± 0%  1.00GB/s ± 1%  -0.82%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/fse-artifact3.bin/best-8               386MB/s ± 0%   383MB/s ± 0%  -0.57%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/gettysburg.txt/fastest-8               271MB/s ± 1%   275MB/s ± 1%  +1.59%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/gettysburg.txt/default-8               224MB/s ± 1%   223MB/s ± 1%    ~     (p=0.222 n=5+5)
Decoder_DecodeAllFiles/gettysburg.txt/better-8                228MB/s ± 0%   226MB/s ± 0%  -0.89%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/gettysburg.txt/best-8                  223MB/s ± 1%   221MB/s ± 1%  -1.03%  (p=0.016 n=5+5)
Decoder_DecodeAllFiles/html.txt/fastest-8                     592MB/s ± 1%   611MB/s ± 0%  +3.20%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/html.txt/default-8                     597MB/s ± 0%   607MB/s ± 0%  +1.71%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/html.txt/better-8                      623MB/s ± 0%   633MB/s ± 0%  +1.57%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/html.txt/best-8                        603MB/s ± 0%   610MB/s ± 0%  +1.25%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/pi.txt/fastest-8                      9.59GB/s ± 1%  9.70GB/s ± 1%  +1.16%  (p=0.032 n=5+5)
Decoder_DecodeAllFiles/pi.txt/default-8                       391MB/s ± 0%   393MB/s ± 0%  +0.62%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/pi.txt/better-8                        437MB/s ± 1%   441MB/s ± 2%    ~     (p=0.087 n=5+5)
Decoder_DecodeAllFiles/pi.txt/best-8                          501MB/s ± 0%   507MB/s ± 0%  +1.22%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/pngdata.bin/fastest-8                 1.66GB/s ± 1%  1.70GB/s ± 0%  +2.49%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/pngdata.bin/default-8                 1.49GB/s ± 0%  1.51GB/s ± 0%  +1.18%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/pngdata.bin/better-8                  1.87GB/s ± 0%  1.90GB/s ± 1%    ~     (p=0.056 n=5+5)
Decoder_DecodeAllFiles/pngdata.bin/best-8                    1.44GB/s ± 1%  1.46GB/s ± 0%  +1.75%  (p=0.008 n=5+5)
Decoder_DecodeAllFiles/sharnd.out/fastest-8                  9.64GB/s ± 1%  9.66GB/s ± 1%    ~     (p=0.841 n=5+5)
Decoder_DecodeAllFiles/sharnd.out/default-8                  9.70GB/s ± 1%  9.70GB/s ± 2%    ~     (p=1.000 n=5+5)
Decoder_DecodeAllFiles/sharnd.out/better-8                   9.71GB/s ± 1%  9.79GB/s ± 1%    ~     (p=0.151 n=5+5)
Decoder_DecodeAllFiles/sharnd.out/best-8                     9.76GB/s ± 0%  9.80GB/s ± 0%    ~     (p=0.056 n=5+5)
Decoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-8  1.85GB/s ± 0%  1.85GB/s ± 0%  -0.31%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-8  1.86GB/s ± 0%  1.85GB/s ± 0%  -0.47%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-8   2.00GB/s ± 0%  2.00GB/s ± 0%  -0.32%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-8     1.93GB/s ± 0%  1.93GB/s ± 0%  -0.22%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/e.txt/fastest-8                      37.7GB/s ± 0%  37.5GB/s ± 0%  -0.38%  (p=0.016 n=5+5)
Decoder_DecodeAllFilesP/e.txt/default-8                      1.68GB/s ± 0%  1.69GB/s ± 0%  +0.55%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/e.txt/better-8                       1.91GB/s ± 0%  1.92GB/s ± 0%  +0.96%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/e.txt/best-8                         2.22GB/s ± 0%  2.25GB/s ± 0%  +1.50%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/fse-artifact3.bin/fastest-8          5.18GB/s ± 0%  5.05GB/s ± 2%  -2.50%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/fse-artifact3.bin/default-8          5.50GB/s ± 1%  5.34GB/s ± 1%  -2.86%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/fse-artifact3.bin/better-8           5.11GB/s ± 0%  5.14GB/s ± 0%  +0.57%  (p=0.016 n=5+5)
Decoder_DecodeAllFilesP/fse-artifact3.bin/best-8             2.36GB/s ± 0%  2.37GB/s ± 0%  +0.20%  (p=0.032 n=5+5)
Decoder_DecodeAllFilesP/gettysburg.txt/fastest-8             1.16GB/s ± 0%  1.16GB/s ± 0%    ~     (p=0.056 n=5+5)
Decoder_DecodeAllFilesP/gettysburg.txt/default-8             1.09GB/s ± 0%  1.08GB/s ± 0%  -1.19%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/gettysburg.txt/better-8              1.09GB/s ± 0%  1.08GB/s ± 1%  -0.96%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/gettysburg.txt/best-8                1.03GB/s ± 3%  1.02GB/s ± 0%    ~     (p=0.151 n=5+5)
Decoder_DecodeAllFilesP/html.txt/fastest-8                   2.50GB/s ± 1%  2.56GB/s ± 0%  +2.39%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/html.txt/default-8                   2.51GB/s ± 0%  2.55GB/s ± 0%  +1.69%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/html.txt/better-8                    2.61GB/s ± 0%  2.66GB/s ± 0%  +1.93%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/html.txt/best-8                      2.53GB/s ± 0%  2.56GB/s ± 0%  +1.13%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/pi.txt/fastest-8                     37.8GB/s ± 0%  37.6GB/s ± 0%  -0.44%  (p=0.016 n=5+5)
Decoder_DecodeAllFilesP/pi.txt/default-8                     1.67GB/s ± 0%  1.68GB/s ± 0%  +0.61%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/pi.txt/better-8                      1.91GB/s ± 0%  1.93GB/s ± 0%  +0.82%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/pi.txt/best-8                        2.23GB/s ± 0%  2.26GB/s ± 0%  +1.35%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/pngdata.bin/fastest-8                6.99GB/s ± 0%  7.00GB/s ± 0%    ~     (p=0.690 n=5+5)
Decoder_DecodeAllFilesP/pngdata.bin/default-8                6.88GB/s ± 0%  6.87GB/s ± 0%    ~     (p=0.222 n=5+5)
Decoder_DecodeAllFilesP/pngdata.bin/better-8                 8.49GB/s ± 0%  8.44GB/s ± 1%    ~     (p=0.310 n=5+5)
Decoder_DecodeAllFilesP/pngdata.bin/best-8                   6.59GB/s ± 1%  6.53GB/s ± 1%  -0.96%  (p=0.032 n=5+5)
Decoder_DecodeAllFilesP/sharnd.out/fastest-8                 37.8GB/s ± 0%  37.5GB/s ± 0%  -0.86%  (p=0.008 n=5+5)
Decoder_DecodeAllFilesP/sharnd.out/default-8                 37.9GB/s ± 1%  38.0GB/s ± 1%    ~     (p=0.310 n=5+5)
Decoder_DecodeAllFilesP/sharnd.out/better-8                  37.9GB/s ± 0%  37.8GB/s ± 2%    ~     (p=0.841 n=5+5)
Decoder_DecodeAllFilesP/sharnd.out/best-8                    37.8GB/s ± 0%  38.0GB/s ± 1%    ~     (p=0.310 n=5+5)
Decoder_DecodeAllParallel/kppkn.gtb.zst-8                    2.20GB/s ± 0%  2.20GB/s ± 0%    ~     (p=1.000 n=5+5)
Decoder_DecodeAllParallel/geo.protodata.zst-8                5.37GB/s ± 0%  5.39GB/s ± 0%  +0.35%  (p=0.008 n=5+5)
Decoder_DecodeAllParallel/plrabn12.txt.zst-8                 1.77GB/s ± 0%  1.76GB/s ± 0%  -0.19%  (p=0.008 n=5+5)
Decoder_DecodeAllParallel/lcet10.txt.zst-8                   1.90GB/s ± 0%  1.92GB/s ± 0%  +0.80%  (p=0.008 n=5+5)
Decoder_DecodeAllParallel/asyoulik.txt.zst-8                 1.83GB/s ± 0%  1.83GB/s ± 0%    ~     (p=0.841 n=5+5)
Decoder_DecodeAllParallel/alice29.txt.zst-8                  1.74GB/s ± 0%  1.74GB/s ± 0%    ~     (p=0.548 n=5+5)
Decoder_DecodeAllParallel/html_x_4.zst-8                     6.55GB/s ± 0%  6.49GB/s ± 0%  -0.97%  (p=0.008 n=5+5)
Decoder_DecodeAllParallel/paper-100k.pdf.zst-8               18.3GB/s ± 0%  18.3GB/s ± 0%    ~     (p=0.056 n=5+5)
Decoder_DecodeAllParallel/fireworks.jpeg.zst-8               37.4GB/s ± 0%  37.2GB/s ± 1%  -0.57%  (p=0.016 n=4+5)
Decoder_DecodeAllParallel/urls.10K.zst-8                     2.97GB/s ± 0%  2.96GB/s ± 0%    ~     (p=0.310 n=5+5)
Decoder_DecodeAllParallel/html.zst-8                         4.42GB/s ± 1%  4.43GB/s ± 0%    ~     (p=0.556 n=5+4)
Decoder_DecodeAllParallel/comp-data.bin.zst-8                1.69GB/s ± 1%  1.70GB/s ± 0%  +0.84%  (p=0.008 n=5+5)
[Geo mean]                                                   1.77GB/s       1.78GB/s       +0.57%

copyMemoryPrecise now generates a loop over 16-byte blocks with a single
branchless 16-byte fixup after it.

This is a tiny bit faster on the whole and quite a bit faster for some
inputs. Benchmark results on Intel Core i7-3770K:

	name                                                         old speed      new speed      delta
	Decoder_DecoderSmall/kppkn.gtb.zst-8                          369MB/s ± 0%   374MB/s ± 1%  +1.56%  (p=0.008 n=5+5)
	Decoder_DecoderSmall/geo.protodata.zst-8                      977MB/s ± 0%  1056MB/s ± 1%  +8.17%  (p=0.008 n=5+5)
	Decoder_DecoderSmall/plrabn12.txt.zst-8                       291MB/s ± 0%   289MB/s ± 0%  -0.74%  (p=0.008 n=5+5)
	Decoder_DecoderSmall/lcet10.txt.zst-8                         329MB/s ± 1%   333MB/s ± 0%  +1.23%  (p=0.008 n=5+5)
	Decoder_DecoderSmall/asyoulik.txt.zst-8                       310MB/s ± 0%   310MB/s ± 1%    ~     (p=1.000 n=5+5)
	Decoder_DecoderSmall/alice29.txt.zst-8                        291MB/s ± 0%   291MB/s ± 1%    ~     (p=0.421 n=5+5)
	Decoder_DecoderSmall/html_x_4.zst-8                          2.07GB/s ± 0%  2.15GB/s ± 2%  +4.05%  (p=0.008 n=5+5)
	Decoder_DecoderSmall/paper-100k.pdf.zst-8                    3.58GB/s ± 3%  3.74GB/s ± 1%  +4.31%  (p=0.008 n=5+5)
	Decoder_DecoderSmall/fireworks.jpeg.zst-8                    8.57GB/s ± 0%  8.60GB/s ± 0%    ~     (p=0.056 n=5+5)
	Decoder_DecoderSmall/urls.10K.zst-8                           474MB/s ± 1%   507MB/s ± 1%  +6.80%  (p=0.008 n=5+5)
	Decoder_DecoderSmall/html.zst-8                               745MB/s ± 0%   803MB/s ± 0%  +7.68%  (p=0.008 n=5+5)
	Decoder_DecoderSmall/comp-data.bin.zst-8                      399MB/s ± 1%   400MB/s ± 0%    ~     (p=0.841 n=5+5)
	Decoder_DecodeAll/kppkn.gtb.zst-8                             521MB/s ± 0%   521MB/s ± 0%    ~     (p=0.841 n=5+5)
	Decoder_DecodeAll/geo.protodata.zst-8                        1.27GB/s ± 1%  1.29GB/s ± 0%  +1.19%  (p=0.008 n=5+5)
	Decoder_DecodeAll/plrabn12.txt.zst-8                          429MB/s ± 0%   427MB/s ± 0%  -0.51%  (p=0.032 n=5+5)
	Decoder_DecodeAll/lcet10.txt.zst-8                            435MB/s ± 0%   439MB/s ± 0%  +0.94%  (p=0.008 n=5+5)
	Decoder_DecodeAll/asyoulik.txt.zst-8                          438MB/s ± 0%   436MB/s ± 0%  -0.39%  (p=0.008 n=5+5)
	Decoder_DecodeAll/alice29.txt.zst-8                           423MB/s ± 0%   420MB/s ± 1%  -0.72%  (p=0.008 n=5+5)
	Decoder_DecodeAll/html_x_4.zst-8                             1.59GB/s ± 0%  1.59GB/s ± 1%  +0.54%  (p=0.032 n=5+5)
	Decoder_DecodeAll/paper-100k.pdf.zst-8                       4.53GB/s ± 1%  4.54GB/s ± 1%    ~     (p=0.310 n=5+5)
	Decoder_DecodeAll/fireworks.jpeg.zst-8                       9.64GB/s ± 1%  9.57GB/s ± 0%    ~     (p=0.151 n=5+5)
	Decoder_DecodeAll/urls.10K.zst-8                              683MB/s ± 0%   681MB/s ± 0%    ~     (p=0.056 n=5+5)
	Decoder_DecodeAll/html.zst-8                                 1.04GB/s ± 1%  1.06GB/s ± 0%  +1.77%  (p=0.008 n=5+5)
	Decoder_DecodeAll/comp-data.bin.zst-8                         398MB/s ± 1%   399MB/s ± 1%    ~     (p=1.000 n=5+5)
	Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-8    439MB/s ± 0%   437MB/s ± 0%  -0.39%  (p=0.016 n=5+5)
	Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-8    448MB/s ± 0%   448MB/s ± 0%    ~     (p=0.841 n=5+5)
	Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-8     478MB/s ± 0%   477MB/s ± 0%    ~     (p=0.151 n=5+5)
	Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-8       463MB/s ± 0%   460MB/s ± 0%  -0.57%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/e.txt/fastest-8                       9.62GB/s ± 3%  9.66GB/s ± 1%    ~     (p=0.841 n=5+5)
	Decoder_DecodeAllFiles/e.txt/default-8                        394MB/s ± 0%   395MB/s ± 0%    ~     (p=0.056 n=5+5)
	Decoder_DecodeAllFiles/e.txt/better-8                         438MB/s ± 0%   442MB/s ± 0%  +0.82%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/e.txt/best-8                           501MB/s ± 0%   506MB/s ± 0%  +1.07%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/fse-artifact3.bin/fastest-8           1.04GB/s ± 0%  1.05GB/s ± 1%    ~     (p=0.056 n=5+5)
	Decoder_DecodeAllFiles/fse-artifact3.bin/default-8           1.20GB/s ± 1%  1.20GB/s ± 1%    ~     (p=0.095 n=5+5)
	Decoder_DecodeAllFiles/fse-artifact3.bin/better-8            1.01GB/s ± 0%  1.00GB/s ± 1%  -0.82%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/fse-artifact3.bin/best-8               386MB/s ± 0%   383MB/s ± 0%  -0.57%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/gettysburg.txt/fastest-8               271MB/s ± 1%   275MB/s ± 1%  +1.59%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/gettysburg.txt/default-8               224MB/s ± 1%   223MB/s ± 1%    ~     (p=0.222 n=5+5)
	Decoder_DecodeAllFiles/gettysburg.txt/better-8                228MB/s ± 0%   226MB/s ± 0%  -0.89%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/gettysburg.txt/best-8                  223MB/s ± 1%   221MB/s ± 1%  -1.03%  (p=0.016 n=5+5)
	Decoder_DecodeAllFiles/html.txt/fastest-8                     592MB/s ± 1%   611MB/s ± 0%  +3.20%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/html.txt/default-8                     597MB/s ± 0%   607MB/s ± 0%  +1.71%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/html.txt/better-8                      623MB/s ± 0%   633MB/s ± 0%  +1.57%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/html.txt/best-8                        603MB/s ± 0%   610MB/s ± 0%  +1.25%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/pi.txt/fastest-8                      9.59GB/s ± 1%  9.70GB/s ± 1%  +1.16%  (p=0.032 n=5+5)
	Decoder_DecodeAllFiles/pi.txt/default-8                       391MB/s ± 0%   393MB/s ± 0%  +0.62%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/pi.txt/better-8                        437MB/s ± 1%   441MB/s ± 2%    ~     (p=0.087 n=5+5)
	Decoder_DecodeAllFiles/pi.txt/best-8                          501MB/s ± 0%   507MB/s ± 0%  +1.22%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/pngdata.bin/fastest-8                 1.66GB/s ± 1%  1.70GB/s ± 0%  +2.49%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/pngdata.bin/default-8                 1.49GB/s ± 0%  1.51GB/s ± 0%  +1.18%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/pngdata.bin/better-8                  1.87GB/s ± 0%  1.90GB/s ± 1%    ~     (p=0.056 n=5+5)
	Decoder_DecodeAllFiles/pngdata.bin/best-8                    1.44GB/s ± 1%  1.46GB/s ± 0%  +1.75%  (p=0.008 n=5+5)
	Decoder_DecodeAllFiles/sharnd.out/fastest-8                  9.64GB/s ± 1%  9.66GB/s ± 1%    ~     (p=0.841 n=5+5)
	Decoder_DecodeAllFiles/sharnd.out/default-8                  9.70GB/s ± 1%  9.70GB/s ± 2%    ~     (p=1.000 n=5+5)
	Decoder_DecodeAllFiles/sharnd.out/better-8                   9.71GB/s ± 1%  9.79GB/s ± 1%    ~     (p=0.151 n=5+5)
	Decoder_DecodeAllFiles/sharnd.out/best-8                     9.76GB/s ± 0%  9.80GB/s ± 0%    ~     (p=0.056 n=5+5)
	Decoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-8  1.85GB/s ± 0%  1.85GB/s ± 0%  -0.31%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-8  1.86GB/s ± 0%  1.85GB/s ± 0%  -0.47%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-8   2.00GB/s ± 0%  2.00GB/s ± 0%  -0.32%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-8     1.93GB/s ± 0%  1.93GB/s ± 0%  -0.22%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/e.txt/fastest-8                      37.7GB/s ± 0%  37.5GB/s ± 0%  -0.38%  (p=0.016 n=5+5)
	Decoder_DecodeAllFilesP/e.txt/default-8                      1.68GB/s ± 0%  1.69GB/s ± 0%  +0.55%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/e.txt/better-8                       1.91GB/s ± 0%  1.92GB/s ± 0%  +0.96%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/e.txt/best-8                         2.22GB/s ± 0%  2.25GB/s ± 0%  +1.50%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/fse-artifact3.bin/fastest-8          5.18GB/s ± 0%  5.05GB/s ± 2%  -2.50%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/fse-artifact3.bin/default-8          5.50GB/s ± 1%  5.34GB/s ± 1%  -2.86%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/fse-artifact3.bin/better-8           5.11GB/s ± 0%  5.14GB/s ± 0%  +0.57%  (p=0.016 n=5+5)
	Decoder_DecodeAllFilesP/fse-artifact3.bin/best-8             2.36GB/s ± 0%  2.37GB/s ± 0%  +0.20%  (p=0.032 n=5+5)
	Decoder_DecodeAllFilesP/gettysburg.txt/fastest-8             1.16GB/s ± 0%  1.16GB/s ± 0%    ~     (p=0.056 n=5+5)
	Decoder_DecodeAllFilesP/gettysburg.txt/default-8             1.09GB/s ± 0%  1.08GB/s ± 0%  -1.19%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/gettysburg.txt/better-8              1.09GB/s ± 0%  1.08GB/s ± 1%  -0.96%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/gettysburg.txt/best-8                1.03GB/s ± 3%  1.02GB/s ± 0%    ~     (p=0.151 n=5+5)
	Decoder_DecodeAllFilesP/html.txt/fastest-8                   2.50GB/s ± 1%  2.56GB/s ± 0%  +2.39%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/html.txt/default-8                   2.51GB/s ± 0%  2.55GB/s ± 0%  +1.69%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/html.txt/better-8                    2.61GB/s ± 0%  2.66GB/s ± 0%  +1.93%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/html.txt/best-8                      2.53GB/s ± 0%  2.56GB/s ± 0%  +1.13%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/pi.txt/fastest-8                     37.8GB/s ± 0%  37.6GB/s ± 0%  -0.44%  (p=0.016 n=5+5)
	Decoder_DecodeAllFilesP/pi.txt/default-8                     1.67GB/s ± 0%  1.68GB/s ± 0%  +0.61%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/pi.txt/better-8                      1.91GB/s ± 0%  1.93GB/s ± 0%  +0.82%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/pi.txt/best-8                        2.23GB/s ± 0%  2.26GB/s ± 0%  +1.35%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/pngdata.bin/fastest-8                6.99GB/s ± 0%  7.00GB/s ± 0%    ~     (p=0.690 n=5+5)
	Decoder_DecodeAllFilesP/pngdata.bin/default-8                6.88GB/s ± 0%  6.87GB/s ± 0%    ~     (p=0.222 n=5+5)
	Decoder_DecodeAllFilesP/pngdata.bin/better-8                 8.49GB/s ± 0%  8.44GB/s ± 1%    ~     (p=0.310 n=5+5)
	Decoder_DecodeAllFilesP/pngdata.bin/best-8                   6.59GB/s ± 1%  6.53GB/s ± 1%  -0.96%  (p=0.032 n=5+5)
	Decoder_DecodeAllFilesP/sharnd.out/fastest-8                 37.8GB/s ± 0%  37.5GB/s ± 0%  -0.86%  (p=0.008 n=5+5)
	Decoder_DecodeAllFilesP/sharnd.out/default-8                 37.9GB/s ± 1%  38.0GB/s ± 1%    ~     (p=0.310 n=5+5)
	Decoder_DecodeAllFilesP/sharnd.out/better-8                  37.9GB/s ± 0%  37.8GB/s ± 2%    ~     (p=0.841 n=5+5)
	Decoder_DecodeAllFilesP/sharnd.out/best-8                    37.8GB/s ± 0%  38.0GB/s ± 1%    ~     (p=0.310 n=5+5)
	Decoder_DecodeAllParallel/kppkn.gtb.zst-8                    2.20GB/s ± 0%  2.20GB/s ± 0%    ~     (p=1.000 n=5+5)
	Decoder_DecodeAllParallel/geo.protodata.zst-8                5.37GB/s ± 0%  5.39GB/s ± 0%  +0.35%  (p=0.008 n=5+5)
	Decoder_DecodeAllParallel/plrabn12.txt.zst-8                 1.77GB/s ± 0%  1.76GB/s ± 0%  -0.19%  (p=0.008 n=5+5)
	Decoder_DecodeAllParallel/lcet10.txt.zst-8                   1.90GB/s ± 0%  1.92GB/s ± 0%  +0.80%  (p=0.008 n=5+5)
	Decoder_DecodeAllParallel/asyoulik.txt.zst-8                 1.83GB/s ± 0%  1.83GB/s ± 0%    ~     (p=0.841 n=5+5)
	Decoder_DecodeAllParallel/alice29.txt.zst-8                  1.74GB/s ± 0%  1.74GB/s ± 0%    ~     (p=0.548 n=5+5)
	Decoder_DecodeAllParallel/html_x_4.zst-8                     6.55GB/s ± 0%  6.49GB/s ± 0%  -0.97%  (p=0.008 n=5+5)
	Decoder_DecodeAllParallel/paper-100k.pdf.zst-8               18.3GB/s ± 0%  18.3GB/s ± 0%    ~     (p=0.056 n=5+5)
	Decoder_DecodeAllParallel/fireworks.jpeg.zst-8               37.4GB/s ± 0%  37.2GB/s ± 1%  -0.57%  (p=0.016 n=4+5)
	Decoder_DecodeAllParallel/urls.10K.zst-8                     2.97GB/s ± 0%  2.96GB/s ± 0%    ~     (p=0.310 n=5+5)
	Decoder_DecodeAllParallel/html.zst-8                         4.42GB/s ± 1%  4.43GB/s ± 0%    ~     (p=0.556 n=5+4)
	Decoder_DecodeAllParallel/comp-data.bin.zst-8                1.69GB/s ± 1%  1.70GB/s ± 0%  +0.84%  (p=0.008 n=5+5)
	[Geo mean]                                                   1.77GB/s       1.78GB/s       +0.57%
@greatroar greatroar force-pushed the zstd-copymemoryprecise branch from 5d8f037 to cc3f110 Compare July 3, 2022 18:05
@klauspost
Copy link
Owner

Great stuff. I will run a few tests tomorrow.

@klauspost
Copy link
Owner

@greatroar I can confirm your numbers. Some rather noticeable regressions, but with this I can port some of the s2 memcopy for an even bigger speedup and fixing some of the regression cases.

Thanks for the contribution!

@klauspost klauspost merged commit bf3f0fd into klauspost:master Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants