BuildZStdDict Testing #950

macneale4 · 2024-04-10T17:09:28Z

Hello!

I'm hoping to use the compress package for packing bytes in our archive format for Dolt (https://github.com/dolthub/dolt). Our data is very amenable to custom dictionaries for small sets (5-50) of binary blobs which average about 4K in size.

We built our POC using the gozstd package which has Dictionary support: https://pkg.go.dev/github.com/valyala/gozstd

Now that I'm are convinced this is a good approach for us, I'm curious how we can move the dict.BuildZstdDict function out of the experimental phase, as it's stated in the dict/README.md. In particular, the README says it doesn't work well with small sets, and indeed I got a panic when using a set of 32 samples.

How far from complete do you think it is? How can I help get this tested and get the bugs ironed out? I have one local test case I can share with you if that's helpful.

FWIW, using the dictionary produced by valyala/gozstd, I can compress/decompress successfully with the compress/zstd package. So I think it's just a matter of generating a correct dict and we'll be able to move forward.

Thanks!

The text was updated successfully, but these errors were encountered:

klauspost · 2024-04-11T14:18:55Z

I'm curious how we can move the dict.BuildZstdDict function out of the experimental phase

TBH I don't see it happening. I don't have the bandwidth for it at the moment.

However if you can make a simple reproducer for any crash I will fix it.

klauspost · 2024-04-11T14:41:42Z

Also sorry for being a bit distracted in my last replies... Let me provide some context:

dict.BuildZstdDict(input [][]byte, o Options) is the higher level API.

What it mainly does is to create a common dictionary for the input samples you provide. Deduplicating and truncating to opts.MaxDictSize. Once that is done it calls zstd.BuildDict to contruct the actual dictionary.

It will attempt to compress all of Contents [][]byte with History []byte as a common history to generate the optimal tables for these. These tables are then all combined with the history and written in the zstd dictionary format.

There are for sure untested corner cases in this process and it is mostly a "released PoC" - as in - it works - some/most of the time and it is "ok" in terms of (compression) efficiency.

macneale4 · 2024-04-11T20:27:16Z

That's good context. I'll try and bypass the higher level API and use the zstd.BuildDict next. In the meantime, here is a test which is failing.

macneale4@956c229

There is a divide by 0 error, which I got around with a hack (see commit). I doubt that is right. But then once the dictionary is created, the loadDict method fails with the error tableLog too large

I'm out of my element when is comes to compression, but will probably be pecking at this for the next few days. We'd really like to avoid using CGO in our product.

klauspost · 2024-04-12T09:06:52Z

Fix in #951

Made a few tweaks to your test.

A) You are writing to the input buffer when decompressing.
B) EncodeAll/DecodeAll appends- adding a [:0] to make that clear.
C) MaxDictSize is for the backreference size, not the total size.

package dict

import (
	"bytes"
	"io"
	"math/rand"
	"os"
	"testing"

	"github.com/klauspost/compress/zstd"
)

func TestZStdDict(t *testing.T) {
	for _, level := range []zstd.EncoderLevel{zstd.SpeedFastest, zstd.SpeedDefault, zstd.SpeedBetterCompression, zstd.SpeedBestCompression} {
		testZStdDict(t, level)
	}
}

func testZStdDict(t *testing.T, level zstd.EncoderLevel) {
	out := io.Discard
	if testing.Verbose() {
		out = os.Stdout
	}
	opts := Options{
		MaxDictSize:    2048,
		HashBytes:      4,
		Output:         out,
		ZstdDictID:     0,
		ZstdDictCompat: false,
		ZstdLevel:      level,
	}

	inBuf := make([]byte, 0, 4096)
	outBuf := make([]byte, 0, 4096)

	// This is 32K worth of data, but it's all very similar. Only fits in 4K if compressed with a dictionary.
	samples := generateSimilarByteSlices(42, 32)

	dict, err := BuildZstdDict(samples, opts)
	if err != nil {
		t.Fatal(err.Error())
	}

	totalSize := 0
	for _, blob := range samples {
		compressed, err := zCompressDict(inBuf, dict, blob)
		if err != nil {
			t.Fatal(err.Error())
		}
		totalSize += len(compressed)

		// Check round trip.
		decompressed, err := zDecompressDict(outBuf, dict, compressed)
		if err != nil {
			t.Fatal(err.Error())
		}
		if !bytes.Equal(decompressed, blob) {
			t.Fatal("Round trip failed")
		}
	}
	if totalSize > 4096 {
		t.Fatal("Total compressed size exceeds 4096 bytes")
	}
	t.Log("Total compressed size:", totalSize)
}

func zCompressDict(dst, dict, data []byte) ([]byte, error) {
	encoder, err := zstd.NewWriter(nil, zstd.WithEncoderDict(dict))
	if err != nil {
		return nil, err
	}
	defer encoder.Close()

	result := encoder.EncodeAll(data, dst[:0])
	return result, nil
}

func zDecompressDict(dst, dict, data []byte) ([]byte, error) {
	decoder, err := zstd.NewReader(nil, zstd.WithDecoderDicts(dict))
	if err != nil {
		return nil, err
	}
	defer decoder.Close()

	result, err := decoder.DecodeAll(data, dst[:0])
	if err != nil {
		return nil, err
	}

	return result, nil
}

// Creates a slice of byte slices, each of which is has the same random seed, so they are very similar. The length
// of each slice is 1024 + the index of the slice.
func generateSimilarByteSlices(seed int64, count int) [][]byte {
	chks := make([][]byte, count)
	for i := 0; i < count; i++ {
		chks[i] = generateRandomByteSlice(seed, 1024+i)
		if false {
			// Generate a small diff.
			chks[i][i] = byte(seed)
		}
	}

	return chks
}

func generateRandomByteSlice(seed int64, len int) []byte {
	r := rand.NewSource(seed)

	data := make([]byte, len)
	for i := range data {
		data[i] = byte(r.Int63())
	}
	return data
}

With your permission I'd like to include that as a test.

macneale4 · 2024-04-12T22:21:04Z

I pulled your changes, and they do result in our unit tests passing. Great!

When exercising against real data, I ran into another edge case. I pushed a new test to my fork branch:

https://github.com/macneale4/compress/tree/zstd-dict-test

There tip commit has a speculative change (allowing v == threshold) will allows my testing to continue. I have no idea if that's correct though.

You are welcome to take any of this and stick into your tests. The data is coming from one of our public data repos.

klauspost · 2024-04-15T09:52:43Z

The threshold is mostly to filter out candidates in large sets. Though everything being discarded could indicate that there is little of actual value. I'll merge your change back into the PR.

For production use you'd probably want to set up a fuzz test to check that border cases are handled.

klauspost mentioned this issue Apr 12, 2024

zstd: BuildDict fails with RLE table #951

Merged

klauspost closed this as completed in #951 Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BuildZStdDict Testing #950

BuildZStdDict Testing #950

macneale4 commented Apr 10, 2024 •

edited

Loading

klauspost commented Apr 11, 2024

klauspost commented Apr 11, 2024

macneale4 commented Apr 11, 2024

klauspost commented Apr 12, 2024 •

edited

Loading

macneale4 commented Apr 12, 2024

klauspost commented Apr 15, 2024

BuildZStdDict Testing #950

BuildZStdDict Testing #950

Comments

macneale4 commented Apr 10, 2024 • edited Loading

klauspost commented Apr 11, 2024

klauspost commented Apr 11, 2024

macneale4 commented Apr 11, 2024

klauspost commented Apr 12, 2024 • edited Loading

macneale4 commented Apr 12, 2024

klauspost commented Apr 15, 2024

macneale4 commented Apr 10, 2024 •

edited

Loading

klauspost commented Apr 12, 2024 •

edited

Loading