Lance File Format v2.1 #2856

westonpace · 2024-09-11T15:57:53Z

Now that 2.0 is the default we should avoid making changes, even non-breaking feature changes, to make sure we work out all the kinks. There are still a number of things that we would like to improve and so we will work on a 2.1 release. I'd like to focus on the following things for 2.1

Compression for strings, integers (and possibly floats)

Let's enable FSST in 2.1 and add some more tests to confirm stability. For integer compression we need bitpacking / delta / frame of reference. For floating point compression we can investigate ALP although this is a lower priority (it's not clear that ALP can help much with embeddings as they tend to be rather compressed already)

1/2 IOPs structural encodings & repetition index

Miniblock and zipped structural encodings will give us 1-2 IOPS (1 for fixed width types and 2 for variable width types) regardless of how many levels of nesting and repetition are present. This should give us maximum performance for random access

Complete row based encodings

We introduced packed struct in 2.0. We should introduce a new encoding which handles variable width types (we can call it packed row or just extend packed struct) In addition, we should make it possible to create a file that is entirely row-major.

Simplified priority

The logic for calculating priority (for backpressure) is pretty complicated in 2.0. I believe we have it correct now but we have to do some expensive calculations (e.g. binary searches into list offsets) to calculate the priority correctly and there is quite a bit of complexity to handle some corner cases. In 2.1 we will simplify things by always recording the top-level row number on each page. This will be the only format change (i.e. not encodings) that I'm aware of.

Potential extra features which we may tackle as opportunity permits but are not part of the focus:

Run length encoding (using repetition index)
Enhanced I/O schedulers for NVME (e.g. io uring) and RAM (e.g. fully synchronous)

This issue is an umbrella issue that will cover a number of tasks to achieve the above goals.

Tasks:

Better Compression

New Structural Encodings

Row Based Encodings

Allow packing of variable length columns #2862

Simplified Priority

Add "priority" concept to v2 file format #2863

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lance File Format v2.1 #2856

Lance File Format v2.1 #2856

westonpace commented Sep 11, 2024 •

edited

Loading

Lance File Format v2.1 #2856

Lance File Format v2.1 #2856

Comments

westonpace commented Sep 11, 2024 • edited Loading

westonpace commented Sep 11, 2024 •

edited

Loading