Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lance File Format v2.1 #2856

Open
7 tasks
westonpace opened this issue Sep 11, 2024 · 0 comments
Open
7 tasks

Lance File Format v2.1 #2856

westonpace opened this issue Sep 11, 2024 · 0 comments

Comments

@westonpace
Copy link
Contributor

westonpace commented Sep 11, 2024

Now that 2.0 is the default we should avoid making changes, even non-breaking feature changes, to make sure we work out all the kinks. There are still a number of things that we would like to improve and so we will work on a 2.1 release. I'd like to focus on the following things for 2.1

  • Compression for strings, integers (and possibly floats)

Let's enable FSST in 2.1 and add some more tests to confirm stability. For integer compression we need bitpacking / delta / frame of reference. For floating point compression we can investigate ALP although this is a lower priority (it's not clear that ALP can help much with embeddings as they tend to be rather compressed already)

  • 1/2 IOPs structural encodings & repetition index

Miniblock and zipped structural encodings will give us 1-2 IOPS (1 for fixed width types and 2 for variable width types) regardless of how many levels of nesting and repetition are present. This should give us maximum performance for random access

  • Complete row based encodings

We introduced packed struct in 2.0. We should introduce a new encoding which handles variable width types (we can call it packed row or just extend packed struct) In addition, we should make it possible to create a file that is entirely row-major.

  • Simplified priority

The logic for calculating priority (for backpressure) is pretty complicated in 2.0. I believe we have it correct now but we have to do some expensive calculations (e.g. binary searches into list offsets) to calculate the priority correctly and there is quite a bit of complexity to handle some corner cases. In 2.1 we will simplify things by always recording the top-level row number on each page. This will be the only format change (i.e. not encodings) that I'm aware of.

Potential extra features which we may tackle as opportunity permits but are not part of the focus:

  • Run length encoding (using repetition index)
  • Enhanced I/O schedulers for NVME (e.g. io uring) and RAM (e.g. fully synchronous)

This issue is an umbrella issue that will cover a number of tasks to achieve the above goals.

Tasks:

Better Compression

New Structural Encodings

Row Based Encodings

Simplified Priority

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant