Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add dictionary encoding(draft, for discussion only) #3134

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

broccoliSpicy
Copy link
Contributor

@broccoliSpicy broccoliSpicy commented Nov 18, 2024

This PR tries to support dictionary encoding by integrating it with MiniBlock PageLayout.

The general approach here is:
In a MiniBlock PageLayout, there is a optional dictionary field that stores a dictionary encoding if this miniblock has a dictionary.

/// A layout used for pages where the data is small
///
/// In this case we can fit many values into a single disk sector and transposing buffers is
/// expensive.  As a result, we do not transpose the buffers but compress the data into small
/// chunks (called mini blocks) which are roughly the size of a disk sector.
message MiniBlockLayout {
  // Description of the compression of repetition levels (e.g. how many bits per rep)
  ArrayEncoding rep_compression = 1;
  // Description of the compression of definition levels (e.g. how many bits per def)
  ArrayEncoding def_compression = 2;
  // Description of the compression of values
  ArrayEncoding value_compression = 3;
  ArrayEncoding dictionary = 4;
}

The rational for this is that if we dictionary encoding something, it's indices will definitely fall into a MiniBlockLayout.
By doing this, we don't need to have a specific DictionaryEncoding, it can be any ArrayEncoding.
The Dictionary and the indices are cascaded into another encoding automatically.

Currently, the dictionary is stored inside the page along with chunk meta data and chunk data, this is not ideal and is a TODO task.

This is a draft for discussion with the above idea so I only supported FixedWidthDataBlock with this encoding, the effort to add support for VariableWidthData is trivial.

some performance comparison with parquet(no snappy):
tpch lineitem table with scale factor 10.

Column Name Parquet Write Time Lance Write Time Parquet Read Time Lance Read Time Parquet File Size Lance File Size Cardinality
l_quantity 1.37s 1.70s 2.41s 0.19s 43 MiB 44 MiB 50
l_extendedprice 2.93s 3.61s 4.90s 2.55s 318 MiB 917 MiB 1351462
l_discount 1.49s 1.65s 3.15s 0.16s 28 MiB 29 MiB 11
l_tax 1.57s 1.65s 2.03s 0.25s 28 MiB 29 MiB 9

forl_extendedprice, dictionary encoding is not applied due to large cardinality.

#3123

@github-actions github-actions bot added the enhancement New feature or request label Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant