Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: fsst compression with mini-block #3121

Merged
merged 7 commits into from
Nov 13, 2024

Conversation

broccoliSpicy
Copy link
Contributor

This PR tries to integrate mini-block page layout with FSST compression.

During compression, it first FSST compresses the input data then write out the data use BinaryMiniBlockEncoder.
During decompression, it first uses BinaryMiniBlockDecompressor to decode the raw data read from disk, it then applies FSST decompression.

@github-actions github-actions bot added the enhancement New feature or request label Nov 12, 2024
Copy link

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@codecov-commenter
Copy link

codecov-commenter commented Nov 12, 2024

Codecov Report

Attention: Patch coverage is 87.71930% with 14 lines in your changes missing coverage. Please review.

Project coverage is 77.87%. Comparing base (ec76db4) to head (2397c19).

Files with missing lines Patch % Lines
rust/lance-encoding/src/encodings/physical/fsst.rs 85.22% 9 Missing and 4 partials ⚠️
rust/lance-encoding/src/encodings/physical.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3121      +/-   ##
==========================================
+ Coverage   77.19%   77.87%   +0.68%     
==========================================
  Files         240      240              
  Lines       81517    81630     +113     
  Branches    81517    81630     +113     
==========================================
+ Hits        62927    63572     +645     
+ Misses      15383    14831     -552     
- Partials     3207     3227      +20     
Flag Coverage Δ
unittests 77.87% <87.71%> (+0.68%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

);
let max_len = max_len.as_primitive::<UInt64Type>().value(0);

if max_len > 4 && data_size >= 4 * 1024 * 1024 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do more experiments to tune this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. The original 4MIB threshold was very arbitrary.

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice :) Only a few minor suggestions but this looks good, thank you (also, CI will need to pass :))

Comment on lines 793 to +794
.get_stat(Stat::BitWidth)
.expect("FixedWidthDataBlock should have valid bit width statistics");
.expect("FixedWidthDataBlock should have valid `Stat::BitWidth` statistics");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: if you find yourself repeating the same expect statement again and again then maybe it would be worth it to make an expect_stat method which does the get_stat / expect combination.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha, I actually didn't know about this, thanks for the suggestion. I will fill a separate PR for this.

let data_size = variable_width_data.get_stat(Stat::DataSize).expect(
"VariableWidth DataBlock should have valid `Stat::DataSize` statistics",
);
let data_size = data_size.as_primitive::<UInt64Type>().value(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could be helpful to have a expect_single_state / get_single_stat method. Then you can just do:

let data_size = variable_width_data.expect_single_stat::<UInt64Type>(Stat::DataSize);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion! I will create a separate PR for this.

);
let max_len = max_len.as_primitive::<UInt64Type>().value(0);

if max_len > 4 && data_size >= 4 * 1024 * 1024 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. The original 4MIB threshold was very arbitrary.

@broccoliSpicy broccoliSpicy changed the title feat: FSST compression with mini-block feat: fsst compression with mini-block Nov 13, 2024
@broccoliSpicy broccoliSpicy merged commit 3ac6d4a into lancedb:main Nov 13, 2024
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants