-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: fsst compression with mini-block #3121
Conversation
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3121 +/- ##
==========================================
+ Coverage 77.19% 77.87% +0.68%
==========================================
Files 240 240
Lines 81517 81630 +113
Branches 81517 81630 +113
==========================================
+ Hits 62927 63572 +645
+ Misses 15383 14831 -552
- Partials 3207 3227 +20
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
rust/lance-encoding/src/encoder.rs
Outdated
); | ||
let max_len = max_len.as_primitive::<UInt64Type>().value(0); | ||
|
||
if max_len > 4 && data_size >= 4 * 1024 * 1024 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do more experiments to tune this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. The original 4MIB threshold was very arbitrary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice :) Only a few minor suggestions but this looks good, thank you (also, CI will need to pass :))
.get_stat(Stat::BitWidth) | ||
.expect("FixedWidthDataBlock should have valid bit width statistics"); | ||
.expect("FixedWidthDataBlock should have valid `Stat::BitWidth` statistics"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: if you find yourself repeating the same expect
statement again and again then maybe it would be worth it to make an expect_stat
method which does the get_stat
/ expect
combination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha, I actually didn't know about this, thanks for the suggestion. I will fill a separate PR for this.
let data_size = variable_width_data.get_stat(Stat::DataSize).expect( | ||
"VariableWidth DataBlock should have valid `Stat::DataSize` statistics", | ||
); | ||
let data_size = data_size.as_primitive::<UInt64Type>().value(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could be helpful to have a expect_single_state
/ get_single_stat
method. Then you can just do:
let data_size = variable_width_data.expect_single_stat::<UInt64Type>(Stat::DataSize);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion! I will create a separate PR for this.
rust/lance-encoding/src/encoder.rs
Outdated
); | ||
let max_len = max_len.as_primitive::<UInt64Type>().value(0); | ||
|
||
if max_len > 4 && data_size >= 4 * 1024 * 1024 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. The original 4MIB threshold was very arbitrary.
This PR tries to integrate mini-block page layout with FSST compression.
During compression, it first FSST compresses the input data then write out the data use
BinaryMiniBlockEncoder
.During decompression, it first uses
BinaryMiniBlockDecompressor
to decode the raw data read from disk, it then appliesFSST decompression
.