-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support fastlanes bitpacking #2886
Changes from 5 commits
c089644
a1e3cdf
3f340a3
9a6c489
f55a445
7c21438
3b82ec5
f0bd3a8
ce0f798
fb9ede2
3ad773c
403e89d
0ae8362
23e261c
3f92fcd
dba9a48
1eb75e2
1485759
8543f54
ee78fc6
fe3fda8
697af4a
922c2fe
f09cad7
fc89bf4
d5b9201
13b757a
ca4dba3
f42af4c
1c2878b
688bb1f
4d7557f
c7ecb08
dedb306
655a063
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,6 +37,8 @@ snafu.workspace = true | |
tokio.workspace = true | ||
tracing.workspace = true | ||
zstd.workspace = true | ||
fastlanes = "0.1.5" | ||
bytemuck = "=1.18.0" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these being used? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, they are used in the |
||
|
||
[dev-dependencies] | ||
lance-testing.workspace = true | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -203,6 +203,36 @@ impl LanceBuffer { | |
Self::Borrowed(Buffer::from_vec(vec)) | ||
} | ||
|
||
pub fn reinterpret_to_rust_native<T>(&mut self) -> Result<&[T]> | ||
where | ||
T: Copy, // Ensure `T` can be copied (as needed for safely reinterpreting bytes) | ||
{ | ||
let buffer = self.borrow_and_clone(); | ||
|
||
let buffer = buffer.into_buffer(); | ||
|
||
// Get the raw byte slice from the buffer. | ||
let byte_slice = buffer.as_slice(); | ||
|
||
// Safety check - ensure that the byte slice length is a multiple of `T`. | ||
if byte_slice.len() % std::mem::size_of::<T>() != 0 { | ||
return Err(Error::Internal { | ||
message: "Buffer size is not a multiple of the target type size".to_string(), | ||
location: location!(), | ||
}); | ||
} | ||
|
||
// Reinterpret the byte slice as a slice of `T`. | ||
let typed_slice = unsafe { | ||
std::slice::from_raw_parts( | ||
byte_slice.as_ptr() as *const T, | ||
byte_slice.len() / std::mem::size_of::<T>(), | ||
) | ||
}; | ||
|
||
Ok(typed_slice) | ||
} | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How is this different from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, this is actually the same, I should delete this function There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
/// Reinterprets a LanceBuffer into a Vec<T> | ||
/// | ||
/// Unfortunately, there is no way to do this safely in Rust without a copy, even if | ||
|
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -14,6 +14,8 @@ use snafu::{location, Location}; | |||||||||
use crate::buffer::LanceBuffer; | ||||||||||
use crate::data::DataBlock; | ||||||||||
use crate::encodings::logical::r#struct::StructFieldEncoder; | ||||||||||
use crate::encodings::physical::bitpack_fastlanes::compute_compressed_bit_width_for_non_neg; | ||||||||||
use crate::encodings::physical::bitpack_fastlanes::BitpackedForNonNegArrayEncoder; | ||||||||||
use crate::encodings::physical::block_compress::CompressionScheme; | ||||||||||
use crate::encodings::physical::dictionary::AlreadyDictionaryEncoder; | ||||||||||
use crate::encodings::physical::fsst::FsstArrayEncoder; | ||||||||||
|
@@ -331,6 +333,23 @@ impl CoreArrayEncodingStrategy { | |||||||||
|
||||||||||
Ok(Box::new(PackedStructEncoder::new(inner_encoders))) | ||||||||||
} | ||||||||||
DataType::UInt8 | DataType::UInt16 | DataType::UInt32 | DataType::UInt64 => { | ||||||||||
let compressed_bit_width = compute_compressed_bit_width_for_non_neg(arrays); | ||||||||||
Ok(Box::new(BitpackedForNonNegArrayEncoder::new( | ||||||||||
compressed_bit_width as usize, | ||||||||||
data_type.clone(), | ||||||||||
))) | ||||||||||
} | ||||||||||
|
||||||||||
// for signed integers, I intend to make it a cascaded encoding, a sparse array for the negative values and very wide(bit-width) values, | ||||||||||
// then a bitpacked array for the narrow(bit-width) values, I need `BitpackedForNeg` to be merged first | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Minor nit: add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||||||||||
DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => { | ||||||||||
let compressed_bit_width = compute_compressed_bit_width_for_non_neg(arrays); | ||||||||||
Ok(Box::new(BitpackedForNonNegArrayEncoder::new( | ||||||||||
compressed_bit_width as usize, | ||||||||||
data_type.clone(), | ||||||||||
))) | ||||||||||
} | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's make sure we only use the bitpacking encoder if the version is 2.1. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||||||||||
_ => Ok(Box::new(BasicEncoder::new(Box::new( | ||||||||||
ValueEncoder::default(), | ||||||||||
)))), | ||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think BitpackedWithNeg` will look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to make it a cascading encoding of
a BTreeMap
ofrow number index
toreal value
for a few very wide(bit-width) values (negative values) thenbitpacking
, for arrays that have too many negative values (for example: 50 percent), I think we should not usebitpacking
on it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory these should still may be bitpackable right? For example if all data is between
[-8, 8]
then we could shift to[0, 16]
and bitpack? Although I suppose frame-of-reference would do that for us 🤔There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think there might be some literatures focusing on how to intuitively combine lightweight integer encoding algorithms, I will research on that.