-
-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add sparse codec #1723
add sparse codec #1723
Conversation
PSeitz
commented
Dec 13, 2022
•
edited
Loading
edited
fbf4e86
to
f7e5efa
Compare
615c0c2
to
7bac90f
Compare
Codecov Report
@@ Coverage Diff @@
## main #1723 +/- ##
========================================
Coverage 94.06% 94.07%
========================================
Files 262 263 +1
Lines 49994 50518 +524
========================================
+ Hits 47028 47525 +497
- Misses 2966 2993 +27
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
// The number of vals so far | ||
let mut offset = 0; | ||
let mut sparse_codec_blocks = Vec::new(); | ||
let num_blocks = u16::from_le_bytes([data[data.len() - 2], data[data.len() - 1]]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to introduce
#[inline(always)]
pub fn read_u16_at_offset(data: &[u8], offset: usize) -> u16 {
assert!(offset+1 < data.len());
unsafe { read_u16_at_offset_unsafe(data, offset) }
}
#[inline(always)]
pub unsafe fn read_u16_at_offset_unsafe(data: &[u8], offset: usize) -> u16 {
unsafe {
let u16_ptr: *const [u8; 2] = data.as_ptr().offset(offset as isize) as *const [u8; 2];
let u16_bytes: [u8; 2] = u16_ptr.read_unaligned();
u16::from_le_bytes(u16_bytes)
}
}
And use it everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would only use unsafe
when we can see the difference in the benchmark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's forget the unsafe part then, we already have a different implementation of this in one place...
Can we extract it as a function mark it as inline and use it everywhere in this file.
|
||
/// Splits a idx into block index and value in the block | ||
fn split_in_block_idx_and_val_in_block(idx: u32) -> (u16, u16) { | ||
let val_in_block = (idx % u16::MAX as u32) as u16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let val_in_block = (idx % u16::MAX as u32) as u16; | |
let val_in_block = (idx % u16::MAX as u32) as u16; |
That's a bug isn't it? u16::MAX is incorrect. We want 1 << 16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline, that was a measured decision to deal with the fact that
a len in a block defined that way takes values in [0..u16::MAX]
, so that it cannot be represented as a u16.
let (block_idx, val_in_block) = split_in_block_idx_and_val_in_block(idx); | ||
// There may be trailing nulls without data, those are not stored as blocks. It would be | ||
// possible to create empty blocks, but for that we would need to serialize the number of | ||
// values or pass them when opening |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather do that actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would add some fixed bytes to every sparse codec field
|
||
/// Return the number of non-null values in an index | ||
pub fn num_non_null_vals(&self) -> u32 { | ||
self.blocks.last().map(|block| block.offset).unwrap_or(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seem to use num_null_vals
or num_non_null_vals
in different places. (Index traits)
I'd rather settle to num_non_null_vals
everywhere... What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, let's stick to that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make the modification in the ValueIndex
trait(s?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments
db4764d
to
f0bec47
Compare
fn value_addr(idx: u32) -> ValueAddr { | ||
/// Static assert number elements per block this method expects | ||
#[allow(clippy::assertions_on_constants)] | ||
const _: () = assert!(ELEMENTS_PER_BLOCK == (1 << 16)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the first time I see this trick. Interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work. Please have a look at clippy comments before merging.
Co-authored-by: Paul Masurel <paul@quickwit.io>
Co-authored-by: Paul Masurel <paul@quickwit.io>
Co-authored-by: Paul Masurel <paul@quickwit.io>
f0bec47
to
7f12317
Compare
7f12317
to
0d1e901
Compare
* add sparse codec * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> * Apply suggestions from code review Co-authored-by: Paul Masurel <paul@quickwit.io> * add the -1 u16 fix for metadata num_vals * add dense block encoding to sparse codec * add comment, refactor u16 reading Co-authored-by: Paul Masurel <paul@quickwit.io>