-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add union_extract
kernel
#6387
Add union_extract
kernel
#6387
Conversation
I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more |
1 similar comment
I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more |
I've started reviewing this one, but it's going to take me a bit (learning) to make it thru. Thank you for the documentation and testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the well documented code.
.iter() | ||
.find(|field| field.1.name() == target) | ||
.ok_or_else(|| { | ||
ArrowError::InvalidArgumentError(format!("field {target} not found on union")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for discussion: this behavior matches the duckdb implementation of union_extract, whereas the postgres json/jsonb operators ->>
and #>
would return NULL.
arrow-select/src/union_extract.rs
Outdated
let set_bits = set_bits - set_bits % 64; | ||
|
||
let mut buffer = | ||
MutableBuffer::new(bit_util::ceil(type_ids.len(), 64) * 8).with_bitset(set_bits / 8, val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this bit-math needed, since the MutableBuffer::new
uses bit_util::round_upto_multiple_of_64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure relative to which part, will explain both:
Without set_bits = set_bits - set_bits % 64
, if the type_ids
len is already a multiple of 64 and aligned, and set_bits
is not multiple of 64, buffer.extend
below would push an additional partial packed mask, requiring a buffer grow. Even with EQ_SCALAR_CHUNK_SIZE = 512
, set_bits
can be a non multiple of 64 if the first bit flip happens in the last chunk, which is potentially smaller than 512
Relative to bit_util::ceil(type_ids.len(), 64) * 8
, I didn't realized that in MutableBuffer::new/with_capacity
, so yes, using 64 as divisor and then * 8
is redundant, removed it. Thanks!
I also noticed that buffer.truncate
is unnecessary so I removed it too. BooleanBuffer::new
uses the correct len from type_ids
arrow-select/src/union_extract.rs
Outdated
enum BoolValue { | ||
Scalar(bool), | ||
Buffer(BooleanBuffer), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document BoolValue? Something along of lines of Scalar == all same existence (e.g. all is, or all is not, the target type_id), and Buffer is target type_id mask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
arrow-select/src/union_extract.rs
Outdated
// checks a common form of non sequential offsets, when sequential nulls reuses the same value, | ||
// pointed by the same offset, while valid values offsets increases one by one | ||
// this also checks if the last chunk/remainder is sequential relative to the first offset | ||
if offsets[0] + offsets.len() as i32 - 1 != offsets[offsets.len() - 1] { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't quite get the first explanation.
But the second reason (that it checks the remainder when we have a single remainder element past the last chunk boundary) makes sense -- and is covered in the tests. 👍🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
#[test] | ||
fn test_is_sequential() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add assert!(!is_sequential(&[1, 2, 3, 5, 6, 7]));
to demonstrate that it checks the incr at the chunk boundary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Added with a few more variations too.
} | ||
|
||
#[cfg(test)] | ||
mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also confirmed branch test coverage with cov. 👍🏼
I pushed like dozens of faster ways of doing the same thing, felt obligated to document it 😅 Thanks @wiedld for your review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll plan to merge this tomorrow unless there are any other comments or anyone else wants to comment |
Looks like the CI failure is not related -- https://github.com/apache/arrow-rs/actions/runs/10984024899/job/30494124939?pr=6387 |
I restarted the failed CI job https://github.com/apache/arrow-rs/actions/runs/10984024899/job/30494124939?pr=6387 I think it will pass now that #6437 is merged |
I took the liberty of merging up from |
This CI run looks entirely unrelated - I will close/reopen this PR to retrigger and see if it is still happening |
integration CI is failing due to #6448 |
#6448 is fixed, so merging it in |
Which issue does this PR close?
Closes #6386
Rationale for this change
What changes are included in this PR?
union_extract
kernel implementation, docs and testsAre there any user-facing changes?
A new public kernel
union_extract
is added