-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix count(null) and count(distinct null) #8511
Conversation
} else { | ||
accumulate_indices( | ||
group_indices, | ||
values.nulls(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use logical_nulls
for all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logical_nulls
returns an owned value, so in most cases it will clone, that's why I added a branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm that's a bit unfortunate 🤔
The clone is relatively cheap though, as the buffer holding the bitmap is wrapped in Arc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, then we can use it I guess. I assumed that's why they don't want to change it in Arrow.
170f528
to
9d3a3e9
Compare
if values.data_type() == &DataType::Null { | ||
values.len() | ||
} else { | ||
values.null_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be all based on logical_nulls
as well
e7fadf6
to
23097f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this contribution @joroKr21 -- the basic idea is great. I am just worried about the use of the logical_nulls
method as it copies things under the covers that may cause performance slowdowns.
I offered an alternate suggestion -- let me know what you think
@@ -198,16 +198,18 @@ fn null_count_for_multiple_cols(values: &[ArrayRef]) -> usize { | |||
if values.len() > 1 { | |||
let result_bool_buf: Option<BooleanBuffer> = values | |||
.iter() | |||
.map(|a| a.nulls()) | |||
.map(|a| a.logical_nulls()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am slightly worried about the need to allocate a new null buffer each time, even for arrays when we could just use the exising one
This is particularly concerning given this is on the critical path in aggregates
I reviewed the logical_nulls
method --
https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.logical_nulls and I see the issue is that it returns an owned Option
What would you think about implemeting a method in DataFusion
that avoids the copy if it is not necessary, like
fn logical_nulls(arr: &ArrayRef) -> Cow<'_, Option<BooleanBuffer>> {
}
That only creates the nulll buffer for NullArrays
?
Then we can propose upstreaming that back to arrow-rs to avoid the potential performance issue
I know the Cow thing is not always the easiest to make happen -- if you need help I can try and find time to help code it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how I would implement this outside of the Array
trait while ensuring that all cases are covered. Originally I had some branching logic based on the datatype but removed it after the discussion here: #8511 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah @alamb in the end NullBuffer
has Arc<Bytes>
so it mostly clones this + a few usize
s etc. While not ideal I don't think it will be very expensive?
https://arrow.apache.org/rust/arrow_buffer/buffer/immutable/struct.Buffer.html
But I like the suggestion of returning a reference or Cow
in arrow-rs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not opposed to this PR, but I would prefer to have the Cow
thing. Let me see if I can whip it up quickly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my proposal for improvement: coralogix#221
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @joroKr21 and @Dandandan
It would be great to do some benchmark runs to show this doesn't impact performance, but I think I am over worrying this extra copy of the NullBuffer -- it is 48 bytes and one increment
FWIW I want to be clear I think we can revert cecc493 unless it shows a performance improvement. I am sorry for all the noise |
Oh sorry, I saw your comment too late. I will force push in that case. |
Use `logical_nulls` when the array data type is `Null`.
cecc493
to
9fbe554
Compare
I don't know how long it takes to run the benchmarks. I could probably do a run during the weekend. |
I'll run some now as I have it all setup. I'll post them here when ready |
I ran benchmarks and my conclusion is that this branch doesn't change the performance and any changes are within the level of noise
|
Thanks @joroKr21 |
Use
logical_nulls
when the array data type isNull
.Which issue does this PR close?
Closes #8509.
Rationale for this change
The semantics of
NullArray
in Arrow are confusing: apache/arrow-rs#4838So we have to handle the
Null
data type in a special way.What changes are included in this PR?
Patches in count and count distinct accumulators to handle the
Null
data type or uselogical_nulls
when appropriate.Are these changes tested?
Yes, added SQL tests.
Are there any user-facing changes?
Yes, count and count distinct now behave consistently wrt
null
regardless of the data type.