-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Produce correct answers for Group BY NULL (Option 1) #793
Conversation
"+-----------------+----+", | ||
"| COUNT(UInt8(1)) | c1 |", | ||
"+-----------------+----+", | ||
"| 1 | |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Clippy error is unrelated -- see fix in #794 |
pub(crate) fn create_key( | ||
group_by_keys: &[ArrayRef], | ||
row: usize, | ||
vec: &mut Vec<u8>, | ||
) -> Result<()> { | ||
vec.clear(); | ||
for col in group_by_keys { | ||
create_key_for_col(col, row, vec)? | ||
if !col.is_valid(row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it makes sense to improve performance here, but an optimization might be to check on null-count==0
outside of this function to avoid the is_valid
call and just always add an 0xFF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion.
If you don't mind I would like to spend time on #790 which, if successful, I expect to significantly remove all this code.
I will attempt to add that optimization at a later date.
fn scalar_try_from_dict_datatype() { | ||
let data_type = | ||
DataType::Dictionary(Box::new(DataType::Int8), Box::new(DataType::Utf8)); | ||
let data_type = &data_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amusingly, supporting this behavior ended up causing a test to fail when I brought the code into IOx and I think I traced the problem to an issue in parquet file statistics: apache/arrow-rs#641 🤣 this was not a side effect I had anticipated
"+-----------------+----+-----+", | ||
"| 1 | | |", | ||
"| 2 | | bar |", | ||
"| 3 | 0 | |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
datafusion/src/scalar.rs
Outdated
// any newly added enum variant will require editing this list | ||
// or else face a compile error | ||
match (self, other) { | ||
(Boolean(v1), Boolean(v2)) => v1.eq(v2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also use ==
instead?
ddf2298
to
b0d834a
Compare
Looks good. I was trying to come up with an example where two distinct keys would end up as the same encoding but could not find any because any variable length types include the length prefix. 🚀 |
Thanks for the reviews @Dandandan and @jhorstmann ! I plan to wait another day or two to see if anyone else has feedback on this approach, but what I am thinking of doing is merging this PR (after addressing comments) so at least we get correct answers and then work on the more sophisticated implementation in parallel.
Thanks for double checking -- this also worried me a lot so I am glad to hear someone else did a double check too. I convinced myself that since each key had entries for the same columns in the same order, there was no way to concoct the same bytes with different column values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went through this carefully, and it looks great! Great work, @alamb !
Rebased given that #786 is merged, so that this PR just shows the delta now |
b0d834a
to
b6c6a3c
Compare
Co-authored-by: Daniël Heres <danielheres@gmail.com>
#808 contains the PR that should give us back any performance we lost in this one |
takes the relevant part out of apache#793 which was ignored by cube maintainers
takes the relevant part out of apache#793 which was ignored by cube maintainers
Which issue does this PR close?
This PR closes #781 and #782;
Built on #786 so review that first.
Rationale for this change
This is a version of the "Alternative" approach described in #790 which I think is the minimum change to GroupByHash to produce the correct answers when grouping on columns that contain nulls. Thanks to @jhorstmann and @Dandandan for the ideas leading to this PR
It will likely reduce the speed of grouping as well as require more memory than the current implementation (though it does get correct answers!)
I created this PR it available for comparison and as a fallback in case I run into trouble or run out of time time trying to implement #790, which I expect will take longer to code and review.
What changes are included in this PR?
ScalarValue
On master keys are created like this:
After this PR, the keys are created as follows (note the two extra bytes, one for each grouping column)
Example of a key without any nulls:
Example of a key with NULL values:
Are there any user-facing changes?
Correct answers!
Benchmark results
The benchmarks show a slight slowdown, which is not unexpected given there is now more work being done