-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement vectorized hashing for DictionaryArray types #812
Conversation
)) | ||
})?; | ||
|
||
*hash = if multi_col { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For efficiency, checking formulti_col
can be better done outside of the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea -- done ✅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other improvement later could be not using the iter()
which returns Option<T>
but optimizing for batches containing no nulls etc.
b5b9a9e
to
7ee68d4
Compare
)) | ||
})?; | ||
*hash = dict_hashes[idx] | ||
} // no update for Null, consistent with other hashes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering now whether this is actually good for some edge cases, as it might make the hashing of values from two columns, for exampleNULL,1
and 1,NULL
is the same regardless of order => probably better to set it to some fixed value and let it participate in hashing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree in general the hashing of nulls needs some more thought. I think it is best to have dictionary hashing be consistent with other types but I think the hashing of nulls for everything should be considered.
I have filed #822 to track the issue
"Unsupported data type in hasher".to_string(), | ||
)); | ||
return Err(DataFusionError::Internal(format!( | ||
"Unsupported data type in hasher: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Verified that this change fixes #821 : #821 (comment) |
Which issue does this PR close?
closes #821
Re #790
Rationale for this change
In order to implment the GroupBy approach described #790, hashing needs to support all types that the existing grouping operator does, including dictionary arrays, including
DictionaryArray
s.Also, support for hashing dictionaries is also necessary (but I don't think sufficient) to support joining on
DictionaryArray
columnsWhat changes are included in this PR?
Implement hashing for
DictionaryArray
types by hashing the values in the dictionaryAre there any user-facing changes?
No (not yet), and no API changes