Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement vectorized hashing for DictionaryArray types #812

Merged
merged 3 commits into from
Aug 4, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 2, 2021

Which issue does this PR close?

closes #821
Re #790

Rationale for this change

In order to implment the GroupBy approach described #790, hashing needs to support all types that the existing grouping operator does, including dictionary arrays, including DictionaryArrays.

Also, support for hashing dictionaries is also necessary (but I don't think sufficient) to support joining on DictionaryArray columns

What changes are included in this PR?

Implement hashing for DictionaryArray types by hashing the values in the dictionary

Are there any user-facing changes?

No (not yet), and no API changes

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Aug 2, 2021
@houqp houqp added the enhancement New feature or request label Aug 3, 2021
))
})?;

*hash = if multi_col {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For efficiency, checking formulti_col can be better done outside of the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea -- done ✅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other improvement later could be not using the iter() which returns Option<T> but optimizing for batches containing no nulls etc.

))
})?;
*hash = dict_hashes[idx]
} // no update for Null, consistent with other hashes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering now whether this is actually good for some edge cases, as it might make the hashing of values from two columns, for exampleNULL,1 and 1,NULL is the same regardless of order => probably better to set it to some fixed value and let it participate in hashing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree in general the hashing of nulls needs some more thought. I think it is best to have dictionary hashing be consistent with other types but I think the hashing of nulls for everything should be considered.

I have filed #822 to track the issue

"Unsupported data type in hasher".to_string(),
));
return Err(DataFusionError::Internal(format!(
"Unsupported data type in hasher: {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@alamb
Copy link
Contributor Author

alamb commented Aug 4, 2021

Verified that this change fixes #821 : #821 (comment)

@alamb alamb merged commit a5a58c4 into apache:master Aug 4, 2021
@alamb alamb deleted the alamb/hash_dictionaries branch August 4, 2021 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Internal error : Unsupported data type in hasher
3 participants