-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor size estimation of Hashset into a function #8779
Conversation
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: Marco Neumann <marco@crepererum.net>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this PR @yyy1000 -- the contribution is much apprecaited
@@ -83,6 +83,11 @@ macro_rules! float_distinct_count_accumulator { | |||
}}; | |||
} | |||
|
|||
/// Returns the estimated number of hashbrown hashtables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could potentially re-use the comments here:
I think the value of this PR / issue is to consolidate the logic in datafusion/physical-expr/src/aggregate/count_distinct.rs
with the logic in arrow-datafusion/datafusion/physical-plan/src/joins /hash_join.rs
with comments explaining the rationale (aka answering @crepererum 's in comments)
To that end what would you think about:
- Adding the code to
arrow-datafusion/datafusion/physical-plan/src/common.rs
- Use the comments from https://github.com/apache/arrow-datafusion/blob/819d3577872a082f2aea7a68ae83d68534049662/datafusion/physical-plan/src/joins/hash_join.rs#L734-L749 to explain the calculation
- Change the code in
hash_join.rs
to use it too
I think this may require changing the signature to something like
/// Estimates the memory allocated by a [`hashbrown::HashTable`].
///
/// (add explanation about size calculation here)
///
/// Note a [`hashbrown::HashSet`] is implemented as a HashTable with a zero sized key
pub fn estimated_hashtable_size<T>(table: &HashTable<T, RandomState>) -> usize {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 100% cool with the changes!
Thank you all for the review. :)
Ah, I have several questions that need help. :)
so passing the len of a hashtable would be possible.
|
Indeed -- I think that sounds like a good plan I think your implementation will somewhere have to use the relative sizes of pub fn estimated_hashtable_size<K, V>(len: usize) -> usize {
...
} (I think some of the calculations make assumptions on the size of keys -- like in hash join maybe that they are If this PR is getting too complicated, perhaps it was a poor choice to suggest as a good first project. I thought it was going to be a matter of refactoring 3 copies of code into a single function but that appears not to be the case
Perhaps we could move the size calculation into |
Maybe we don't need the 'key' and 'value' if simply calculate the size by pub fn estimated_hashtable_size(len: usize) -> usize {
(len.checked_mul(8).unwrap_or(usize::MAX) / 7).next_power_of_two()
} What do you think? 🤔
No worry, I can finish it. 😎
Good idea! |
I don't understand how that calculation can capture the size of the hash table without taking into account the sizes of keys and values. Maybe the |
Maybe And the size of values are calculated later, here is when using If this function just estimate the buckets, it will not use the 'key' and 'value'. |
BTW when I was working on #9025 I remember this PR I wonder if we can use https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/proxy/trait.RawTableAllocExt.html here which may already have the accounting we need |
I checked some code and it seems that it would involve changing some API calls to leverage |
This PR appears to be stalled. I am trying to go through old PRs and make sure we don't lose any. Marking as draft so it isn't on the review queue. Please feel free to reopen / mark as ready for review if it is |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
#8764
Rationale for this change
Title
What changes are included in this PR?
Define a new function in the same file which accept a HashSet as its param and return the estimated buckets size
Are these changes tested?
Yes