Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor size estimation of Hashset into a function #8779

Closed
wants to merge 3 commits into from

Conversation

yyy1000
Copy link
Contributor

@yyy1000 yyy1000 commented Jan 7, 2024

Which issue does this PR close?

#8764

Rationale for this change

Title

What changes are included in this PR?

Define a new function in the same file which accept a HashSet as its param and return the estimated buckets size

Are these changes tested?

Yes

Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: Marco Neumann <marco@crepererum.net>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR @yyy1000 -- the contribution is much apprecaited

@@ -83,6 +83,11 @@ macro_rules! float_distinct_count_accumulator {
}};
}

/// Returns the estimated number of hashbrown hashtables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could potentially re-use the comments here:

https://github.com/apache/arrow-datafusion/blob/819d3577872a082f2aea7a68ae83d68534049662/datafusion/physical-plan/src/joins/hash_join.rs#L734-L749

I think the value of this PR / issue is to consolidate the logic in datafusion/physical-expr/src/aggregate/count_distinct.rs with the logic in arrow-datafusion/datafusion/physical-plan/src/joins /hash_join.rs with comments explaining the rationale (aka answering @crepererum 's in comments)

To that end what would you think about:

  1. Adding the code to arrow-datafusion/datafusion/physical-plan/src/common.rs
  2. Use the comments from https://github.com/apache/arrow-datafusion/blob/819d3577872a082f2aea7a68ae83d68534049662/datafusion/physical-plan/src/joins/hash_join.rs#L734-L749 to explain the calculation
  3. Change the code in hash_join.rs to use it too

I think this may require changing the signature to something like

/// Estimates the memory allocated by a [`hashbrown::HashTable`].
///
/// (add explanation about size calculation here)
///
/// Note a [`hashbrown::HashSet`] is implemented as a HashTable with a zero sized key
pub fn estimated_hashtable_size<T>(table: &HashTable<T, RandomState>) -> usize {
...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 100% cool with the changes!
Thank you all for the review. :)

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 8, 2024

Ah, I have several questions that need help. :)

  1. The code in hash_join.rs would calculate the number of rows before Estimation of memory size so I can't pass a HashTable as a param into the function, https://github.com/apache/arrow-datafusion/blob/819d3577872a082f2aea7a68ae83d68534049662/datafusion/physical-plan/src/joins/hash_join.rs#L717 I wonder whether it's possible to change the signature like
pub fn estimated_hashtable_size<T>(len: usize) -> usize {
    (len.checked_mul(8).unwrap_or(usize::MAX) / 7).next_power_of_two()
}

so passing the len of a hashtable would be possible.

  1. physical-plan seems depends on physical-expr, so adding the code to arrow-datafusion/datafusion/physical-plan/src/common.rs and add dependency in arrow-datafusion/datafusion/physical-expr/Cargo.toml will lead cyclic package dependency issue on my side.

@alamb
Copy link
Contributor

alamb commented Jan 9, 2024

I wonder whether it's possible to change the signature like

Indeed -- I think that sounds like a good plan

I think your implementation will somewhere have to use the relative sizes of key and values -- so maybe something like

pub fn estimated_hashtable_size<K, V>(len: usize) -> usize {
...
}

(I think some of the calculations make assumptions on the size of keys -- like in hash join maybe that they are u64 🤔 )

If this PR is getting too complicated, perhaps it was a poor choice to suggest as a good first project. I thought it was going to be a matter of refactoring 3 copies of code into a single function but that appears not to be the case

physical-plan seems depends on physical-expr, so adding the code to arrow-datafusion/datafusion/physical-plan/src/common.rs and add dependency in arrow-datafusion/datafusion/physical-expr/Cargo.toml will lead cyclic package dependency issue on my side.

Perhaps we could move the size calculation into datafusion_common (or into physical-expr)

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 9, 2024

I think your implementation will somewhere have to use the relative sizes of key and values -- so maybe something like

pub fn estimated_hashtable_size<K, V>(len: usize) -> usize {
...
}

Maybe we don't need the 'key' and 'value' if simply calculate the size by

pub fn estimated_hashtable_size(len: usize) -> usize {
    (len.checked_mul(8).unwrap_or(usize::MAX) / 7).next_power_of_two()
}

What do you think? 🤔

If this PR is getting too complicated, perhaps it was a poor choice to suggest as a good first project.

No worry, I can finish it. 😎

Perhaps we could move the size calculation into datafusion_common (or into physical-expr)

Good idea!

@alamb
Copy link
Contributor

alamb commented Jan 9, 2024

Maybe we don't need the 'key' and 'value' if simply calculate the size by

I don't understand how that calculation can capture the size of the hash table without taking into account the sizes of keys and values. Maybe the 8 is the implicit size of the values for the hashset (e.g. sizeof(u64) == 8) 🤔

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 9, 2024

I don't understand how that calculation can capture the size of the hash table without taking into account the sizes of keys and values. Maybe the 8 is the implicit size of the values for the hashset (e.g. sizeof(u64) == 8) 🤔

Maybe 8 is just because it 'require 1/8 buckets to be empty (87.5% load)', we will divide it by 7 later. https://github.com/rust-lang/hashbrown/blob/f2e62124cd947b5e2309dd6a24c7e422932aae97/src/raw/mod.rs#L206-L211

And the size of values are calculated later, here is when using sizeof(u64) + sizeof(u64) https://github.com/apache/arrow-datafusion/blob/819d3577872a082f2aea7a68ae83d68534049662/datafusion/physical-plan/src/joins/hash_join.rs#L745-L750C1

If this function just estimate the buckets, it will not use the 'key' and 'value'.
If this function will calculate the whole size of a hashtable/hashset, it will use use 'key' and 'value', but if this, it seems that 'hash_join.rs' can't use this function easily. 🤔

@alamb
Copy link
Contributor

alamb commented Jan 28, 2024

BTW when I was working on #9025 I remember this PR

I wonder if we can use https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/proxy/trait.RawTableAllocExt.html here which may already have the accounting we need

@yyy1000
Copy link
Contributor Author

yyy1000 commented Jan 28, 2024

I checked some code and it seems that it would involve changing some API calls to leverage accounting. 🤔
For example, here JoinHashMap is created after calculating the estimated_hastable_size.
I can inspect them then. https://github.com/apache/arrow-datafusion/blob/f4fc2639f1d9d1f4dbc73d39990a83f6bf7a725f/datafusion/physical-plan/src/joins/hash_join.rs#L753-L759

@alamb
Copy link
Contributor

alamb commented Feb 2, 2024

This PR appears to be stalled.

I am trying to go through old PRs and make sure we don't lose any. Marking as draft so it isn't on the review queue. Please feel free to reopen / mark as ready for review if it is

@alamb alamb marked this pull request as draft February 2, 2024 21:50
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Apr 13, 2024
@github-actions github-actions bot closed this Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants