Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use RawTable API in hash join #827

Merged
merged 4 commits into from
Aug 7, 2021
Merged

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Aug 5, 2021

Which issue does this PR close?

Closes #826

Rationale for this change

Simplifies the code by not having to use IdHasher and the raw entry api, allows to experiment with different way of detecting collisions later (not included).
Performance looks to be unchanged on TPC-H benchmark queries.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Aug 5, 2021
@Dandandan Dandandan requested a review from alamb August 5, 2021 17:33
@alamb
Copy link
Contributor

alamb commented Aug 5, 2021

I will review this carefully tomorrow

);
}
};
let item = hash_map.0.get_mut(*hash_value, |_| true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure this is correct anymore. I think it should be the same as in the previous implementation, but I wonder whether the hash values still need to be checked, as also hash values might end up in the same bucket. Maybe it's easier to test this now with the raw table api.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted this to check on equality of hashes.
I found it impossible to test this, as creating a hash which points to the same bucket is surprisingly hard (tried 100M values) but might happen more frequently maybe when there is a lot of data in the hashmap.

impl fmt::Debug for JoinHashMap {
fn fmt(&self, _f: &mut fmt::Formatter<'_>) -> fmt::Result {
Ok(())
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todo

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty cool @Dandandan 👍 -- I do wonder about the treatment of collisions however (though this seems no better/worse than master upon second glance)

};
let item = hash_map
.0
.get_mut(*hash_value, |(hash, _)| *hash_value == *hash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand how this works in the case of hash collisions. Don't you also have to check if the existing values in the table (aka the _) are the same as new row that came in to ensure two rows don't have the same hash values?

Although now that I write this, I am not sure how the existing structure handles hash collisions either as it seems to only look at the hash values and not the values in the rows themselves.

I found creating a test for collisions incredibly hard, btw, as I couldn't actually find any u32 values that collided and I blew out the memory on my machine searching for u64 values that collided 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand how this works in the case of hash collisions. Don't you also have to check if the existing values in the table (aka the _) are the same as new row that came in to ensure two rows don't have the same hash values?

This doesn't check for actual values but AFAIK is required to avoid mapping/finding different hash values in the same bucket and finding the wrong bucket for a different hash. I tested it earlier and it seems the case should also be extremely rare too (couldn't get a single collission within 100M values). Problem of having the hash values in a same bucket is that another u64 hash could point to the wrong bucket in the probing phase.
I couldn't find a way yet to test this properly. The case of hash collissions by having the offsets present in different u64 has a unit test.

See here for the note where it says that there could be a same bucket for different hashes:
https://docs.rs/hashbrown/0.11.2/src/hashbrown/raw/mod.rs.html#1043-1053

Also the u64 output of the hasher seem of high quality so very hard to get collissions there indeed (at least for single columns, with multiple columns combine_hashes also affects the quality).
The actual case for detecting whether there is a colission of the values for the offsets happens at the moment elsewhere in the code.

I am not sure whether it makes sense to do the "value equality" check inside the eq parameter of the insert for the hash join. Probably it will only add time if there are not a lot of actual u64 collissions, as it should be done in the probing too anyway.

@@ -678,7 +685,7 @@ fn build_join_indexes(
// This possibly contains rows with hash collisions,
// So we have to check here whether rows are equal or not
if let Some((_, indices)) =
left.raw_entry().from_hash(*hash_value, |_| true)
left.0.get(*hash_value, |(hash, _)| *hash_value == *hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caching the hash in the value is pretty slick

datafusion/src/physical_plan/hash_join.rs Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use RawTable API in hash join code
2 participants