New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Use `RawTable` API in hash join #827

Merged

Dandandan merged 4 commits into apache:master from Dandandan:rawtable

Aug 7, 2021

Contributor

Dandandan commented Aug 5, 2021 •

edited

Loading

Which issue does this PR close?

Closes #826

Rationale for this change

Simplifies the code by not having to use IdHasher and the raw entry api, allows to experiment with different way of detecting collisions later (not included).
Performance looks to be unchanged on TPC-H benchmark queries.

What changes are included in this PR?

Are there any user-facing changes?


          Use rawtable API

a25065c

github-actions bot added the datafusion label

Dandandan requested a review from alamb

August 5, 2021 17:33


          Avoid changes

ff7dd76

Contributor

alamb commented Aug 5, 2021

I will review this carefully tomorrow

Dandandan commented

View reviewed changes

datafusion/src/physical_plan/hash_join.rs Outdated

-                              );
-                          }
-                      };
+                      let item = hash_map.0.get_mut(*hash_value, |_| true);

Contributor Author

Dandandan Aug 5, 2021

I'm not so sure this is correct anymore. I think it should be the same as in the previous implementation, but I wonder whether the hash values still need to be checked, as also hash values might end up in the same bucket. Maybe it's easier to test this now with the raw table api.

Contributor Author

Dandandan Aug 6, 2021

I reverted this to check on equality of hashes.
I found it impossible to test this, as creating a hash which points to the same bucket is surprisingly hard (tried 100M values) but might happen more frequently maybe when there is a lot of data in the hashmap.

Dandandan commented

View reviewed changes

datafusion/src/physical_plan/hash_join.rs

+              impl fmt::Debug for JoinHashMap {
+                  fn fmt(&self, _f: &mut fmt::Formatter<'_>) -> fmt::Result {
+                      Ok(())
+                  }

Contributor Author

Dandandan Aug 5, 2021

Todo


          Check on hash again

086b47f

alamb reviewed

View reviewed changes

Contributor

alamb left a comment •

edited

Loading

Looks pretty cool @Dandandan 👍 -- I do wonder about the treatment of collisions however (though this seems no better/worse than master upon second glance)

datafusion/src/physical_plan/hash_join.rs

-                      };
+                      let item = hash_map
+                          .0
+                          .get_mut(*hash_value, |(hash, _)| *hash_value == *hash);

Contributor

alamb Aug 6, 2021

I don't fully understand how this works in the case of hash collisions. Don't you also have to check if the existing values in the table (aka the _) are the same as new row that came in to ensure two rows don't have the same hash values?

Although now that I write this, I am not sure how the existing structure handles hash collisions either as it seems to only look at the hash values and not the values in the rows themselves.

I found creating a test for collisions incredibly hard, btw, as I couldn't actually find any u32 values that collided and I blew out the memory on my machine searching for u64 values that collided 😆

Contributor Author

Dandandan Aug 6, 2021

I don't fully understand how this works in the case of hash collisions. Don't you also have to check if the existing values in the table (aka the _) are the same as new row that came in to ensure two rows don't have the same hash values?

This doesn't check for actual values but AFAIK is required to avoid mapping/finding different hash values in the same bucket and finding the wrong bucket for a different hash. I tested it earlier and it seems the case should also be extremely rare too (couldn't get a single collission within 100M values). Problem of having the hash values in a same bucket is that another u64 hash could point to the wrong bucket in the probing phase.
I couldn't find a way yet to test this properly. The case of hash collissions by having the offsets present in different u64 has a unit test.

See here for the note where it says that there could be a same bucket for different hashes:
https://docs.rs/hashbrown/0.11.2/src/hashbrown/raw/mod.rs.html#1043-1053

Also the u64 output of the hasher seem of high quality so very hard to get collissions there indeed (at least for single columns, with multiple columns combine_hashes also affects the quality).
The actual case for detecting whether there is a colission of the values for the offsets happens at the moment elsewhere in the code.

I am not sure whether it makes sense to do the "value equality" check inside the eq parameter of the insert for the hash join. Probably it will only add time if there are not a lot of actual u64 collissions, as it should be done in the probing too anyway.

datafusion/src/physical_plan/hash_join.rs

    
            @@ -678,7 +685,7 @@ fn build_join_indexes(
          
                              // This possibly contains rows with hash collisions,

                              // So we have to check here whether rows are equal or not

                              if let Some((_, indices)) =

                                  left.raw_entry().from_hash(*hash_value, |_| true)

                                  left.0.get(*hash_value, |(hash, _)| *hash_value == *hash)

Contributor

alamb Aug 6, 2021

caching the hash in the value is pretty slick

datafusion/src/physical_plan/hash_join.rs Outdated Show resolved Hide resolved


          Test fix

7f233d3

alamb approved these changes

View reviewed changes

Dandandan merged commit 2c0c062 into apache:master

Dandandan mentioned this pull request

Aggregate queries produce different results between runs #658

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels