Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing tests in master: left_join_using and left_join #235

Closed
andygrove opened this issue May 1, 2021 · 5 comments · Fixed by #238
Closed

Failing tests in master: left_join_using and left_join #235

andygrove opened this issue May 1, 2021 · 5 comments · Fixed by #238
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Member

andygrove commented May 1, 2021

Describe the bug

---- left_join_using stdout ----
thread 'left_join_using' panicked at 'assertion failed: `(left == right)`
  left: `[["11", "a", "z"], ["22", "b", "y"], ["33", "c", "NULL"], ["44", "d", "x"]]`,
 right: `[["11", "a", "z"], ["22", "b", "y"], ["44", "d", "x"]]`', datafusion/tests/sql.rs:1252:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- left_join stdout ----
thread 'left_join' panicked at 'assertion failed: `(left == right)`
  left: `[["11", "a", "z"], ["22", "b", "y"], ["33", "c", "NULL"], ["44", "d", "x"]]`,
 right: `[["11", "a", "z"], ["22", "b", "y"], ["44", "d", "x"]]`', datafusion/tests/sql.rs:1221:5

To Reproduce
Checkout commit c945b03f3a459a5c15f481f9d52819df56e1090c and run cargo test.

Expected behavior
Tests should pass.

Additional context
rustc 1.51.0 (2fd73fabe 2021-03-23)
Running on my 24 core desktop so perhaps this uncovers a race condition?

@andygrove andygrove added the bug Something isn't working label May 1, 2021
@andygrove
Copy link
Member Author

The regression seems to have been introduced in this commit (or at least I start to see it with this commit).

commit 33715747db5a3cc48936c7b26859d4ad5809cde8 (HEAD)
Author: Daniël Heres <danielheres@gmail.com>
Date:   Wed Apr 28 12:28:16 2021 +0200

    [DataFusion] Optimize hash join inner workings, null handling fix (#24)

@Dandandan fyi

@Dandandan
Copy link
Contributor

Hm. I will have a look tomorrow if someone doesn't beat me to it. There is also a known issue with the left join being wrong across multiple batches, so it could that the fix included in this particular commit triggers the issue in those tests

@Dandandan
Copy link
Contributor

I reproduced the bug by explicitly setting the concurrency for those tests to 24.

So here is my hypothesis into what happens:

  • The left join has a wrong implementation in that it will produce rows when they are missing in the right batch, instead of in the entire partition.
  • The referenced commit has some changes to (re)hashing of single columns, which means that columns could end up without any right-side rows.
  • We also use the same hashing code in hash-repartition which means that the 33 row could end up in its "own" partition. In that case, no right batch is being processed, so no row is being generated for 33.

I have a feeling that to fix this in the general case it would be best to "just" fix the left join implementation.

Another option would be maybe to cherry-pick this change which would fix just this test from PR #55:

https://github.com/apache/arrow-datafusion/pull/55/files#diff-44d49c7778aa0c300afacdd7d89b0729ffaedd932d1ac34f3ef8db6b6cdfd73aR904

@jorgecarleitao
Copy link
Member

I was thinking (this may be completely off):

Since nulls are generally ignored on joins, can't we fix the null stuff on joins by a "drop_null" operation on both sides, prior to any hashing and actual join?

@Dandandan
Copy link
Contributor

Dandandan commented May 2, 2021

Thanks @jorgecarleitao

I added an implementation of left join where unmatched left rows are produced at the end of a stream.
I'm not totally sure what you mean, I think we still have to keep track of rows that didn't match any row at the left side.
For inner joins or on the the right/left part of a join for respectively left/right joins, we could indeed add a null filter on the columns, but this would be more of an optimization to push down null filters as far as possible. I think this is something Spark does too.

I think there might be some possible improvements in the current implementation:

  • Use a bitmap structure instead of Vec<bool>. Efficiency-wise, the current PR should already be a large improvement though (don't have any benchmarks to prove it ATM, but a new hashset for each batch seems like it will be quite slow).
  • Generate the unmatched rows in batches with the configured batch size. Currently, it generates them in "one go".

@andygrove this also seems to fix the tests in this issue, would be nice if you could confirm this on the

More generally, maybe we should run the sql tests in some different settings (concurrency / optimizations, etc) to do some more exhaustive checking using all of the different configurations / environmental changes.

mustafasrepo pushed a commit that referenced this issue Dec 25, 2023
* Interval dt and mdn cmp

* update

* test added

* extend tests
mustafasrepo pushed a commit that referenced this issue Dec 25, 2023
* Interval dt and mdn cmp

* update

* test added

* extend tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants