Failing tests in master: left_join_using and left_join #235

andygrove · 2021-05-01T15:45:17Z

Describe the bug

---- left_join_using stdout ----
thread 'left_join_using' panicked at 'assertion failed: `(left == right)`
  left: `[["11", "a", "z"], ["22", "b", "y"], ["33", "c", "NULL"], ["44", "d", "x"]]`,
 right: `[["11", "a", "z"], ["22", "b", "y"], ["44", "d", "x"]]`', datafusion/tests/sql.rs:1252:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- left_join stdout ----
thread 'left_join' panicked at 'assertion failed: `(left == right)`
  left: `[["11", "a", "z"], ["22", "b", "y"], ["33", "c", "NULL"], ["44", "d", "x"]]`,
 right: `[["11", "a", "z"], ["22", "b", "y"], ["44", "d", "x"]]`', datafusion/tests/sql.rs:1221:5

To Reproduce
Checkout commit c945b03f3a459a5c15f481f9d52819df56e1090c and run cargo test.

Expected behavior
Tests should pass.

Additional context
rustc 1.51.0 (2fd73fabe 2021-03-23)
Running on my 24 core desktop so perhaps this uncovers a race condition?

The text was updated successfully, but these errors were encountered:

andygrove · 2021-05-01T16:11:59Z

The regression seems to have been introduced in this commit (or at least I start to see it with this commit).

commit 33715747db5a3cc48936c7b26859d4ad5809cde8 (HEAD)
Author: Daniël Heres <danielheres@gmail.com>
Date:   Wed Apr 28 12:28:16 2021 +0200

    [DataFusion] Optimize hash join inner workings, null handling fix (#24)

@Dandandan fyi

Dandandan · 2021-05-01T17:15:30Z

Hm. I will have a look tomorrow if someone doesn't beat me to it. There is also a known issue with the left join being wrong across multiple batches, so it could that the fix included in this particular commit triggers the issue in those tests

Dandandan · 2021-05-02T10:08:18Z

I reproduced the bug by explicitly setting the concurrency for those tests to 24.

So here is my hypothesis into what happens:

The left join has a wrong implementation in that it will produce rows when they are missing in the right batch, instead of in the entire partition.
The referenced commit has some changes to (re)hashing of single columns, which means that columns could end up without any right-side rows.
We also use the same hashing code in hash-repartition which means that the 33 row could end up in its "own" partition. In that case, no right batch is being processed, so no row is being generated for 33.

I have a feeling that to fix this in the general case it would be best to "just" fix the left join implementation.

Another option would be maybe to cherry-pick this change which would fix just this test from PR #55:

https://github.com/apache/arrow-datafusion/pull/55/files#diff-44d49c7778aa0c300afacdd7d89b0729ffaedd932d1ac34f3ef8db6b6cdfd73aR904

jorgecarleitao · 2021-05-02T10:16:40Z

I was thinking (this may be completely off):

Since nulls are generally ignored on joins, can't we fix the null stuff on joins by a "drop_null" operation on both sides, prior to any hashing and actual join?

Dandandan · 2021-05-02T11:13:46Z

Thanks @jorgecarleitao

I added an implementation of left join where unmatched left rows are produced at the end of a stream.
I'm not totally sure what you mean, I think we still have to keep track of rows that didn't match any row at the left side.
For inner joins or on the the right/left part of a join for respectively left/right joins, we could indeed add a null filter on the columns, but this would be more of an optimization to push down null filters as far as possible. I think this is something Spark does too.

I think there might be some possible improvements in the current implementation:

Use a bitmap structure instead of Vec<bool>. Efficiency-wise, the current PR should already be a large improvement though (don't have any benchmarks to prove it ATM, but a new hashset for each batch seems like it will be quite slow).
Generate the unmatched rows in batches with the configured batch size. Currently, it generates them in "one go".

@andygrove this also seems to fix the tests in this issue, would be nice if you could confirm this on the

More generally, maybe we should run the sql tests in some different settings (concurrency / optimizations, etc) to do some more exhaustive checking using all of the different configurations / environmental changes.

* Interval dt and mdn cmp * update * test added * extend tests

andygrove added the bug Something isn't working label May 1, 2021

Dandandan mentioned this issue May 2, 2021

Fix Left join implementation is incorrect for 0 or multiple batches on the right side #238

Merged

Dandandan closed this as completed in #238 May 2, 2021

mustafasrepo pushed a commit that referenced this issue Dec 25, 2023

[Bugfix]: IntervalDayTime and IntervalMonthDayNano Partial Cmp (#235)

f1a075b

* Interval dt and mdn cmp * update * test added * extend tests

mustafasrepo pushed a commit that referenced this issue Dec 25, 2023

[Bugfix]: IntervalDayTime and IntervalMonthDayNano Partial Cmp (#235)

6dbb7fc

* Interval dt and mdn cmp * update * test added * extend tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing tests in master: left_join_using and left_join #235

Failing tests in master: left_join_using and left_join #235

andygrove commented May 1, 2021 •

edited

Loading

andygrove commented May 1, 2021

Dandandan commented May 1, 2021

Dandandan commented May 2, 2021

jorgecarleitao commented May 2, 2021

Dandandan commented May 2, 2021 •

edited

Loading

Failing tests in master: left_join_using and left_join #235

Failing tests in master: left_join_using and left_join #235

Comments

andygrove commented May 1, 2021 • edited Loading

andygrove commented May 1, 2021

Dandandan commented May 1, 2021

Dandandan commented May 2, 2021

jorgecarleitao commented May 2, 2021

Dandandan commented May 2, 2021 • edited Loading

andygrove commented May 1, 2021 •

edited

Loading

Dandandan commented May 2, 2021 •

edited

Loading