-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal error: Invalid HashJoinExec partition count mismatch 1!=2 when constructing merge plan with 1 CPU #9928
Comments
Thanks for the report @echai58 -- this certainly looks like a bug in DataFusion I think the next step for this issue would be to create a self contained (datafusion only) reproducer |
@echai58 can you post the PhysicalPlan for the failing test, if possible. It might help to produce datafusion only reproducer |
@alamb @mustafasrepo here's the output when running with
Let me know if there is any other log output that would be helpful, thanks! |
According to input plan, and how the join is planned
target partitions value is 1 (single core, works as expected, ok), since CollectLeft planned by default for single partition, otherwise it would be Auto/Partitioned. In the same time, ParquetExec is planned as if there are 146 partitions
normally datafusion plans 1 file group per target partition -- so this plan is already a bit inconsistent, as ParquetExec planned through delta-rs doesn't seem to respect partition limitations (if I'm not mistaken). After initial planning there is optimization phase -- currently DF has some issues with FULL (and some other) joins in CollectLeft mode (should be fixed by #9757), so In the same time increasing number of target partitions allows To sum up:
Any other ideas and comments are welcome. |
Thanks @korowa and @echai58 for detailed analysis. Sorry for the late reply. I will try to generate datafusion only reproducer for this use case with your findings. Then we can try to fix the problem there. The root cause of the problem seems to stem from we allow to insert data larger than the target partition from the source. This seems to cause some inconsistency internally. By adding a RepartitionExec immediately after the source to bring the source desired partitioning might solve the issue. I am not sure though, will post my findings as I progress. |
I could generate a datafusion only reproducer. The following query triggers same error when target partition is 1.
To set target partition to 1, following command can be used:
I understood the problem. It seems that while satisfying hash requirement we do not add hash repartition when target partition is 1 (assuming source has single partition, hence |
Describe the bug
In delta-rs, when constructing a merge plan via Datafusion, we've seen the following error when running in an environment with only 1 CPU. After increasing to 2 CPUs, the issue is resolved. (delta-io/delta-rs#2188)
Internal error: Invalid HashJoinExec partition count mismatch 1!=2
Why is this the case?
To Reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: