Improve UX for `UNION` vs `UNION ALL` (introduce a LogicalPlan::Distinct) #2573

andygrove · 2022-05-18T20:53:20Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The current support for UNION vs UNION ALL is confusing to me, in the context of trying to map this to another query engine's support for these operators.

The logical plan simply has the Union operator and this is currently assumed to represent UNION ALL.
The SQL planner has logic to wrap a union with an aggregate query and that seems like something we would want in the physical plan but not in the logical plan
The DataFrame API does not have a function for performing a regular UNION. It is possible to manually wrap a union in a distinct to achieve this but that might not be obvious to users. Also, the documentation for distinct is incorrect and says that it performs a union

DataFusion Logical Plan

❯ explain select * from foo union select * from foo;
+---------------+----------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                           |
+---------------+----------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: #a, #b                                                                                             |
|               |   Aggregate: groupBy=[[#a, #b]], aggr=[[]]                                                                     |
|               |     Union                                                                                                      |
|               |       Projection: #foo.a, #foo.b                                                                               |
|               |         TableScan: foo projection=Some([0, 1])                                                                 |
|               |       Projection: #foo.a, #foo.b                                                                               |
|               |         TableScan: foo projection=Some([0, 1])                                                                 |
+---------------+----------------------------------------------------------------------------------------------------------------+

Spark Logical Plan

'Distinct
+- 'Union false, false
   :- 'Project ['a]
   :  +- 'UnresolvedRelation [t1], [], false
   +- 'Project [unresolvedalias(1, None)]
      +- 'UnresolvedRelation [t1], [], false

Spark Physical Plan

*(2) HashAggregate(keys=[a#4], functions=[], output=[a#4])
+- Exchange hashpartitioning(a#4, 200), ENSURE_REQUIREMENTS, [id=#26]
   +- *(1) HashAggregate(keys=[a#4], functions=[], output=[a#4])
      +- Union
         :- LocalTableScan [a#4]
         +- LocalTableScan [1#6]

Describe the solution you'd like
I think what we want is:

Introduce a Distinct operator in the logical plan
Have the DataFrame distinct function create the Distinct operator instead of the aggregate query
The DataFrame API should have functions for both union (existing method representing UNION ALL) and union_distinct
The physical planner should translate Distinct to an aggregate query

Describe alternatives you've considered
Leave things as they are and improve the documentation.

Additional context
For users of DataFusion for SQL query planning, it would be easier to map union/union all to other engines rather than trying to reverse engineer the aggregate query wrapping the union.

The text was updated successfully, but these errors were encountered:

andygrove · 2022-05-19T00:58:03Z

@alamb I would appreciate your insights here when you have time

alamb · 2022-05-19T10:59:53Z

Given the usecase is trying to map Union plans to what Spark does, using an explicit Distinct in the logical plan (that is converted into a GroupBy when doing physical planning makes sense to me 👍

Doing the UNION --> GroupBy transformation during SQL planning was probably a shortcut.

mrob95 · 2022-06-24T20:37:31Z

Will take a look at this

andygrove added enhancement New feature or request sql SQL Planner labels May 18, 2022

andygrove self-assigned this May 18, 2022

andygrove mentioned this issue May 18, 2022

Add DataFrame union_distinct and fix documentation for distinct #2574

Merged

alamb changed the title ~~Improve UX for UNION vs UNION ALL~~ Improve UX for UNION vs UNION ALL (introduce a LogicalPlan::Distinct) May 19, 2022

This was referenced May 23, 2022

[REVIEW] Add support for UNION dask-contrib/dask-sql#542

Merged

[DF] DataFusion Missing Features & Bugs dask-contrib/dask-sql#474

Open

galipremsagar mentioned this issue Jun 1, 2022

[ENH] Improve UNION logic to drop duplicates once Datafusion logical planner is updated dask-contrib/dask-sql#561

Open

mrob95 mentioned this issue Jun 25, 2022

Add LogicalPlan::Distinct #2792

Merged

andygrove closed this as completed in #2792 Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve UX for `UNION` vs `UNION ALL` (introduce a LogicalPlan::Distinct) #2573

Improve UX for `UNION` vs `UNION ALL` (introduce a LogicalPlan::Distinct) #2573

andygrove commented May 18, 2022 •

edited

Loading

andygrove commented May 19, 2022

alamb commented May 19, 2022

mrob95 commented Jun 24, 2022

Improve UX for UNION vs UNION ALL (introduce a LogicalPlan::Distinct) #2573

Improve UX for UNION vs UNION ALL (introduce a LogicalPlan::Distinct) #2573

Comments

andygrove commented May 18, 2022 • edited Loading

DataFusion Logical Plan

Spark Logical Plan

Spark Physical Plan

andygrove commented May 19, 2022

alamb commented May 19, 2022

mrob95 commented Jun 24, 2022

Improve UX for `UNION` vs `UNION ALL` (introduce a LogicalPlan::Distinct) #2573

Improve UX for `UNION` vs `UNION ALL` (introduce a LogicalPlan::Distinct) #2573

andygrove commented May 18, 2022 •

edited

Loading