feat: Support duplicate column names in Joins in Substrait consumer #11049

Blizzara · 2024-06-21T13:57:50Z

Which issue does this PR close?

Closes #10815 , and also adds support for joining two VirtualTables with overlapping column names.

Rationale for this change

Substrait doesn't contain aliases inside the plan: neither table aliases (SubqueryAlias), nor column aliases. Columns are only named at the beginning and end of the plan, while tables are not named by substrait at all. Substrait referes to columns only by their relative indices, so that's name-independent.

When we consumer Substrait plans' ReadRels, the created DataFusion tables may or may not have qualifiers. If the tables are NamedTable reads, then DataFusion uses the name as the qualifier. If the tables are VirtualTables, then we have unqualified tables.

This causes trouble in some cases for DataFusion, as DF requires columns to be uniquely named in at least two places (there’s probably more but these are the ones we’ve hit so far):

when joining schemas, in DFSchema::check_names. This throws when there are two duplicate unqualified names, or if a unqualified name matches the name of a qualified name. This was problematic for joining VirtualTables since they would be unqualified.
Interestingly, the (1) does NOT throw when there are two duplicate qualified names. That's why the existing tests were working. However, this results in a schema that contains duplicate fields, which then fail if those fields are actually trying to be used - that's Projects require unique expressions names error in substrait producer/consumer #10815

What changes are included in this PR?

I tried first a slightly different solution in #11048, where I'd create a unique reference for each table at the beginning. However that solution has some gaps (it might not work if there's a project between join and a read, e.g.) and is a bit nasty overall.

While thinking about it, I concluded there might be a simpler solution: just (re)qualify the tables before joining them. This is what this PR does. We check, before a join, if there are columns that appear both on the left and right sides, and if there are, we requalify both sides (to left and right). As Substrait only cares about the position of the column, the (re)qualifying doesn't affect the rest of the plan (other than the qualifiers of the output columns).

This ended up looking much better to me. The table qualifiers do now change within the plan created from Substrait, but only if necessary to avoid conflicts.

This PR also separately fixes a case in the final schema aliasing where we'd unnecessarily add aliases to columns to drop table qualifiers (ie. if our plan had a column table.a but the Substrait names defines the name to be a, we'd alias that: table.a AS a. But that's unnecessary since we should only care about the name, not the qualifier.

Are these changes tested?

Tested by some stricter roundtrip unit tests, + adapting existing tests.

Are there any user-facing changes?

Plans produced by Substrait consumer may get different qualifiers for output columns. I think that's unlikely to affect anyone since I dunno why people would be looking at those qualifiers.

Otherwise just improved support for Substrait plans.

Blizzara · 2024-06-21T14:00:51Z

datafusion/substrait/src/logical_plan/consumer.rs

+            let new_name = new_field.name();
+            match &new_expr {
+                Expr::Column(c) => {
+                    // If expr is a column reference, alias_if_changed would cause an aliasing if the old expr has a qualifier


Always calling alias_if_changed was causing things like .. table.a AS a in the final project, which is just unnecessary and slightly annoying in some testing cases (but shouldn't matter much in reality since the name is the same).

Blizzara · 2024-06-21T14:04:05Z

datafusion/substrait/tests/cases/roundtrip_logical_plan.rs

@@ -628,32 +660,6 @@ async fn qualified_catalog_schema_table_reference() -> Result<()> {
    roundtrip("SELECT a,b,c,d,e FROM datafusion.public.data;").await
 }

-#[tokio::test]


I believe these tests are fully replaced by the stricter roundtrip_self_join tests I added above.

These tests worked before since they hit the case (1) - the tables have overlapping names but they're both qualified, so the DFSchema::check_names() doesn't mind, and they don't read those overlapping columns in the final result.

They would pass with the changes here as well, they'd just have additional SubqueryAliases.

Blizzara · 2024-06-21T14:09:22Z

datafusion/substrait/tests/cases/roundtrip_logical_plan.rs

@@ -707,20 +713,17 @@ async fn roundtrip_literal_struct() -> Result<()> {
 #[tokio::test]
 async fn roundtrip_values() -> Result<()> {
    // TODO: would be nice to have a struct inside the LargeList, but arrow_cast doesn't support that currently
-    let values = "(\


These changes are unrelated, but I was annoyed about the two assert inside this test while developing and one of them was failing - so I split them out.

There was no real benefit to using a large list of types for the empty relation.

Blizzara · 2024-06-21T14:09:38Z

datafusion/substrait/tests/cases/roundtrip_logical_plan.rs

-    // Test LogicalPlan::EmptyRelation
-    roundtrip(format!("SELECT * FROM (VALUES {values}) LIMIT 0").as_str()).await
+#[tokio::test]
+async fn roundtrip_values_duplicate_column_join() -> Result<()> {


A test for the case (2)

alamb

Thank you @Blizzara -- looks great to me

cc @waynexia

alamb · 2024-06-23T13:46:42Z

datafusion/substrait/src/logical_plan/consumer.rs

@@ -317,16 +317,27 @@ fn rename_expressions(
        .into_iter()


As rename_expressions is getting more feature-full perhaps we can add some doc comments explaining what it is doing (I think it is basically renaming expressions from substrait to match DataFusion's expectations?)

Sure, added some docs in 87829ec

alamb · 2024-06-23T13:47:43Z

datafusion/substrait/src/logical_plan/consumer.rs

@@ -846,6 +861,24 @@ pub async fn from_substrait_rel(
    }
 }

+fn requalify_sides_if_needed(


Perhaps you could add some of the excellent context from this PR's description as docstrings here to help future readers understand what is going on

This is not strictly necessary as I think the comments in the function explain it, but docstrings would still be nice

Added a short version in 87829ec

Blizzara · 2024-06-24T07:24:21Z

Thanks for the review @alamb - I added some docs in 87829ec, and also cleaned/improved stuff a bit.

However I also realized this approach falls short form providing full guarantee of uniqueness either, since the left or right side itself may contain duplicate names with different qualifiers, which then would turn into full duplicates once the side gets aliased. Unfortunately I don't see a way around that currently other than adding a lot more complexity into the renaming to basically qualify each column separately or something. (Or switching DataFusion to use something else than column name to refer to columns, e.g. position or some unique expr ID, but that's probably quite a big change 😅 ) Luckily I don't expect that to be very common case, so maybe we can go with just this for now and then improve on it later if needed.

alamb · 2024-06-24T15:27:24Z

Unfortunately I don't see a way around that currently other than adding a lot more complexity into the renaming to basically qualify each column separately or something. (Or switching DataFusion to use something else than column name to refer to columns, e.g. position or some unique expr ID, but that's probably quite a big change 😅 )

I agree this is a fundamental design difference (identifying columns by string or by offset). I think what you have in this PR is good enough for now, and we can figure out how to improve the situation if it turns out handling duplicated names within a relation is a common case.

alamb

Thanks agian @Blizzara

alamb · 2024-06-24T15:37:22Z

I pushed a commit up to this PR to get CI passing

Blizzara · 2024-06-24T15:39:41Z

I agree this is a fundamental design difference (identifying columns by string or by offset). I think what you have in this PR is good enough for now, and we can figure out how to improve the situation if it turns out handling duplicated names within a relation is a common case.

Yup

I pushed a commit up to this PR to get CI passing

Ah thanks, I hadn't noticed that failure, looks good.

Thanks agian @Blizzara

My pleasure :)

alamb · 2024-06-24T16:17:03Z

🚀

…pache#11049) * add tests for joining tables with same name * alias tables just before joins instead, for a simpler solution * test cleanup * docstrings and some cleanup * Clippy --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

github-actions bot added the substrait label Jun 21, 2024

Blizzara commented Jun 21, 2024

View reviewed changes

Blizzara mentioned this pull request Jun 21, 2024

feat: Support duplicate column names in Substrait consumer #11048

Closed

Blizzara marked this pull request as ready for review June 21, 2024 14:10

Blizzara mentioned this pull request Jun 21, 2024

feat: allow naming/aliasing relations substrait-io/substrait#649

Merged

alamb mentioned this pull request Jun 22, 2024

DataFusion weekly project plan (Andrew Lamb) - June 17, 2024 #10955

Closed

5 tasks

alamb approved these changes Jun 23, 2024

View reviewed changes

Blizzara added 4 commits June 24, 2024 09:15

add tests for joining tables with same name

4065236

alias tables just before joins instead, for a simpler solution

e2115c7

test cleanup

f54925a

docstrings and some cleanup

87829ec

Blizzara force-pushed the avo/substrait-join-qualify-sides branch from f830339 to 87829ec Compare June 24, 2024 07:19

alamb approved these changes Jun 24, 2024

View reviewed changes

Clippy

c6c61d0

alamb merged commit 3ff0bfe into apache:main Jun 24, 2024
23 checks passed

Blizzara mentioned this pull request Jun 26, 2024

Substrait: Add SubqueryAlias support #6867

Open

Blizzara deleted the avo/substrait-join-qualify-sides branch June 28, 2024 07:14

Blizzara mentioned this pull request Jul 8, 2024

fix: When consuming Substrait, temporarily rename clashing duplicate columns #11329

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support duplicate column names in Joins in Substrait consumer #11049

feat: Support duplicate column names in Joins in Substrait consumer #11049

Blizzara commented Jun 21, 2024 •

edited

Loading

Blizzara Jun 21, 2024

Blizzara Jun 21, 2024

Blizzara Jun 21, 2024

Blizzara Jun 21, 2024

alamb left a comment

alamb Jun 23, 2024

Blizzara Jun 24, 2024

alamb Jun 23, 2024

Blizzara Jun 24, 2024

Blizzara commented Jun 24, 2024 •

edited

Loading

alamb commented Jun 24, 2024

alamb left a comment

alamb commented Jun 24, 2024

Blizzara commented Jun 24, 2024 •

edited

Loading

alamb commented Jun 24, 2024

feat: Support duplicate column names in Joins in Substrait consumer #11049

feat: Support duplicate column names in Joins in Substrait consumer #11049

Conversation

Blizzara commented Jun 21, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara commented Jun 24, 2024 • edited Loading

alamb commented Jun 24, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jun 24, 2024

Blizzara commented Jun 24, 2024 • edited Loading

alamb commented Jun 24, 2024

Blizzara commented Jun 21, 2024 •

edited

Loading

Blizzara commented Jun 24, 2024 •

edited

Loading

Blizzara commented Jun 24, 2024 •

edited

Loading