-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support duplicate column names in Joins in Substrait consumer #11049
Conversation
let new_name = new_field.name(); | ||
match &new_expr { | ||
Expr::Column(c) => { | ||
// If expr is a column reference, alias_if_changed would cause an aliasing if the old expr has a qualifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always calling alias_if_changed
was causing things like .. table.a AS a
in the final project, which is just unnecessary and slightly annoying in some testing cases (but shouldn't matter much in reality since the name is the same).
@@ -628,32 +660,6 @@ async fn qualified_catalog_schema_table_reference() -> Result<()> { | |||
roundtrip("SELECT a,b,c,d,e FROM datafusion.public.data;").await | |||
} | |||
|
|||
#[tokio::test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these tests are fully replaced by the stricter roundtrip_self_join
tests I added above.
These tests worked before since they hit the case (1) - the tables have overlapping names but they're both qualified, so the DFSchema::check_names()
doesn't mind, and they don't read those overlapping columns in the final result.
They would pass with the changes here as well, they'd just have additional SubqueryAliases.
@@ -707,20 +713,17 @@ async fn roundtrip_literal_struct() -> Result<()> { | |||
#[tokio::test] | |||
async fn roundtrip_values() -> Result<()> { | |||
// TODO: would be nice to have a struct inside the LargeList, but arrow_cast doesn't support that currently | |||
let values = "(\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are unrelated, but I was annoyed about the two assert inside this test while developing and one of them was failing - so I split them out.
There was no real benefit to using a large list of types for the empty relation.
// Test LogicalPlan::EmptyRelation | ||
roundtrip(format!("SELECT * FROM (VALUES {values}) LIMIT 0").as_str()).await | ||
#[tokio::test] | ||
async fn roundtrip_values_duplicate_column_join() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A test for the case (2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -317,16 +317,27 @@ fn rename_expressions( | |||
.into_iter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As rename_expressions
is getting more feature-full perhaps we can add some doc comments explaining what it is doing (I think it is basically renaming expressions from substrait to match DataFusion's expectations?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, added some docs in 87829ec
@@ -846,6 +861,24 @@ pub async fn from_substrait_rel( | |||
} | |||
} | |||
|
|||
fn requalify_sides_if_needed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you could add some of the excellent context from this PR's description as docstrings here to help future readers understand what is going on
This is not strictly necessary as I think the comments in the function explain it, but docstrings would still be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a short version in 87829ec
f830339
to
87829ec
Compare
Thanks for the review @alamb - I added some docs in 87829ec, and also cleaned/improved stuff a bit. However I also realized this approach falls short form providing full guarantee of uniqueness either, since the left or right side itself may contain duplicate names with different qualifiers, which then would turn into full duplicates once the side gets aliased. Unfortunately I don't see a way around that currently other than adding a lot more complexity into the renaming to basically qualify each column separately or something. (Or switching DataFusion to use something else than column name to refer to columns, e.g. position or some unique expr ID, but that's probably quite a big change 😅 ) Luckily I don't expect that to be very common case, so maybe we can go with just this for now and then improve on it later if needed. |
I agree this is a fundamental design difference (identifying columns by string or by offset). I think what you have in this PR is good enough for now, and we can figure out how to improve the situation if it turns out handling duplicated names within a relation is a common case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks agian @Blizzara
I pushed a commit up to this PR to get CI passing |
Yup
Ah thanks, I hadn't noticed that failure, looks good.
My pleasure :) |
🚀 |
…pache#11049) * add tests for joining tables with same name * alias tables just before joins instead, for a simpler solution * test cleanup * docstrings and some cleanup * Clippy --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Closes #10815 , and also adds support for joining two VirtualTables with overlapping column names.
Rationale for this change
Substrait doesn't contain aliases inside the plan: neither table aliases (SubqueryAlias), nor column aliases. Columns are only named at the beginning and end of the plan, while tables are not named by substrait at all. Substrait referes to columns only by their relative indices, so that's name-independent.
When we consumer Substrait plans' ReadRels, the created DataFusion tables may or may not have qualifiers. If the tables are NamedTable reads, then DataFusion uses the name as the qualifier. If the tables are VirtualTables, then we have unqualified tables.
This causes trouble in some cases for DataFusion, as DF requires columns to be uniquely named in at least two places (there’s probably more but these are the ones we’ve hit so far):
What changes are included in this PR?
I tried first a slightly different solution in #11048, where I'd create a unique reference for each table at the beginning. However that solution has some gaps (it might not work if there's a project between join and a read, e.g.) and is a bit nasty overall.
While thinking about it, I concluded there might be a simpler solution: just (re)qualify the tables before joining them. This is what this PR does. We check, before a join, if there are columns that appear both on the left and right sides, and if there are, we requalify both sides (to
left
andright
). As Substrait only cares about the position of the column, the (re)qualifying doesn't affect the rest of the plan (other than the qualifiers of the output columns).This ended up looking much better to me. The table qualifiers do now change within the plan created from Substrait, but only if necessary to avoid conflicts.
This PR also separately fixes a case in the final schema aliasing where we'd unnecessarily add aliases to columns to drop table qualifiers (ie. if our plan had a column
table.a
but the Substraitnames
defines the name to bea
, we'd alias that:table.a AS a
. But that's unnecessary since we should only care about the name, not the qualifier.Are these changes tested?
Tested by some stricter roundtrip unit tests, + adapting existing tests.
Are there any user-facing changes?
Plans produced by Substrait consumer may get different qualifiers for output columns. I think that's unlikely to affect anyone since I dunno why people would be looking at those qualifiers.
Otherwise just improved support for Substrait plans.