Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support duplicate column names in Substrait consumer #11048

Conversation

Blizzara
Copy link
Contributor

@Blizzara Blizzara commented Jun 21, 2024

Which issue does this PR close?

Closes #10815 , and also adds support for joining two VirtualTables with overlapping column names.

Rationale for this change

Substrait doesn't contain aliases inside the plan: neither table aliases (SubqueryAlias), nor column aliases. Columns are only named at the beginning and end of the plan, while tables are not named by substrait at all. Substrait referes to columns only by their relative indices, so that's name-independent.

When we consumer Substrait plans' ReadRels, the created DataFusion tables may or may not have qualifiers. If the tables are NamedTable reads, then DataFusion uses the name as the qualifier. If the tables are VirtualTables, then we have unqualified tables.

This causes trouble in some cases for DataFusion, as DF requires columns to be uniquely named in at least two places:

  1. when joining schemas, in DFSchema::check_names. This throws when there are two duplicate unqualified names, or if a unqualified name matches the name of a qualified name. This was problematic for joining VirtualTables since they would be unqualified.
  2. Interestingly, the (1) does NOT throw when there are two duplicate qualified names. That's why the existing tests were working. However, this results in a schema that contains duplicate fields, which then fail if those fields are actually trying to be used - that's Projects require unique expressions names error in substrait producer/consumer #10815

What changes are included in this PR?

In Substrait consumer, we now qualify all tables resulting from a ReadRel, with a pseudo-qualifier if there is no qualifier originally. Also, we maintain a set of qualifiers we've seen so far, and we re-qualify tables with an incremental postfix if the original qualifier has been seen already.

This ensures all tables have unique qualifiers at least at the beginning of the plan.

Note: there is probably a chance this doesn't fix cases where a project happens before the join, since the project may lose the qualifiers and that may cause duplicates again. An alternative solution could be to check the names when creating the join in Substrait consumer, and (re-)qualify one or both sides of the join at that point in case of conflicts.

Are these changes tested?

Tested by some stricter roundtrip unit tests, + adapting existing tests.

Are there any user-facing changes?

Plans produced by Substrait consumer may get different qualifiers for columns. I think that's unlikely to affect anyone since I dunno why people would be looking at those qualifiers.

Otherwise just improved support for Substrait plans.

@Blizzara
Copy link
Contributor Author

Closing in favor of #11049

@Blizzara Blizzara closed this Jun 21, 2024
@Blizzara Blizzara deleted the avo/substrait-uniquely-qualify-tables-for-joins branch June 28, 2024 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Projects require unique expressions names error in substrait producer/consumer
1 participant