Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Projects require unique expressions names error in substrait producer/consumer #10815

Closed
Tracked by #5173
richtia opened this issue Jun 6, 2024 · 7 comments · Fixed by #11049
Closed
Tracked by #5173

Projects require unique expressions names error in substrait producer/consumer #10815

richtia opened this issue Jun 6, 2024 · 7 comments · Fixed by #11049
Labels
bug Something isn't working

Comments

@richtia
Copy link

richtia commented Jun 6, 2024

Describe the bug

Datafusion substrait consumer is unable to produce/consumer a substrait plan that uses the same column names with different aliases

To Reproduce

import substrait.gen.proto.plan_pb2 as plan_pb2
from datafusion import SessionContext
from datafusion import substrait as ss

ctx = SessionContext()
sql_query = "SELECT PS_PARTKEY K1, PS_SUPPKEY AS K1 FROM 'partsupp'"

substrait_proto = plan_pb2.Plan()
substrait_plan = ss.substrait.serde.serialize_to_plan(sql_query, ctx)

Error:

DataFusion error: Plan("Projections require unique expression names but the expression \\"partsupp.ps_partkey AS k1\\" at position 0 and \\"partsupp.ps_suppkey AS k1\\" at position 1 have the same name. Consider aliasing (\\"AS\\") one of them.")')

Expected behavior

No response

Additional context

No response

@richtia richtia added the bug Something isn't working label Jun 6, 2024
@Blizzara
Copy link
Contributor

Blizzara commented Jun 7, 2024

same column names with different aliases

Isn't the repro trying to alias different column names (PS_PARTKEY, PS_SUPPKEY) to same alias (K1)? Why would you want to do that? 😅

@richtia
Copy link
Author

richtia commented Jun 7, 2024

same column names with different aliases

Isn't the repro trying to alias different column names (PS_PARTKEY, PS_SUPPKEY) to same alias (K1)? Why would you want to do that? 😅

Ahh...that was my mistake. One of those should be k2. I was trying to get a more simple repro of a much larger query with multiple joins. However...now that I have a more proper query, I am running into a different issue.

This is the query I have now:

SELECT p1.PS_PARTKEY supp_key, p2.PS_PARTKEY cust_key
FROM
    'partsupp' p1,
    'partsupp' p2

And this is the substrait error from that:

DataFusion error: NotImplemented("Unsupported operator: CrossJoin:\\n  SubqueryAlias: p1\\n    TableScan: partsupp projection=[ps_partkey]\\n  SubqueryAlias: p2\\n    TableScan: partsupp projection=[ps_partkey]")')

So the original issue that I was hitting was datafusion trying to run a substrait plan generated from DuckDB. And the error from that is the same error as I put in the description.

@alamb
Copy link
Contributor

alamb commented Jun 9, 2024

I added to the substrait support epic: #5173

@Blizzara
Copy link
Contributor

Hmm not sure how you got the NotImplemented error - maybe somehow running a quite old DataFusion? However with the query

SELECT p1.PS_PARTKEY supp_key, p2.PS_PARTKEY cust_key
FROM
    'partsupp' p1,
    'partsupp' p2

I do get the same error as you originally:

#[tokio::test]
async fn roundtrip_implicit_cross_join() -> Result<()> {
    roundtrip("SELECT p1.a p1_a, p2.a p2_a FROM data p1, data p2").await
}

Error: Plan("Projections require unique expression names but the expression \"data.a\" at position 0 and \"data.a\" at position 1 have the same name. Consider aliasing (\"AS\") one of them.")

This is because Substrait doesn't include aliases neither for tables nor for columns. I'm trying to see if I can add that into Substrait, it'd make these things easier to support: substrait-io/substrait#648

@richtia
Copy link
Author

richtia commented Jun 10, 2024

Hmm not sure how you got the NotImplemented error - maybe somehow running a quite old DataFusion? However with the query

Ahhh yea...i was on an older version.

@EpsilonPrime
Copy link

Given that names don't matter in Substrait (the final names are provided) is the problem solvable within the Substrait consumer for Datafusion? Shouldn't the consumer be able to rename the columns to whatever it wants?

Stepping further back I wonder if the check is needed at all here -- is it trying to prevent extra work or is it trying to prevent confusion on its part later on? It may be designed for the case where the fields are named the same but are from different sources which isn't happening here. Perhaps the check needs to be made more precise?

@Blizzara
Copy link
Contributor

Given that names don't matter in Substrait (the final names are provided) is the problem solvable within the Substrait consumer for Datafusion?

As discussed on the Substrait ticket, yes it can be solved, but not in a nice way.

Shouldn't the consumer be able to rename the columns to whatever it wants?

It can, however given the user has named the columns/tables in one way in the original plan, it can be quite confusing to the user if the columns/tables are named much differently in the actually executed plan.

Stepping further back I wonder if the check is needed at all here -- is it trying to prevent extra work or is it trying to prevent confusion on its part later on? It may be designed for the case where the fields are named the same but are from different sources which isn't happening here. Perhaps the check needs to be made more precise?

This plan results in a cross join, so the fields do refer to different sources, or same table but different sides of the join, so they are different columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants